SUPERFAMILY

Lead Research Organisation: MRC LABORATORY OF MOLECULAR BIOLOGY

Department Name: Structural Studies

Abstract

SUPERFAMILY is a popular resource that detects and classifies protein domains of known structure in genome sequences.

Small proteins are a single unit but larger proteins can be made up of multiple subunits we call domains. Domains are modular evolutionary blocks which form components of whole proteins via duplication and recombination of existing units. X-ray crystallography and NMR experiments provide the 3D atomic resolution structure of proteins allowing the domains to be grouped into related families which often share a common or related function. The SUPERFAMILY database contains a library of profiles of these domain families in the form of hidden Markov models. These models are a computational tool which can detect the presence of domains in the sequences of proteins.

Some years ago the complete human genome was experimentally characterised, giving us a list of all the sequences of the proteins. The genomes of many other organisms have also been sequenced and at this time we have the complete sequences for the proteins of thousands of organisms. The SUPERFAMILY model library is run against all the genomes to identify the domains in the proteins. The data produced by the SUPERFAMILY analysis can be used for example by biologists working on specific proteins in the laboratory, larger projects working on a whole genome, or to improve our understanding of molecular evolution across all genomes and all kingdoms of life.

The SUPERFAMILY website enables users to enter sequences to search against the model library. The results of the domain assignments to all the genomes are stored in a database and can also be viewed on the website. There are many tools and ways of browsing the data which allow the comparison of different organisms, proteins and domains to allow researchers to answer biological questions. The data, software and model library are available for people to download wholesale to carry out their own analysis. The information contained in SUPERFAMILY feeds into several other websites and resources, e.g. the ENSEMBL human genome website, which bring together different specialist sources of data to display alongside each other.

Technical Summary

The SUPERFAMILY resource detects and classifies protein domains in genome sequences. The domain definitions are taken from the SCOP2, SCOPe and CATH hierarchies and searched against all completely sequenced genomes using hidden Markov models. The resource provides the pre-computed results for a comprehensive collection of genomes via data download and an interactive website. There are alignments, structural models, statistics, comparative and enrichment tools plus visualisations.

The gold standard Structural Classification of Proteins (SCOP) database classified the proteins of solved 3D structure in the PDB but has now been superseded by SCOP2 and SCOPe. The CATH classification is the most significant alternative. Domains are defined as minimum units of evolution, and the domains are hierarchically grouped into superfamilies and families. SUPERFAMILY maps these families and superfamilies onto sequence datasets including all completely sequenced genomes, totalling over 100 million sequences. SUPERFAMILY currently has comprehensive inclusion of proteomes, but advances in sequencing technology are increasing the demand for inclusion of nucleotide datasets.

The analysis is achieved using hidden Markov model (HMM) technology, enhanced indirectly via structural knowledge. A hand-curated library of models representing the superfamilies forms part of an assignment procedure which detects domains in protein sequences. The assignment procedure then classifies the domain into the relevant superfamily and family, also listing the closest solved structure. Cutting edge software is not just implemented in SUPERFAMILY, but the development process involves creating new algorithms and contributing to the development of HMM technology.

The analysis tools are an essential part of the resource, enabling those users inexperienced in computational work, to share the deeper benefits available from data-mining, enrichment analysis, comparative genomics and visualisation.

Planned Impact

With its main purpose to serve the biological community, by its very nature, if it is succeeding in that role then the main impact will be indirect, via others making use of it to generate their own impact. Commercially and economically SUPERFAMILY can be seen to be having an effect in that it is cited in at least 70 patents. These are mostly regarding protein mutants, but also include patents relating to: the innate immune system, detergents, antimicrobial agents, computer software, plant yield and even cake mix. Protein mutants have largely medical applications (as would applications to the innate immune system), but detergents and antimicrobial agents are of more general benefit to healthy people. Plant yield is key to the BBSRC strategic priority of food security whereas the benefits of cake mix might be more debatable.

One of the strong new directions in this proposal is to move to include nucleotide sequences. This is partly driven by demand, but also by the BBSRC strategic priority on food security. For example Gough is a co-PI on another BBR project for wheat genomics lead by Prof. Edwards who is on the SUPERFAMILY SAB (see Edwards letter of support). There is a great deal of experimental data produced on wheat varieties, but since wheat has a complex hexaploid genome, there is not yet a complete genome assembly available. Thus for SUPERFAMILY to contribute to wheat, central to the BBSRC food security priority, the ability to include nucleotide and variant data (via FATHMM) is of the utmost importance. This is but one example, as there are increasing quantities of variant data becoming available on crops and livestock. See letters of support from Blundell and Edwards. SUPERFAMILY, via sister resources FATHMM and dcGO, will provide tools for variant analysis and functional enrichment aiding researchers in this area. This unified presentation of sister resources in the SUPERFAMILY interface is the other major objective of this proposal.

As stated above, the main impact of SUPERFAMILY is indirect, via other researchers using the database to support their research. The section on academic beneficiaries (a) demonstrates the extent of the impact via thousands of citations, (b) shows the strength of the contribution to UK academia via the high proportion of citations coming from leading UK institutions, and (c) gives a breakdown of the areas of biology in which this impact is taking place. Amongst these citations are publications from top journals (e.g. more than 10 from Nature journals, others from Cell, Science etc.) and some very highly-cited publications.

SUPERFAMILY is also having a cooperative impact in influencing and working with the UK bioinformatics community with e.g. frequent interactions with the EBI via ENSEMBL, InterPro, UniProt, Pfam, PDBe (see letters of support). SUPERFAMILY contributes to two major consortia: InterPro, and the UK based Genome3D consortium for structural bioinformatics (founding member).

The most significant impact outside the BBSRC remit is clearly medical, with SUPERFAMILY and its sister resources contributing to cancer research via the COSMIC database at the Sanger Centre (see letter of support from McDermott).

Funded Value:

£487,765

Funded Period:

Jan 17 - Sep 21

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/N019431/2

Principal Investigator:

Julian Gough

Research Subject:

Biomolecules & biochemistry (35%)

Tools, technologies & methods (56%)

Research Topic:

Bioinformatics (56%)

Chemical Biology (14%)

Structural biology (21%)

Organisations

People	ORCID iD
Julian Gough (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Blum M (2021) The InterPro protein families and domains database: 20 years on in Nucleic Acids Research

Carraro M (2017) Performance of in silico tools for the evaluation of p16INK4a (CDKN2A) variants in CAGI. in Human mutation

Lu C (2020) Genetic risk factors for death with SARS-CoV-2 from the UK Biobank

Mitchell AL (2019) InterPro in 2019: improving coverage, classification and access to protein sequence annotations. in Nucleic acids research

Pandurangan AP (2020) Prediction of impacts of mutations on protein structure and interactions: SDM, a statistical approach, and mCSM, using machine learning. in Protein science : a publication of the Protein Society

Pandurangan AP (2019) The SUPERFAMILY 2.0 database: a significant proteome update and a new webserver. in Nucleic acids research

Sillitoe I (2020) Genome3D: integrating a collaborative data pipeline to expand the depth and breadth of consensus protein structure annotation. in Nucleic acids research

Zhou B (2017) A Subset of Ubiquitin-Conjugating Enzymes Is Essential for Plant Immunity. in Plant physiology

Zhou N (2019) The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. in Genome biology

Related Projects

Project Reference	Relationship	Related To	Start	End	Award Value
BB/N019431/1			30/09/2016	31/12/2016	£511,680
BB/N019431/2	Transfer	BB/N019431/1	01/01/2017	29/09/2021	£487,766

Key Findings
Impact Summary
Collaboration
Engagement Activities


Description	The SUPERFAMILY database has a new website release and several times more models including new SCOP models and CATH and ECOD models.
Exploitation Route	The website and database his highly accessed and used by the community.
Sectors	Pharmaceuticals and Medical Biotechnology
URL	http://supfam.org


Description	The SUPERFAMILY database has been cited in hundreds of patents.
Sector	Agriculture, Food and Drink,Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types	Economic


Description	Genome3D
Organisation	University College London
Country	United Kingdom
Sector	Academic/University
PI Contribution	The Genome3D consortium was founded with SUPERFAMILY as a founding member
Start Year	2011


Description	I'm a Scientist, Stay at home - Arun Pandurangan
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Schools
Results and Impact	Over 4200 school students across the UK participated in live chats with Scientist during COVID pandemic. It was a great opportunity for both students and Scientist to share knowledge broadly in the area of medical research. Because of the online format of the event it was very useful for student to reach out to as many researchers as possible with with their burning questions. More importantly, this event had a great impact in successfully engaging students and researchers during the period of COVID lockdown which was really good for science communication in this challenging times.
Year(s) Of Engagement Activity	2020
URL	https://medical20.imascientist.org.uk