SUPERFAMILY

Lead Research Organisation: MRC Laboratory of Molecular Biology
Department Name: Structural Studies

Abstract

SUPERFAMILY is a popular resource that detects and classifies protein domains of known structure in genome sequences.

Small proteins are a single unit but larger proteins can be made up of multiple subunits we call domains. Domains are modular evolutionary blocks which form components of whole proteins via duplication and recombination of existing units. X-ray crystallography and NMR experiments provide the 3D atomic resolution structure of proteins allowing the domains to be grouped into related families which often share a common or related function. The SUPERFAMILY database contains a library of profiles of these domain families in the form of hidden Markov models. These models are a computational tool which can detect the presence of domains in the sequences of proteins.

Some years ago the complete human genome was experimentally characterised, giving us a list of all the sequences of the proteins. The genomes of many other organisms have also been sequenced and at this time we have the complete sequences for the proteins of thousands of organisms. The SUPERFAMILY model library is run against all the genomes to identify the domains in the proteins. The data produced by the SUPERFAMILY analysis can be used for example by biologists working on specific proteins in the laboratory, larger projects working on a whole genome, or to improve our understanding of molecular evolution across all genomes and all kingdoms of life.

The SUPERFAMILY website enables users to enter sequences to search against the model library. The results of the domain assignments to all the genomes are stored in a database and can also be viewed on the website. There are many tools and ways of browsing the data which allow the comparison of different organisms, proteins and domains to allow researchers to answer biological questions. The data, software and model library are available for people to download wholesale to carry out their own analysis. The information contained in SUPERFAMILY feeds into several other websites and resources, e.g. the ENSEMBL human genome website, which bring together different specialist sources of data to display alongside each other.

Technical Summary

The SUPERFAMILY resource detects and classifies protein domains in genome sequences. The domain definitions are taken from the SCOP2, SCOPe and CATH hierarchies and searched against all completely sequenced genomes using hidden Markov models. The resource provides the pre-computed results for a comprehensive collection of genomes via data download and an interactive website. There are alignments, structural models, statistics, comparative and enrichment tools plus visualisations.

The gold standard Structural Classification of Proteins (SCOP) database classified the proteins of solved 3D structure in the PDB but has now been superseded by SCOP2 and SCOPe. The CATH classification is the most significant alternative. Domains are defined as minimum units of evolution, and the domains are hierarchically grouped into superfamilies and families. SUPERFAMILY maps these families and superfamilies onto sequence datasets including all completely sequenced genomes, totalling over 100 million sequences. SUPERFAMILY currently has comprehensive inclusion of proteomes, but advances in sequencing technology are increasing the demand for inclusion of nucleotide datasets.

The analysis is achieved using hidden Markov model (HMM) technology, enhanced indirectly via structural knowledge. A hand-curated library of models representing the superfamilies forms part of an assignment procedure which detects domains in protein sequences. The assignment procedure then classifies the domain into the relevant superfamily and family, also listing the closest solved structure. Cutting edge software is not just implemented in SUPERFAMILY, but the development process involves creating new algorithms and contributing to the development of HMM technology.

The analysis tools are an essential part of the resource, enabling those users inexperienced in computational work, to share the deeper benefits available from data-mining, enrichment analysis, comparative genomics and visualisation.

Planned Impact

With its main purpose to serve the biological community, by its very nature, if it is succeeding in that role then the main impact will be indirect, via others making use of it to generate their own impact. Commercially and economically SUPERFAMILY can be seen to be having an effect in that it is cited in at least 70 patents. These are mostly regarding protein mutants, but also include patents relating to: the innate immune system, detergents, antimicrobial agents, computer software, plant yield and even cake mix. Protein mutants have largely medical applications (as would applications to the innate immune system), but detergents and antimicrobial agents are of more general benefit to healthy people. Plant yield is key to the BBSRC strategic priority of food security whereas the benefits of cake mix might be more debatable.

One of the strong new directions in this proposal is to move to include nucleotide sequences. This is partly driven by demand, but also by the BBSRC strategic priority on food security. For example Gough is a co-PI on another BBR project for wheat genomics lead by Prof. Edwards who is on the SUPERFAMILY SAB (see Edwards letter of support). There is a great deal of experimental data produced on wheat varieties, but since wheat has a complex hexaploid genome, there is not yet a complete genome assembly available. Thus for SUPERFAMILY to contribute to wheat, central to the BBSRC food security priority, the ability to include nucleotide and variant data (via FATHMM) is of the utmost importance. This is but one example, as there are increasing quantities of variant data becoming available on crops and livestock. See letters of support from Blundell and Edwards. SUPERFAMILY, via sister resources FATHMM and dcGO, will provide tools for variant analysis and functional enrichment aiding researchers in this area. This unified presentation of sister resources in the SUPERFAMILY interface is the other major objective of this proposal.

As stated above, the main impact of SUPERFAMILY is indirect, via other researchers using the database to support their research. The section on academic beneficiaries (a) demonstrates the extent of the impact via thousands of citations, (b) shows the strength of the contribution to UK academia via the high proportion of citations coming from leading UK institutions, and (c) gives a breakdown of the areas of biology in which this impact is taking place. Amongst these citations are publications from top journals (e.g. more than 10 from Nature journals, others from Cell, Science etc.) and some very highly-cited publications.

SUPERFAMILY is also having a cooperative impact in influencing and working with the UK bioinformatics community with e.g. frequent interactions with the EBI via ENSEMBL, InterPro, UniProt, Pfam, PDBe (see letters of support). SUPERFAMILY contributes to two major consortia: InterPro, and the UK based Genome3D consortium for structural bioinformatics (founding member).

The most significant impact outside the BBSRC remit is clearly medical, with SUPERFAMILY and its sister resources contributing to cancer research via the COSMIC database at the Sanger Centre (see letter of support from McDermott).
 
Description The SUPERFAMILY database has a new website release and several times more models including new SCOP models and CATH and ECOD models.
Exploitation Route The website and database his highly accessed and used by the community.
Sectors Pharmaceuticals and Medical Biotechnology

URL http://supfam.org
 
Description The SUPERFAMILY database has been cited in hundreds of patents.
Sector Agriculture, Food and Drink,Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description Genome3D 
Organisation University College London
Country United Kingdom 
Sector Academic/University 
PI Contribution The Genome3D consortium was founded with SUPERFAMILY as a founding member
Start Year 2011
 
Description I'm a Scientist, Stay at home - Arun Pandurangan 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Schools
Results and Impact Over 4200 school students across the UK participated in live chats with Scientist during COVID pandemic. It was a great opportunity for both students and Scientist to share knowledge broadly in the area of medical research. Because of the online format of the event it was very useful for student to reach out to as many researchers as possible with with their burning questions. More importantly, this event had a great impact in successfully engaging students and researchers during the period of COVID lockdown which was really good for science communication in this challenging times.
Year(s) Of Engagement Activity 2020
URL https://medical20.imascientist.org.uk