Support for the SUPERFAMILY protein domain resource.

Lead Research Organisation: University of Bristol
Department Name: Computer Science

Abstract

The SUPERFAMILY resource detects and classifies protein domains of known structure in genome sequences. Small proteins are a single unit but larger proteins can be made up of multiple subunits we call domains. Domains are modular evolutionary blocks which are assembled into whole proteins via duplication and recombination. X-ray crystallography and NMR experiments provide the 3D atomic resolution of proteins allowing the domains to be grouped into related families which often share a common or related function. The SUPERFAMILY database contains a library of profiles of these domain families in the form of hidden Markov models. These models are a computational tool which can detect the presence of domains in the sequences of proteins. Some years ago the first complete genome was experimentally characterised, giving us a list of all the sequences of the proteins which make up that organism. Subsequently the human genome was sequenced and now we have the complete sequences for the proteins of approaching 1,000 organisms. The SUPERFAMILY model library is run against all the genomes to identify the domains in the proteins. Our knowledge of domain families is not complete, so the assignments from the hidden Markov models cover only about half of the protein sequences, but this is still extremely valuable information. The data produced by the SUPERFAMILY analysis can be used for example by biologists working on specific proteins in the laboratory, larger projects working on a whole genome, or to improve our understanding of molecular evolution across all genomes and all kingdoms of life. The SUPERFAMILY website enables users to enter sequences to search against the model library. The results of the domain assignments to all the genomes are stored in a database and can also be viewed on the website. There are many tools and ways of browsing the data which allow the comparison of different organisms, proteins and domains to allow researchers to answer biological questions. The data,software and model library are available for people to download wholesale to carry out their own analysis. The information contained in SUPERFAMILY feeds into several other websites and resources, e.g. the ENSEMBL human genome website, which bring together different specialist sources of data to display alongside each other.

Technical Summary

The SUPERFAMILY resource detects and classifies protein domains in genome sequences. The domain definitions are taken from the SCOP hierarchy and searched against all completely sequenced genomes using hidden Markov models. The resource contains 4 main components accessed by end users: a database of over 14 million domain assignments, a library of over 14 thousand hidden Markov models, numerous analysis tools, and a web interface to all of these. The Structural Classification of Proteins (SCOP) database classifies the proteins of solved 3D structure in the PDB. Domains are defined as minimum units of evolution, and the domains are hierarchically grouped into superfamilies and families. There are 3464 families contained in 1777 superfamiles, totalling 97178 domain definitions. SUPERFAMILY maps these families and superfamilies onto sequence datasets including all completely sequenced genomes, totalling over 14 million domains. SUPERFAMILY currently has comprehensive inclusion of genomes, but advances in sequencing technology are rapidly increasing the number which need to be included. The detection and classification of domains in genome sequences is achieved using hidden Markov model (HMM) technology, enhanced indirectly via structural knowledge. A hand-curated library of models representing the superfamilies forms part of an assignment procedure which detects domains in protein sequences. The assignment procedure then classifies the domain into the relevant superfamily and family, also listing the closest solved structure. Cutting edge software is not just implemented in SUPERFAMILY, but the development process involves creating new algorithms and contributing to the development of HMM technology. The analysis tools are an essential part of the resource, enabling those users inexperienced in computational work, to share the deeper benefits available from data-mining, comparative genomics and visualisation which are usually accessible only to the more expert.

Publications

10 25 50
publication icon
Abroi A (2011) Are viruses a source of new protein folds for organisms? - Virosphere structure space and evolution. in BioEssays : news and reviews in molecular, cellular and developmental biology

publication icon
Baumgarten S (2015) The genome of Aiptasia , a sea anemone model for coral symbiosis in Proceedings of the National Academy of Sciences

publication icon
De Lima Morais DA (2011) SUPERFAMILY 1.75 including a domain-centric gene ontology method. in Nucleic acids research

publication icon
Dunker AK (2011) Sequences and topology: intrinsic disorder in the evolving universe of protein structure. in Current opinion in structural biology

 
Description This is a resource rather than research project and consists mostly of deliverables rather than findings.



There was a user survey conducted both in person at ISMB and online for those not attending the conference. The key findings were that most users were satisfied with the resource, but that most of them were not aware of the more advanced features.
Exploitation Route We have a large userbase who access the website and resources.
Sectors Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://supfam.org
 
Description The SUPERFAMILY resource is cited in approximately 100 patents, mostly regarding protein mutants, but also include patents relating to: the innate immune system, detergents, antimicrobial agents, computer software, and plant yield. There is even one on cake mix.
First Year Of Impact 2004
Sector Agriculture, Food and Drink,Chemicals,Healthcare,Pharmaceuticals and Medical Biotechnology,Retail
Impact Types Economic

 
Description Astra Zeneca Blue Skies fund
Amount £100,000 (GBP)
Organisation AstraZeneca 
Department Astra Zeneca
Sector Private
Country United States
Start 04/2018 
End 04/2019
 
Description Faculty awards
Amount $50,000 (USD)
Organisation Google 
Sector Private
Country United States
Start 01/2011 
 
Description Genome3D
Amount £93,404 (GBP)
Funding ID BB/I02500X/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 11/2011 
End 10/2013
 
Description TRDF
Amount £80,501 (GBP)
Funding ID BB/L018543/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 05/2014 
End 04/2015
 
Description FANTOM 
Organisation RIKEN
Country Japan 
Sector Public 
PI Contribution Bioinformatics
Collaborator Contribution HIgh throughput data production.
Impact Multi-disciplinary bioinformatics and molecular biology.
 
Description Genome3D 
Organisation Imperial College London
Country United Kingdom 
Sector Academic/University 
PI Contribution The Genome3D consortium was founded with SUPERFAMILY as a founding member
Start Year 2011
 
Description Genome3D 
Organisation Medical Research Council (MRC)
Department MRC Laboratory of Molecular Biology (LMB)
Country United Kingdom 
Sector Academic/University 
PI Contribution The Genome3D consortium was founded with SUPERFAMILY as a founding member
Start Year 2011
 
Description Genome3D 
Organisation University College London
Country United Kingdom 
Sector Academic/University 
PI Contribution The Genome3D consortium was founded with SUPERFAMILY as a founding member
Start Year 2011
 
Company Name Genetrainer 
Description Genetically guided fitness 
Year Established 2013 
Impact No public product launch yet.
Website http://genetrainer.com
 
Description Cambridge Academy for Science and Technology: Challenge Project 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Arun Pandurangan's activity involved a presentation/talk before Secondary School
students. The activity was part of the Challenge projects conducted by
Cambridge Academy for Science and Technology in partnership with
MRC-LMB. During my activity, I shared my journey in Science and
explained students about the importance of scientific poster and on how
to prepare and present them. The presentation was followed by a Q&A session.
Year(s) Of Engagement Activity 2018
 
Description STEM career talk 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Undergraduate students
Results and Impact Delivered a STEM Career talk at the Cambridge Regional College (CRC) to raise awareness about STEM subjects and importance of doing interdisciplinary research. I addressed a class doing the Access to Higher Education programme.
Year(s) Of Engagement Activity 2019
 
Description Sidney Sussex open day 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Delivered an interdisciplinary talk on Computational Biology addressing A-level students as part of their visit to the College during the Open Day
Year(s) Of Engagement Activity 2019