Support for the SUPERFAMILY protein domain resource.
Lead Research Organisation:
University of Bristol
Department Name: Computer Science
Abstract
The SUPERFAMILY resource detects and classifies protein domains of known structure in genome sequences. Small proteins are a single unit but larger proteins can be made up of multiple subunits we call domains. Domains are modular evolutionary blocks which are assembled into whole proteins via duplication and recombination. X-ray crystallography and NMR experiments provide the 3D atomic resolution of proteins allowing the domains to be grouped into related families which often share a common or related function. The SUPERFAMILY database contains a library of profiles of these domain families in the form of hidden Markov models. These models are a computational tool which can detect the presence of domains in the sequences of proteins. Some years ago the first complete genome was experimentally characterised, giving us a list of all the sequences of the proteins which make up that organism. Subsequently the human genome was sequenced and now we have the complete sequences for the proteins of approaching 1,000 organisms. The SUPERFAMILY model library is run against all the genomes to identify the domains in the proteins. Our knowledge of domain families is not complete, so the assignments from the hidden Markov models cover only about half of the protein sequences, but this is still extremely valuable information. The data produced by the SUPERFAMILY analysis can be used for example by biologists working on specific proteins in the laboratory, larger projects working on a whole genome, or to improve our understanding of molecular evolution across all genomes and all kingdoms of life. The SUPERFAMILY website enables users to enter sequences to search against the model library. The results of the domain assignments to all the genomes are stored in a database and can also be viewed on the website. There are many tools and ways of browsing the data which allow the comparison of different organisms, proteins and domains to allow researchers to answer biological questions. The data,software and model library are available for people to download wholesale to carry out their own analysis. The information contained in SUPERFAMILY feeds into several other websites and resources, e.g. the ENSEMBL human genome website, which bring together different specialist sources of data to display alongside each other.
Technical Summary
The SUPERFAMILY resource detects and classifies protein domains in genome sequences. The domain definitions are taken from the SCOP hierarchy and searched against all completely sequenced genomes using hidden Markov models. The resource contains 4 main components accessed by end users: a database of over 14 million domain assignments, a library of over 14 thousand hidden Markov models, numerous analysis tools, and a web interface to all of these. The Structural Classification of Proteins (SCOP) database classifies the proteins of solved 3D structure in the PDB. Domains are defined as minimum units of evolution, and the domains are hierarchically grouped into superfamilies and families. There are 3464 families contained in 1777 superfamiles, totalling 97178 domain definitions. SUPERFAMILY maps these families and superfamilies onto sequence datasets including all completely sequenced genomes, totalling over 14 million domains. SUPERFAMILY currently has comprehensive inclusion of genomes, but advances in sequencing technology are rapidly increasing the number which need to be included. The detection and classification of domains in genome sequences is achieved using hidden Markov model (HMM) technology, enhanced indirectly via structural knowledge. A hand-curated library of models representing the superfamilies forms part of an assignment procedure which detects domains in protein sequences. The assignment procedure then classifies the domain into the relevant superfamily and family, also listing the closest solved structure. Cutting edge software is not just implemented in SUPERFAMILY, but the development process involves creating new algorithms and contributing to the development of HMM technology. The analysis tools are an essential part of the resource, enabling those users inexperienced in computational work, to share the deeper benefits available from data-mining, comparative genomics and visualisation which are usually accessible only to the more expert.
People |
ORCID iD |
Julian Gough (Principal Investigator) |
Publications
De Lima Morais DA
(2011)
SUPERFAMILY 1.75 including a domain-centric gene ontology method.
in Nucleic acids research
Lizio M
(2017)
Update of the FANTOM web resource: high resolution transcriptome of diverse cell types in mammals.
in Nucleic acids research
Smithers B
(2019)
'Why genes in pieces?'-revisited.
in Nucleic acids research
Smithers B
(2015)
Splice junctions are constrained by protein disorder.
in Nucleic acids research
Oates ME
(2015)
The SUPERFAMILY 1.75 database in 2014: a doubling of data.
in Nucleic acids research
Fang H
(2013)
DcGO: database of domain-centric ontologies on functions, phenotypes, diseases and more.
in Nucleic acids research
Oates ME
(2013)
D²P²: database of disordered protein predictions.
in Nucleic acids research
Fang H
(2011)
A topology-preserving selection and clustering approach to multidimensional biological data.
in Omics : a journal of integrative biology
Zhou B
(2017)
A Subset of Ubiquitin-Conjugating Enzymes Is Essential for Plant Immunity.
in Plant physiology
Pethica R
(2010)
TreeVector: scalable, interactive, phylogenetic trees for the web.
in PloS one
Description | This is a resource rather than research project and consists mostly of deliverables rather than findings. There was a user survey conducted both in person at ISMB and online for those not attending the conference. The key findings were that most users were satisfied with the resource, but that most of them were not aware of the more advanced features. |
Exploitation Route | We have a large userbase who access the website and resources. |
Sectors | Healthcare,Pharmaceuticals and Medical Biotechnology |
URL | http://supfam.org |
Description | The SUPERFAMILY resource is cited in approximately 100 patents, mostly regarding protein mutants, but also include patents relating to: the innate immune system, detergents, antimicrobial agents, computer software, and plant yield. There is even one on cake mix. |
First Year Of Impact | 2004 |
Sector | Agriculture, Food and Drink,Chemicals,Healthcare,Pharmaceuticals and Medical Biotechnology,Retail |
Impact Types | Economic |
Description | Astra Zeneca Blue Skies fund |
Amount | £100,000 (GBP) |
Organisation | AstraZeneca |
Department | Astra Zeneca |
Sector | Private |
Country | United States |
Start | 04/2018 |
End | 04/2019 |
Description | Faculty awards |
Amount | $50,000 (USD) |
Organisation | |
Sector | Private |
Country | United States |
Start | 01/2011 |
Description | Genome3D |
Amount | £93,404 (GBP) |
Funding ID | BB/I02500X/1 |
Organisation | Biotechnology and Biological Sciences Research Council (BBSRC) |
Sector | Public |
Country | United Kingdom |
Start | 11/2011 |
End | 10/2013 |
Description | TRDF |
Amount | £80,501 (GBP) |
Funding ID | BB/L018543/1 |
Organisation | Biotechnology and Biological Sciences Research Council (BBSRC) |
Sector | Public |
Country | United Kingdom |
Start | 05/2014 |
End | 04/2015 |
Description | FANTOM |
Organisation | RIKEN |
Country | Japan |
Sector | Public |
PI Contribution | Bioinformatics |
Collaborator Contribution | HIgh throughput data production. |
Impact | Multi-disciplinary bioinformatics and molecular biology. |
Description | Genome3D |
Organisation | Imperial College London |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | The Genome3D consortium was founded with SUPERFAMILY as a founding member |
Start Year | 2011 |
Description | Genome3D |
Organisation | Medical Research Council (MRC) |
Department | MRC Laboratory of Molecular Biology (LMB) |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | The Genome3D consortium was founded with SUPERFAMILY as a founding member |
Start Year | 2011 |
Description | Genome3D |
Organisation | University College London |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | The Genome3D consortium was founded with SUPERFAMILY as a founding member |
Start Year | 2011 |
Company Name | Genetrainer |
Description | Genetically guided fitness |
Year Established | 2013 |
Impact | No public product launch yet. |
Website | http://genetrainer.com |
Description | Cambridge Academy for Science and Technology: Challenge Project |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Schools |
Results and Impact | Arun Pandurangan's activity involved a presentation/talk before Secondary School students. The activity was part of the Challenge projects conducted by Cambridge Academy for Science and Technology in partnership with MRC-LMB. During my activity, I shared my journey in Science and explained students about the importance of scientific poster and on how to prepare and present them. The presentation was followed by a Q&A session. |
Year(s) Of Engagement Activity | 2018 |
Description | STEM career talk |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Undergraduate students |
Results and Impact | Delivered a STEM Career talk at the Cambridge Regional College (CRC) to raise awareness about STEM subjects and importance of doing interdisciplinary research. I addressed a class doing the Access to Higher Education programme. |
Year(s) Of Engagement Activity | 2019 |
Description | Sidney Sussex open day |
Form Of Engagement Activity | Participation in an open day or visit at my research institution |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Schools |
Results and Impact | Delivered an interdisciplinary talk on Computational Biology addressing A-level students as part of their visit to the College during the Open Day |
Year(s) Of Engagement Activity | 2019 |