Support for the SUPERFAMILY protein domain resource.
Lead Research Organisation:
University of Bristol
Department Name: Computer Science
Abstract
The SUPERFAMILY resource detects and classifies protein domains of known structure in genome sequences. Small proteins are a single unit but larger proteins can be made up of multiple subunits we call domains. Domains are modular evolutionary blocks which are assembled into whole proteins via duplication and recombination. X-ray crystallography and NMR experiments provide the 3D atomic resolution of proteins allowing the domains to be grouped into related families which often share a common or related function. The SUPERFAMILY database contains a library of profiles of these domain families in the form of hidden Markov models. These models are a computational tool which can detect the presence of domains in the sequences of proteins. Some years ago the first complete genome was experimentally characterised, giving us a list of all the sequences of the proteins which make up that organism. Subsequently the human genome was sequenced and now we have the complete sequences for the proteins of approaching 1,000 organisms. The SUPERFAMILY model library is run against all the genomes to identify the domains in the proteins. Our knowledge of domain families is not complete, so the assignments from the hidden Markov models cover only about half of the protein sequences, but this is still extremely valuable information. The data produced by the SUPERFAMILY analysis can be used for example by biologists working on specific proteins in the laboratory, larger projects working on a whole genome, or to improve our understanding of molecular evolution across all genomes and all kingdoms of life. The SUPERFAMILY website enables users to enter sequences to search against the model library. The results of the domain assignments to all the genomes are stored in a database and can also be viewed on the website. There are many tools and ways of browsing the data which allow the comparison of different organisms, proteins and domains to allow researchers to answer biological questions. The data,software and model library are available for people to download wholesale to carry out their own analysis. The information contained in SUPERFAMILY feeds into several other websites and resources, e.g. the ENSEMBL human genome website, which bring together different specialist sources of data to display alongside each other.
Technical Summary
The SUPERFAMILY resource detects and classifies protein domains in genome sequences. The domain definitions are taken from the SCOP hierarchy and searched against all completely sequenced genomes using hidden Markov models. The resource contains 4 main components accessed by end users: a database of over 14 million domain assignments, a library of over 14 thousand hidden Markov models, numerous analysis tools, and a web interface to all of these. The Structural Classification of Proteins (SCOP) database classifies the proteins of solved 3D structure in the PDB. Domains are defined as minimum units of evolution, and the domains are hierarchically grouped into superfamilies and families. There are 3464 families contained in 1777 superfamiles, totalling 97178 domain definitions. SUPERFAMILY maps these families and superfamilies onto sequence datasets including all completely sequenced genomes, totalling over 14 million domains. SUPERFAMILY currently has comprehensive inclusion of genomes, but advances in sequencing technology are rapidly increasing the number which need to be included. The detection and classification of domains in genome sequences is achieved using hidden Markov model (HMM) technology, enhanced indirectly via structural knowledge. A hand-curated library of models representing the superfamilies forms part of an assignment procedure which detects domains in protein sequences. The assignment procedure then classifies the domain into the relevant superfamily and family, also listing the closest solved structure. Cutting edge software is not just implemented in SUPERFAMILY, but the development process involves creating new algorithms and contributing to the development of HMM technology. The analysis tools are an essential part of the resource, enabling those users inexperienced in computational work, to share the deeper benefits available from data-mining, comparative genomics and visualisation which are usually accessible only to the more expert.
People |
ORCID iD |
Julian Gough (Principal Investigator) |
Publications
Rackham OJ
(2010)
The evolution and structure prediction of coiled coils across all genomes.
in Journal of molecular biology
Pethica R
(2010)
TreeVector: scalable, interactive, phylogenetic trees for the web.
in PloS one
Madera M
(2010)
Improving protein secondary structure prediction using a simple k-mer model.
in Bioinformatics (Oxford, England)
Ravasi T
(2010)
An atlas of combinatorial transcriptional regulation in mouse and man.
in Cell
Fang H
(2010)
Transcriptome analysis of early organogenesis in human embryos.
in Developmental cell
Fang H
(2010)
The evolutionary dynamics of protein networks
in Genome Biology
Wang K
(2010)
PML/RARalpha targets promoter regions containing PU.1 consensus and RARE half sites in acute promyelocytic leukemia.
in Cancer cell
Hunter S
(2011)
InterPro in 2011: new developments in the family and domain prediction database
in Nucleic Acids Research
Dunker AK
(2011)
Sequences and topology: intrinsic disorder in the evolving universe of protein structure.
in Current opinion in structural biology
De Lima Morais DA
(2011)
SUPERFAMILY 1.75 including a domain-centric gene ontology method.
in Nucleic acids research
Fang H
(2011)
A topology-preserving selection and clustering approach to multidimensional biological data.
in Omics : a journal of integrative biology
Abroi A
(2011)
Are viruses a source of new protein folds for organisms? - Virosphere structure space and evolution.
in BioEssays : news and reviews in molecular, cellular and developmental biology
Pethica RB
(2012)
Evolutionarily consistent families in SCOP: sequence, structure and function.
in BMC structural biology
Fang H
(2013)
A domain-centric solution to functional genomics via dcGO Predictor.
in BMC bioinformatics
Fang H
(2013)
A disease-drug-phenotype matrix inferred by walking on a functional domain network.
in Molecular bioSystems
Radivojac P
(2013)
A large-scale evaluation of computational protein function prediction.
in Nature methods
Fang H
(2013)
A daily-updated tree of (sequenced) life as a reference for genome research.
in Scientific reports
Shihab HA
(2013)
Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models.
in Human mutation
Fang H
(2013)
DcGO: database of domain-centric ontologies on functions, phenotypes, diseases and more.
in Nucleic acids research
Oates ME
(2013)
D²P²: database of disordered protein predictions.
in Nucleic acids research
Shihab HA
(2013)
Predicting the functional consequences of cancer-associated amino acid substitutions.
in Bioinformatics (Oxford, England)
Gough J
(2013)
Sequences and topology: disorder, modularity, and post/pre translation modification.
in Current opinion in structural biology
Burton BR
(2014)
Sequential transcriptional changes dictate safe and effective antigen-specific immunotherapy.
in Nature communications
Venkatakrishnan AJ
(2014)
Structured and disordered facets of the GPCR fold.
in Current opinion in structural biology
FANTOM Consortium And The RIKEN PMI And CLST (DGT)
(2014)
A promoter-level mammalian expression atlas.
in Nature
V Vavoulis D
(2014)
Non-Parametric Bayesian Modelling of Digital Gene Expression Data
in Journal of Computer Science & Systems Biology
Fang H
(2014)
The `dnet¿ approach promotes emerging research on cancer patient survival
in Genome Medicine
Van Der Lee R
(2014)
Classification of intrinsically disordered regions and proteins.
in Chemical reviews
Shihab HA
(2014)
Ranking non-synonymous single nucleotide polymorphisms based on disease concepts.
in Human genomics
Fang H
(2014)
supraHex: an R/Bioconductor package for tabular omics data analysis using a supra-hexagonal map.
in Biochemical and biophysical research communications
Sardar AJ
(2014)
The evolution of human cells in terms of protein innovation.
in Molecular biology and evolution
Fang H
(2014)
The 'dnet' approach promotes emerging research on cancer patient survival.
in Genome medicine
Mitchell A
(2015)
The InterPro protein families database: the classification resource after 15 years.
in Nucleic acids research
Baumgarten S
(2015)
The genome of Aiptasia , a sea anemone model for coral symbiosis
in Proceedings of the National Academy of Sciences
Vavoulis DV
(2015)
DGEclust: differential expression analysis of clustered count data.
in Genome biology
Lewis TE
(2015)
Genome3D: exploiting structure to help users understand their sequences.
in Nucleic acids research
Zaucha J
(2015)
A proteome quality index.
in Environmental microbiology
Shihab HA
(2015)
An integrative approach to predicting the functional effects of non-coding and coding sequence variation.
in Bioinformatics (Oxford, England)
Smithers B
(2015)
Splice junctions are constrained by protein disorder.
in Nucleic acids research
Oates ME
(2015)
The SUPERFAMILY 1.75 database in 2014: a doubling of data.
in Nucleic acids research
Linkeviciute V
(2015)
Function-selective domain architecture plasticity potentials in eukaryotic genome evolution.
in Biochimie
Smithers B
(2016)
Three reasons protein disorder analysis makes more sense in the light of collagen.
in Protein science : a publication of the Protein Society
Ryu T
(2016)
Hologenome analysis of two marine sponges with different microbiomes.
in BMC genomics
Harish A
(2016)
Did Viruses Evolve As a Distinct Supergroup from Common Ancestors of Cells?
in Genome biology and evolution
Latysheva N
(2016)
Molecular Principles of Gene Fusion Mediated Rewiring of Protein Interaction Networks in Cancer
in Molecular Cell
Jiang Y
(2016)
An expanded evaluation of protein function prediction methods shows an improvement in accuracy.
in Genome biology
Zhou B
(2017)
A Subset of Ubiquitin-Conjugating Enzymes Is Essential for Plant Immunity.
in Plant physiology
Description | This is a resource rather than research project and consists mostly of deliverables rather than findings. There was a user survey conducted both in person at ISMB and online for those not attending the conference. The key findings were that most users were satisfied with the resource, but that most of them were not aware of the more advanced features. |
Exploitation Route | We have a large userbase who access the website and resources. |
Sectors | Healthcare,Pharmaceuticals and Medical Biotechnology |
URL | http://supfam.org |
Description | The SUPERFAMILY resource is cited in approximately 100 patents, mostly regarding protein mutants, but also include patents relating to: the innate immune system, detergents, antimicrobial agents, computer software, and plant yield. There is even one on cake mix. |
First Year Of Impact | 2004 |
Sector | Agriculture, Food and Drink,Chemicals,Healthcare,Pharmaceuticals and Medical Biotechnology,Retail |
Impact Types | Economic |
Description | Astra Zeneca Blue Skies fund |
Amount | £100,000 (GBP) |
Organisation | AstraZeneca |
Department | Astra Zeneca |
Sector | Private |
Country | United States |
Start | 04/2018 |
End | 04/2019 |
Description | Faculty awards |
Amount | $50,000 (USD) |
Organisation | |
Sector | Private |
Country | United States |
Start | 01/2011 |
Description | Genome3D |
Amount | £93,404 (GBP) |
Funding ID | BB/I02500X/1 |
Organisation | Biotechnology and Biological Sciences Research Council (BBSRC) |
Sector | Public |
Country | United Kingdom |
Start | 11/2011 |
End | 10/2013 |
Description | TRDF |
Amount | £80,501 (GBP) |
Funding ID | BB/L018543/1 |
Organisation | Biotechnology and Biological Sciences Research Council (BBSRC) |
Sector | Public |
Country | United Kingdom |
Start | 05/2014 |
End | 04/2015 |
Description | FANTOM |
Organisation | RIKEN |
Country | Japan |
Sector | Public |
PI Contribution | Bioinformatics |
Collaborator Contribution | HIgh throughput data production. |
Impact | Multi-disciplinary bioinformatics and molecular biology. |
Description | Genome3D |
Organisation | Imperial College London |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | The Genome3D consortium was founded with SUPERFAMILY as a founding member |
Start Year | 2011 |
Description | Genome3D |
Organisation | Medical Research Council (MRC) |
Department | MRC Laboratory of Molecular Biology (LMB) |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | The Genome3D consortium was founded with SUPERFAMILY as a founding member |
Start Year | 2011 |
Description | Genome3D |
Organisation | University College London |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | The Genome3D consortium was founded with SUPERFAMILY as a founding member |
Start Year | 2011 |
Company Name | Genetrainer |
Description | Genetically guided fitness |
Year Established | 2013 |
Impact | No public product launch yet. |
Website | http://genetrainer.com |
Description | Cambridge Academy for Science and Technology: Challenge Project |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Schools |
Results and Impact | Arun Pandurangan's activity involved a presentation/talk before Secondary School students. The activity was part of the Challenge projects conducted by Cambridge Academy for Science and Technology in partnership with MRC-LMB. During my activity, I shared my journey in Science and explained students about the importance of scientific poster and on how to prepare and present them. The presentation was followed by a Q&A session. |
Year(s) Of Engagement Activity | 2018 |
Description | STEM career talk |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Undergraduate students |
Results and Impact | Delivered a STEM Career talk at the Cambridge Regional College (CRC) to raise awareness about STEM subjects and importance of doing interdisciplinary research. I addressed a class doing the Access to Higher Education programme. |
Year(s) Of Engagement Activity | 2019 |
Description | Sidney Sussex open day |
Form Of Engagement Activity | Participation in an open day or visit at my research institution |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Schools |
Results and Impact | Delivered an interdisciplinary talk on Computational Biology addressing A-level students as part of their visit to the College during the Open Day |
Year(s) Of Engagement Activity | 2019 |