Support for the SUPERFAMILY protein domain resource.

Lead Research Organisation: University of Bristol

Department Name: Computer Science

Abstract

The SUPERFAMILY resource detects and classifies protein domains of known structure in genome sequences. Small proteins are a single unit but larger proteins can be made up of multiple subunits we call domains. Domains are modular evolutionary blocks which are assembled into whole proteins via duplication and recombination. X-ray crystallography and NMR experiments provide the 3D atomic resolution of proteins allowing the domains to be grouped into related families which often share a common or related function. The SUPERFAMILY database contains a library of profiles of these domain families in the form of hidden Markov models. These models are a computational tool which can detect the presence of domains in the sequences of proteins. Some years ago the first complete genome was experimentally characterised, giving us a list of all the sequences of the proteins which make up that organism. Subsequently the human genome was sequenced and now we have the complete sequences for the proteins of approaching 1,000 organisms. The SUPERFAMILY model library is run against all the genomes to identify the domains in the proteins. Our knowledge of domain families is not complete, so the assignments from the hidden Markov models cover only about half of the protein sequences, but this is still extremely valuable information. The data produced by the SUPERFAMILY analysis can be used for example by biologists working on specific proteins in the laboratory, larger projects working on a whole genome, or to improve our understanding of molecular evolution across all genomes and all kingdoms of life. The SUPERFAMILY website enables users to enter sequences to search against the model library. The results of the domain assignments to all the genomes are stored in a database and can also be viewed on the website. There are many tools and ways of browsing the data which allow the comparison of different organisms, proteins and domains to allow researchers to answer biological questions. The data,software and model library are available for people to download wholesale to carry out their own analysis. The information contained in SUPERFAMILY feeds into several other websites and resources, e.g. the ENSEMBL human genome website, which bring together different specialist sources of data to display alongside each other.

Technical Summary

The SUPERFAMILY resource detects and classifies protein domains in genome sequences. The domain definitions are taken from the SCOP hierarchy and searched against all completely sequenced genomes using hidden Markov models. The resource contains 4 main components accessed by end users: a database of over 14 million domain assignments, a library of over 14 thousand hidden Markov models, numerous analysis tools, and a web interface to all of these. The Structural Classification of Proteins (SCOP) database classifies the proteins of solved 3D structure in the PDB. Domains are defined as minimum units of evolution, and the domains are hierarchically grouped into superfamilies and families. There are 3464 families contained in 1777 superfamiles, totalling 97178 domain definitions. SUPERFAMILY maps these families and superfamilies onto sequence datasets including all completely sequenced genomes, totalling over 14 million domains. SUPERFAMILY currently has comprehensive inclusion of genomes, but advances in sequencing technology are rapidly increasing the number which need to be included. The detection and classification of domains in genome sequences is achieved using hidden Markov model (HMM) technology, enhanced indirectly via structural knowledge. A hand-curated library of models representing the superfamilies forms part of an assignment procedure which detects domains in protein sequences. The assignment procedure then classifies the domain into the relevant superfamily and family, also listing the closest solved structure. Cutting edge software is not just implemented in SUPERFAMILY, but the development process involves creating new algorithms and contributing to the development of HMM technology. The analysis tools are an essential part of the resource, enabling those users inexperienced in computational work, to share the deeper benefits available from data-mining, comparative genomics and visualisation which are usually accessible only to the more expert.

Funded Value:

£684,409

Funded Period:

Feb 10 - Jan 15

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/G022771/1

Principal Investigator:

Julian Gough

Research Subject:

Biomolecules & biochemistry (38%)

Omic sciences & technologies (24%)

Tools, technologies & methods (12%)

Research Topic:

Bioinformatics (12%)

Protein expression (13%)

Protein folding / misfolding (13%)

Proteomics (24%)

Structural biology (12%)

Organisations

People	ORCID iD
Julian Gough (Principal Investigator)

Publications

Author Name Title Publication Date Published

|< < 1 2 > >|

10 25 50

Rackham OJ (2010) The evolution and structure prediction of coiled coils across all genomes. in Journal of molecular biology

Pethica R (2010) TreeVector: scalable, interactive, phylogenetic trees for the web. in PloS one

Madera M (2010) Improving protein secondary structure prediction using a simple k-mer model. in Bioinformatics (Oxford, England)

Ravasi T (2010) An atlas of combinatorial transcriptional regulation in mouse and man. in Cell

Fang H (2010) Transcriptome analysis of early organogenesis in human embryos. in Developmental cell

Fang H (2010) The evolutionary dynamics of protein networks in Genome Biology

Wang K (2010) PML/RARalpha targets promoter regions containing PU.1 consensus and RARE half sites in acute promyelocytic leukemia. in Cancer cell

Hunter S (2011) InterPro in 2011: new developments in the family and domain prediction database in Nucleic Acids Research

Dunker AK (2011) Sequences and topology: intrinsic disorder in the evolving universe of protein structure. in Current opinion in structural biology

De Lima Morais DA (2011) SUPERFAMILY 1.75 including a domain-centric gene ontology method. in Nucleic acids research

Chavali S (2011) Evolution of eukaryotic genome architecture: Insights from the study of a rapidly evolving metazoan, Oikopleura dioica Non-adaptive forces such as elevated mutation rates may influence the evolution of genome architecture in BioEssays

Fang H (2011) A topology-preserving selection and clustering approach to multidimensional biological data. in Omics : a journal of integrative biology

Abroi A (2011) Are viruses a source of new protein folds for organisms? - Virosphere structure space and evolution. in BioEssays : news and reviews in molecular, cellular and developmental biology

Pethica RB (2012) Evolutionarily consistent families in SCOP: sequence, structure and function. in BMC structural biology

Gough J (2013) Protein Families - Relating Protein Sequence, Structure, and Function

Fang H (2013) A domain-centric solution to functional genomics via dcGO Predictor. in BMC bioinformatics

Fang H (2013) A disease-drug-phenotype matrix inferred by walking on a functional domain network. in Molecular bioSystems

Radivojac P (2013) A large-scale evaluation of computational protein function prediction. in Nature methods

Fang H (2013) A daily-updated tree of (sequenced) life as a reference for genome research. in Scientific reports

Fang H (2013) A daily-updated tree of (sequenced) life as a reference for genome research

Shihab HA (2013) Predicting the functional, molecular, and phenotypic consequences of amino acid substitutions using hidden Markov models. in Human mutation

Fang H (2013) DcGO: database of domain-centric ontologies on functions, phenotypes, diseases and more. in Nucleic acids research

Oates ME (2013) D²P²: database of disordered protein predictions. in Nucleic acids research

Shihab HA (2013) Predicting the functional consequences of cancer-associated amino acid substitutions. in Bioinformatics (Oxford, England)

Gough J (2013) Sequences and topology: disorder, modularity, and post/pre translation modification. in Current opinion in structural biology

Burton BR (2014) Sequential transcriptional changes dictate safe and effective antigen-specific immunotherapy. in Nature communications

Venkatakrishnan AJ (2014) Structured and disordered facets of the GPCR fold. in Current opinion in structural biology

FANTOM Consortium And The RIKEN PMI And CLST (DGT) (2014) A promoter-level mammalian expression atlas. in Nature

V Vavoulis D (2014) Non-Parametric Bayesian Modelling of Digital Gene Expression Data in Journal of Computer Science & Systems Biology

Fang H (2014) The `dnet¿ approach promotes emerging research on cancer patient survival in Genome Medicine

Van Der Lee R (2014) Classification of intrinsically disordered regions and proteins. in Chemical reviews

Shihab HA (2014) Ranking non-synonymous single nucleotide polymorphisms based on disease concepts. in Human genomics

Fang H (2014) supraHex: an R/Bioconductor package for tabular omics data analysis using a supra-hexagonal map. in Biochemical and biophysical research communications

Sardar AJ (2014) The evolution of human cells in terms of protein innovation. in Molecular biology and evolution

Fang H (2014) The 'dnet' approach promotes emerging research on cancer patient survival. in Genome medicine

Mitchell A (2015) The InterPro protein families database: the classification resource after 15 years. in Nucleic acids research

Baumgarten S (2015) The genome of Aiptasia , a sea anemone model for coral symbiosis in Proceedings of the National Academy of Sciences

Vavoulis DV (2015) DGEclust: differential expression analysis of clustered count data. in Genome biology

Lewis TE (2015) Genome3D: exploiting structure to help users understand their sequences. in Nucleic acids research

Zaucha J (2015) A proteome quality index. in Environmental microbiology

Shihab HA (2015) An integrative approach to predicting the functional effects of non-coding and coding sequence variation. in Bioinformatics (Oxford, England)

Smithers B (2015) Splice junctions are constrained by protein disorder. in Nucleic acids research

Oates ME (2015) The SUPERFAMILY 1.75 database in 2014: a doubling of data. in Nucleic acids research

Linkeviciute V (2015) Function-selective domain architecture plasticity potentials in eukaryotic genome evolution. in Biochimie

Smithers B (2016) Three reasons protein disorder analysis makes more sense in the light of collagen. in Protein science : a publication of the Protein Society

Ryu T (2016) Hologenome analysis of two marine sponges with different microbiomes. in BMC genomics

Harish A (2016) Did Viruses Evolve As a Distinct Supergroup from Common Ancestors of Cells? in Genome biology and evolution

Latysheva N (2016) Molecular Principles of Gene Fusion Mediated Rewiring of Protein Interaction Networks in Cancer in Molecular Cell

Jiang Y (2016) An expanded evaluation of protein function prediction methods shows an improvement in accuracy. in Genome biology

Zhou B (2017) A Subset of Ubiquitin-Conjugating Enzymes Is Essential for Plant Immunity. in Plant physiology

Key Findings
Impact Summary
Further Funding
Collaboration
Spin Outs
Engagement Activities


Description	This is a resource rather than research project and consists mostly of deliverables rather than findings. There was a user survey conducted both in person at ISMB and online for those not attending the conference. The key findings were that most users were satisfied with the resource, but that most of them were not aware of the more advanced features.
Exploitation Route	We have a large userbase who access the website and resources.
Sectors	Healthcare,Pharmaceuticals and Medical Biotechnology
URL	http://supfam.org


Description	The SUPERFAMILY resource is cited in approximately 100 patents, mostly regarding protein mutants, but also include patents relating to: the innate immune system, detergents, antimicrobial agents, computer software, and plant yield. There is even one on cake mix.
First Year Of Impact	2004
Sector	Agriculture, Food and Drink,Chemicals,Healthcare,Pharmaceuticals and Medical Biotechnology,Retail
Impact Types	Economic


Description	Astra Zeneca Blue Skies fund
Amount	£100,000 (GBP)
Organisation	AstraZeneca
Department	Astra Zeneca
Sector	Private
Country	United States
Start	04/2018
End	04/2019


Description	Faculty awards
Amount	$50,000 (USD)
Organisation	Google
Sector	Private
Country	United States
Start	01/2011


Description	Genome3D
Amount	£93,404 (GBP)
Funding ID	BB/I02500X/1
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	11/2011
End	10/2013


Description	TRDF
Amount	£80,501 (GBP)
Funding ID	BB/L018543/1
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	05/2014
End	04/2015


Description	FANTOM
Organisation	RIKEN
Country	Japan
Sector	Public
PI Contribution	Bioinformatics
Collaborator Contribution	HIgh throughput data production.
Impact	Multi-disciplinary bioinformatics and molecular biology.


Description	Genome3D
Organisation	Imperial College London
Country	United Kingdom
Sector	Academic/University
PI Contribution	The Genome3D consortium was founded with SUPERFAMILY as a founding member
Start Year	2011


Description	Genome3D
Organisation	Medical Research Council (MRC)
Department	MRC Laboratory of Molecular Biology (LMB)
Country	United Kingdom
Sector	Academic/University
PI Contribution	The Genome3D consortium was founded with SUPERFAMILY as a founding member
Start Year	2011


Description	Genome3D
Organisation	University College London
Country	United Kingdom
Sector	Academic/University
PI Contribution	The Genome3D consortium was founded with SUPERFAMILY as a founding member
Start Year	2011


Company Name	Genetrainer
Description	Genetically guided fitness
Year Established	2013
Impact	No public product launch yet.
Website	http://genetrainer.com


Description	Cambridge Academy for Science and Technology: Challenge Project
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	Arun Pandurangan's activity involved a presentation/talk before Secondary School students. The activity was part of the Challenge projects conducted by Cambridge Academy for Science and Technology in partnership with MRC-LMB. During my activity, I shared my journey in Science and explained students about the importance of scientific poster and on how to prepare and present them. The presentation was followed by a Q&A session.
Year(s) Of Engagement Activity	2018


Description	STEM career talk
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Undergraduate students
Results and Impact	Delivered a STEM Career talk at the Cambridge Regional College (CRC) to raise awareness about STEM subjects and importance of doing interdisciplinary research. I addressed a class doing the Access to Higher Education programme.
Year(s) Of Engagement Activity	2019


Description	Sidney Sussex open day
Form Of Engagement Activity	Participation in an open day or visit at my research institution
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	Delivered an interdisciplinary talk on Computational Biology addressing A-level students as part of their visit to the College during the Open Day
Year(s) Of Engagement Activity	2019

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications