GENOME-3D: a UK network providing structure-based annotations for genotype to phenotype studies

Lead Research Organisation: University College London
Department Name: Structural Molecular Biology

Abstract

The 3D structures of proteins are essential to fully characterise the sites mediating their molecular functions and their interactions with other proteins. However, whilst revolutionary technologies have enabled the sequencing of thousands of complete genomes, it is more challenging to determine the 3D structures of the proteins. Although the sequence repositories now contain >10 million protein sequences, less than 70,000 protein structures have been determined. Fortunately, in parallel with developments in sequencing technologies, powerful computational methods have emerged to predict the structure of a protein from its sequence. Currently these methods provide putative structures for ~80% of domain sequences from completed genomes, although the accuracy of this data varies from reasonably precise when structures are modelled using templates based on close relatives, through to quite approximate for models based on remote relatives and where proteins have no structurally characterised relatives. This project will bring together 6 internationally renowned UK groups involved in (1) classifying protein domains into evolutionary families (as this facilitates structure and function prediction) and/or (2) protein structure prediction. As regards the first activity - classification of protein structures - the two groups involved (SCOP,CATH) are the only groups, worldwide, providing this data. However, each applies somewhat different methodologies to make their assignments. Collaboration between these groups, in GENOME-3D, will involve comparison of domain structures and family classifications leading to refinements of assignments and/or confidence levels where the methods disagree. Since manual curation of the data is essential and since the rate at which the structures are determined is increasing, collaborations will speed up classification by allowing the groups to share information on the more challenging assignments and to discuss outcomes. For the second activity, structure prediction, the groups involved use technologies that vary in their sensitivity and in their ability to handle large numbers of sequences. Whilst SUPERFAMILY (based on SCOP) and Gene3D (based on CATH) provide greater coverage they are less likely to recognise very remote homologues, where methods such as GenTHREADER, Phyre, Fugue perform better. For each sequence, we will combine predictions from these different resources and assign confidence for each residue position in a query sequence based on the number of methods that agree in their structural prediction. We will provide pre-calculated assignments and also allow dynamic queries on the methods. We will also build 3D models for the sequences with residue positions highlighted according to agreement between the methods. We will develop computational platforms that integrate the information provided by each resource. To distribute this data to the biological and medical community we will build a dedicated web site. We will also establish web servers that link the methods ie run all the methods on query sequences and then report consensus assignments and highlight differences. In addition the consensus classification and annotation data will also be provided via two major international sites - the PDBe and InterPro. The sequence repositories are expanding at phenomenal rates as metagenomics and next gen sequencing initiatives bring in sequences from diverse microbial environments and report sequence variants occurring across different human populations or associated with different disease phenotypes. Structural data will enhance the insights available from this data. For example, known or predicted structures can reveal whether residue mutations occur near sites important for protein function or interaction with other proteins in complexes and signalling pathways. Collaborations within GENOME-3D will lead to more accurate and more comprehensive structural data for use by these initiatives.

Technical Summary

We will develop the GENOME-3D: (1) website - presenting integrated information from the consortiums resources (2) webserver - allowing users to submit query sequences/structures to run against the consortiums methods and return consensus predictions. (1) GENOME-3D website We will develop SOAP/REST based web services for: - Exporting data from individual resources to GENOME-3D ie domain boundaries/superfamily classifications/domain structure predictions - Combining data, identifying consensus regions and calculating confidence values We will develop Taverna workflows which plug together the above web-services to provide consensus data. We will build a web portal to display this data (see figure 1 main text). The website will exploit an Oracle database and will provide facilities for querying with protein structure ids (PDB ids) or sequence ids (UniProt or GI codes). All partners have extensive experience in web design. CATH-Gene3D has tools for visualising multiple structure/multiple sequence alignments and highlighting conserved residues on representative structures. These will be adopted by GENOME-3D. We will design a questionnaire to capture feedback on the site and use this to improve design. (1) GENOME-3D webserver As well as providing predetermined classifications/annotations via the website (some data is manually curated), we will establish a server that allows structure/sequence based queries and automatically returns consensus domain classifications/predictions (no manual curation). We will develop SOAP/REST based web services for: - Scanning query structures against classification methods ie structure comparison (CATHEDRAL) and homologue recognition (HMMscan) to give uncurated SCOP/CATH assignments. - Performing multiple structure alignments - Scanning query sequences against individual methods predicting domain structures and structural features eg membrane regions - Generating consensus data from multiple prediction methods

Planned Impact

SUMMARY OF RESOURCE This proposal is to establish a resource (GENOME-3D) for the bioscience and biomedical communities providing integrated information on the 3D structures of proteins and relate this data to protein function. GENOME-3D will comprise information from major UK groups in structural bioinformatics. The individual resources are extensively used by the community - combined access to the different databases is >50,000 visits per month and the total number of jobs run on all the servers is 20,000 jobs per month. This testifies to the importance of this structure and functional information for both the academic and commercial communities. Producing a combined resource will enhance the value of the individual components by enabling comparisons and cross-referencing. The resource will have an impact on many applications of bioscience and biomedical research. This proposal is endorsed by letters of support from several major UK pharmaceutical, biotech and agricultural companies - Syngenta, UCB, GSK, Isogenica, Heptares, Syntaxin and Astex. SCIENCE COMMUNITY Food security - Increasingly the sequences of plants, agricultural pests and agents of disease will be the focus of genome sequencing and structural studies. GENOME-3D will assist in the interpretation of the relationship between sequence variations between plant strains and help identify the best strain to meet yield, water requirements, colour, taste and resistance to pests and disease. The information could benefit chemical discovery and marker identification for crop breeding programs. Bio-energy and bio-industry - The manipulation of individual molecules and pathways will yield new sources of energy and materials. Synthetic pathways can be engineered to make molecules, such as fuels, more efficiently. In addition, novel molecules can be designed and synthesised. Detailed structural knowledge of a protein family can be used to suggest the critical changes to alter function. At the pathway level, GENOME-3D will help to identify the components based on sequence and structural information of families of proteins. Health - The central role of protein structure in the design of novel and improved pharmaceuticals is well established. Provision of the highest quality 3D models from gene sequence will therefore directly enhance the discovery of new hits. The refinement of these hits into leads will benefit from information about a family of molecules to highlight the relationship of stereochemistry, ligand binding and activity. Therapeutic molecules will span the spectrum from low molecular weight compounds, through peptides into proteins, including antibodies. A major development in the next few years will be the sequencing of many individuals and relating their sequence variations (single nucleotide polymorphisms, SNPs) to disease susceptibility. This will provide major insights into biological processes in humans, the development of personalised medicine and the identification of novel drug targets. Central in the interpretation of SNPs effects in protein coding regions will be knowledge in GENOME-3D of the inter-relationships between protein sequence, structure, function and pathways. POLICY MAKERS AND THE LAY PUBLIC GENOME-3D will involve several UK groups working together to develop a world-leading bioinformatics resource. The success of the project could inform policy makers about the value of collaborative work for bioinformatics and other scientific resources within the UK, within Europe and worldwide. Similarly, GENOME-3D can demonstrate to the general public (including schools) the value of bioinformatics resources and collaborative research. GENOME-3D has applied to become a node within the ELIXIR funding framework. Participation in this new mechanism for promoting collaborative development and maintenance of major European resources will help shape policy and provide exemplars of how Elixir can benefit the wider European community.

Publications

10 25 50
 
Description Knowing the 3D structure of a protein can be very valuable for understanding the mechanism by which it functions and for designing drugs that could inhibit or modify its function. However, fewer than 10% of known protein sequences have known protein structures. This project provided predictions of protein structures for structurally uncharacterised proteins from ten model organisms including human, fly, mouse and yeast. These predictions were made on the basis of homology to known protein structures. In order to increase confidence in the accuracy of the predictions, information was combined from five independent resources, (SUPERFAMILY, Gene3D, PHYRE, FUGUE, pDomTHREADER) and the groups developing these resources collaborated to provide this data.

A computational platform was developed for integrating the data from the different resources and a public website was launched one year after the project started. The new Genome3D resource was publicised in the special issue of Nucleic Acids Research (NAR) on databases and presented in a technology track at the ISMB international meeting for computational biology in July 2013. A workshop demonstrating Genome3D was held at UCL in July 2014.

The coverage of sequences with predicted structures has recently been extended to include representatives from all Pfam families. Pfam families account for nearly 80% of known domains in nature. A further publication in NAR reports the structural coverage of Pfam families.

Since the different Genome3D prediction methods use either CATH or SCOP domain families, a mapping between these resources was developed to determine when methods were predicting that proteins adopted structures from equivalent families in SCOP or CATH.
Exploitation Route Since fewer than 10% of protein sequences have known structures, Genome3D data is likely to be valuable for a wide range of biologists and structural biologists. For example, the domain structure boundary information will be useful for guiding construct design for protein structure determination. Structural data is also valuable for understanding the mechanisms by which proteins function.

The structural data in Genome3D will be valuable for biologists/biomedical researchers seeking to understand the likely impact of a genetic variation or nsSNP in a query protein.

Genome3D data is also useful for target selection by structural genomics consortia. Since Genome3D provides the most comprehensive structural coverage of Pfam families, it has been used by the NIH PSI structural genomics consortia to target structurally uncharacterised Pfam families.

The structural annotations will also be useful for bioinformaticians and chemical biologists in pharmaceutical companies seeking to determine whether a drug target has structural analogs that could bind the drug, resulting in side effects.
Sectors Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology

URL http://www.genome3d.eu
 
Description The Genome3D website has only been live for two years and so the web stats are lower than for any of the established resources contributing the prediction data. However, there is clear evidence that the biological community are starting to use the site with a 50% increase in the number of web sessions in 2014 and 56% increase in the number of users (comparing Nov 2012-Nov 2013 with Nov 2013-Nov 2014). The consensus structural annotation data is being used by PDBe to provide information on protein sequences related to structures deposited in the PDB. The SCOP/CATH mapping is being used by InterPro to guide integration of the structural annotations from SUPERFAMILY and Gene3D in InterPro. Genome3D structural annotations of Pfam families are being used by the NIH funded Midwest Centre for Structural Genomics to guide target selection for structure determination. This will significantly aid the structural coverage of sequence space.
First Year Of Impact 2012
Sector Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description Structural Bioinformatics Consortium (ELIXIR)
Geographic Reach Europe 
Policy Influence Type Influenced training of practitioners or researchers
Impact We are part of a consortium of 17 research groups developing training material in structural bioinformatics. This work is being co-ordinated by the Geneom3D consortium which is managed by Orengo. Each group within the consortium is developing their own training material relating to their particular research area. This material will be integrated via on-line workflows which are being developed as a part of the TeSS platform - an on-line training catalogue and training facility being organised by the ELIXIR UK node. The CATH-Gene3D training material was developed in 2013 for an ECCB workshop on protein structure to Function held in July 2013, organised by Christine Orengo, Nicholas Furnham and Romain Studer. Christine Orengo is also deputy lead of the Functional Effects domain in Structural Bioinformatics which is integrating tools and resources from the 17 structural bioinformatics research groups mentioned. This integrated resource will be used for the interpretation of genetic variations related to health and disease. Training material is also being developed in this context.
 
Title Genome3D 
Description Please note that this research database is still being continuously developed and improved. Genome3D provides consensus structural annotations and 3D models for sequences from model organisms, including human. These data are generated by several UK based resources in the Genome3D consortium: SCOP, CATH, SUPERFAMILY, Gene3D, FUGUE, THREADER, PHYRE. In addition Genome3D integrates structural classification data from SCOP and CATH. An overview of some of the features this resource provides: Structural Annotations -- regions of protein sequences that have been matched to structural domains (from CATH or SCOP) Structural Models -- regions of protein sequences that have been modelled in 3D (based on similarity to a CATH or SCOP domain) Consensus Superfamilies -- an official collaboration between the structural domain classification databases CATH and SCOP. The resource provides annotations based on over 160,000 UniProtKB sequences from 10 model organisms plus a representive set of proteins from Pfam. The annotations include over 1,000,000 predicted structural domains and over 350,000 predicted 3D structural models. The upcoming release of Genome3D updates the existing set of protein sequences and includes 3 additional model organisms (pig, wheat and TB) which more than doubles the total number of sequences (over 400,000 UniProtKB entries). 
Type Of Material Database/Collection of data 
Year Produced 2012 
Provided To Others? Yes  
Impact The development of the Genome3D research involved building strong working collaborations between leading structural bioinformatics groups in the UK. Providing a single portal that shows annotations from each of these resources has helped to advertise and improve the quality of these individual resources. Having this strong collaboration in place helped us to bring together an effective working group to represent structural bioinformatics for the training node in ELIXIR UK. 
URL http://www.genome3d.eu
 
Description ELIXIR 
Organisation ELIXIR
Department ELIXIR UK
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution We are part of the 3D-BioInfo ELIXIR Community in Structural Bioinformatics, which was established in January 2019 and is being coordinated by Christine Orengo. CATH-Gene3D contributes to two of the four major activities in 3D-BioInfo. Activity I relates to integration of functional sites in PDBe Knowledge Base (PDBe-KB). CATH Functional Families (FunFams) are being used to identify functional sites for domain families and this data is being integrated in PDBe-KB. Activity II relates to integration of tools and data associated with protein structure prediction. CATH functional families are being used to identify templates for homology modelling of structurally uncharacterised proteins. 3D-models have been generated for 14 model organisms including human, mouse, rat, arabadopsis, fly, yeast and E. Coli. 3D-Models are then integrated in the Genome3D resource, managed by Orengo. 3D-BioInfo Activity II involves integration of 3D-Models from Genome3D in PDBe-KB with links to UniProt. CATH-Gene3D recently received ELIXIR implementation study funding to collaborate with the SWISS-MODEL team in Switzerland to use the SWISS-MODEL pipeline together with template data from CATH functional families to build more accurate 3D models. We are planning to extend this activity to include more European partners through collaborations facilitated by 3D-BioInfo workshops. We are also part of a ELIXIR UK consortium of 17 research groups developing training material in structural bioinformatics. This work is being co-ordinated by the Genome3D consortium managed by Orengo. CATH-Gene3D training material was developed in 2013 for an ECCB workshop on protein structure to Function held in July 2013, organised by Christine Orengo, Nicholas Furnham and Romain Studer. This material has been adapted for the ELIXIR training workflows. Christine Orengo is also deputy lead of the Functional Effects Domain in Structural Bioinformatics which is integrating tools and resources from the 17 structural bioinformatics research groups mentioned above. The Domain is part of Genomics England and is headed by Ewan Birney. The aim is to establish an integrated resource will be used for the interpretation of genetic variations related to health and disease. Training material is also being developed in this context. ELIXIR UK funding was allocated in March 2017 to develop training workflows for predicting the impacts of genetic variations. These workflows have now been developed and are accessible via the ELIXIR TESS Training website.
Collaborator Contribution As regards the ELIXIR 3D-BioInfo collaborations, research groups from 15 European countries are involved in this collaboration. For the Activities that CATH-Gene3D contributes to, more than 10 groups are involved from 7 countries including the UK. All are contributing predicted functional site data to PDBe-KB. We all participate in workshops held at the EBI regularly to discuss ontologies and export/import mechanisms and APIs. As regards the ELIXIR UK training workflows, each group within the consortium is developing their own training material relating to their particular research area.
Impact All predicted functional site data will be made available via the PDBe-KB. Predicted domain data structure will be made available through Genome3D and also through PDBe-KB once the exchange mechanisms for that have been completed. All training material material will be integrated via on-line workflows which are being developed as a part of the TeSS platform - an on-line training catalogue and training facility being organised by the ELIXIR UK node.
Start Year 2013
 
Description InterPro 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution InterPro provides functional analysis of proteins by classifying them into families and predicting domains and important sites. It combines protein signatures from a number of member databases into a single searchable resource, capitalising on their individual strengths to produce a powerful integrated database and diagnostic tool. Our research team has provided the following contributions to the InterPro resource: - Structural annotations from CATH - Structural annotations from Genome3D - mapping between CATH and SCOP protein structure classifications CATH-Gene3D provide domain family HMMs and structure annotations to InterPro on a regular basis. We have recently provided a new tool - CATH-Resolve-Hits for generating accurate multi-domain architecture information from sequence matches to the CATH domain HMM libraries. We currently have BBSRC BBR funding to extend the mapping between SCOP and CATH, integrate Genome3D annotations in InterPro for selected model organisms, and provide a 3D viewer for the structural annotations.
Collaborator Contribution Annotations from other sources, manual curations, central database and web site.
Impact Publications Community resource to further biological research.
Start Year 2007
 
Description PDBe 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution Our resource CATH provides high quality annotations to improve the quality of the information provided by the PDBe, primarily the location of structural domains and identifying distant evolutionary relationships between known protein structures. Our Gene3D resource provides structural annotations for genome sequences from ~20,000 species. These annotations are also incorporated in the Genome3D resource for selected model organisms. Collaborations between research groups involved in the Genome3D project has resulted in a high quality mapping between the CATH and SCOP structural classification databases. This is being implemented by the PDBe to improve the clarity and coverage of structural annotations in their resource. We currently have a BBSRC BBR funded collaboration with PDBe and InterPro to provide our CATH-Gene3D structural annotations to these resources, via the Genome3D portal.
Collaborator Contribution Host, maintain and curate the central PDBe resource and website.
Impact Publications Community resources to further scientific research.
Start Year 2006
 
Description Swiss-Model - 3D Models for CATH domain sequences 
Organisation University of Basel
Country Switzerland 
Sector Academic/University 
PI Contribution This is an ELIXIR funded collaboration between the Orengo Group and the Swiss-Model Team, led by Prof. Torsten Schwede. The Orengo group will be building a computational platform to provide domain sequences predicted to belong to CATH functional families (FunFams). FunFams are generated using agglomerative clustering of domain sequences in each superfamily guided using a protocol that assess similarity in specificity determining residues.
Collaborator Contribution The SwissModel team will be building computational pipelines to import the CATH sequences data and then submit these sequences to the established Swiss-Model homology modelling platforms. The 3D models generated will be made available to the biology community via the Swiss-Model, CATH-Gene3D, PDBe and InterPro websites.
Impact We have built APIs that allow exchange of data between CATH and SWISS-MODEL Using these we have imported 3D-Models for structurally uncharacterised CATH-FunFams into CATH. This pilot work has led to a more substantial collaboration between the partners as part of the 3D-Gateway project which is establishing the 3D-Beacons portal to integrate 3D-Models from different resources (SWISS-MODEL, PHYRE, Rosetta, DomTHREADER)
Start Year 2017
 
Title SSAP- structure comparison program 
Description Algorithm for aligning protein structures. It exploits a double dynamic algorithm to handle insertions and deletions and so can be used to align very distantly related homologues as well as close homologues. It has been used to identify the structural relationships on which the CATH classification was based. 
Type Of Technology Software 
Impact This software is licenced by UCLi and have been sold to several companies including CellTech, Pfizer India etc. 
 
Title cath-superpose: flexible superpositions of protein structures 
Description cath-superpose provides the optimal structural superposition between two protein structures. When deciding on which residues to use for the superposition, the tool takes into account the structural environment of each residue. This focuses the superposition on the parts of the alignment that align well rather that variable regions that can disrupt superpositions. In contrast with methods that simply attempt to minimise the RMSD, this approach can be used to build superpositions of hundreds of protein structures that clearly show the highly conserved ancient structural core within distantly related protein domain structures. - written and tested in strict C++ - removed a number of local dependencies to allow the tool to be used by the wider community - source code released on GitHub under the GPLv3 license (as part of the cath-tools suite) - incorporated into a robust continuous integration (CI) build with tests and releases 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact - used as a tool to superpose predicted structures from the Genome3D collaboration - used to provide superpositions of entire superfamilies for the CATH database (previously not possible) 
URL http://cath-tools.readthedocs.io/en/latest/tools/cath-superpose/
 
Description Genome3D workshop (UCL, 2014) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact We organised a workshop at UCL which brought 90 people from a variety of different fields (academia and industry) and geographic locations.

The workhop was split into two sections: the first included presentations from all member databases of the Genome3D collaboration (including CATH-Gene3D), the second section was a hands-on tutorial where each participant could work through the prepared examples on the web site.
Year(s) Of Engagement Activity 2014
URL http://genome3d.eu/
 
Description Public talks and workshops 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact We have given several talks on CATH-Gene3D and Genome3D in local schools in London.

We participated in a Wellcome Trust funded workshop on chronic pain at which we gave a talk and demonstration of how CATH-Gene3D and Genome3D data was being used to provide structural and functional information on genes involved in chronic pain

The schools reported that our talks had generated a lot of interest in proteins and structural biology and that several students had decided to seek further information on undergraduate courses with study modules on computational biology.

Our talks include images of protein structures which help in intuitively conveying information on the mechanisms by which proteins function.

The Wellcome workshop on chronic pain was very well received with excellent responses to the feedback questionnai
Year(s) Of Engagement Activity 2009,2011,2012,2014,2015,2016