GENOME-3D: a UK network providing structure-based annotations for genotype to phenotype studies

Lead Research Organisation: European Bioinformatics Institute
Department Name: Protein Data Bank in Europe

Abstract

The 3D structures of proteins are essential to fully characterise the sites mediating their molecular functions and their interactions with other proteins. However, whilst revolutionary technologies have enabled the sequencing of thousands of complete genomes, it is more challenging to determine the 3D structures of the proteins. Although the sequence repositories now contain >10 million protein sequences, less than 70,000 protein structures have been determined. Fortunately, in parallel with developments in sequencing technologies, powerful computational methods have emerged to predict the structure of a protein from its sequence. Currently these methods provide putative structures for ~80% of domain sequences from completed genomes, although the accuracy of this data varies from reasonably precise when structures are modelled using templates based on close relatives, through to quite approximate for models based on remote relatives and where proteins have no structurally characterised relatives. This project will bring together 6 internationally renowned UK groups involved in (1) classifying protein domains into evolutionary families (as this facilitates structure and function prediction) and/or (2) protein structure prediction. As regards the first activity - classification of protein structures - the two groups involved (SCOP,CATH) are the only groups, worldwide, providing this data. However, each applies somewhat different methodologies to make their assignments. Collaboration between these groups, in GENOME-3D, will involve comparison of domain structures and family classifications leading to refinements of assignments and/or confidence levels where the methods disagree. Since manual curation of the data is essential and since the rate at which the structures are determined is increasing, collaborations will speed up classification by allowing the groups to share information on the more challenging assignments and to discuss outcomes. For the second activity, structure prediction, the groups involved use technologies that vary in their sensitivity and in their ability to handle large numbers of sequences. Whilst SUPERFAMILY (based on SCOP) and Gene3D (based on CATH) provide greater coverage they are less likely to recognise very remote homologues, where methods such as GenTHREADER, Phyre, Fugue perform better. For each sequence, we will combine predictions from these different resources and assign confidence for each residue position in a query sequence based on the number of methods that agree in their structural prediction. We will provide pre-calculated assignments and also allow dynamic queries on the methods. We will also build 3D models for the sequences with residue positions highlighted according to agreement between the methods. We will develop computational platforms that integrate the information provided by each resource. To distribute this data to the biological and medical community we will build a dedicated web site. We will also establish web servers that link the methods ie run all the methods on query sequences and then report consensus assignments and highlight differences. In addition the consensus classification and annotation data will also be provided via two major international sites - the PDBe and InterPro. The sequence repositories are expanding at phenomenal rates as metagenomics and next gen sequencing initiatives bring in sequences from diverse microbial environments and report sequence variants occurring across different human populations or associated with different disease phenotypes. Structural data will enhance the insights available from this data. For example, known or predicted structures can reveal whether residue mutations occur near sites important for protein function or interaction with other proteins in

Technical Summary

We will develop the GENOME-3D: (1) website - presenting integrated information from the consortiums resources (2) webserver - allowing users to submit query sequences/structures to run against the consortiums methods and return consensus predictions. (1) GENOME-3D website We will develop SOAP/REST based web services for: - Exporting data from individual resources to GENOME-3D ie domain boundaries/superfamily classifications/domain structure predictions - Combining data, identifying consensus regions and calculating confidence values We will develop Taverna workflows which plug together the above web-services to provide consensus data. We will build a web portal to display this data (see figure 1 main text). The website will exploit an Oracle database and will provide facilities for querying with protein structure ids (PDB ids) or sequence ids (UniProt or GI codes). All partners have extensive experience in web design. CATH-Gene3D has tools for visualising multiple structure/multiple sequence alignments and highlighting conserved residues on representative structures. These will be adopted by GENOME-3D. We will design a questionnaire to capture feedback on the site and use this to improve design. (1) GENOME-3D webserver As well as providing predetermined classifications/annotations via the website (some data is manually curated), we will establish a server that allows structure/sequence based queries and automatically returns consensus domain classifications/predictions (no manual curation). We will develop SOAP/REST based web services for: - Scanning query structures against classification methods ie structure comparison (CATHEDRAL) and homologue recognition (HMMscan) to give uncurated SCOP/CATH assignments. - Performing multiple structure alignments - Scanning query sequences against individual methods predicting domain structures and structural features eg membrane regions - Generating consensus data from multiple prediction methods

Planned Impact

SUMMARY OF RESOURCE This proposal is to establish a resource (GENOME-3D) for the bioscience and biomedical communities providing integrated information on the 3D structures of proteins and relate this data to protein function. GENOME-3D will comprise information from major UK groups in structural bioinformatics. The individual resources are extensively used by the community - combined access to the different databases is >50,000 visits per month and the total number of jobs run on all the servers is 20,000 jobs per month. This testifies to the importance of this structure and functional information for both the academic and commercial communities. Producing a combined resource will enhance the value of the individual components by enabling comparisons and cross-referencing. The resource will have an impact on many applications of bioscience and biomedical research. This proposal is endorsed by letters of support from several major UK pharmaceutical, biotech and agricultural companies - Syngenta, UCB, GSK, Isogenica, Heptares, Syntaxin and Astex. SCIENCE COMMUNITY Food security - Increasingly the sequences of plants, agricultural pests and agents of disease will be the focus of genome sequencing and structural studies. GENOME-3D will assist in the interpretation of the relationship between sequence variations between plant strains and help identify the best strain to meet yield, water requirements, colour, taste and resistance to pests and disease. The information could benefit chemical discovery and marker identification for crop breeding programs. Bio-energy and bio-industry - The manipulation of individual molecules and pathways will yield new sources of energy and materials. Synthetic pathways can be engineered to make molecules, such as fuels, more efficiently. In addition, novel molecules can be designed and synthesised. Detailed structural knowledge of a protein family can be used to suggest the critical changes to alter function. At the pathway level, GENOME-3D will help to identify the components based on sequence and structural information of families of proteins. Health - The central role of protein structure in the design of novel and improved pharmaceuticals is well established. Provision of the highest quality 3D models from gene sequence will therefore directly enhance the discovery of new hits. The refinement of these hits into leads will benefit from information about a family of molecules to highlight the relationship of stereochemistry, ligand binding and activity. Therapeutic molecules will span the spectrum from low molecular weight compounds, through peptides into proteins, including antibodies. A major development in the next few years will be the sequencing of many individuals and relating their sequence variations (single nucleotide polymorphisms, SNPs) to disease susceptibility. This will provide major insights into biological processes in humans, the development of personalised medicine and the identification of novel drug targets. Central in the interpretation of SNPs effects in protein coding regions will be knowledge in GENOME-3D of the inter-relationships between protein sequence, structure, function and pathways. POLICY MAKERS AND THE LAY PUBLIC GENOME-3D will involve several UK groups working together to develop a world-leading bioinformatics resource. The success of the project could inform policy makers about the value of collaborative work for bioinformatics and other scientific resources within the UK, within Europe and worldwide. Similarly, GENOME-3D can demonstrate to the general public (including schools) the value of bioinformatics resources and collaborative research. GENOME-3D has applied to become a node within the ELIXIR funding framework. Participation in this new mechanism for promoting collaborative development and maintenance of major European resources will help shape policy and provide exemplars of how Elixir can benefit the wider European community.

Publications

10 25 50
publication icon
Berman HM (2013) How community has shaped the Protein Data Bank. in Structure (London, England : 1993)

publication icon
Gutmanas A (2014) PDBe: Protein Data Bank in Europe. in Nucleic acids research

publication icon
Gutmanas A (2013) The role of structural bioinformatics resources in the era of integrative structural biology. in Acta crystallographica. Section D, Biological crystallography

publication icon
Sen S (2014) Small molecule annotation for the Protein Data Bank. in Database : the journal of biological databases and curation

 
Description The Genome3D portal has integrated annotation of structure features from various UK based prediction servers for three model organisms - Human, Mouse, Bakers yeast, E. coli., Arabidopsis and many others. The integration of annotation has allowed for the first time to provide information on consensus between these independent and diverse methods making it possible for the users to select high confidence annotations of structural features.
The project has also mapped the correspondence of the structure domain data from CATH and SCOP. This has highlighted the differences and errors in domain annotations making it possible to improve both CATH and SCOP data.
Exploitation Route The resource makes 3D structure feature annotations available for sequences from model organisms including Human. This makes structure data available for proteim molecules that do not have experimentally determined structure information in the Protein Data Bank. The integration of different resources also makes it possible to derive high confidence predictions based on the consensus between different methods. This data can be used by academic users as well as biotechnology and Pharmaceutical researchers to understand the function of the protein molecules. This can lead to better understanding of the effects of variation in protein sequences on the function of the molecule leading to better understanding of disease mechanism. The information can also help design novel and improved pharmaceutical molecules. The structure prediction data can also help in altering function or in design of new biomolecules for biotechnology applications.
Sectors Education,Pharmaceuticals and Medical Biotechnology

URL http://genome3d.eu/
 
Description Biomacromolecular structure information can provide insights into the function of the protein and also help understand the effects of changes to the amino acid sequence of the protein on structure and consequently function of the molecule. There are several approaches to predict structure and function of protein molecules. This project aims to bring together the leading UK resources for structure-function predictions providing unique resource that allows users access to this information in one place. The integration also allows the developers to compare their method against all the other methods leading to improvements to the existing methods. The project also helped map the two structure domain resources CATH and SCOP that provide structure domain classification data. This information has made it possible to improve both CATH and SCOP resources helping many in the structural and structure-function prediction community.
First Year Of Impact 2013
Sector Education,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Title Genome3D website 
Description Website providing Genome3D data 
Type Of Technology Webtool/Application 
Year Produced 2012 
Impact The predicted structural features for sequences are made available via the website 
URL http://genome3d.eu/
 
Description Booth with handouts on "PDBe-KB aggregated views of proteins" 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Manned a booth and distributed handouts to participants at the RSC NMR Discussion Group meeting organised at the University of Leeds.
Year(s) Of Engagement Activity 2019
URL https://www.rsc.org/events/detail/37139/nmr-in-biophysics-and-molecular-biology
 
Description Invited lecture and workshop titled "Data deposition and validation at the Worldwide Protein Data Bank" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Poster presented at the CCPEM Spring Symposium 2019 held at the University of Nottingham, UK.
Year(s) Of Engagement Activity 2019
URL https://www.ebi.ac.uk/pdbe/about/events/ccp-em-spring-symposium
 
Description PDBe Knowledge Base (PDBe-KB) - infrastructure for FAIR structural and functional annotations 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Talk given at the BCA Spring Meeting 2019 organised at the University of Nottingham, UK.
Year(s) Of Engagement Activity 2019
URL https://www.ebi.ac.uk/pdbe/about/events/bca-spring-meeting-2019
 
Description Poster titled "Functional annotations in the PDBe Knowledge Base" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Poster presented during the CCPEM Spring Symposium 2019 held at the University of Nottingham, UK.
Year(s) Of Engagement Activity 2019
URL https://www.ebi.ac.uk/pdbe/about/events/ccp-em-spring-symposium
 
Description Poster titled "Functional annotations in the PDBe Knowledge Base" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Poster presented during the Instruct ERIC Structural Biology Conference 2019 meeting held at the University of Alcala, Spain.
Year(s) Of Engagement Activity 2019
URL https://www.structuralbiology.eu/biennial2019
 
Description Poster titled "Functional annotations in the PDBe Knowledge Base" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Poster presented during the BCA Spring Meeting 2019 held at the University of Nottingham, UK.
Year(s) Of Engagement Activity 2019
URL https://www.ebi.ac.uk/pdbe/about/events/bca-spring-meeting-2019
 
Description Poster titled "PDBe-KB aggregated views of proteins" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This poster for presented by the PDBe-KB at the 12th International BioCuration Conference in UK.
Year(s) Of Engagement Activity 2019
 
Description Poster titled "PDBe-KB: Aggregated views of protein structural data for drug development" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact This poster was presented at the UKQSAR Spring meeting 2019 at Downing College, Cambridge, an event hosted by Astex Pharmaceuticals and is themed around structure-based drug discovery.
Year(s) Of Engagement Activity 2019
 
Description Talk and workshop entitled, "Finding and understanding macromolecular structures at PDBe" 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Workshop and talk presented in Pavia, Italy as part of a wider EMBL-EBI workshop titled "Resources and tools for genomics, protein interactions and structural applications".
Year(s) Of Engagement Activity 2019
URL https://www.ebi.ac.uk/training/events/2019/embl-ebi-workshop-resources-and-tools-genomics-protein-in...
 
Description What does PDB do to improve data quality? 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact One talk and one institutional seminar were presented at the Proteopedia training workshop organised at the University of Strasbourg, France.
Year(s) Of Engagement Activity 2019