GENOME-3D: a UK network providing structure-based annotations for genotype to phenotype studies

Lead Research Organisation: European Bioinformatics Institute

Department Name: Protein Data Bank in Europe

Abstract

The 3D structures of proteins are essential to fully characterise the sites mediating their molecular functions and their interactions with other proteins. However, whilst revolutionary technologies have enabled the sequencing of thousands of complete genomes, it is more challenging to determine the 3D structures of the proteins. Although the sequence repositories now contain >10 million protein sequences, less than 70,000 protein structures have been determined. Fortunately, in parallel with developments in sequencing technologies, powerful computational methods have emerged to predict the structure of a protein from its sequence. Currently these methods provide putative structures for ~80% of domain sequences from completed genomes, although the accuracy of this data varies from reasonably precise when structures are modelled using templates based on close relatives, through to quite approximate for models based on remote relatives and where proteins have no structurally characterised relatives. This project will bring together 6 internationally renowned UK groups involved in (1) classifying protein domains into evolutionary families (as this facilitates structure and function prediction) and/or (2) protein structure prediction. As regards the first activity - classification of protein structures - the two groups involved (SCOP,CATH) are the only groups, worldwide, providing this data. However, each applies somewhat different methodologies to make their assignments. Collaboration between these groups, in GENOME-3D, will involve comparison of domain structures and family classifications leading to refinements of assignments and/or confidence levels where the methods disagree. Since manual curation of the data is essential and since the rate at which the structures are determined is increasing, collaborations will speed up classification by allowing the groups to share information on the more challenging assignments and to discuss outcomes. For the second activity, structure prediction, the groups involved use technologies that vary in their sensitivity and in their ability to handle large numbers of sequences. Whilst SUPERFAMILY (based on SCOP) and Gene3D (based on CATH) provide greater coverage they are less likely to recognise very remote homologues, where methods such as GenTHREADER, Phyre, Fugue perform better. For each sequence, we will combine predictions from these different resources and assign confidence for each residue position in a query sequence based on the number of methods that agree in their structural prediction. We will provide pre-calculated assignments and also allow dynamic queries on the methods. We will also build 3D models for the sequences with residue positions highlighted according to agreement between the methods. We will develop computational platforms that integrate the information provided by each resource. To distribute this data to the biological and medical community we will build a dedicated web site. We will also establish web servers that link the methods ie run all the methods on query sequences and then report consensus assignments and highlight differences. In addition the consensus classification and annotation data will also be provided via two major international sites - the PDBe and InterPro. The sequence repositories are expanding at phenomenal rates as metagenomics and next gen sequencing initiatives bring in sequences from diverse microbial environments and report sequence variants occurring across different human populations or associated with different disease phenotypes. Structural data will enhance the insights available from this data. For example, known or predicted structures can reveal whether residue mutations occur near sites important for protein function or interaction with other proteins in

Technical Summary

We will develop the GENOME-3D: (1) website - presenting integrated information from the consortiums resources (2) webserver - allowing users to submit query sequences/structures to run against the consortiums methods and return consensus predictions. (1) GENOME-3D website We will develop SOAP/REST based web services for: - Exporting data from individual resources to GENOME-3D ie domain boundaries/superfamily classifications/domain structure predictions - Combining data, identifying consensus regions and calculating confidence values We will develop Taverna workflows which plug together the above web-services to provide consensus data. We will build a web portal to display this data (see figure 1 main text). The website will exploit an Oracle database and will provide facilities for querying with protein structure ids (PDB ids) or sequence ids (UniProt or GI codes). All partners have extensive experience in web design. CATH-Gene3D has tools for visualising multiple structure/multiple sequence alignments and highlighting conserved residues on representative structures. These will be adopted by GENOME-3D. We will design a questionnaire to capture feedback on the site and use this to improve design. (1) GENOME-3D webserver As well as providing predetermined classifications/annotations via the website (some data is manually curated), we will establish a server that allows structure/sequence based queries and automatically returns consensus domain classifications/predictions (no manual curation). We will develop SOAP/REST based web services for: - Scanning query structures against classification methods ie structure comparison (CATHEDRAL) and homologue recognition (HMMscan) to give uncurated SCOP/CATH assignments. - Performing multiple structure alignments - Scanning query sequences against individual methods predicting domain structures and structural features eg membrane regions - Generating consensus data from multiple prediction methods

Planned Impact

SUMMARY OF RESOURCE This proposal is to establish a resource (GENOME-3D) for the bioscience and biomedical communities providing integrated information on the 3D structures of proteins and relate this data to protein function. GENOME-3D will comprise information from major UK groups in structural bioinformatics. The individual resources are extensively used by the community - combined access to the different databases is >50,000 visits per month and the total number of jobs run on all the servers is 20,000 jobs per month. This testifies to the importance of this structure and functional information for both the academic and commercial communities. Producing a combined resource will enhance the value of the individual components by enabling comparisons and cross-referencing. The resource will have an impact on many applications of bioscience and biomedical research. This proposal is endorsed by letters of support from several major UK pharmaceutical, biotech and agricultural companies - Syngenta, UCB, GSK, Isogenica, Heptares, Syntaxin and Astex. SCIENCE COMMUNITY Food security - Increasingly the sequences of plants, agricultural pests and agents of disease will be the focus of genome sequencing and structural studies. GENOME-3D will assist in the interpretation of the relationship between sequence variations between plant strains and help identify the best strain to meet yield, water requirements, colour, taste and resistance to pests and disease. The information could benefit chemical discovery and marker identification for crop breeding programs. Bio-energy and bio-industry - The manipulation of individual molecules and pathways will yield new sources of energy and materials. Synthetic pathways can be engineered to make molecules, such as fuels, more efficiently. In addition, novel molecules can be designed and synthesised. Detailed structural knowledge of a protein family can be used to suggest the critical changes to alter function. At the pathway level, GENOME-3D will help to identify the components based on sequence and structural information of families of proteins. Health - The central role of protein structure in the design of novel and improved pharmaceuticals is well established. Provision of the highest quality 3D models from gene sequence will therefore directly enhance the discovery of new hits. The refinement of these hits into leads will benefit from information about a family of molecules to highlight the relationship of stereochemistry, ligand binding and activity. Therapeutic molecules will span the spectrum from low molecular weight compounds, through peptides into proteins, including antibodies. A major development in the next few years will be the sequencing of many individuals and relating their sequence variations (single nucleotide polymorphisms, SNPs) to disease susceptibility. This will provide major insights into biological processes in humans, the development of personalised medicine and the identification of novel drug targets. Central in the interpretation of SNPs effects in protein coding regions will be knowledge in GENOME-3D of the inter-relationships between protein sequence, structure, function and pathways. POLICY MAKERS AND THE LAY PUBLIC GENOME-3D will involve several UK groups working together to develop a world-leading bioinformatics resource. The success of the project could inform policy makers about the value of collaborative work for bioinformatics and other scientific resources within the UK, within Europe and worldwide. Similarly, GENOME-3D can demonstrate to the general public (including schools) the value of bioinformatics resources and collaborative research. GENOME-3D has applied to become a node within the ELIXIR funding framework. Participation in this new mechanism for promoting collaborative development and maintenance of major European resources will help shape policy and provide exemplars of how Elixir can benefit the wider European community.

Funded Value:

£54,794

Funded Period:

Aug 12 - Aug 14

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/I02576X/1

Principal Investigator:

Gerard Kleywegt

Research Subject:

Biomolecules & biochemistry (65%)

Omic sciences & technologies (13%)

Tools, technologies & methods (13%)

Research Topic:

Protein expression (39%)

Protein folding / misfolding (13%)

Proteomics (13%)

Structural biology (13%)

Theoretical biology (13%)

Organisations

European Bioinformatics Institute (Lead Research Organisation)

People	ORCID iD
Gerard Kleywegt (Principal Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Berman HM (2013) How community has shaped the Protein Data Bank. in Structure (London, England : 1993)

Dutta S (2014) Improving the representation of peptide-like inhibitor and antibiotic molecules in the Protein Data Bank. in Biopolymers

Gutmanas A (2013) The role of structural bioinformatics resources in the era of integrative structural biology. in Acta crystallographica. Section D, Biological crystallography

Gutmanas A (2014) PDBe: Protein Data Bank in Europe. in Nucleic acids research

Lewis TE (2015) Genome3D: exploiting structure to help users understand their sequences. in Nucleic acids research

Lewis TE (2013) Genome3D: a UK collaborative project to annotate genomic sequences with predicted 3D structures based on SCOP and CATH domains. in Nucleic acids research

Sen S (2014) Small molecule annotation for the Protein Data Bank. in Database : the journal of biological databases and curation

Key Findings
Impact Summary
Software and Technical Products
Engagement Activities


Description	The Genome3D portal has integrated annotation of structure features from various UK based prediction servers for three model organisms - Human, Mouse, Bakers yeast, E. coli., Arabidopsis and many others. The integration of annotation has allowed for the first time to provide information on consensus between these independent and diverse methods making it possible for the users to select high confidence annotations of structural features. The project has also mapped the correspondence of the structure domain data from CATH and SCOP. This has highlighted the differences and errors in domain annotations making it possible to improve both CATH and SCOP data.
Exploitation Route	The resource makes 3D structure feature annotations available for sequences from model organisms including Human. This makes structure data available for proteim molecules that do not have experimentally determined structure information in the Protein Data Bank. The integration of different resources also makes it possible to derive high confidence predictions based on the consensus between different methods. This data can be used by academic users as well as biotechnology and Pharmaceutical researchers to understand the function of the protein molecules. This can lead to better understanding of the effects of variation in protein sequences on the function of the molecule leading to better understanding of disease mechanism. The information can also help design novel and improved pharmaceutical molecules. The structure prediction data can also help in altering function or in design of new biomolecules for biotechnology applications.
Sectors	Education Pharmaceuticals and Medical Biotechnology
URL	http://genome3d.eu/


Description	Biomacromolecular structure information can provide insights into the function of the protein and also help understand the effects of changes to the amino acid sequence of the protein on structure and consequently function of the molecule. There are several approaches to predict structure and function of protein molecules. This project aims to bring together the leading UK resources for structure-function predictions providing unique resource that allows users access to this information in one place. The integration also allows the developers to compare their method against all the other methods leading to improvements to the existing methods. The project also helped map the two structure domain resources CATH and SCOP that provide structure domain classification data. This information has made it possible to improve both CATH and SCOP resources helping many in the structural and structure-function prediction community.
First Year Of Impact	2013
Sector	Education,Pharmaceuticals and Medical Biotechnology
Impact Types	Economic


Title	Genome3D website
Description	Website providing Genome3D data
Type Of Technology	Webtool/Application
Year Produced	2012
Impact	The predicted structural features for sequences are made available via the website
URL	http://genome3d.eu/


Description	Booth with handouts on "PDBe-KB aggregated views of proteins"
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Manned a booth and distributed handouts to participants at the RSC NMR Discussion Group meeting organised at the University of Leeds.
Year(s) Of Engagement Activity	2019
URL	https://www.rsc.org/events/detail/37139/nmr-in-biophysics-and-molecular-biology


Description	Invited lecture and workshop titled "Data deposition and validation at the Worldwide Protein Data Bank"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Poster presented at the CCPEM Spring Symposium 2019 held at the University of Nottingham, UK.
Year(s) Of Engagement Activity	2019
URL	https://www.ebi.ac.uk/pdbe/about/events/ccp-em-spring-symposium


Description	PDBe Knowledge Base (PDBe-KB) - infrastructure for FAIR structural and functional annotations
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Talk given at the BCA Spring Meeting 2019 organised at the University of Nottingham, UK.
Year(s) Of Engagement Activity	2019
URL	https://www.ebi.ac.uk/pdbe/about/events/bca-spring-meeting-2019


Description	Poster titled "Functional annotations in the PDBe Knowledge Base"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Poster presented during the CCPEM Spring Symposium 2019 held at the University of Nottingham, UK.
Year(s) Of Engagement Activity	2019
URL	https://www.ebi.ac.uk/pdbe/about/events/ccp-em-spring-symposium


Description	Poster titled "Functional annotations in the PDBe Knowledge Base"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Poster presented during the BCA Spring Meeting 2019 held at the University of Nottingham, UK.
Year(s) Of Engagement Activity	2019
URL	https://www.ebi.ac.uk/pdbe/about/events/bca-spring-meeting-2019


Description	Poster titled "Functional annotations in the PDBe Knowledge Base"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Poster presented during the Instruct ERIC Structural Biology Conference 2019 meeting held at the University of Alcala, Spain.
Year(s) Of Engagement Activity	2019
URL	https://www.structuralbiology.eu/biennial2019


Description	Poster titled "PDBe-KB aggregated views of proteins"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	This poster for presented by the PDBe-KB at the 12th International BioCuration Conference in UK.
Year(s) Of Engagement Activity	2019


Description	Poster titled "PDBe-KB: Aggregated views of protein structural data for drug development"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	This poster was presented at the UKQSAR Spring meeting 2019 at Downing College, Cambridge, an event hosted by Astex Pharmaceuticals and is themed around structure-based drug discovery.
Year(s) Of Engagement Activity	2019


Description	Talk and workshop entitled, "Finding and understanding macromolecular structures at PDBe"
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Workshop and talk presented in Pavia, Italy as part of a wider EMBL-EBI workshop titled "Resources and tools for genomics, protein interactions and structural applications".
Year(s) Of Engagement Activity	2019
URL	https://www.ebi.ac.uk/training/events/2019/embl-ebi-workshop-resources-and-tools-genomics-protein-in...


Description	What does PDB do to improve data quality?
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	One talk and one institutional seminar were presented at the Proteopedia training workshop organised at the University of Strasbourg, France.
Year(s) Of Engagement Activity	2019