GENOME-3D: a UK network providing structure-based annotations for genotype to phenotype studies

Lead Research Organisation: Imperial College London
Department Name: Life Sciences

Abstract

The 3D structures of proteins are essential to fully characterise the sites mediating their molecular functions and their interactions with other proteins. However, whilst revolutionary technologies have enabled the sequencing of thousands of complete genomes, it is more challenging to determine the 3D structures of the proteins. Although the sequence repositories now contain >10 million protein sequences, less than 70,000 protein structures have been determined. Fortunately, in parallel with developments in sequencing technologies, powerful computational methods have emerged to predict the structure of a protein from its sequence. Currently these methods provide putative structures for ~80% of domain sequences from completed genomes, although the accuracy of this data varies from reasonably precise when structures are modelled using templates based on close relatives, through to quite approximate for models based on remote relatives and where proteins have no structurally characterised relatives. This project will bring together 6 internationally renowned UK groups involved in (1) classifying protein domains into evolutionary families (as this facilitates structure and function prediction) and/or (2) protein structure prediction. As regards the first activity - classification of protein structures - the two groups involved (SCOP,CATH) are the only groups, worldwide, providing this data. However, each applies somewhat different methodologies to make their assignments. Collaboration between these groups, in GENOME-3D, will involve comparison of domain structures and family classifications leading to refinements of assignments and/or confidence levels where the methods disagree. Since manual curation of the data is essential and since the rate at which the structures are determined is increasing, collaborations will speed up classification by allowing the groups to share information on the more challenging assignments and to discuss outcomes. For the second activity, structure prediction, the groups involved use technologies that vary in their sensitivity and in their ability to handle large numbers of sequences. Whilst SUPERFAMILY (based on SCOP) and Gene3D (based on CATH) provide greater coverage they are less likely to recognise very remote homologues, where methods such as GenTHREADER, Phyre, Fugue perform better. For each sequence, we will combine predictions from these different resources and assign confidence for each residue position in a query sequence based on the number of methods that agree in their structural prediction. We will provide pre-calculated assignments and also allow dynamic queries on the methods. We will also build 3D models for the sequences with residue positions highlighted according to agreement between the methods. We will develop computational platforms that integrate the information provided by each resource. To distribute this data to the biological and medical community we will build a dedicated web site. We will also establish web servers that link the methods ie run all the methods on query sequences and then report consensus assignments and highlight differences. In addition the consensus classification and annotation data will also be provided via two major international sites - the PDBe and InterPro. The sequence repositories are expanding at phenomenal rates as metagenomics and next gen sequencing initiatives bring in sequences from diverse microbial environments and report sequence variants occurring across different human populations or associated with different disease phenotypes. Structural data will enhance the insights available from this data. For example, known or predicted structures can reveal whether residue mutations oc

Technical Summary

We will develop the GENOME-3D: (1) website - presenting integrated information from the consortiums resources (2) webserver - allowing users to submit query sequences/structures to run against the consortiums methods and return consensus predictions. (1) GENOME-3D website We will develop SOAP/REST based web services for: - Exporting data from individual resources to GENOME-3D ie domain boundaries/superfamily classifications/domain structure predictions - Combining data, identifying consensus regions and calculating confidence values We will develop Taverna workflows which plug together the above web-services to provide consensus data. We will build a web portal to display this data (see figure 1 main text). The website will exploit an Oracle database and will provide facilities for querying with protein structure ids (PDB ids) or sequence ids (UniProt or GI codes). All partners have extensive experience in web design. CATH-Gene3D has tools for visualising multiple structure/multiple sequence alignments and highlighting conserved residues on representative structures. These will be adopted by GENOME-3D. We will design a questionnaire to capture feedback on the site and use this to improve design. (1) GENOME-3D webserver As well as providing predetermined classifications/annotations via the website (some data is manually curated), we will establish a server that allows structure/sequence based queries and automatically returns consensus domain classifications/predictions (no manual curation). We will develop SOAP/REST based web services for: - Scanning query structures against classification methods ie structure comparison (CATHEDRAL) and homologue recognition (HMMscan) to give uncurated SCOP/CATH assignments. - Performing multiple structure alignments - Scanning query sequences against individual methods predicting domain structures and structural features eg membrane regions - Generating consensus data from multiple prediction methods

Planned Impact

SUMMARY OF RESOURCE This proposal is to establish a resource (GENOME-3D) for the bioscience and biomedical communities providing integrated information on the 3D structures of proteins and relate this data to protein function. GENOME-3D will comprise information from major UK groups in structural bioinformatics. The individual resources are extensively used by the community - combined access to the different databases is >50,000 visits per month and the total number of jobs run on all the servers is 20,000 jobs per month. This testifies to the importance of this structure and functional information for both the academic and commercial communities. Producing a combined resource will enhance the value of the individual components by enabling comparisons and cross-referencing. The resource will have an impact on many applications of bioscience and biomedical research. This proposal is endorsed by letters of support from several major UK pharmaceutical ,biotech and agricultural companies - Syngenta, UCB, GSK, Isogenica, Heptares, Syntaxin and Astex. SCIENCE COMMUNITY Food security - Increasingly the sequences of plants, agricultural pests and agents of disease will be the focus of genome sequencing and structural studies. GENOME-3D will assist in the interpretation of the relationship between sequence variations between plant strains and help identify the best strain to meet yield, water requirements, colour, taste, and resistance to pests and disease. The information could benefit chemical discovery and marker identification for crop breeding programs. Bio-energy and bio-industry - The manipulation of individual molecules and pathways will yield new sources of energy and materials. Synthetic pathways can be engineered to make molecules, such as fuels, more efficiently. In addition, novel molecules can be designed and synthesised. Detailed structural knowledge of a protein family can be used to suggest the critical changes to alter function. At the pathway level, GENOME-3D will help to identify the components based on sequence and structural information of families of proteins. Health - The central role of protein structure in the design of novel and improved pharmaceuticals is well established. Provision of the highest quality 3D models from gene sequence will therefore directly enhance the discovery of new hits. The refinement of these hits into leads will benefit from information about a family of molecules to highlight the relationship of stereochemistry, ligand binding, and activity. Therapeutic molecules will span the spectrum from low molecular weight compounds, through peptides into proteins, including antibodies. A major development in the next few years will be the sequencing of many individuals and relating their sequence variations (single nucleotide polymorphisms, SNPs) to disease susceptibility. This will provide major insights into biological processes in human, the development of personalised medicine, and the identification of novel drug targets. Central in the interpretation of SNPs effects in protein coding regions will be knowledge in GENOME-3D of the inter-relationships between protein sequence, structure, function and pathways. POLICY MAKERS AND THE LAY PUBLIC GENOME-3D will involve several UK groups working together to develop a world-leading bioinformatics resource. The success of the project could inform policy makers about the value of collaborative work for bioinformatics and other scientific resources within the UK, within Europe and worldwide. GENOME-3D can demonstrate to the general public (including schools) the value of bioinformatics resources and collaborative research. GENOME-3D has applied to become a node within the ELIXIR funding framework. Participation in this new mechanism for promoting collaborative development and maintenance of major European resources will help shape policy and provide exemplars of how Elixir can benefit the European community.
 
Description Several groups in the UK collaborated to develop an integrated resource providing structural annotations of proteins.
Exploitation Route Further development of reliability of annotations. Inclusion of functional annotations.
Sectors Agriculture, Food and Drink,Chemicals,Pharmaceuticals and Medical Biotechnology

URL http://www.genome3d.eu
 
Description Genome3D, available at http://www.genome3d.eu, is a new collaborative project that integrates UK-based structural resources to provide a unique perspective on sequence-structure-function relationships. Leading structure prediction resources (DomSerf, FUGUE, Gene3D, pDomTHREADER, Phyre and SUPERFAMILY) provide annotations for UniProt sequences to indicate the locations of structural domains (structural annotations) and their 3D structures (structural models). Structural annotations and 3D model predictions are currently available for three model genomes (Homo sapiens, E. coli and baker's yeast), and the project will extend to other genomes in the near future. As these resources exploit different strategies for predicting structures, the main aim of Genome3D is to enable comparisons between all the resources so that biologists can see where predictions agree and are therefore more trusted. Furthermore, as these methods differ in whether they build their predictions using CATH or SCOP, Genome3D also contains the first official mapping between these two databases. This has identified pairs of similar superfamilies from the two resources at various degrees of consensus.
First Year Of Impact 2013
Sector Agriculture, Food and Drink,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description Biomedical Resource Development Fund
Amount £830,000 (GBP)
Funding ID WT104955MA 
Organisation Wellcome Trust 
Department Wellcome Trust Institutional Strategic Support Fund
Sector Charity/Non Profit
Country United Kingdom
Start 01/2015 
End 12/2020
 
Description EPSRC PhD Studentship
Amount £65,000 (GBP)
Funding ID EP/K502856/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 10/2012 
End 03/2016
 
Title Genome3D 
Description GGenome3D, available at http://www.genome3d.eu, is a new collaborative project that integrates UK-based structural resources to provide a unique perspective on sequence-structure-function relationships. Leading structure prediction resources (DomSerf, FUGUE, Gene3D, pDomTHREADER, Phyre and SUPERFAMILY) provide annotations for UniProt sequences to indicate the locations of structural domains (structural annotations) and their 3D structures (structural models). Structural annotations and 3D model predictions are currently available for three model genomes (Homo sapiens, E. coli and baker's yeast), and the project will extend to other genomes in the near future. As these resources exploit different strategies for predicting structures, the main aim of Genome3D is to enable comparisons between all the resources so that biologists can see where predictions agree and are therefore more trusted. Furthermore, as these methods differ in whether they build their predictions using CATH or SCOP, Genome3D also contains the first official mapping between these two databases. 
Type Of Technology Webtool/Application 
Year Produced 2013 
Impact More reliable proteome structural annotation for use by the community. 
URL http://www.genome3d.eu
 
Title Phyre2 - A portal for protein modelling 
Description Phyre2 is the seond generation of Phyre in which a user pastes a protein sequence and the server returns a predicted 3D structure and provides additional protein modelling. D 
Type Of Technology Webtool/Application 
Year Produced 2011 
Impact During 2013, Phyre2 had over 40,00 unique visitors and since 2012, over 80,000 distinct users. 
URL http://www.sbg.bio.ic.ac.uk/phyre2/html/page.cgi?id=index
 
Description Human - computer interaction 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact Demonstrated first playable protoype of docking game to general audience

To early to report
Year(s) Of Engagement Activity 2014
 
Description Imperial Festival & Fringe (open to public) 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact The Imperial festival is attended by over 10,000 visitors ranging from policy makers, the general public including children of all ages. We demonstrated the implications of understanding protein structure. At our stand we had over 100 visitors.
Year(s) Of Engagement Activity 2014,2016
URL https://www.imperial.ac.uk/be-inspired/festival/
 
Description Lecture - Art and Science 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Public/other audiences
Results and Impact Talk highlighted link of structural biology and art.

Follow up invitation to talk at a human/computer iteraction conference
Year(s) Of Engagement Activity 2013
 
Description School lecture (London) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Talk to school children to spark interest in science

Requests for work experience
Year(s) Of Engagement Activity 2012
 
Description Talk at school 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact Describing use of bioinformatics in medical research
Year(s) Of Engagement Activity 2015
 
Description Work experience for 16-18 year old pupils 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact We provided 1 weeks work experience for about 6 students each year. They visting facilities at Imperial and we introduced to computer programming and molecular graphics.
Year(s) Of Engagement Activity 2014,2015