GENOME-3D: a UK network providing structure-based annotations for genotype to phenotype studies

Lead Research Organisation: MRC Centre Cambridge

Department Name: LMB Structural Studies

Abstract

The 3D structures of proteins are essential to fully characterise the sites mediating their molecular functions and their interactions with other proteins. However, whilst revolutionary technologies have enabled the sequencing of thousands of complete genomes, it is more challenging to determine the 3D structures of the proteins. Although the sequence repositories now contain >10 million protein sequences, less than 70,000 protein structures have been determined. Fortunately, in parallel with developments in sequencing technologies, powerful computational methods have emerged to predict the structure of a protein from its sequence. Currently these methods provide putative structures for ~80% of domain sequences from completed genomes, although the accuracy of this data varies from reasonably precise when structures are modelled using templates based on close relatives, through to quite approximate for models based on remote relatives and where proteins have no structurally characterised relatives. This project will bring together 6 internationally renowned UK groups involved in (1) classifying protein domains into evolutionary families (as this facilitates structure and function prediction) and/or (2) protein structure prediction. As regards the first activity - classification of protein structures - the two groups involved (SCOP,CATH) are the only groups, worldwide, providing this data. However, each applies somewhat different methodologies to make their assignments. Collaboration between these groups, in GENOME-3D, will involve comparison of domain structures and family classifications leading to refinements of assignments and/or confidence levels where the methods disagree. Since manual curation of the data is essential and since the rate at which the structures are determined is increasing, collaborations will speed up classification by allowing the groups to share information on the more challenging assignments and to discuss outcomes. For the second activity, structure prediction, the groups involved use technologies that vary in their sensitivity and in their ability to handle large numbers of sequences. Whilst SUPERFAMILY (based on SCOP) and Gene3D (based on CATH) provide greater coverage they are less likely to recognise very remote homologues, where methods such as GenTHREADER, Phyre, Fugue perform better. For each sequence, we will combine predictions from these different resources and assign confidence for each residue position in a query sequence based on the number of methods that agree in their structural prediction. We will provide pre-calculated assignments and also allow dynamic queries on the methods. We will also build 3D models for the sequences with residue positions highlighted according to agreement between the methods. We will develop computational platforms that integrate the information provided by each resource. To distribute this data to the biological and medical community we will build a dedicated web site. We will also establish web servers that link the methods ie run all the methods on query sequences and then report consensus assignments and highlight differences. In addition the consensus classification and annotation data will also be provided via two major international sites - the PDBe and InterPro. The sequence repositories are expanding at phenomenal rates as metagenomics and next gen sequencing initiatives bring in sequences from diverse microbial environments and report sequence variants occurring across different human populations or associated with different disease phenotypes. Structural data will enhance the insights available from this data. For example, known or predicted structures can reveal whether residue mutat

Technical Summary

We will develop the GENOME-3D: (1) website - presenting integrated information from the consortiums resources (2) webserver - allowing users to submit query sequences/structures to run against the consortiums methods and return consensus predictions. (1) GENOME-3D website We will develop SOAP/REST based web services for: - Exporting data from individual resources to GENOME-3D ie domain boundaries/superfamily classifications/domain structure predictions - Combining data, identifying consensus regions and calculating confidence values We will develop Taverna workflows which plug together the above web-services to provide consensus data. We will build a web portal to display this data (see figure 1 main text). The website will exploit an Oracle database and will provide facilities for querying with protein structure ids (PDB ids) or sequence ids (UniProt or GI codes). All partners have extensive experience in web design. CATH-Gene3D has tools for visualising multiple structure/multiple sequence alignments and highlighting conserved residues on representative structures. These will be adopted by GENOME-3D. We will design a questionnaire to capture feedback on the site and use this to improve design. (1) GENOME-3D webserver As well as providing predetermined classifications/annotations via the website (some data is manually curated), we will establish a server that allows structure/sequence based queries and automatically returns consensus domain classifications/predictions (no manual curation). We will develop SOAP/REST based web services for: - Scanning query structures against classification methods ie structure comparison (CATHEDRAL) and homologue recognition (HMMscan) to give uncurated SCOP/CATH assignments. - Performing multiple structure alignments - Scanning query sequences against individual methods predicting domain structures and structural features eg membrane regions - Generating consensus data from multiple prediction methods

Planned Impact

SUMMARY OF RESOURCE This proposal is to establish a resource (GENOME-3D) for the bioscience and biomedical communities providing integrated information on the 3D structures of proteins and relate this data to protein function. GENOME-3D will comprise information from major UK groups in structural bioinformatics. The individual resources are extensively used by the community - combined access to the different databases is >50,000 visits per month and the total number of jobs run on all the servers is 20,000 jobs per month. This testifies to the importance of this structure and functional information for both the academic and commercial communities. Producing a combined resource will enhance the value of the individual components by enabling comparisons and cross-referencing. The resource will have an impact on many applications of bioscience and biomedical research. This proposal is endorsed by letters of support from several major UK pharmaceutical, biotech and agricultural companies - Syngenta, UCB, GSK, Isogenica, Heptares, Syntaxin and Astex. SCIENCE COMMUNITY Food security - Increasingly the sequences of plants, agricultural pests and agents of disease will be the focus of genome sequencing and structural studies. GENOME-3D will assist in the interpretation of the relationship between sequence variations between plant strains and help identify the best strain to meet yield, water requirements, colour, taste, and resistance to pests and disease. The information could benefit chemical discovery and marker identification for crop breeding programs. Bio-energy and bio-industry - The manipulation of individual molecules and pathways will yield new sources of energy and materials. Synthetic pathways can be engineered to make molecules, such as fuels, more efficiently. In addition, novel molecules can be designed and synthesised. Detailed structural knowledge of a protein family can be used to suggest the critical changes to alter function. At the pathway level, GENOME-3D will help to identify the components based on sequence and structural information of families of proteins. Health - The central role of protein structure in the design of novel and improved pharmaceuticals is well established. Provision of the highest quality 3D models from gene sequence will therefore directly enhance the discovery of new hits. The refinement of these hits into leads will benefit from information about a family of molecules to highlight the relationship of stereochemistry, ligand binding, and activity. Therapeutic molecules will span the spectrum from low molecular weight compounds, through peptides into proteins, including antibodies. A major development in the next few years will be the sequencing of many individuals and relating their sequence variations (single nucleotide polymorphisms, SNPs) to disease susceptibility. This will provide major insights into biological processes in human, the development of personalised medicine, and the identification of novel drug targets. Central in the interpretation of SNPs effects in protein coding regions will be knowledge in GENOME-3D of the inter-relationships between protein sequence, structure, function and pathways. POLICY MAKERS AND THE LAY PUBLIC GENOME-3D will involve several UK groups working together to develop a world-leading bioinformatics resource. The success of the project could inform policy makers about the value of collaborative work for bioinformatics and other scientific resources within the UK, within Europe and worldwide. Similarly, GENOME-3D can demonstrate to the general public (including schools) the value of bioinformatics resources and collaborative research. GENOME-3D has applied to become a node within the ELIXIR funding framework. Participation in this new mechanism for promoting collaborative development and maintenance of major European resources will help shape policy and provide exemplars of how Elixir can benefit the wider European community.

Funded Value:

£62,391

Funded Period:

Jan 12 - Jul 13

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/I024917/1

Principal Investigator:

Alexey Murzin

Research Subject:

Biomolecules & biochemistry (65%)

Omic sciences & technologies (13%)

Tools, technologies & methods (13%)

Research Topic:

Protein expression (39%)

Protein folding / misfolding (13%)

Proteomics (13%)

Structural biology (13%)

Theoretical biology (13%)

Organisations

MRC Centre Cambridge (Lead Research Organisation)

People	ORCID iD
Alexey Murzin (Principal Investigator)
Cyrus Chothia (Co-Investigator)

Publications

Author Name

Title Publication Date Published

10 25 50

Andreeva A (2015) Investigating Protein Structure and Evolution with SCOP2. in Current protocols in bioinformatics

Andreeva A (2020) The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. in Nucleic acids research

Andreeva A (2014) SCOP2 prototype: a new approach to protein structure mining. in Nucleic acids research

Lewis T (2012) Genome3D: a UK collaborative project to annotate genomic sequences with predicted 3D structures based on SCOP and CATH domains in Nucleic Acids Research

Lewis TE (2015) Genome3D: exploiting structure to help users understand their sequences. in Nucleic acids research

Sillitoe I (2020) Genome3D: integrating a collaborative data pipeline to expand the depth and breadth of consensus protein structure annotation. in Nucleic acids research

Key Findings
Further Funding


Description	Genome3D is a collaborative resource that provides structure-based predictions for gene products from the several representative complete genomes, including human, to help users learn more about their gene sequences. By combining results from independent resources, it allows the user to assess agreement and hence gauge confidence. More details on the key finding associated with this grant can be found in the report of the leading PI Prof. Christine Orengo of UCL
Exploitation Route	Genome3D is a resource for exploring the insights that structure can bring to sequence and is a gateway for then learning more through the groups' individual resources.
Sectors	Agriculture Food and Drink Healthcare Pharmaceuticals and Medical Biotechnology Other
URL	http://www.genome3d.eu


Description	BBSRC Research Grant Genome3D
Amount	£62,391 (GBP)
Funding ID	BB/1024917/1
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	01/2012
End	07/2013