GENOME-3D: UK network providing structure-based annotations for genotype to phenotype studies

Lead Research Organisation: University of Cambridge
Department Name: Biochemistry

Abstract

The 3D structures of proteins are essential to fully characterise the sites mediating their molecular functions and their interactions with other proteins. However, whilst revolutionary technologies have enabled the sequencing of thousands of complete genomes, it is more challenging to determine the 3D structures of the proteins. Although the sequence repositories now contain >10 million protein sequences, less than 70,000 protein structures have been determined. Fortunately, in parallel with developments in sequencing technologies, powerful computational methods have emerged to predict the structure of a protein from its sequence. Currently these methods provide putative structures for ~80% of domain sequences from completed genomes, although the accuracy of this data varies from reasonably precise when structures are modelled using templates based on close relatives, through to quite approximate for models based on remote relatives and where proteins have no structurally characterised relatives. This project will bring together 6 internationally renowned UK groups involved in (1) classifying protein domains into evolutionary families (as this facilitates structure and function prediction) and/or (2) protein structure prediction. As regards the first activity - classification of protein structures - the two groups involved (SCOP,CATH) are the only groups, worldwide, providing this data. However, each applies somewhat different methodologies to make their assignments. Collaboration between these groups, in GENOME-3D, will involve comparison of domain structures and family classifications leading to refinements of assignments and/or confidence levels where the methods disagree. Since manual curation of the data is essential and since the rate at which the structures are determined is increasing, collaborations will speed up classification by allowing the groups to share information on the more challenging assignments and to discuss outcomes. For the second activity, structure prediction, the groups involved use technologies that vary in their sensitivity and in their ability to handle large numbers of sequences. Whilst SUPERFAMILY (based on SCOP) and Gene3D (based on CATH) provide greater coverage they are less likely to recognise very remote homologues, where methods such as GenTHREADER, Phyre, Fugue perform better. For each sequence, we will combine predictions from these different resources and assign confidence for each residue position in a query sequence based on the number of methods that agree in their structural prediction. We will provide pre-calculated assignments and also allow dynamic queries on the methods. We will also build 3D models for the sequences with residue positions highlighted according to agreement between the methods. We will develop computational platforms that integrate the information provided by each resource. To distribute this data to the biological and medical community we will build a dedicated web site. We will also establish web servers that link the methods ie run all the methods on query sequences and then report consensus assignments and highlight differences. In addition the consensus classification and annotation data will also be provided via two major international sites - the PDBe and InterPro. The sequence repositories are expanding at phenomenal rates as metagenomics and next gen sequencing initiatives bring in sequences from diverse microbial environments and report sequence variants occurring across different human populations or associated with different disease phenotypes. Structural data will enhance the insights available from this data. For example, known or predicted structures can reveal whether residue mutations oc

Technical Summary

We will develop the GENOME-3D: (1) website - presenting integrated information from the consortiums resources (2) webserver - allowing users to submit query sequences/structures to run against the consortiums methods and return consensus predictions. (1) GENOME-3D website We will develop SOAP/REST based web services for: - Exporting data from individual resources to GENOME-3D ie domain boundaries/superfamily classifications/domain structure predictions - Combining data, identifying consensus regions and calculating confidence values We will develop Taverna workflows which plug together the above web-services to provide consensus data. We will build a web portal to display this data (see figure 1 main text). The website will exploit an Oracle database and will provide facilities for querying with protein structure ids (PDB ids) or sequence ids (UniProt or GI codes). All partners have extensive experience in web design. CATH-Gene3D has tools for visualising multiple structure/multiple sequence alignments and highlighting conserved residues on representative structures. These will be adopted by GENOME-3D. We will design a questionnaire to capture feedback on the site and use this to improve design. (1) GENOME-3D webserver As well as providing predetermined classifications/annotations via the website (some data is manually curated), we will establish a server that allows structure/sequence based queries and automatically returns consensus domain classifications/predictions (no manual curation). We will develop SOAP/REST based web services for: - Scanning query structures against classification methods ie structure comparison (CATHEDRAL) and homologue recognition (HMMscan) to give uncurated SCOP/CATH assignments. - Performing multiple structure alignments - Scanning query sequences against individual methods predicting domain structures and structural features eg membrane regions - Generating consensus data from multiple prediction methods

Planned Impact

SUMMARY OF RESOURCE This proposal is to establish a resource (GENOME-3D) for the bioscience and biomedical communities to access an integrated source of information on the 3D structures of proteins and relate this data to protein function. GENOME-3D will consist of information generated from major UK groups in structural bioinformatics. The individual resources are extensively used by the community - the combined access to the different databases is over 50,000 visits per month and the total number of jobs run on all the servers is 20,000 jobs per month. This testifies to the importance of this structure and functional information for both the academic and commercial communities. Producing a combined resource will enhance the value of the individual components by enabling comparisons and cross-referencing. The impact of the resource will be extensive and span most of the applications of bioscience and biomedical research. This proposal is endorsed by letters of support from several major UK pharmaceutical ,biotech and agricultural companies - Syngenta, UCB, GSK, Isogenica, Heptares, Syntaxin and Astex. SCIENCE COMMUNITY Food security - Increasingly the sequences of plants, agricultural pests and agents of disease will be the focus of genome sequencing and structural studies. GENOME-3D will assist in the interpretation of the relationship between sequence variations between strains in the plants and help in the identification of the best strain to meet objectives such as yield, water requirements, colour and taste, and resistance to pests and disease. The information could benefit chemical discovery and marker identification for crop breeding programs. Bio-energy and bio-industry - The manipulation of individual molecules and of pathways will be central in the exploitation of bioscience to yield new sources of energy and materials. Synthetic pathways can be engineered to make molecules, such as fuels, more efficiently. In addition, novel molecules can be designed and synthesised. Detailed knowledge of structure of a family of protein can be used to suggest the critical changes to alter function. At the pathway level, GENOME-3D will help to identify the components based on sequence and structural information of families of proteins. Health - The central role of protein structure in the design of novel and improved pharmaceuticals is well established. Provision of the highest quality 3D models from gene sequence will therefore directly enhance the discovery of new hits. The refinement of these hits into leads will benefit from information about a family of molecules to highlight the relationship of stereochemistry, ligand binding, and activity. Therapeutic molecules will span the spectrum from low molecular weight compounds, through peptides into proteins, including antibodies. A major development over the next few years will be the sequencing of many individuals and relating their sequence variations (single nucleotide polymorphisms, SNPs) to disease susceptibility. This will provide major insights into biological processes in human, the development of personalised medicine, and the identification of novel drug targets. Central in the interpretation of SNPs effects in protein coding regions will be knowledge available in GENOME of the inter-relationships between protein sequence, structure, function and pathways. POLICY MAKERS AND THE LAY PUBLIC GENOME-3D will be an integrated resource with several UK groups working together to develop a world-leading bioinformatics resource. The success of the project could inform policy makers about the value of collaborative work for bioinformatics and other scientific resources within the UK, within Europe and worldwide. Similarly, GENOME-3D can serve as an example to the general public (including schools) which demonstrates both bioinformatics resources and the added value of collaborative research.
 
Description The project has integrated several UK-based structural resources to provide a unique perspective on sequence-structure-function relationships. The Blundell group has used FUGUE, alongside other leading structure prediction resources (DomSerf, Gene3D, pDomTHREADER, Phyre and SUPERFAMILY) to provide annotations for UniProt sequences to indicate the locations of structural domains (structural annotations) and their 3D structures (structural models). Structural annotations and 3D model predictions are currently available for three model genomes (Homo sapiens, E coli and baker's yeast), with a further seven in preparation:



Mus musculus (mouse)

Arabidopsis thaliana (mouse-ear cress)

Drosophila melanogaster (fruit fly)

Caenorhabditis elegans (nematode)

Plasmodium falciparum (malaria parasite)

Staphylococcus aureus

Schizosaccharomyces pombe (fission yeast)



The Blundell group through Dr Ochoa has overseen (as independent arbitrator) the first official mapping between the SCOP and CATH databases, identifying "consensus" superfamily pairs that overlap substantially between the two resources. The pairs are categorised into bronze standard (532 pairs), silver standard (527 pairs) and gold standard (370 pairs).
Exploitation Route 3D Genome provides annotation of value to structure-guided drug discovery. The database is of interest to drug discovery companies. It will also be of interest companies involved in making scaffolds, antibodies or peptides that bind protein domains.
The Genome3D website is freely available at http://www.genome3d.eu. Structural annotations and 3D model predictions are currently available for three model genomes (Homo sapiens, E coli and baker's yeast).
Sectors Agriculture, Food and Drink,Chemicals,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL http://www.genome3d.eu
 
Description The 3D-Genome resource is used by pharmaceutical and biotech companies to investigate targets of phenotypic screens for druggability and optimisation of initial hits. Also of value in understanding mutations (nsSNPs) associated with human genetic variation and genetic disease.
First Year Of Impact 2013
Sector Chemicals,Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Economic,Policy & public services

 
Description Chair, BBSRC (Tom Blundell)
Geographic Reach National 
Policy Influence Type Implementation circular/rapid advice/letter to e.g. Ministry of Health
Impact The BBSRC has been a leading influence in basic and strategic research underpinning agriculture and food, biotechnology and animal and hum health (one Health)
URL http://www.bbsrc.ac.uk
 
Description President of UK Science Council (Tom Blundell)
Geographic Reach National 
Policy Influence Type Influenced training of practitioners or researchers
Impact As President of Science Council I have overseen the introduction of professional accreditation, such as CSci, in across 40 professional societies. This is a major initiative to recognise the contributions of professional scientists in
URL http://sciencecouncil.org
 
Description Tom Blundell, Guest Lecturer International Chair of Therapeutic Innovation, an initiative of the Laboratory of Excellence in Research on Medication and Innovative Therapeutics: dissemination and training program.
Geographic Reach Europe 
Policy Influence Type Influenced training of practitioners or researchers
Impact Tom Blundell was Guest Lecturer and Discussant for three days in Paris as part of an initiative Medication and Innovative Therapeutics (LERMIT) as part of a dissemination and training program.
URL http://www.labex-lermit.fr/en/formation/chaire-internationale-d-innovation-therapeutique
 
Description Shorten-TB: Collaboration coordinated by FNIH, involving NIH, Cape Town and Dundee
Amount $3,000,000 (USD)
Funding ID FNIH #BLUN17STB SHORTEN-TB 
Organisation Bill and Melinda Gates Foundation 
Sector Charity/Non Profit
Country United States
Start 03/2017 
End 02/2020
 
Title Update of Site-directed Mutator software for prediction of impacts of mutations in genetic disease and antimicrobial resistance 
Description The computer program SDM uses genome information to predict the impacts of mutations. Much of the focus in the past year has been on antimicrobial resistance and target selection in the design of new antibacterial for tuberculosis. My comments refer to the version updated in the past year 
Type Of Material Improvements to research infrastructure 
Year Produced 2016 
Provided To Others? Yes  
Impact The method has been combined with other tools such as mCSM, a machine learning approach developed in the Blundell team, to analyse the genomes from different strains of tuberculosis and to identify causes of drug resistance 
URL http://mordred.bioc.cam.ac.uk/~sdm/sdm.php
 
Title Credo 
Description A database of protein interactions, including protein-protein, protein ligand 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact Used to understand drug interactions with protein targets 
URL http://marid.bioc.cam.ac.uk/credo
 
Description Collaborative Research 
Organisation Wellcome Trust
Department Wellcome Trust Bloomsbury Centre
Country United Kingdom 
Sector Charity/Non Profit 
PI Contribution Development of computer programmes to search for distant homologues (Fugue) and to Model proteins (Modeller) Expertise on relating structure to function
Collaborator Contribution Complementary expertise in Genomics, Bioinformatics and Classification of protein, sequences and structures; functional annotation of sequences
Impact Continuing collaboration through employment of Harry Jubb
Start Year 2016
 
Description Identifying targets from phenotypic screening in tuberculosis 
Organisation University of Dundee
Department College of Life Sciences
Country United Kingdom 
Sector Academic/University 
PI Contribution Collaboration Funded by Gates Foundation to identification of new targets for drug discovery arising from phenotypic screens. My team has contributed knowledge, databases and software focusing on protein targets in Mycobacterium tuberculosis
Collaborator Contribution Dundee has contributed software and expertise in medicinal chemistry
Impact Talks in meetings identified elsewhere by various participants. Discussions with HIT-TB Consortium
Start Year 2013
 
Description Shorten-TB 
Organisation National Institute of Allergy and Infectious Diseases (NIAID)
Country United States 
Sector Public 
PI Contribution Analysis of structure, function and druggability of targets in tuberculosis
Collaborator Contribution Drug screening and development
Impact None yet
Start Year 2017
 
Description Shorten-TB 
Organisation University of Cape Town
Department Institute of Infectious Disease and Molecular Medicine (IIDMM)
Country South Africa 
Sector Academic/University 
PI Contribution Analysis of structure, function and druggability of targets in tuberculosis
Collaborator Contribution Drug screening and development
Impact None yet
Start Year 2017
 
Description Shorten-TB 
Organisation University of Dundee
Department College of Life Sciences
Country United Kingdom 
Sector Academic/University 
PI Contribution Analysis of structure, function and druggability of targets in tuberculosis
Collaborator Contribution Drug screening and development
Impact None yet
Start Year 2017
 
Description Antimicrobial Resistance Workshop 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Workshop discussing with policy makers, clinicians, scientists and research students in order to understand and cobalt impacts of antimicrobial resistance, mainly in tuberculosis
Year(s) Of Engagement Activity 2016
 
Description Cambridge Therapeutics Forum: Pharma, Biotech, Clinical School and University 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Industry/Business
Results and Impact A short presentation by Tom Blundell to a mixed group involved in research ecosystems. Exemplified by foundation of Astex in my lab, progression to science park, candidate drugs into man, phase III, sale for $886million.

The second talk given by Sir Greg Winter,
Year(s) Of Engagement Activity 2015
URL http://www.onenucleus.com/cambridge-new-therapeutics-forum
 
Description Indian National Science Congress 2016, Mysore 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact General theme: Science & Technology for Indigenous Development in India. Tom Blundell Plenary Lecturer on Drug discovery for infectious disease in India where budgets have to be low. Open Source Drug Discovery, Biotech Spin-outs and Academia in Research Ecosystems
Year(s) Of Engagement Activity 2016
URL http://www.isc103.in
 
Description Joint Workshop for PhD researchers 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Postgraduate students
Results and Impact Increased use of Genome3D databases and software

Questions from PhD about visiting lab
Year(s) Of Engagement Activity 2014
 
Description School Visit by Tom Blundell (Suffolk) 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact Talk and opening a plaque to Dorothy Hodgkin commemorating her attendance at a secondary school in Beccles. Designed to encourage young students to do science - with local MP and relatives of Dorothy. Extensive media coverage in local papers.
Year(s) Of Engagement Activity 2015
 
Description Tom Blundell appointed 8th Distinguished Technopreneur 2015, Singapore 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Policymakers/politicians
Results and Impact A discussion by Tom Blundell of research ecosystems, based on experience of forming companies in London and Cambridge, and looking at options for Singapore.

Discussions with Deputy Prime Minister of Singapore; visit of Head of Research to my company on the Science Park
Year(s) Of Engagement Activity 2015
URL http://www.science50.com.sg/dts.html
 
Description Two lectures in University Pretoria, first to broad audience of students, policy makers, teachers; the second to students from the local Ndebele township 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Undergraduate students
Results and Impact Lectures leading to small discussions with groups of mainly Ndebele speaking undergraduate students; followed by visit to local township for discussions mediated by Dr Gugu Motshwene, and ex-tudent now lecturer in the University of Pretoria
Year(s) Of Engagement Activity 2016
 
Description Weaver Endowed Lecture at UC Davis California 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact A general lecture to a broad audience about the contributions of my science to drug discovery over the past 50 years
Year(s) Of Engagement Activity 2016