BBSRC-NSF/BIO Expanding the fold library in the twilight zone to facilitate structure determination of macromolecular machines

Lead Research Organisation: University College London
Department Name: Structural Molecular Biology

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

We will significantly expand structural coverage of sequence space by applying a powerful method, Rosetta, to predict structures of novel folds or very remote homologues to known folds. Recent developments, detecting co-evolving, contacting residues, exploit vast sequence data and have revolutionised structural biology. The method is also valuable for macromolecular assembly by predicting co-varying residues forming interfaces.

The quality of Rosetta models will be improved by using multiple sequence alignments (MSAs) from FunFams, clusters of structurally coherent relatives. FunFams will be vastly expanded with metagenome sequences, to increase sequence diversity giving deeper, more informative MSAs. The Baker group established a vast library of metagenome sequences from collaborations with Joint Genome Initiative (14522 metagenome sets) and include 27 algal, 92 plant, 772 fungal, 142 worm, 48 bird, 93 insect, 370 eukaryotic genomes from Ensembl and 1915 curated fungal genomes, giving a total of 9 billion sequences. Fast protocols to generate coarse clusters will cope with this vast data by exploiting k-mer hashing, followed by HMM-HMM protocols. Subsequently, the FunFamer algorithm will identify structurally coherent FunFams in each coarse cluster. This exploits HHpred for HMM-HMM comparison, groupsim for SDP detection and generates multiple sequence alignments using MAFFT.

Rosetta predicted "interface" residues will enhance PISA prediction to identify biological assemblies. We will analyse assemblies annotated in the PDB to validate predictions and use known interface information from IntAct.

Predicted structures will be integrated in Genome3D and novel confidence measures developed. Novel web visualisations will show known and predicted structures, enabling clear differentiation. Complementing experimental structures in PDB with predicted models in Genome3D will help elucidation of large structural complexes by EM and by molecular replacement.

Planned Impact

The accuracy and reliability of predicted 3D structure models built from close homologues (>50% sequence identity) is clearly demonstrated by their frequent use in X-ray structure determination pipelines as templates for molecular replacement. Recently, powerful new approaches have emerged that allow prediction of reliable 3D structure models for more remote homologues, even below 30% sequence identity, based on predicted residue contacts. These approaches use co-variation information derived from vast amounts of sequence data. The methods also facilitate modelling of molecular assemblies by predicting cross-subunit co-variation of residues forming the assembly interfaces.

Many predicted 3D models are not archived in a centralised repository, but a recent BBSRC funded resource, Genome3D, integrates predicted 3D models, built by complementary methods for sequences of important model organisms (eg human, mouse, wheat, E.Coli). Genome3D is therefore the obvious home also for the models derived using residue co-variation information and this project will significantly expand Genome3D with accurate models for protein domains sequence remote from known structures and likely to have significant structural novelty. We will build a web portal displaying known and predicted structures together to ensure maximum impact of the experimentally and computationally obtained 3D structure models and develop appropriate visualisations, allowing users to easily distinguish the experimentally determined models and annotations from the computationally derived structure models and predicted annotations.

Major beneficiaries of this data will be structural biologists who will be able to use the expanded library of domain structures for molecular replacement and for interpreting electron microscopy data. These libraries and associated predictions of interface residues will considerably facilitate the assembly of large macromolecular complexes and thereby provide important insights into the biological role of the proteins.

The other major beneficiaries of this new portal will be biologists in academia and industry using the structural data to guide drug design and the design of new proteins. Protein structure data is also key to understanding whether a residue mutation is likely to disrupt the structure or modify the function of a protein. Extensive next generation sequencing projects increasingly reveal these genetic variations (e.g. for different strains of wheat) and biomedical researchers and food biologists will therefore greatly benefit from being able to interpret this variant data in a structural context.

In 2017, the structure data in PDB was downloaded >500 million times by >500K distinct users via the PDBe website (pdbe.org). Genome3D (genome3d.eu) is a relatively new resource with lower exposure, but BBSRC funds integration of the Genome3D data in InterPro, a very highly accessed resourced with >90,000 users per month. There will therefore be a large user community of life scientists from academia and industry, who will benefit from the availability of these data.

In summary, the impact will be realised by:
1. Direct use of the resources by the non-academic sector such as pharmaceutical companies who extensively use macromolecular structure models in target identification and design of compounds. The availability of the combined experimental and computational models will also help in the design of modified proteins by the synthetic biology community.
2. The models will also aid interpretation of the impact of disease specific variants providing possible molecular explanations for the observed phenotypes.
4. The structural biology community will benefit from access to 3D structure models together with the information on interfaces for interpretation of Electron Microscopy electric potential maps. The models will also serve as search templates for molecular replacement in crystallographic structure determination pipelines.

Publications

10 25 50
 
Description We analysed the value of using multiple sequence alignments from CATH functional families as input in protein structure prediction by the Rosetta method. In some cases these were helpful in others the alignments were too shallow
We compared the quality of Rosettafold models with AlphaFold models for a set of structurally uncharacterised, disease associated human proteins. I most cases the AlphaFold models had high quality but a reasonable number of Rosettafold models were of better quality.
We modelled the human proteins whose mutations lead to disease conditions in collaboration with experts in protein structure prediction in America (David Baker group). We also extracted these proteins models from the state of art AlphaFold database. We used two modelling techniques to gain higher confidence in the predicted models. We have developed a method to predict the impact of disease associated mutations based on proximity to predicted functional sites using our CATH-FunFams and other publicly available tools. We also included stability and pathogenicity predictors to study the impact of these mutations. Mutations close to functional sites or predicted to be pathogenic/destabilizing might have functional and structural impacts on the protein leading to a disease phenotype. This has now been published in Briefings in Bioinformatics.
Exploitation Route The 3D models generated by this project will be valuable for understanding the structural and functional impacts of genetic variations i.e. residue mutations, linked to diseases (human or agricultural organisms) or antimicrobial resistance in bacterial proteins.

Some of the models are very high quality and may be useful for drug design.
Sectors Digital/Communication/Information Technologies (including Software),Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

 
Description Our comparison of AlphaFold and RosettaFold models for human disease proteins has been published and will provide useful information for users in pharmaceutical industries these proteins. The methods for analysing the impacts of variants in proteins that we developed are being added to a NextFlow workflow and will be helpful to users in academia and industry analysing variant impacts.
First Year Of Impact 2022
Sector Digital/Communication/Information Technologies (including Software),Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
 
Description 3D Beacons Network 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution Our research team has been part of the core development team responsible for the API schema, architecture and tools underpinning the 3D Beacons framework. Specifically, we have taken direct responsibility for delivering an example client implementation of the 3D beacon API with the intention that this will allow research groups with minimal coding skills and/or technical resources to be part of the 3D Beacons network.
Collaborator Contribution The PDBe team have direct responsibility for delivering the 3D Beacons "Hub", which gathers information from all the nodes on the network. This also includes developing the front end public web pages that users can use to consume this data. The SWISS-MODEL team are responsible for the quality metrics that are used to normalise the predicted models from various methods.
Impact Aims of the project: 1.) A major aim of the 3D-Gateway project (renamed as 3D Beacons) is to combine access to experimental and predicted structures to increase the coverage of structure data and structure-based functional annotations available for UniProt sequences from key model organisms linked to human health and agriculture. To increase the predicted structural data we will expand Genome3D and develop a mechanism (3D-Beacons network) for providing access to models from other model providers (i.e., SWISSMODEL, Rosetta, ModBase) in order to significantly increase coverage and reliability when assessing agreement across multiple models. 3D-Gateway will increase the amount of structural the information available for UniProt sequences at least 10-fold based on Genome3D data and considerably more taking into account projected model acquisition from the other external resources. 2.) As well as increasing the structural information available for UniProt, the 3D-Gateway project will integrate structure-based functional annotations from the PDBe Knowledge base (PDBe-KB) with the predicted models. These annotations will also be used to build new UniRules - consensus rules used in the annotation of UniProt sequences, including the assignment of functional residues. This data on functional motifs on 3D structures/models will enable a significant expansion of annotations of automatically curated sequences in UniProt. 3.) Another goal is to make these structural data and added value annotations available to non-expert users by building web-pages for displaying the 3D structure models (both experimental and predicted) and added value annotations for UniProt sequences. We will ensure that the information is presented in a way that clearly demonstrates data provenance. Web-based teaching materials and workshops will help biologists to exploit the new data and understand their benefits and limitations. This project has become more timely due to the improvement in accuracy in protein structural modelling advances in AlphaFold and the pending release of very large scale data (~100m structural models). 3D Beacons will ensure that structural analysis will become more central to Biological and Biomedical research. Outcomes to date: 1) Established specification of the API to establish communication protocols across the 3D Beacons Network 2) Established the 3D Beacons Hub to aggregate queries and responses across the Network 3) Developed an exemplar client implementation to enable groups to join the 3D Beacons Network 4) The Hub web pages have been launched and are available to the public 5) 3D Beacons has been adopted as a central activity of the ELIXIR 3DBioInfo structural bioinformatics community
Start Year 2019
 
Description 3D Beacons Network 
Organisation University of Basel
Department Biozentrum Basel
Country Switzerland 
Sector Academic/University 
PI Contribution Our research team has been part of the core development team responsible for the API schema, architecture and tools underpinning the 3D Beacons framework. Specifically, we have taken direct responsibility for delivering an example client implementation of the 3D beacon API with the intention that this will allow research groups with minimal coding skills and/or technical resources to be part of the 3D Beacons network.
Collaborator Contribution The PDBe team have direct responsibility for delivering the 3D Beacons "Hub", which gathers information from all the nodes on the network. This also includes developing the front end public web pages that users can use to consume this data. The SWISS-MODEL team are responsible for the quality metrics that are used to normalise the predicted models from various methods.
Impact Aims of the project: 1.) A major aim of the 3D-Gateway project (renamed as 3D Beacons) is to combine access to experimental and predicted structures to increase the coverage of structure data and structure-based functional annotations available for UniProt sequences from key model organisms linked to human health and agriculture. To increase the predicted structural data we will expand Genome3D and develop a mechanism (3D-Beacons network) for providing access to models from other model providers (i.e., SWISSMODEL, Rosetta, ModBase) in order to significantly increase coverage and reliability when assessing agreement across multiple models. 3D-Gateway will increase the amount of structural the information available for UniProt sequences at least 10-fold based on Genome3D data and considerably more taking into account projected model acquisition from the other external resources. 2.) As well as increasing the structural information available for UniProt, the 3D-Gateway project will integrate structure-based functional annotations from the PDBe Knowledge base (PDBe-KB) with the predicted models. These annotations will also be used to build new UniRules - consensus rules used in the annotation of UniProt sequences, including the assignment of functional residues. This data on functional motifs on 3D structures/models will enable a significant expansion of annotations of automatically curated sequences in UniProt. 3.) Another goal is to make these structural data and added value annotations available to non-expert users by building web-pages for displaying the 3D structure models (both experimental and predicted) and added value annotations for UniProt sequences. We will ensure that the information is presented in a way that clearly demonstrates data provenance. Web-based teaching materials and workshops will help biologists to exploit the new data and understand their benefits and limitations. This project has become more timely due to the improvement in accuracy in protein structural modelling advances in AlphaFold and the pending release of very large scale data (~100m structural models). 3D Beacons will ensure that structural analysis will become more central to Biological and Biomedical research. Outcomes to date: 1) Established specification of the API to establish communication protocols across the 3D Beacons Network 2) Established the 3D Beacons Hub to aggregate queries and responses across the Network 3) Developed an exemplar client implementation to enable groups to join the 3D Beacons Network 4) The Hub web pages have been launched and are available to the public 5) 3D Beacons has been adopted as a central activity of the ELIXIR 3DBioInfo structural bioinformatics community
Start Year 2019
 
Description PDBe 
Organisation EMBL European Bioinformatics Institute (EMBL - EBI)
Country United Kingdom 
Sector Academic/University 
PI Contribution Our resource CATH provides high quality annotations to improve the quality of the information provided by the PDBe, primarily the location of structural domains and identifying distant evolutionary relationships between known protein structures. Our Gene3D resource provides structural annotations for genome sequences from ~20,000 species. These annotations are also incorporated in the Genome3D resource for selected model organisms. Collaborations between research groups involved in the Genome3D project has resulted in a high quality mapping between the CATH and SCOP structural classification databases. This is being implemented by the PDBe to improve the clarity and coverage of structural annotations in their resource. We currently have a BBSRC BBR funded collaboration with PDBe and InterPro to provide our CATH-Gene3D structural annotations to these resources, via the Genome3D portal.
Collaborator Contribution Host, maintain and curate the central PDBe resource and website.
Impact Publications Community resources to further scientific research.
Start Year 2006
 
Description Oral presentation at ISCB/ISMB 2022 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This work was presented at ISMB/ISCB 2022 conference (3D-SIG) track. There were around 30 researchers/scientists who attended the talk. The models of human disease associated proteins were used to infer predicted functional sites. We then check if known mutations were in close proximity to these functional sites and hence if it might impact the functioning of the proteins. We also checked the structural effect and pathogenicity of these mutations based on known predictors.
Year(s) Of Engagement Activity 2022