BBSRC-NSF/BIO - Expanding fold library in the twilight zone to facilitate structure determination of macromolecular machines

Lead Research Organisation: European Bioinformatics Institute
Department Name: MSCB Macromolec, structural and chem bio

Abstract

The Protein Data Bank (PDB) is the single global archive of three-dimensional (3D) structures of large biological molecules. PDBe (pdbe.org) is the European partner in the global consortium managing the PDB. PDB is one of the oldest biological archives, with 144,000+ entries and nearly 2 million downloads daily by users worldwide in academic or industry settings, working on topics ranging from food security, human health through to design of more efficient enzymes in various aspects of biotechnology. Despite a steady increase in its holdings (13,000+ entries added in 2017), the growth of the PDB is far outstripped by the growth in the available protein sequence data.

Resources like Genome3D (genome3d.eu), funded by the BBSRC, aim to fill the gap in structure coverage of the protein sequence space with reliable predictions of structures. This resource combines data from a number of UK and overseas groups who apply complementary methods for protein structure prediction. These approaches largely model proteins that are closely related to a protein of known structure (ie the protein relatives share more than 30% identical residues in their sequences). The Rosetta method for predicting protein structures, a world-leading approach developed by the Baker lab in the USA, was recently enhanced with information derived from evolutionary analyses of protein sequence data, yielding reliable models even for cases where sequence identity between the model and the available experimental structures is very low (below 30%). We will integrate Rosetta models into Genome3D to expand the coverage of structural data for important organisms for health (e.g. human) and food security (e.g. wheat).

This project will also enrich both the experimentally determined and computationally predicted structures with valuable functional annotations, such as information pertaining to surface interfaces, a key ingredient in understanding how proteins interact with each other and with other biological molecules. By focussing on proteins dissimilar to those with known structures, this portal will help fill the gaps in structure coverage of the protein sequence space and will make structure data much more readily available and accessible. Finally, novel visualisation tools integrating the presentation of the predicted and experimentally determined structures will be developed, maintaining a clear distinction between what is predicted and what is experimentally determined.

The expanded set of 3D models derived from this project will in turn help to expand the coverage of sequence space even further, since these models can be used to guide the experimental determination of protein structures being obtained by powerful new structural biology techniques like cryo-Electron Microscopy (EM). This project will also endeavour, where possible, to improve the assembly of individual protein structures into macromolecular complexes which can be analysed to determine their biological role.

We anticipate that scientists in both academia and industrial sectors (e.g. pharmaceutical companies) will benefit from access to such an integrated portal, assisting them in designing new medicines, understanding the mechanism of disease, or in designing proteins with novel properties. Recent "resolution revolution" in Electron Microscopy allows near routine determination of structures of large molecular machines, and is in need of a large repertoire of "building blocks" in interpreting the experimental results, a need which will be partially addressed by the new portal and its provision of expanded domain structure libraries. The portal will also have ways to access the assembled data programmatically, benefiting power users: software developers and maintainers of other resources.

Technical Summary

We will significantly expand structural coverage of sequence space by applying a powerful method, Rosetta, to predict structures of novel folds or very remote homologues to known folds. Recent developments, detecting co-evolving, contacting residues, exploit vast sequence data and have revolutionised structural biology. The method is also valuable for macromolecular assembly by predicting co-varying residues forming interfaces.

The quality of Rosetta models will be improved by using multiple sequence alignments (MSAs) from FunFams, clusters of structurally coherent relatives. FunFams will be vastly expanded with metagenome sequences, to increase sequence diversity giving deeper, more informative MSAs. The Baker group established a vast library of metagenome sequences from collaborations with Joint Genome Initiative (14522 metagenome sets) and include 27 algal, 92 plant, 772 fungal, 142 worm, 48 bird, 93 insect, 370 eukaryotic genomes from Ensembl and 1915 curated fungal genomes, giving a total of 9 billion sequences. Fast protocols to generate coarse clusters will cope with this vast data by exploiting k-mer hashing, followed by HMM-HMM protocols. Subsequently, the FunFamer algorithm will identify structurally coherent FunFams in each coarse cluster. This exploits HHpred for HMM-HMM comparison, groupsim for SDP detection and generates multiple sequence alignments using MAFFT.

Rosetta predicted "interface" residues will enhance PISA prediction to identify biological assemblies. We will analyse assemblies annotated in the PDB to validate predictions and use known interface information from IntAct.

Predicted structures will be integrated in Genome3D and novel confidence measures developed. Novel web visualisations will show known and predicted structures, enabling clear differentiation. Complementing experimental structures in PDB with predicted models in Genome3D will help elucidation of large structural complexes by EM and by molecular replacement.

Planned Impact

The accuracy and reliability of predicted 3D structure models built from close homologues (>50% sequence identity) is clearly demonstrated by their frequent use in X-ray structure determination pipelines as templates for molecular replacement. Recently, powerful new approaches have emerged that allow prediction of reliable 3D structure models for more remote homologues, even below 30% sequence identity, based on predicted residue contacts. These approaches use co-variation information derived from vast amounts of sequence data. The methods also facilitate modelling of molecular assemblies by predicting cross-subunit co-variation of residues forming the assembly interfaces.

Many predicted 3D models are not archived in a centralised repository, but a recent BBSRC funded resource, Genome3D, integrates predicted 3D models, built by complementary methods for sequences of important model organisms (eg human, mouse, wheat, E.Coli). Genome3D is therefore the obvious home also for the models derived using residue co-variation information and this project will significantly expand Genome3D with accurate models for protein domains sequence remote from known structures and likely to have significant structural novelty. We will build a web portal displaying known and predicted structures together to ensure maximum impact of the experimentally and computationally obtained 3D structure models and develop appropriate visualisations, allowing users to easily distinguish the experimentally determined models and annotations from the computationally derived structure models and predicted annotations.

Major beneficiaries of this data will be structural biologists who will be able to use the expanded library of domain structures for molecular replacement and for interpreting electron microscopy data. These libraries and associated predictions of interface residues will considerably facilitate the assembly of large macromolecular complexes and thereby provide important insights into the biological role of the proteins.

The other major beneficiaries of this new portal will be biologists in academia and industry using the structural data to guide drug design and the design of new proteins. Protein structure data is also key to understanding whether a residue mutation is likely to disrupt the structure or modify the function of a protein. Extensive next generation sequencing projects increasingly reveal these genetic variations (e.g. for different strains of wheat) and biomedical researchers and food biologists will therefore greatly benefit from being able to interpret this variant data in a structural context.

In 2017, the structure data in PDB was downloaded >500 million times by >500K distinct users via the PDBe website (pdbe.org). Genome3D (genome3d.eu) is a relatively new resource with lower exposure, but BBSRC funds integration of the Genome3D data in InterPro, a very highly accessed resourced with >90,000 users per month. There will therefore be a large user community of life scientists from academia and industry, who will benefit from the availability of these data.

In summary, the impact will be realised by:
1. Direct use of the resources by the non-academic sector such as pharmaceutical companies who extensively use macromolecular structure models in target identification and design of compounds. The availability of the combined experimental and computational models will also help in the design of modified proteins by the synthetic biology community
2. The models will also aid interpretation of the impact of disease specific variants providing possible molecular explanations for the observed phenotypes
3. The structural biology community will benefit from access to 3D structure models together with the information on interfaces for interpretation of Electron Microscopy electric potential maps. The models will also serve as search templates for molecular replacement in crystallographic structure determination pipelines
 
Description Collaboration on covariation analysis with Daniel Rigden 
Organisation University of Liverpool
Country United Kingdom 
Sector Academic/University 
PI Contribution We shared datasets for the analysis of covariation signals.
Collaborator Contribution They shared some of their data, and also presented data visualisation and analysis plans.
Impact Shared data
Start Year 2020
 
Title Covariations package 
Description This Python package is a data pipeline for calculating covariation signals. The initial release focuses on homomeric complexes. 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact We use this pipeline to generate covariation data across the complete PDB archive. 
URL https://github.com/PDBe-KB/pdbe_covariations
 
Title PISA analysis package 
Description PISA analysis package is a Python package to analyse the outcome of the interactions process of PISA. 
Type Of Technology Software 
Year Produced 2023 
Open Source License? Yes  
Impact We use this package to prepare interactions data for loading into PDBe. 
URL https://github.com/PDBe-KB/pisa-analysis
 
Description 3D-BioInfo Annual Meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact PDBe-KB (FunPDBe, BioChemGraph, covariation-related works), 3D-Beacons and AlphaFold DB was presented at the 3D-BioInfo Annual Meeting 2021.
Year(s) Of Engagement Activity 2021
 
Description ELIXIR 3D-BioInfo AGM 2022 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This presentation gave an overview of PDBe-KB, AlphaFold DB and 3D-Beacons to around 100 international attendees.
Year(s) Of Engagement Activity 2022
 
Description Infection Biology Retreat 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact PDBe-KB, 3D-Beacons and the AlphaFold DB was presented at the EMBL Infection Biology Retreat in the context of infectious diseases.
Year(s) Of Engagement Activity 2021
 
Description PDBe-KB & 3D-Beacons presentation at Annual General Meeting of the 3D-BioInfo ELIXIR Community 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact PDBe-KB, 3D-Beacons and the covariation project were presented during the Annual General Meeting of the 3D-BioInfo ELIXIR Community.
Year(s) Of Engagement Activity 2020
 
Description PDBe-KB at ECCB 2022 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A presentation that gave an update on the latest developments in PDBe-KB, including work related to the BioChemGraph project.
Year(s) Of Engagement Activity 2022