3D-Gateway - Gateway to protein structure and function

Lead Research Organisation: University College London
Department Name: Structural Molecular Biology

Abstract

Proteins comprise long chains of organic molecules that fold into compact globular 3-dimensional structures. Knowing this structure can give very valuable insights into the clefts, pockets or other surface features important for binding other molecules in the cell eg small molecules or proteins. Knowledge of the structure is also essential for designing drugs that bind to these features and inhibit the protein and can also help in understanding whether mutations in the protein's residues affect its stability or function, leading to disease.

Experimentally determining the structure can be challenging, which is why only a small percentage of known proteins (~145,000 out of 120 million) have been characterised. However, powerful computational methods have been developed that predict protein structures by inheriting structural information from evolutionary related proteins whose structures are known. These prediction techniques have been made even more powerful, recently, as new ways of exploiting the evolutionary data have been found that more accurately constrain contacts in the protein.

Applying these techniques, structures can be predicted for a large proportion of uncharacterised proteins. For example, for human proteins about 5% of the structures are known but a further 88% can be modelled, some to very high accuracy, thereby providing important frameworks for designing drugs to treat human diseases. When inheriting structural data between distant relatives one has to be much more cautious and most prediction methods return a confidence score for the models produced.

This project will build an infrastructure (3D-Beacons) that aggregates experimentally determined structures with predicted structures generated by groups applying different algorithms. This will be done for proteins from selected organisms relevant to food security and human health - some will be pathogenic bacteria that threaten humans or animals/crops.

We will use this data to annotate proteins in the UniProt resource, widely used by more than 750,000 unique users each month. Since the prediction methods reside in many different labs, by pooling the data in this way we can significantly increase the number of proteins with structural data. In addition, combining models built by independent algorithms allows us to compare 3D-models to find which parts agree regardless of method and which parts vary between methods and are clearly harder to model. Therefore, we will use this aggregated data to research the best strategies for calculating model quality at each position in the protein.

We will build web pages to display the known and predicted structures for a given protein. It can be difficult to determine the structure of the whole protein so, where appropriate, we will display both experimental and predicted structures, taking great care to label the structures with information on the source (eg method used) and reliability of the data (eg confidence).

We will also use our 3D-Beacons infrastructure to aggregate information on known and predicted functional sites on the protein structure and display this data on web pages, together with information on source and confidence. The site data mapped onto structure will be particularly helpful for developing rules that allow us to gauge whether a protein with no experimental characterisation has the same function as an evolutionary related protein with experimental characterisation. Relatives sharing the same function should have the same key functional site residues. With these rules we will be able to provide structural and functional annotations for millions of proteins in UniProt. The new data will represent a tenfold or more increase in the number of UniProt sequences which have structural and functional site information. UniProt is also widely used by researchers in industry and thus this expansion in information will have a very significant impact.

Technical Summary

Despite significant advances in protein structure determination, the majority of proteins have no experimental structural data. Significant improvements in structure prediction methods can fill this gap and provide valuable data for understanding protein functions. Presently, structure data are archived in distinct resources (the PDB for experimental structures, and Genome3D and other specialist resources for predicted models) impeding their access by the wider user community. The 3D-Beacons infrastructure will allow seamless access to all structure models providing a mechanism for maximising structural coverage of UniProtKB.

The 3D-Beacons network will also simplify the comparison of models from different model repositories, allowing development of better confidence measures. Collaboration of resources will ensure the sustainability of the system, and the proposed uniform data access mechanism (REST API) will simplify integration of structural data by other resources such as InterPro and Ensembl and tools such as JalView, Chimera, Pymol to provide an essential foundation for understanding the impacts of genetic variations on protein functions. Access to model structures is also valuable in structure determination and analysis pipelines.

We will also develop mechanisms for transferring structure-based functional annotations from the PDBe Knowledgebase to UniProt proteins, and derive a confidence measure for the annotations. This infrastructure will allow integration and display of these annotations for UniProt sequences from key model organisms including important agricultural organisms and their pathogens.

Furthermore, these functional annotations will be built into the UniProt UniRule system enabling both (i) large scale annotation of the UniProt KnowledgeBase (UniProtKB) and (ii) their use by other groups annotating completed genomes or metagenome data, through the UniFIRE (the UNIprot Functional annotation Inference Rule Engine) system.

Planned Impact

Protein structure data provides valuable insights into the mechanisms by which proteins function and can thus provide explanation for impacts of genetic variation. It also aids drug design and protein engineering e.g. for greater stability or higher catalytic efficiency. The impact of structural data is evident from the significant uptake of the data by the community. For example, structural data in PDBe is accessed by >60,000 unique users/month. Genome3D is an integrated resource with structural data from 5 world-leading UK resources, whose sites typically attract 10,000 - 15,000 users/month. The Genome3D data is also disseminated via InterPro which has 135,000 unique users/month.

Despite significant advances in protein structure determination, a significant proportion of proteins have no experimental structural data in the PDB. However, protein structure prediction methods have improved significantly and the models produced can provide valuable data for understanding protein functions and the impacts of genetic variations. By expanding the predicted structural data in Genome3D and implementing the 3D-Beacon network to integrate additional predicted data from other internationally acclaimed resources (i.e. ModBase, Rosetta, SWISS-MODEL), we will maximise the structural coverage of sequences in UniProtKB and provide valuable data benefitting a very wide community of biologists. As well as aggregating known/predicted structural data, 3D-Beacon network will aggregate structure-based functional annotations from PDBe-KB. Our 3D-Gateway pages will be carefully designed to display all this information for a given UniProt sequence, in a highly intuitive manner that makes the source of, and confidence in, the data clear. The impact of UniProt in the biological community is extremely high, with access by >750,000 biologists each month.

Our project has clear deliverables likely to have impact on research studies:
(1) 3D-Beacon network based aggregation of structural and functional data will also allow individual groups to download aggregated structural data for sets of UniProt proteins. This gives a mechanism for other data providers, eg Interactome3D, to combine the data with their information, e.g on protein interactions and drug targets.
(2) Dedicated 3D-Gateway webpages showing structural and functional annotations will provide biologists access to functional information on a protein they are studying. In this context information provided by UniProt on known disease variants will be enriched by structural and functional information, provided by our 3D-Gateway project, highlighting key residues.
(3) The incorporation of structural and functional annotations in UniRules will allow safe transference of annotations to an even greater set of UniProt sequences and these rules will also be available to genome curators to enable functional annotation and comparative genome studies.

Industry will also benefit from the structural and functional annotations of UniProt sequences on the 3D-Gateway pages, to guide drug design.

As an activity in the ELIXIR Community of Structural Bioinformatics, 3D-Beacons will make the aggregated data available to groups across Europe and beyond, who in turn will contribute their own data to 3D-Beacons for display on the web pages. This community will also be involved in exploring mechanisms to ensure the quality of the aggregated data (eg by highlighting outlying data and developing sound confidence measures).

Furthermore, the ELIXIR 3D-BioInfo Community is building links with the ELIXIR Rare Disease and Galaxy Communities to develop workflows for accessing known and predicted structural data. ELIXIR funding is already supporting development of web-based training workflows in TESS, by PDBe and Genome3D, for exploiting structural data to gauge the impacts of genetic variations. This training material will ensure wider uptake and exploitation of data from the 3D-Gateway project.

Publications

10 25 50
 
Description Proteins are fundamental to many essential biological processes but less than 10% have had their structures experimentally characterised. New developments in protein structure prediction mean that high quality 3D models can now be generated using computational approaches. This project has built a platform that will allow biologists to search for good quality models for a protein of interest and return information from key providers of structural models.

This project has involved successful collaborations with key groups across Europe and in the States. The portal 3D Beacons has now been launched by the EBI.
Exploitation Route The high quality 3D models will be valuable for understanding the impacts of genetic variations and for drug design.
Sectors Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Education,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL https://www.ebi.ac.uk/pdbe/pdbe-kb/3dbeacons/
 
Description The 3D models provided by 3D Beacons will be used by researchers in industry to understand impacts of genetic variation and to design drugs.
First Year Of Impact 2021
Sector Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
 
Title 3D Beacons Client: an exemplar implementation of 3D Beacon API 
Description The 3D Beacons Network provides a framework that connects heterogeneous data resources across the globe, allowing users to find predicted models of protein structure. One of the aspects of this project has been to develop a standalone client that implements the minimum functionality required for a new resource to join the 3D Beacons Network. The main goals for the client is to: - lower the barrier of entry for new groups to contribute to the network (eg groups that might not have technical 
Type Of Material Improvements to research infrastructure 
Year Produced 2021 
Provided To Others? Yes  
Impact The 3D Beacon Network project started in 2019 with a core collaboration between EBI, SWISS-MODEL and Genome3D, and has subsequently expanded to include contributions from many groups across ELIXIR (EU) and research groups in the US. This network is particularly relevant after DeepMind released the incredibly powerful algorithm AlphaFold2 tool that allows research groups all around the world to build their own highly accurate 3D structures from protein sequence. 
URL https://github.com/3D-Beacons/3d-beacons-client
 
Description BioHackathon Europe 2021 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact The BioHackathon Europe events involve 100-150 people working for a week on international, open source projects within computational biology. Our research team lead a one of the research projects in the hackathon on behalf of 3D Beacons collaboration.

The general aims of BioHackathon are:

- Advance the development of an open source infrastructure for data integration to accelerate scientific innovation. We will focus on technology implementations such as FAIR, identifiers, metadata standards, ontologies and metadata catalogues that support the operations across ELIXIR Platforms, Communities and Focus Groups.
- Engage technical people in the bioinformatics community inside and outside ELIXIR to work together on topics of common interest aligned to ELIXIR activities
- Strengthen the interactions with ELIXIR Platforms, Communities and Focus Groups to establish and reinforce collaborations through hands-on programming activities.
Year(s) Of Engagement Activity 2021
URL https://biohackathon-europe.org/
 
Description Biohackathon Conference - 'Progress on the 3D-Beacons Network' - Ian Sillitoe - 9-11/11/20 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A Biohackathon conference is a more collaborative 'hands-on' meeting of coders and scientists, connecting people and projects with an effort to build reproducible software to tackle a problem in biological sciences. Ian Sillitoe (Research Manager in the group) lead a hackathon on 3DBeacons, a shared set of tools and web APIs to interconnect databases and websites dealing with protein structures.
Year(s) Of Engagement Activity 2020
URL https://biohackathon-europe.org/index.html
 
Description Oral presentation at ISCB/ISMB 2022 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This work was presented at ISMB/ISCB 2022 conference (3D-SIG) track. There were around 30 researchers/scientists who attended the talk. The models of human disease associated proteins were used to infer predicted functional sites. We then check if known mutations were in close proximity to these functional sites and hence if it might impact the functioning of the proteins. We also checked the structural effect and pathogenicity of these mutations based on known predictors.
Year(s) Of Engagement Activity 2022