Intermediate-to-low resolution feature detection in cryoEM maps using cascaded neural networks

Lead Research Organisation: Science and Technology Facilities Council
Department Name: Scientific Computing Department

Abstract

Understanding the function of biomolecules is fundamental to comprehend how life is sustained and design specific therapeutics for diseases associated with their function. Proteins form the largest fraction of cell constituents and often assemble together, and also with other biomolecules, into large molecular machines that perform vital roles in many cellular processes. The three dimensional (3D) structure of a molecular machine forms the platform for its function and determining the 3D structure is crucial to understand the details of its activity.

Cryogenic electron microscopy (cryo-EM) has had an immense impact on the structure determination of such large molecular assemblies in a near native state. These assemblies can either be studied in isolation (single particle analysis) or in the native cellular environment (electron tomography). Advances in technology and software for cryo-EM have helped to push the level of detail that can be discerned. Nonetheless, intrinsic properties of biological samples often make them less amenable to high resolution structure determination. 88% of cryo-EM structures deposited in the public repository EMDB are worse than 3.5Å resolution, and therefore don't contain atomic detail.

The 3D structural details of a macromolecular assembly are obtained as a density map. Interpreting details of the map requires detection of structural features of the components of the assembly. Intermediate (between 3.5 Å to 6Å) and low (>6Å) resolution maps are extremely difficult to interpret using standard automated tools. Available methods usually detect structural features by six-dimensional search procedures that are computationally expensive and are associated with a large number of false-positives. Moreover, for most of these methods, it is required that the structural details of each component of the assembly is known. Related problems that go in hand are validation of features derived from low resolution map data and representation of these low resolution models themselves.

The basic structural organization of protein structures and the process of 3D folding from a 1D sequence of amino acids have been studied over several decades. Proteins use a finite set of modular features like secondary structures and folds, and the functional form is formed of a unique arrangement of these features. An intermediate level of features is also observed where a few secondary structures organize into stable motifs or sub-folds.

We plan to exploit the hierarchical feature organization of protein structures and using powerful deep learning approaches established for pattern recognition we aim to address the problem of feature recognition in intermediate-to-low resolution maps. We will use structural feature libraries of different sizes ranging from secondary structures and smaller motifs (e.g. turns of the protein chain) to sub-folds and folds. A specialized set of motifs or sub-folds covering the intermediate size features will be generated based on compactness (contacts).

Deep neural network architectures will be designed to detect these 3D structural features in the map, with layers arranged to reflect the structural hierarchy. We also plan to use the developed networks for validation of existing structure models derived from low resolution data. In the future we would like to extend this work to potentially build structural models by assembling the features using additional sequence based information.

The developed approach would help to extend structure interpretability at intermediate and low-resolutions and make better use of such data to get insights into the mechanisms of biological function. The proposed development will be implemented as a user-friendly tool and distributed to the scientific community. We anticipate that other scientific fields could potentially benefit from the machine learning architecture designed for such multi-label 3D segmentation from noisy data.

Technical Summary

Cryogenic electron microscopy (cryo-EM) currently enables structure determination of large macromolecular machines at close to atomic resolutions. However, 88% of cryo-EM data (in the EM DataBank) are worse than 3.5Å and the average resolution currently achieved using single particle analysis is only 5.7 Å. Further developments in the field are likely to bring this number down but structure interpretability and model validation beyond 3.5Å is a clear challenge at the moment. We aim to extend the structure interpretability beyond 3.5Å by generating feature libraries of different size ranges and train machine learning models to detect these features in single particle maps and sub-tomogram averages.

We propose to use libraries of structural features at three levels based on the spatial extent:
1) 'secondary structure like' motifs comprising alpha helix, beta strand, polyproline helix, etc and other frequently occurring small motifs identified by the PDBeMotif database
2) Sub-folds made up of unique arrangements of two secondary structure elements
3) Sub-folds made of more than two secondary structure elements
To generate the last two libraries, we will use a method to segment protein structures based on amino acid contacts, and cluster them by shape.

We will develop a machine learning model to recognise these features in maps at different resolutions, using existing fitted models for training and testing. We will test different deep learning architectures to address this problem of multi-label (feature) segmentation. Larger features are composed of a unique arrangement of smaller features, and hence the contextual/neighborhood information and hierarchical nature (cascaded architecture) are important. An additional network will interpret the output features in terms of overall fold. Upon testing the proof of principle, this work could be extended to assign sequences and build structural models by assembling the features together.

Planned Impact

Understanding the function of biomolecules is fundamental to comprehend how life is sustained and to design specific therapeutics for diseases associated with their function. Proteins form the largest fraction of cell constituents and often
assemble together, and also with other biomolecules, into large molecular machines that perform vital roles in many cellular processes. The three dimensional (3D) structure of a molecular machine forms the platform for its function and determining the 3D structure is crucial to understand the details of its activity.

In this proposal, we aim to develop a tool which will help experimental structural biologists interpret 3D volumes of molecular machines determined by the techniques of electron cryo-microscopy and tomography. Cryogenic electron microscopy (cryo-EM) has had an immense impact on the structure determination of such large molecular assemblies in a near native state. These assemblies can either be studied in isolation (single particle analysis) or in the native cellular environment (electron tomography). Cryo-EM has been adopted widely by the academic community studying the molecular basis of disease or developing biotechnology applications. The technique has also been adopted in the last couple of years by the pharmaceutical industry, agritechnology and biotechnology companies for the insight it gives for example on particular drug targets. The vast majority of economic and societal impacts of this work will be achieved indirectly by improving the outputs of these academic and industrial scientists.

We will integrate the tool into the software suite of the Collaborative Computational Project for Electron cryo-Microscopy (CCP-EM) which is already used by thousands of structural biologists in academia and industry. CCP-EM also organises and hosts several training workshops per year, and has close links with the electron Bio-Imaging Centre (eBIC) on the Harwell campus.

We will also work closely with our collaborators at the Electron Microscopy and Protein Data Banks to see how the tool can improve the interpretation and annotation of structures already deposited in their databases. This could lead to vital new insights into known structures, and impact on the many downstream users of these databases.

There could be additional academic or industrial beneficiaries, users of our software, or software libraries and algorithms, in domains where 3D pattern recognition from noisy data is required.

The PDRA will receive valuable training in the specialist areas of structural biology and machine learning. The role will expose the incumbent to multidisciplinary techniques, and add a valuable skillset to the UK workforce.

STFC is active in public engagement activities. STFC has hosted visits from school parties and engage in providing basic scientific exposure and internships to students at school and graduate levels. The imaging from electron microscopy is very visual in nature, and is an excellent focus help make a connection between science and biology and the everyday world most people experience. Our aim is to raise the awareness of people to science who would not otherwise have contact with it, and to inspire school children to take an interest.

We have a media officer who targets alerts to the public press or trade publications with topical findings.

Publications

10 25 50
 
Description The structural biology technique of cryogenic electron microscopy determines 3D shapes (cryoEM volumes) of biological macromolecules, important in health and disease of all organisms. In order to understand the chemistry and function of these macromolecules, it is necessary to interpet these shapes in terms of discrete atoms. This remains a challenge, especially when the shape is at low resolution.

We have shown that it is possible to apply the latest machine learning techniques to learn characteristic features of these shapes, so that particular groups of atoms (e.g. defining a helix or a fold) can be located. We are considering different levels of atomic features, from secondary structure elements up to sub-domain motifs. For the latter, we are collaborating with a French group who have developed Protein Units which form recognisable motifs of larger macromolecules. We are applying our method down to resolutions of 12A, where features of the molecules are less well resolved, but still recognisable.

We have used a supervised machine learning approach which requires annotated training data, i.e. cryoEM volumes where the atomic features have already been identified. We have generated this training data by starting from a set of known macromolecular structures and calculating cryoEM volumes (at several resolutions) from these. We have done this for 2,378 different structures, and annotating these with structural motifs has been a major task. However, this represents a valuable resource in itself, as methods developers often struggle to find good test data. We will therefore be making this available to the community.

We have trained a 3-level cascaded machine learning model, meaning that we learn structural features at 3 different levels of detail. The outputs of one level are fed as inputs of another level, and we have been able to show that this improves the performance of the combined model. In the end, we can predict that a particular region of a cryoEM volume represents e.g. an alpha helix which is part of an alpha-turn-alpha motif. Such information is invaluable for interpreting the outcome of experiments.

This work was disrupted by the pandemic, but we have now completed a proof of principle showing that a hierarchical approach can improve on a simple segmentation of the cryoEM volume. We continue to refine this approach, and will publish the approach when we are happy that it is robust. The software that implements this approach is now distributed as part of the CCP-EM Macromolecular Machine Learning Toolbox. The latter handles many common technical issues, such as file format conversion and memory issues when dividing up large volumes.
Exploitation Route The project has close links to the CCP-EM project for cryoEM software. We will use core CCP-EM resources to maintain the software, and seek ways (e.g. studentships) to take this work forwards. The software has been made available via Gitlab under an Open Source (MIT) licence, mainly for other developers to build on.
Sectors Agriculture, Food and Drink,Pharmaceuticals and Medical Biotechnology

URL https://gitlab.com/ccpem/ml-protein-toolbox
 
Title Biomolecular volumes at multiple resolutions annotated with structural motifs 
Description As part of our work on segmentation of biomolecular volumes, we have created a test dataset of approximately 2300 proteins and complexes. Simulated cryoEM maps have been generated at resolutions 3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,12.0A. Voxels of these maps have been annotated with multiple structural labels, for example if they are in a region contained within an alpha helix which is in turn part of a larger motif. Annotation labels cover several spatial levels, including secondary structure, turns, and super-secondary motifs. 
Type Of Material Database/Collection of data 
Year Produced 2022 
Provided To Others? Yes  
Impact So far, the dataset has only been used for internal development of our volume segmentation tool. We expect that it would also be of interest to other methods developers with interest in volume segmentation, subvolume identification and atomistic model building. While biomolecular volumes are available from the Electron Microscopy Data Bank (EMDB), these are not structurally annotated (except what can be inferred from a fitted atomistic model) and are not available at multiple resolutions. Methods developers always require curated, standardised datasets for testing, and we hope that ours will be suitable. For the moment, the dataset is available on request from ourselves. We can supply the whole dataset (500GB) or subsets identified according to requested metadata (e.g. structural type or resolution). We are working on making the dataset available from the CCP-EM website. 
 
Title Cascaded neural network for structural classification of cryoEM volumes 
Description We have implemented a 3-level cascaded neural network for segmentation of cryoEM volumes. The lowest level predicts secondary structural element or turn motifs. The second level predicts motifs consisting of two secondary structural elements and the third level predicts larger structural motifs. The second and third levels are based on Protein Units defined in previous work by our collaborators. The cascaded network has been trained on a large set of simulated cryoEM maps, at multiple resolutions. The output of the cascaded network would be a voxel-wise prediction of multiple structural motif labels. These can be used for segmentation or other analyses. The model consists of trained weights, together with code for making predictions from the model. Our code is based on an in-house library, which in turn uses Tensorflow. 
Type Of Material Computer model/algorithm 
Year Produced 2021 
Provided To Others? No  
Impact To date, the model has been used for in-house testing. We are working on releasing the model with CCP-EM, which has a large base of potential users. We wish to finish extensive testing before release. 
 
Description AIMLAC CDT - Aberystwyth, Bangor, Cardiff, Swansea, Bristol 
Organisation Swansea University
Country United Kingdom 
Sector Academic/University 
PI Contribution We provide placements for students on this CDT programme. Specifically for the 2020/2021 cohort, we provided 2 placements. The students worked on projects concerning denoising of electron micrographs for cryoEM, and modelling of neutron reflectometry data. Each student has completed a 2 week initial placement, followed by the main 6 month placement. For the 2021/2022 cohort, we provided a further 2 placements. One will continue to refine the cascade machine learning model for segmentation of molecular volumes from cryoEM. The other will work on the CoVal server for linking SARS-CoV-2 variant data with experimental structures. For the 2022/2023 cohort, we are providing one placement on machine learning in cryoEM.
Collaborator Contribution The CDT administers the programme, and matches us up with specific students. The students themselves contribute to our on-going research programme. Typically, they deliver a small piece of coding which can be included in our larger software packages.
Impact One of the students has contributed code to the Macromolecular Machine Learning Toolbox for 3D cryo-EM data segmentation, which is publically available. The collaboration is multi-disciplinary in the sense that the students come from a background of AI in physical sciences, and contribute to projects in the biosciences when with us.
Start Year 2020
 
Title Macromolecular Machine Learning (MML) Toolbox for 3D cryo-EM data segmentation 
Description We have developed a collaborative software toolbox that includes a number of methods and pre-processing steps common to applying machine learning to 3D macromolecular data. The aim is to improve the accessibility of machine learning techniques to the members of the community and lower the technical entry barrier to applying them. Notable features: * Command line tool and Python API * Set of custom loss functions to handle volume background imbalance * 3 modes of pre- and post-processing including .mrc headers * customisable architecture * loading and saving data * data structure for holding maps and models * 8 different metrics and visualisations for performance tracking The toolbox is being used in several internal and external machine learning projects. It has also been released to collaborators in the CCP-EM consortium, for example at Delft, NL. These application projects are in turn driving the further development of the toolbox. The toolbox is the main destination of code developed under the BBSRC-funded project to develop a cascaded neural network for identifying low resolution features in cryoEM maps. Recent additions include: 1) Set up the cascade architecture in the ML-protein-toolbox 2) Add a data augmentation routine by initial volume rotations and filter by class representation. 3) Add an intermittent HDF5 storage for batches of tiles and a tracking mechanism (tile origin and index) to reduce memory issues with cascade 4) Implement DeepLabV3+ architecture which is one of the best performers in image segmentation tasks (makes use of atrous convolutions) but computationally heavy. The toolbox is expected to be production ready in future releases of the CCP-EM software suite. 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact The toolbox has been used in at least 5 different software development projects, and has an impact indirectly through these. 
 
Description Cryo-EM Validation in the Age of SARS-CoV-2: Methods, Tools, Applications 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This meeting was organised by the UK EM Validation Network to present current thinking on cryo-EM map and map/model validation, and to consider future research directions that anticipate the growing needs in this area. The community effort to apply cryo-EM to structural proteins of SARS-CoV-2 has been a success story but has also highlighted the importance of structure validation. As key members of the Validation Network, CCP-EM were closely involved, and STFC provided the logistical support for the online event.
94 attendees responded to a survey, with overwhelmingly positive feedback.
Year(s) Of Engagement Activity 2020
URL https://www.ccpem.ac.uk/training/validation_symposium_2020/Cryo-EM_Validation_in_the_Age_of_SARS-CoV...