CCP4 Advanced integrated approaches to macromolecular structure determination

Lead Research Organisation: Diamond Light Source
Department Name: Science Division

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Planned Impact

The impact of macromolecular crystallography and CCP4 to fundamental biomedical research, as well as to the pharmaceutical industry, is provided in the Pathways to Impact section.
The popularity of macromolecular crystallography and cryoEM has resulted in these techniques being increasingly applied to the study of progressively more challenging macromolecular structures. These typically exhibit intrinsically position-dependent mobility, resulting in limited data and varying signal-to-noise ratio in different parts of the maps. Moreover, in crystallography, data are collected using multiple crystals that are merged together; serial crystallography has become a standard tool for structural biologists. Popularity of the use CC1/2 to select parts of the data that are suitable for structure elucidation means that the signal-to-noise ratio can be very low, diminishing to 1 or even less. Also, the criteria currently used to assess quality - such as resolution and R-factors - are becoming increasingly confusing for practical structural biologists and journal referees alike. It is timely to re-evaluate quality indicators, ensuring that all data collected during the experiment are optimally utilised. A new Fourier optics based quality indicator will address this problem and it will give an objective indication of the resolvability of peaks in the calculated maps. This measure will depend on directional and time dependent data quality, data completeness, and the current state of the statistical model (including the atomic model). Such an indicator will also address the problem of local resolution, and will be used for position and direction dependent map de-blurring thus making maps more interpretable.
Developed techniques will be implemented in the new data-scaling program developed by the DIALS group. This tool will calculate the limit of useful data as well as the maximum expected resolvability, providing structural biologists with a way to decide whether the experiment should be continued (i.e. more data are required). These techniques will also be implemented in the refinement program REFMAC5, allowing the difference between current and maximum resolvability to be analysed and utilised for decision making by practical crystallographers and automatic pipelines. Resolvability for each data set, with and without the refined model, will be calculated for the METRIX data. This will be included in the feature vector that is used by machine learning algorithms for map quality assessment.
This work package will also address the growing popularity of microED - electron diffraction by micro macromolecular crystals. One of the elements of WP1 will focus on the joint refinement of electron and X-ray diffraction data using the joint conditional probability distribution of two related data sets, with corresponding atomic models reflecting electrostatic potential and electron density, respectively. Under the first Born approximation, electrons are diffracted by the electrostatic potential and X-rays are scattered by the electron charge cloud. These are related by the Poisson equation. This fact will be used for joint refinement, as parameterised in Fourier space by the Mott-Bethe formula, allowing reduction of the effective number of parameters. Using this formula means that one set of atomic scattering factors can be used both for electron and X-ray diffraction. We will explore the possibility of point charge refinement when high quality electron diffraction data are available, possibly together with X-ray diffraction data. To perform such refinement we will need to account for effects such as absorption and radiation damage; such effects can change the charge distribution dramatically. Consequently, unmerged data must be used for such refinement. This part of WP1 will be carried out in collaboration with WP4.
All developed software will be distributed by CCP4, making them accessible to the structural biologist community worldwide.

Publications

10 25 50
 
Description CCP4 champion for equality, diversity and inclusion
Geographic Reach National 
Policy Influence Type Membership of a guideline committee
 
Description Lead of machine learning and AI working group within CCP4's WG2
Geographic Reach Multiple continents/international 
Policy Influence Type Contribution to new or Improved professional practice
 
Title METRIX 
Description The database holds a collection of tables describing crystallographic diffraction data and the metadata derived thereof. Some of the metadata has been extracted from reference files whereas other information has been generated independently through customised data analysis pipelines such as intensity integration, data reduction and structure phasing. Additionally, information based on protein sequence analysis is available. The database is described in more details here: https://doi.org/10.1107/S2052252520000895. The information within the database was used to develop machine learning based decision making tools which have been implemented in automated structure solution pipelines at user facility. This database is the expanded and redesigned version of what was used in a previous grant. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact this information is currently being compiled 
 
Title Python based data analysis tools 
Description A series of Python scripts as wrap-around, which allow easy execution of crystallographic software for analysis of test data sets. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact All scripts can be easily called and run on computing clusters and have greatly increased the speed of data analysis for the test data (539 data sets). The total analysis time has decreased from a couple of weeks to two or three days. 
 
Title Python based experimental phasing prediction 
Description Python and its machine learning extensions were used to create a prediction tool for experimental phasing outcome. The application makes use of a collection of classifiers which have been trained using the data held in METRIX database (also part of this project). The project currently awaits incorporation into the general facility infrastructure to be available to the user community. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact Not yet available. The expected output is that by giving facility users a probability for success or failure of their experimental phasing attempt they will only focus on those with high chances to improve the usage of resources. 
 
Title Python-topaz3 
Description This software package uses volumetric data, here electron density maps of proteins, to identify features in such maps with the goal to determine the atomic structure of a given protein. The means of identification is deep learning using neural networks. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact this is currently being assessed 
 
Title SQLite based database system to manage test data sets and data analysis output 
Description This is a freely (Github) available SQLite database to manage the collection of test data sets referred to in section "Databases and Models" which additionally links statistics and metrics from various data analysis steps. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Rather than manually having to track the outcome for the various test data sets, the database takes on this task. Besides data management it also allows to extract statistics and metrics in a convenient way, as CSV files, to be used in machine learning tools further downstream, which provide the basis of user guidance at the beam lines. 
 
Title pediip - Protein electron density map identification in Python 
Description A Python-based software package using machine learning, convolutional neural networks (CNN) in particular, to classify electron density maps derived from X-ray diffraction experiments. The CNN either use 2D or 3D convolutional layers. The former is used to classifyy 2D images created from the elecreon density map and the latter uses a standardised volume of the density. Electron density maps for training of the neural networks were produced in a series of molecular replacement and refinement experiments using publicly available data, i.e. protein structures and structure factors from the Protein Data Bank. 
Type Of Technology Software 
Year Produced 2022 
Open Source License? Yes  
Impact Not available yet. 
 
Description CCP4 Developers meeting (2022) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Other audiences
Results and Impact Attendance at CCP4 Developers meeting, Cosenors House, Abingdon, UK (in person) to present a project status report for this research.
Year(s) Of Engagement Activity 2022
 
Description Co-editor for IUCr Journals, Acta Cryst D, on CCP4 Study Weekend 2022 Special Issue (M Vollmar) 
Form Of Engagement Activity A magazine, newsletter or online publication
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact M Vollmar was co-editor for a special issue of Acta Cryst D that reported papers from the CCP4 Study Weekend 2022. This is typically a very well cited journal and issue. Issue to be published in 2023.
Year(s) Of Engagement Activity 2022
 
Description Gordon Research Conference, Diffraction Methods in Structural Biology (attendance by M Vollmar) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Attendance by invitation at the Gordon Research Conference on Diffraction Methods in Structural Biology, Bates College, Maine, US (in person). Melanie Vollmar chaired a scientific session at the conference and gave an introductory talk on "Artificial Intelligence in Structural Biology" based directly on her experience gained from conducting this research activity.
Year(s) Of Engagement Activity 2022
URL https://www.grc.org/diffraction-methods-in-structural-biology-conference/2022/
 
Description School Visit (East Hendred) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact This presentation was given as part of the school's special activities during National Science Week. I was one of several parents presenting their scientific work and daily activities as a researcher to primary school children. As a whole school event, this included over 100 children from Reception (5 years of age) up to Year 6 (11 years of age). The main focus was on interesting the children in science and research centered around life and biological sciences. Opening up the perspective that there is no "standard" scientist and that everyone can be one. Besides giving some details about my scientific work I also explained some key decisions I had to make in order to become a scientist. After the presentation, the children were given the opportunity to experiment with light and magnets as they are key elements to carry out experiments in X-ray crystallography. I also explained to them the concept of (electro-magnetic) waves. They also were able to build small molecules with plasticine and straws as well as a crystal lattice made of skewers and marshmallows.
Year(s) Of Engagement Activity 2022