CCP4 Advanced integrated approaches to macromolecular structure determination

Lead Research Organisation: Diamond Light Source
Department Name: Science Division


Proteins, DNA and RNA are the active machines of the cells which make up living organisms, and are collectively known as macromolecules. They carry out all of the functions that sustain life, from metabolism through replication to the exchange of information between a cell and its environment. They are coded for by a 'blueprint' in the form of the DNA sequence in the genome, which describes how to make them as linear strings of building blocks. In order to function, however, most macromolecules fold into a precise 3D structure, which in turn depends primarily on the sequence of building blocks from which they are made. Knowledge of the molecule's 3D structure allows us both to understand its function, and to design chemicals to interfere with it.
Due to advances in molecular biology, a number of projects, including the Human Genome Project, have led to the determination of the complete DNA sequences of many organisms, from which we can now read the linear blueprints for many macromolecules. As yet, however, the 3D structure cannot be predicted from knowledge of the sequence alone. One way to "see" macromolecules, and so to determine their 3D structure, involves initially crystallising the molecule under investigation, and subsequently imaging it with suitable radiation.
Macromolecules are too small to see with normal light, and so a different approach is required. With an optical microscope we cannot see objects which are smaller than the wavelength of light, roughly 1 millionth of a metre: Atoms are about 1000 times smaller than this. However X-rays have a wavelength about the same as the size of the atoms. For this reason, in order to resolve the atomic detail of macromolecular structure, we image them with X-rays rather than with visible light.
The process of imaging the structures of macromolecules that have been crystallised is known as X-ray crystallography. X- ray crystallography is like using a microscope to magnify objects that are too small to be seen with visible light. Unfortunately X-ray crystallography is complicated because, unlike a microscope, there is no lens system for X-rays and so additional information and complex computation are required to reconstruct the final image. This information may come from known protein structures using the Molecular Replacement (MR) method, or from other sources including Electron Microscopy (EM).
Once the structure is known, it is easier to pinpoint how macromolecules contribute to the living cellular machinery. Pharmaceutical research uses this as the basis for designing drugs to turn the molecules on or off when required. Drugs are designed to interact with the target molecule to either block or promote the chemical processes which they perform within the body. Other applications include protein engineering and carbohydrate engineering.
The aim of this project is to improve the key computational tools needed to extract a 3D structure from X-ray and electron diffraction experiments. It will provide continuing support to a Collaborative Computing Project (CCP4 first established in 1979), which has become one of the leading sources of software for this task. The project will help efficient and effective use to be made of the synchrotrons that make the X-rays that are used in most crystallographic experiments but also extend to use of electron microscopes which have gained much recent publicity with the Nobel prize being awarded to researchers from this field. It will provide more powerful tools to allow users to exploit information from known protein structures when the match to the unknown structure is very poor. Finally, it will allow structures to be solved, even when poor quality and very small crystals are obtained.

Technical Summary

This proposal incorporates four related work packages.
In WP1 we will expand on our work using established and novel metrics of data quality and consistency to quantify the relationship between diffraction and map quality. The tools will be used to optimise approaches to structure determination from multiple or serial crystallography data to enable optimal selection of collected data and fully utilise all the information in structural refinement. WP1 will also develop and implement methods for electron diffraction data collection, integration and refinement.
WP2 will utilise generalise the use shift field refinement and extend its usage to hybrid refinement approaches and develop new software libraries to enhance and speed up protein structure model building and refinement across a wide resolution range.
In WP3 we will develop and implement the use of contact prediction methods for use in crystallography. It will help identify protein domain boundaries, define new search model approaches. The contact prediction approach will also be used to validate Molecular replacement solutions and assist in the interpretation of crystallographically derived protein:protein contacts.
In WP4 we will develop a model for electron scatter from macromolecular samples to enable software development and experimental design. These models will be used to develop and implement new scaling algorithms for electron diffraction data within DIALS.

Planned Impact

The impact of macromolecular crystallography and CCP4 to fundamental biomedical research, as well as to the pharmaceutical industry, is provided in the Pathways to Impact section.
The popularity of macromolecular crystallography and cryoEM has resulted in these techniques being increasingly applied to the study of progressively more challenging macromolecular structures. These typically exhibit intrinsically position-dependent mobility, resulting in limited data and varying signal-to-noise ratio in different parts of the maps. Moreover, in crystallography, data are collected using multiple crystals that are merged together; serial crystallography has become a standard tool for structural biologists. Popularity of the use CC1/2 to select parts of the data that are suitable for structure elucidation means that the signal-to-noise ratio can be very low, diminishing to 1 or even less. Also, the criteria currently used to assess quality - such as resolution and R-factors - are becoming increasingly confusing for practical structural biologists and journal referees alike. It is timely to re-evaluate quality indicators, ensuring that all data collected during the experiment are optimally utilised. A new Fourier optics based quality indicator will address this problem and it will give an objective indication of the resolvability of peaks in the calculated maps. This measure will depend on directional and time dependent data quality, data completeness, and the current state of the statistical model (including the atomic model). Such an indicator will also address the problem of local resolution, and will be used for position and direction dependent map de-blurring thus making maps more interpretable.
Developed techniques will be implemented in the new data-scaling program developed by the DIALS group. This tool will calculate the limit of useful data as well as the maximum expected resolvability, providing structural biologists with a way to decide whether the experiment should be continued (i.e. more data are required). These techniques will also be implemented in the refinement program REFMAC5, allowing the difference between current and maximum resolvability to be analysed and utilised for decision making by practical crystallographers and automatic pipelines. Resolvability for each data set, with and without the refined model, will be calculated for the METRIX data. This will be included in the feature vector that is used by machine learning algorithms for map quality assessment.
This work package will also address the growing popularity of microED - electron diffraction by micro macromolecular crystals. One of the elements of WP1 will focus on the joint refinement of electron and X-ray diffraction data using the joint conditional probability distribution of two related data sets, with corresponding atomic models reflecting electrostatic potential and electron density, respectively. Under the first Born approximation, electrons are diffracted by the electrostatic potential and X-rays are scattered by the electron charge cloud. These are related by the Poisson equation. This fact will be used for joint refinement, as parameterised in Fourier space by the Mott-Bethe formula, allowing reduction of the effective number of parameters. Using this formula means that one set of atomic scattering factors can be used both for electron and X-ray diffraction. We will explore the possibility of point charge refinement when high quality electron diffraction data are available, possibly together with X-ray diffraction data. To perform such refinement we will need to account for effects such as absorption and radiation damage; such effects can change the charge distribution dramatically. Consequently, unmerged data must be used for such refinement. This part of WP1 will be carried out in collaboration with WP4.
All developed software will be distributed by CCP4, making them accessible to the structural biologist community worldwide.


10 25 50
Description The database holds a collection of tables describing crystallographic diffraction data and the metadata derived thereof. Some of the metadata has been extracted from reference files whereas other information has been generated independently through customised data analysis pipelines such as intensity integration, data reduction and structure phasing. Additionally, information based on protein sequence analysis is available. The database is described in more details here: The information within the database was used to develop machine learning based decision making tools which have been implemented in automated structure solution pipelines at user facility. This database is the expanded and redesigned version of what was used in a previous grant. 
Type Of Material Database/Collection of data 
Year Produced 2020 
Provided To Others? Yes  
Impact this information is currently being compiled 
Title Python based data analysis tools 
Description A series of Python scripts as wrap-around, which allow easy execution of crystallographic software for analysis of test data sets. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact All scripts can be easily called and run on computing clusters and have greatly increased the speed of data analysis for the test data (539 data sets). The total analysis time has decreased from a couple of weeks to two or three days. 
Title Python based experimental phasing prediction 
Description Python and its machine learning extensions were used to create a prediction tool for experimental phasing outcome. The application makes use of a collection of classifiers which have been trained using the data held in METRIX database (also part of this project). The project currently awaits incorporation into the general facility infrastructure to be available to the user community. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact Not yet available. The expected output is that by giving facility users a probability for success or failure of their experimental phasing attempt they will only focus on those with high chances to improve the usage of resources. 
Title Python-topaz3 
Description This software package uses volumetric data, here electron density maps of proteins, to identify features in such maps with the goal to determine the atomic structure of a given protein. The means of identification is deep learning using neural networks. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact this is currently being assessed 
Title SQLite based database system to manage test data sets and data analysis output 
Description This is a freely (Github) available SQLite database to manage the collection of test data sets referred to in section "Databases and Models" which additionally links statistics and metrics from various data analysis steps. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Rather than manually having to track the outcome for the various test data sets, the database takes on this task. Besides data management it also allows to extract statistics and metrics in a convenient way, as CSV files, to be used in machine learning tools further downstream, which provide the basis of user guidance at the beam lines.