CCP4 Grant Renewal 2014-2019: Question-driven crystallographic data collection and advanced structure solution

Lead Research Organisation: Diamond Light Source
Department Name: Science Division


Proteins, DNA and RNA are the active machines of the cells which make up living organisms, and are collectively known as macromolecules. They carry out all of the functions that sustain life, from metabolism through replication to the exchange of information between a cell and its environment. They are coded for by a 'blueprint' in the form of the DNA sequence in the genome, which describes how to make them as linear strings of building blocks. In order to function, however, most macromolecules fold into a precise 3D structure, which in turn depends primarily on the sequence of building blocks from which they are made. Knowledge of the molecule's 3D structure allows us both to understand its function, and to design chemicals to interfere with it.

Due to advances in molecular biology, a number of projects, including the Human Genome Project, have led to the determination of the complete DNA sequences of many organisms, from which we can now read the linear blueprints for many macromolecules. As yet, however, the 3D structure cannot be predicted from knowledge of the sequence alone. One way to "see" macromolecules, and so to determine their 3D structure, involves initially crystallising the molecule under investigation, and subsequently imaging it with suitable radiation.

Macromolecules are too small to see with normal light, and so a different approach is required. With an optical microscope we cannot see objects which are smaller than the wavelength of light, roughly 1 millionth of a metre: Atoms are about 1000 times smaller than this. However X-rays have a wavelength about the same as the size of the atoms. For this reason, in order to resolve the atomic detail of macromolecular structure, we image them with X-rays rather than with visible light. The process of imaging the structures of macromolecules that have been crystallised is known as X-ray crystallography. X-ray crystallography is like using a microscope to magnify objects that are too small to be seen with visible light. Unfortunately X-ray crystallography is complicated because, unlike a microscope, there is no lens system for X-rays and so additional information and complex computation are required to reconstruct the final image. This information may come from known protein structures using the Molecular Replacement (MR) method, or from other sources including Electron Microscopy (EM).

Once the structure is known, it is easier to pinpoint how macromolecules contribute to the living cellular machinery. Pharmaceutical research uses this as the basis for designing drugs to turn the molecules on or off when required. Drugs are designed to interact with the target molecule to either block or promote the chemical processes which they perform within the body. Other applications include protein engineering and carbohydrate engineering.

The aim of this project is to improve the key computational tools needed to extract a 3D structure from X-ray crystallography experiments. It will provide continuing support to a Collaborative Computing Project (CCP4 first established in 1979), which has become one of the leading sources of software for this task. The project will help efficient and effective use to be made of the synchrotrons that make the X-rays that are used in most crystallographic experiments. It will provide more powerful tools to allow users to exploit information from known protein structures when the match to the unknown structure is very poor. It will also automate the use of information from electron microscopy, even when the crystal structure has been distorted by the process of growing the protein crystal. Finally, it will allow structures to be solved, even when poor quality and very small crystals are obtained.

Technical Summary

This proposal incorporates five related work packages.

In WP1 we will track synchrotron-collected data through computational structure determination, to find whether the most useful data can be recognised a priori using established or novel metrics of data quality and consistency. We will then enable data collection software to communicate with pipelines and graphics programs to assess when sufficient data have been collected for a given scientific question, and so to prioritise further beamtime usage. We will also communicate extra information about diffraction data to structure determination programs, and so support the statistical models and algorithms being developed in WP4.

WP2 will improve the key MR step of model preparation, especially from diverged, NMR, or ab initio models. One development will be to extend the size limit of ab initio search model generation by exploiting sequence covariance algorithms.

In WP3 we will use our description of electron density maps as a field of control points to better use electron density or atomic models positioned by MR. Restrained manipulation of these points provides a low-order parameterisation of refinement decoupled from atomic models, and therefore suitable for highly diverged atomic models or EM-derived maps. We will extend this approach to characterise local protein mobility without the requirement of TLS for predefinition of rigid groups.

In WP4 we will statistically model non-idealities in experimental data, including non isomorphism, spot overlap, and radiation damage. The resulting models, implemented in REFMAC, will be applied to refinement using data that are annotated by WP1 tools and tracked by WP0.

WP0 will provide the tools to integrate the other WPs. For this, it will create a cloud environment where storage- and compute-resources can be utilised optimally, and where rich information can be passed among beamlines, pipelines, and graphics programs.

Planned Impact

With the tremendous improvements in beamline technology it is in principle possible to collect many high quality datasets per hour on a single synchrotron beamline. Nevertheless few of these datasets convert to useful structures. Recognizing the potential impact on the academic and commercial structural biology of improving this success rate Diamond Light Source is directing effort and resource towards increasing productively for the most challenging problems while also making low hanging fruit routine work. For straightforward cases automated pipelines can perform the bulk of the data analysis, in successful cases leading quickly to a partially or fully built molecular model. However, in the most difficult cases the benefits of automation are to mainly take over the most laborious, time-consuming tasks (e.g. sample exchange, automated assessment of diffraction strength and sample alignment), enabling the crystallographer to focus effort on the more complex tasks. Automated data analysis frequently fails in such challenging cases, typically because they require multiple sets of data from multiple crystal samples to be gathered together in a unique way. The current set of Diamond automated pipelines are linear brute-force systems using high performance computing to attempt structure solution on virtually every data set recorded from the beamlines, however users get little feedback to help improve their measurements or analysis. The provision of robust metrics within data analysis streams and their implementation in decision making algorithms will transform the way structural biologist perform synchrotron experiments by

1) rationalising the use of beam time and increasing data set to structure conversion rate
2) providing users with a visual way of assessing data quality at every stage of analysis
3) facilitating decisions about ongoing experiments based on prior data and ongoing analysis
4) making more efficient use of HPC resource by prioritizing jobs based on likelihood of success
5) drastically reducing the distance between diffraction experiments and useful electron density

This package of work will be running in parallel with the DIALS project (a collaboration between Diamond, CCP4 and other European partner synchrotrons) that will deliver advanced data integration software to tackle weaker and high mosaicity data while addressing the very rapid frame rates (>100 Hz) expected from next generation detectors. Together, the delivery of the DIALS software to synchrotrons by 2015 and the provision of assisted data collection and analysis tools from WP1 connected to CCP4 Cloud infrastructure from WP0 will create the opportunity for a leap forward in the level complexity of crystallographic problem that UK users can address.

Diamond and CCP4 both have a track record in training students and young researchers. The tools developed within this grant will be promoted through practical workshops and courses specifically aimed at increasing the level of crystallographic expertise of our next generation of UK structural biologists.

As part of Diamond's continued engagement with the general public several open days are run yearly to communicate the science and technology of Diamond in an entertaining and memorable fashion. The use of visual props and games to explain the fundamental ideas and the importance of crystallography in biology and medicine has been a major part of this. Diamond/CCP4 cooperation in this project will provide an ideal opportunity to showcase CCP4's and the UK's contribution to biological crystallography over decades.


10 25 50
Description We have generated a comprehensive assessment of data analysis metrics and begun an assessment of their relative power in guiding automated data analysis. By performing a detailed review of literature from experts in the field we have been able to determine which if these metrics are likely to perform better at particular stages of analysis.
In 2016 we consolidated 539 test cases, including raw data and refined atomic structure information, into an annotated database accessible, in the first instance, to all methods developers in Diamond. This data base has accelerated the rate with which we can now perform the tests described above.
In 2017 the database was first used to perform early machine learning trials to understand whether the available database could be used to generate predictive knowledge regarding the chances of successful structure determination. These were successful and generated some unexpected but positive results. The METRIX database has been used by the Cowtan groups in University of York for testing its crystallographic software.
In 2018 three groups of metrics and protein/crystal properties representing different stages of the experimental process have been selected to build machine learning classifiers and have been tested using datasets measured by Arnaud Basle from the University of Newcastle, UK. These tests have clearly indicted the potential power of ML classifiers for predicting the likely outcome of an experimental phasing experiment but have also clearly illustrated the need for a more comprehensive training database based on actual user data from Diamond.
Exploitation Route The test data set library is currently being annotated so that this can be of value to other developers of crystallographic methods in the UK and indeed worldwide. Such a resource has been severely lacking within our community and we believe it will be welcomed by all of our colleagues.
Although our test data set database is currently only used internally at Diamond we plan to increase its size and make it openly available to all methods developers in the field of X-ray crystallography towards the end of the award period.
We will now develop the METRIX database into a live database that is automatically updated based on results obtained by Diamond users. This will ensure that the ML classifiers are relevant and reflect the current capabilities and performance of the beamlines being used by users.
Further we are now implementing the classifiers in the CCP4 online cloud analysis scheme so that users are able to assess in realtime the likelihood of success using their current data and/or phasing approach.
Sectors Agriculture, Food and Drink,Energy,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology

Description To date our findings are being used internally within the group to guide further developments of our automated data analysis pipelines and externally with selected users in order to test the applicability of the machine learning classifiers to user data. Initial results are very promising but also indicted the essential nature of having a comprehensive database of structure solutions for use in teaching the classifiers.. The findings are being used to develop a comprehensive plan for data analysis from X-ray data collection through to electron density inspection in the context of automated decision making. Incremental improvements based on our findings are implemented in the automated macromolecular crystallography data analysis pipelines at Diamond Light Source and are of benefit to industrial users of the beamlines, principally those in the pharmaceutical industries. The recent development of a X-ray crystallography dataset database is now allowing routine testing of automated pipelines and accelerating our research into predictive metrics for the early assessment of the usefulness of data. This will again accelerate the pace with which we can address the challenges of delivering high quality structural and chemical information to academics and industrial researchers alike.
First Year Of Impact 2018
Sector Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

Description Member of AthenaSWAN self-assessment panel
Geographic Reach Local/Municipal/Regional 
Policy Influence Type Participation in a advisory committee
Title A classification tool for crystallographic data sets for their chances of experimental phasing success using machine learning 
Description A computational tool was created which uses data processing statistics to predict the likely experimental phasing outcome for a diffraction data set in protein crystallography. The application makes use of machine learning and statistical techniques and methods. 
Type Of Material Improvements to research infrastructure 
Year Produced 2019 
Provided To Others? No  
Impact Not yet available. 
Title Database for Diffraction Data in Macromolecular Crystallography 
Description This database is a collection of diffraction data produced in macromolecular crystallography (MX). The data has been measured in facilities worldwide and therefore covers varying levels of data quality, experimental accuracy, equipment settings, data resolution etc. The aim is to make this database available to software developers in the MX community, at least within the UK, by the end of the grant. As this is still work in progress no URL or DOI are available yet. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact The database is already being used in-house at Diamond Light Source for software development and to monitor performance of computational services in macromolecular crystallography. 2016 The database is now routinely used to stress software developments in-house. This focuses mainly on XIA2 and DIALS to improve their performance and robustness. The number of test sets is currently 539. 2017 Diamond bought additional data storage space in order to be able the test data collection more reliably than on the current location which is publicly shared and prone to off-line time for maintenance. 
Description Effective Resolution in Protein Crystallography 
Organisation Medical Research Council (MRC)
Country United Kingdom 
Sector Academic/University 
PI Contribution Melanie Vollmar is in charge of providing test data. She is located at Diamond Light Source and measures data according to project requirements but also maintains a database of test data which is a collection of data acquired in various labs all over the world. Melanie has been given new or adapted software code from collaborators at LMB-MRC which is then challenged by the test data. Based on the test outcome, software will be further adapted. The test data available covers a broad spectrum of data quality and data resolution and hence serves as a good starting point on the topic to determine the effective resolution of crystallographic data and structures.
Collaborator Contribution Phil Evans and Garib Murshudov at LMB-MRC provide software and write code. During discussion about other macromolecule crystallography related problems a need to describe effective data and structure resolution has been identified. Both collaborators are well established developers of macromolecular crystallography software but were lacking a sufficiently large collection of test data to progress with their development. After changes and amendments to code have been made, the new versions of software are given to Melanie for testing.
Impact no outcome yet
Start Year 2015
Description External testing of METRIX database 
Organisation University of York
Department Department of Chemistry
Country United Kingdom 
Sector Academic/University 
PI Contribution My team have developed the METRIX database against which crystallographic software developments can be rigorously tested.
Collaborator Contribution The groups of Prof K.S. Wilson and Dr K. Cowtan at York have tested the METRIX database and provided valuable feedback on it's content and quality of curation.
Impact Improved quality of METRIX database and the possibility of expanding the database through inclusion of diffraction data from York structural biology groups. The collaboration is multidisciplinary touching of scientific software development, structural biology and database development.
Start Year 2017
Title Python based data analysis tools 
Description A series of Python scripts as wrap-around, which allow easy execution of crystallographic software for analysis of test data sets. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact All scripts can be easily called and run on computing clusters and have greatly increased the speed of data analysis for the test data (539 data sets). The total analysis time has decreased from a couple of weeks to two or three days. 
Title Python based experimental phasing prediction 
Description Python and its machine learning extensions were used to create a prediction tool for experimental phasing outcome. The application makes use of a collection of classifiers which have been trained using the data held in METRIX database (also part of this project). The project currently awaits incorporation into the general facility infrastructure to be available to the user community. 
Type Of Technology Software 
Year Produced 2019 
Open Source License? Yes  
Impact Not yet available. The expected output is that by giving facility users a probability for success or failure of their experimental phasing attempt they will only focus on those with high chances to improve the usage of resources. 
Title SQLite based database system to manage test data sets and data analysis output 
Description This is a freely (Github) available SQLite database to manage the collection of test data sets referred to in section "Databases and Models" which additionally links statistics and metrics from various data analysis steps. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact Rather than manually having to track the outcome for the various test data sets, the database takes on this task. Besides data management it also allows to extract statistics and metrics in a convenient way, as CSV files, to be used in machine learning tools further downstream, which provide the basis of user guidance at the beam lines. 
Description Guided tours to individual researchers/members of the public/school children 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Public/other audiences
Results and Impact short tours given to undergraduate students when interviewed for summer placements;
short tours to personal friends from general public;
short tours to pupils either as personal favour or as part of their stay on the wider campus for work experience
Year(s) Of Engagement Activity 2015,2016,2017,2018