Validation of NMR protein structures using FIRST and RCI

Lead Research Organisation: University of Sheffield
Department Name: School of Biosciences

Abstract

Protein structures are essential for understanding protein function, and for drug design. In order to make use of structures, it is vital for users to know how good the structures are. The structures are generated mainly from X-ray crystallography and NMR. For crystal structures, there are reliable ways of knowing how good the structure is. These are based on the fact that a structure can be used to calculate exactly what the input data should look like: a comparison with the actual diffraction data therefore gives a reliable quantitative measure of quality. For NMR, there is no such measure, meaning that so far it is very difficult to know how good an NMR structure is. This is a problem not only for users of structural information, but also for the scientists who calculate the structures, since they also have no way to judge how good their structures are.
In this proposal we describe a method for calculating the quality of NMR structures (ie, validation), based on comparing two measures of local rigidity, one derived from the structures and one from the original input data. The first measure is calculated using an established method for identifying rigid clusters based on graph theory, called FIRST, and developed by our collaborator Dr Sljoka. The second method uses the Random Coil Index (RCI), which is a program based on the simple idea that the NMR frequencies ('chemical shifts') of protein backbone atoms have very characteristic 'random coil' shifts when the protein is locally disordered, and therefore that the experimentally measured shifts in a protein can be used to quantify to what extent a given amino acid residue is disordered. A comparison of these two measures of local rigidity therefore provides a residue-by-residue test of how well the rigidity of the structures compares to the experimentally determined 'true' rigidity. Although this is not a direct comparison between structure and input, it is likely to be as close as one can get for NMR structures, and is a major improvement in the NMR structure determination process. The proposal describes how we will go about implementing the comparison and checking that it works as expected, and then how we will make it available to the community and use it to examine NMR structures, for example by reporting on the quality of all existing protein NMR structures (objective 1).

Having developed the validation tool, we then propose to apply it to some useful ends. The first of these (objective 2) is to identify sets of 'good' and 'bad' NMR structures. So far there has been no good way to know how good structures are: by identifying such structures we expect to generate an important resource for the structural biology community by marking out quality criteria and therefore stimulating further research into structure quality.

Whereas crystal structures are typically represented by a single set of coordinates at the average position (together with 'B factors' that represent the uncertainty in each coordinate), NMR structures are always represented as an ensemble of (typically 20) structures. There is a valid reason for this, that NMR structures are inherently less well defined than crystal structures. Nevertheless, it is confusing and unnecesary. We aim to apply our method to define more closely how many structures in an ensemble are really necessary, and whether some are simply wrong. In order to assist the process, we will improve current methods for calculating chemical shifts from structures, by modifying them to work on ensembles.

Finally, we shall use our methods to look at an important class of protein structures called Intrinsically Disordered Proteins, to test whether current methods provide a correct representation of the true conformational ensemble. These represent roughly one third of human proteins (including many responsible for signalling), so are an important topic.

Technical Summary

Currently there is no good way to validate NMR structures, because there is no direct connection between the input data (NMR spectra) and the structures, as there is for crystal structures. We present preliminary results for a method that comes as close to this as possible, namely a comparison between chemical shifts (the Random Coil Index, RCI) and the program FIRST, which calculates the local rigidity of a structure or an ensemble of structures using mathematically rigorous methods. Both calculate the local rigidity of a protein. Crucially, the shifts are not used as part of the structure calculation, and represent only a small abstraction from the original NMR spectra. The comparison therefore comes as close as possible to a crystallographic R-free. We will test and refine the method, with the aim of rapidly and automatically generating a per-residue quality index for every NMR structure in the PDB, thereby for the first time allowing PDB users to know how good (accurate) any NMR structure is. Equally importantly, the method can be used by NMR groups to measure the accuracy of their structures at any stage in the structure calculation, and should therefore be a useful tool to improve structure calculations by identifying problems during the calculation.

All NMR structures in the PDB are deposited as ensembles. The relevance of the individual members of an ensemble is far from clear, and the selection process is opaque. Our method will throw light on this, and hopefully stimulate a change in behaviour, or at least debate. It will show how well individual members match, and thus identify outliers. We propose to update our (1993) program for calculating protein chemical shifts to operate on ensembles rather than single (crystal) structures. This will allow us to identify outliers using a second independent method, and thus work towards re-defining NMR ensembles. Finally, we shall better characterise residual structure in Intrinsically Disordered Proteins.

Planned Impact

This work will only have Impact if it is taken up and used by the scientific community, in particular structural biologists. Hence a key aim of the proposal is to make sure that the programs are adopted and used widely once the methodology has been tested and checked. We have therefore built this aim firmly into the proposal:

1. The work will be disseminated as widely as possible, for example by publication in international scientific journals, and presentations at relevant conferences. MPW is on the organising committee for the next ICMRBS meeting in Dublin in August 2018, and if the project is suitable advanced by that point, he will propose a workshop satellite meeting at ICMRBS to cover validation.

2. Part of Objective 1 is to carry out calculations of the quality of all NMR structures in the PDB, publish them on a website and make them available in an easily accessible archive. This will make it easy for anyone to check the accuracy of an NMR structure, and should increase the usage and therefore the impact.

3. The PDB has a task force on protein validation, which has published its preliminary findings (reference 8 in the proposal). It proposed three phases for developing validation, of which the third recognises the need to develop new tools, specifically based around chemical shifts. We propose to put a lot of effort into integrating our methodology with the validation software made available on the PDB website, with the aim of getting PDB to include our method as one of the standard measures for validating NMR structures. In the UK, the two key people who would be involved in this dialogue are Aleksandras Gutmanas, who works at PDBe in Hinxton, Cambridgeshire, and Geerten Vuister, Professor in the Department of Molecular and Cell Biology at the University of Leicester. Gutmanas has a specific responsibility for NMR structures and NMR validation in PDB, while Vuister has worked in NMR validation for many years, is a member of the PDB validation task force (as is Gutmanas' boss at PDBe, Gerard Kleywegt) and is also chair of CCPN. CCPN is the Collaborative Computational Project for NMR, funded by the BBSRC from 2000 to 2012 and by MRC from 2013. It develops programs for NMR analysis, and aims to 'determine and spread best practice in NMR'. We have initiated discussions with both of them. MPW is also in occasional touch with Guy Montelione, chair of the PDB NMR validation task force, and with John Markley, director of the BioMagResBank, which is the repository for all NMR protein chemical data. He has also held detailed discussions with Naohiro Kobayashi, who is the NMR expert in PDBj (the Japanese wing of PDB) and who has a major interest in validation: MPW shared an office with him for a year while on sabbatical in Osaka a few years ago. We therefore feel that we are well placed to get our software incorporated into the PDB NMR validation suite. We also hope to get it linked into the CcpNMR Analysis website.

4. Training. One output from this research is that the PDRA involved (who we expect to be a computer scientist) will be trained in protein structure and NMR. Such cross-disciplinary training is increasingly important.

5. Engagement with the public. We shall use virtual reality displays to explain and delight, focussing on 'wrong' structures. We shall also explore using an app for mobile phones that enables users to see objects in 3D on their mobiles using cheap and readily available cardboard glasses, which is an engaging introduction to protein structure in general, and 'mistakes' in particular. The concept that scientists sometimes make mistakes is one worth working at to be clear but also entertaining without being sensationalist.
 
Description The first aim of this project is to develop tools for validating NMR structures (ie, measuring how correct they are). We have now done this, and a research paper has been published that describes the method, as well as its first application to comparing the quality of NMR and crystal structures. The second stage of the project (as set out in the original research proposal) was to look at the NMR structures depositied in the Protein Data bank and carry out an analysis of their quality, and this work has also been published. We have given several public lectures on the method. We have been engaging with thhe Protein Data Banl and CCPN (see below). We have two more publications ongoing. One (now submitted) looks at the value of NMR structures as compared to those proedicted by the computer AI method AlphaFold: what is the point of calculating an NMR structure if AlphaFOld can do it quicker and (arguably) better? Finally, we are completing work kon another paper that uses ANSURR to follow thhe progress of NMR structure calculation, in particular using hydrogen bond restraints. We expect this paper to be of major interest to thhe many groups that calculate protein structures using NMR.
Exploitation Route We anticipate that this work will be used widely to measure the quality of NMR structures. All the methods developed have been deposited in publicly available sites, to facilitate this process. The software that was a major aim of the project is now publicly available, as are data on the quality of all NMR structures in the Protein Data Bank.
Sectors Agriculture, Food and Drink,Pharmaceuticals and Medical Biotechnology

URL http://ansurr.com
 
Title ANSURR computer program for determining accuracy of protein structures in solution 
Description This is a computer program resulting from the work in this project, which is used to determine the accuracy of protein structures in solution - most obviously NMR structures. The program is available for download, there is a website publishing results from ANSURR, and a web server is on its way. We have already published two outcomes from this, with one more submitted and one about to be submitted, plus a grant proposal currently with BBSRC. 
Type Of Material Technology assay or reagent 
Year Produced 2020 
Provided To Others? Yes  
Impact Too early as yet. We are in discussions with the Protein Data Bank about adding this tool to their validation program, which we are keen to odoo. 
URL http://ansurr.com
 
Description Database of NMR structure quality 
Organisation University of Leicester
Country United Kingdom 
Sector Academic/University 
PI Contribution We have a novel method for charavcterising the quality of NMR structures
Collaborator Contribution Prof Geerten Vuister at the University of Leicester has constructed a database that contains a range of parameters that relate to the quality of protein structures
Impact Prof Vuister has given us access to te database and we have embarked on a joint investigation
Start Year 2018
 
Description PDB NMR Validation task force 
Organisation Rensselaer Polytechnic Institute
Country United States 
Sector Academic/University 
PI Contribution This is a subgroup of PDB, set up to monitor and improve the quality of NMR structures. Our work, including work carried out under the funding provided by this grant, has led to ongoing discussions with the task force, which we hope will lead to our methodology (ANSURR) being adopted by PDB as a recommended validation tool. This will have the effect of improving the quality of all NMR structures submitted to PDB.
Collaborator Contribution The PDB is the recognised repository for experimental protein structures. Structures in the PDB are provided by X-ray crystallography, NMR, and cryo-electron microscopy. NMR structures are under-used and considered less reliable, because up till now there has been no way to check their accuracy. The task force was set up to provide guidelines on the best way to measure and document accuracy. So far they have produced guidelines but little tangible help. It is our expectation that through this collaboration, we will provide a significant input to the validation process.
Impact Our work has been referenced in publications from Guy Montelione (chair of the task force).
Start Year 2021
 
Title ANSURR 
Description The program is used to measure the accuracy of protein structures in solution, by comparing the rigidity of the structure (calculated using graph theory) to the rigidity of the structure measured using NMR backbone chemical shifts in solution. 
Type Of Technology Webtool/Application 
Year Produced 2020 
Open Source License? Yes  
Impact too early as yet 
URL http://ansurr.com
 
Title ANSURR 
Description Used to determine the accuracy of protein structures in solution 
Type Of Technology Software 
Year Produced 2020 
Open Source License? Yes  
Impact too early as yet 
URL http://ansurr.com