CCPNGrid: A framework for high throughput computing in NMR spectroscopy

Lead Research Organisation: University of Cambridge
Department Name: Biochemistry

Abstract

Proteins are the workhorses of a living organism. They are involved in many functions, and without them life as we know it could not exist. In a human body for example, certain proteins transport oxygen in the blood, others defend us against bacteria and viruses, and still others help to digest food. All proteins are composed of amino acids. These amino acids are the building blocks of proteins, and they are connected to each other to form a long chain. There are 20 naturally occurring types of amino acid, each with a different shape. Because each protein has a unique amino acid sequence, how the protein chain folds in 3D space is also unique. For example, a part of the chain can fold back on itself (beta-hairpin), or it can fold into a coil-like structure (alpha-helix). A combination of these structural elements then interact with each other to form the complete fold of the protein. To understand how a protein works, we need to know how this long chain of amino acids folds. This can be determined using two techniques: X-ray crystallography and Nuclear Magnetic Resonance (NMR) spectroscopy. This proposal is directed at NMR. With NMR, you can determine which atoms in the protein are close to each other in space. For example, if the protein chain forms a circle, you can determine that the atoms of the first amino acid are close in space to the atoms of the last amino acid etc. NMR experiments produce a lot of this type of distance information, and a lengthy calculation on a computer is necessary to determine the exact fold of the protein chain. These calculations basically convert distance information into three-dimensional coordinates. This is called a 'structure calculation', and it can be done in many different ways. Also, these structure calculations are quite complex and require a lot of expertise to set up on a computer. We propose to set up and run automatically the latest and most sophisticated structure calculation software on a set of fast computers. This software would be available over the internet to researchers, so that they can use state-of-the-art software with little effort. They could also install it in their own laboratories if they have sufficiently fast computers of their own. Even if you are using the best software, it is still possible that there are problems with the results of the structure calculation. This can be due to mistakes made when analyzing the NMR data, or just because we did not have enough information to get a good answer when we started the calculation. For this reason, we will also automatically run validation programs that analyse the structures resulting from the calculation. This validation will help the researcher find out whether the results are scientifically correct. Finally, we can use the calculation setup to recalculate old structures. The Protein Data Bank (PDB) stores the structures of proteins that were calculated by people all over the world. The way different scientists calculate the structures can, however, be very different, and it can be difficult to directly compare the structures to each other. Recalculating the structures using the same program will improve the quality of the structures. They will also be more consistent with each other, and it will be easier to compare them directly.

Technical Summary

Nuclear Magnetic Resonance (NMR) spectroscopy has become a key tool for determining the 3D structure of biomolecules. The two main steps that determine the speed with which biomolecular NMR data can be processed, are the extraction and analysis of information from the NMR spectra, and the subsequent 3D structure calculation. The software available to perform these steps is not as well developed as it is in, for example, X-ray crystallography, limiting the application of NMR. This project intends to provide the UK NMR community with the means to execute state-of-the-art 3D-structure calculation and validation software, so that the quality and scientific value of structural coordinates from NMR can be improved. The main aim of this project is the creation of a framework where novel computational methods, that require computing resources that are not typically available in NMR laboratories, can be automatically executed via the Grid using data stored in the data model provided by the Collaborative Computing Project for the NMR community (CCPN). Whilst this project will provide a central calculation facility for small NMR laboratories, it will also enable larger NMR groups with their own compute clusters to install the framework for internal use. In this pump/priming application we will implement automated NOE assignment and NMR structure calculations using the ARIA software package, which uses the Crystallography and NMR System (CNS) for structure calculations. When a framework for executing calculations on high performance computing facilities and clusters of workstations has been established, this resource will be extended to other software being implemented in the CCPN project (e.g. CLOUDS and Inferential Structure Calculations, validation programs like QUEEN, and other software being developed within the EU Extend-NMR project). The development within this project is shared by three groups. The EDL group at the University of Cambridge has a long history of NMR studies of biomolecules and NMR methods/software development, and is central to this project through its coordination of the CCPN project. The software framework developed by CCPN will be used to handle and validate the NMR and molecule information required for the structure calculations. The Cambridge eScience centre will play a key role in the project by providing expertise for the implementation of calculations, initially on the High Performance Computing Facility (HPCF) and CamGrid at Cambridge, and later at other locations on the Grid. In particular, they will create a workflow tool that can handle the different steps involved in NMR structure calculations. The Macromolecular Structure Database group at the European Bioinformatics Institute is part of the world-wide Protein Data Bank (wwPDB), and will provide expertise in handling molecular data for creating topology files for the calculations, and in upgrading the RECOORD database. There are several immediate benefits resulting from this project: 1) A resource will be provided for the NMR community for automated structure calculation and validation using the latest protocols, 2) A tool will be developed to generate topology files for ARIA and CNS. This will be especially useful for scientists working with protein complexes, 3) The workflow tools developed as part of this project at the Cambridge eScience Centre can be used in a wider context (e.g. to set up other projects in computational chemistry), and 4) The RECOORD database, which contains PDB entries that have been recalculated with the latest structure calculation protocols, can be further extended and automatically updated. In the long term, it will be possible to make available all the calculation and validation protocols (e.g. CLOUDS and Inferential Structure Determination) that are being implemented in the CCPN project. This will provide the NMR community with an invaluable resource to calculate and analyse protein structures.

Publications

10 25 50
publication icon
Doreleijers JF (2012) CING: an integrated residue-based structure validation program suite. in Journal of biomolecular NMR

publication icon
Sousa Da Silva AW (2012) ACPYPE - AnteChamber PYthon Parser interfacE. in BMC research notes

publication icon
Wassenaar T (2012) WeNMR: Structural Biology on the Grid in Journal of Grid Computing

 
Description CCPNGrid has already delivered a functional web service that can be accessed by the NMR community (http://www.ccpn.ac.uk/ccpn/projects/ccpngrid, login as user 'test', password 'alan123', click on 'Job status' link on top of page, then click on 'confirm' followed by 'Results' for either project). This prototype is now able to automatically generate a set of ARIA input files, run ARIA calculations on CamGrid (the Cambridge campus grid infrastructure), and display the results of these calculations, together with a detailed restraint violation analysis, via a secure web connection. It has also provided the RECOORD project, which recalculates structures from NMR PDB entries by standard protocols, with a framework and scripts to recalculate existing structures automatically.
Exploitation Route The server and the code written to run it have served as a starting point for further CCPN development in the field of semi-automatic software pipeline development, specifically the development of the Workflow Management System as part of the WeNMR collaboration. It has likewise been a starting point for the integration ;of structure calculation programs in the CcpNmr suite, versions 2 and 3 (under development).
Sectors Chemicals,Healthcare,Pharmaceuticals and Medical Biotechnology

URL http://www.ccpn.ac.uk/software/web-apps-general
 
Description The server has been in regular use as a publicly accessible portal for running NMR structure calculations, mainly using the ARIA program.
First Year Of Impact 2008
Sector Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Economic,Policy & public services