Exploiting High Performance Computing to Provide Functional Annotations via CATH-Gene3D

Lead Research Organisation: University College London
Department Name: Structural Molecular Biology

Abstract

Over the last ten years there have been intense efforts to determine the protein compositions of different organisms, including human and other model organisms from all kingdoms of life. Currently more than 1,000 organisms have been completely sequenced and nearly 10 million protein sequences determined. In 2000 the human genome was completed and the latest estimates say it contains between 23,000 and 25,000 protein-coding genes. It is difficult, expensive and time-consuming to determine the functional properties of all these proteins and for many organisms, including human, fewer than 15% of the proteins have been directly experimentally characterised to determine their function. Therefore, a major activity and challenge for bioinformatics groups has been the need to devise computational methods for inferring the functions of proteins. Most predictive methods exploit the premise that proteins in different species are related to each other (homologues) as they have evolved from a common ancestral protein. These homologous proteins frequently share similar functional properties, conserved during evolution. Therefore, many methods search for similarities in the sequences of proteins, indicative of an evolutionary relationship, which then allows functional information to be inherited. In other words, a protein that has been experimentally characterised in fly, for example, can be used to assign functional properties to an evolutionary related protein identified in human. The main challenge faced by these approaches is the fact that gene duplication occurs in all organisms throughout evolution. Therefore, as well as the original copy of a protein, derived from an ancestral protein, there can be additional copies which may have evolved slightly modified functions to expand the functional repertoire of the organism, thereby enhancing its survival. We have developed a resource (CATH-Gene3D) which groups proteins into evolutionary families on the basis of similarities in their 3D structures (where available) and their sequences. Currently, more than 2,200 families are classified in CATH-Gene3D accounting for the majority of protein domain sequences. Some of these families contain very many sequences as the proteins have been highly duplicated in organisms. These families pose a challenge to function prediction methods as the functions of the relatives have frequently diverged. We have designed a new method (GeMMA) which uses a sophisticated approach for comparing sets of evolutionary sequences to group them into subfamilies of proteins, which are very likely to share functional properties. Whilst GeMMA has been shown to be accurate in transferring functional information between relatives it can take a long time to run for the very large families in CATH-Gene3D. Therefore, to speed it up, this project will modify the GeMMA protocol so that we can run it on a wide range of publicly available HPC resources. We will also develop highly intuitive web pages to make the information provided by the GeMMA subfamilies very accessible for the biology community. This web site will also allow biologists to submit a query protein of unknown function which will then be searched against the GeMMA subfamilies to predict a putative function. CATH-Gene3D is already widely used by biologists and this new functional sub-classification will make the resource even more valuable to these researchers by providing more precise functional annotations for the novel proteins they are studying.

Technical Summary

The major and technically most challenging part of our project is the porting of GeMMA to publicly available HPC facilities so that it can be run for each CATH-Gene3D release. We will extend our web sites and servers to present GeMMA annotations by using methodologies well established for CATH-Gene3D. We will refine the GeMMA protocol so that it can be ported to different multiple public HPC facilities. This will involve modifying the current HPC strategy (which uses local compute clusters) to exploit other, much larger, public services such as - the UK National Grid Service (NGS) - the HECToR supercomputer - the European grid consortium EGEE (Enabling Grids for E-sciencE) - the BlueGene facility at Argonne National Laboratories, US In addition, we will use paid infrastructure-on-demand services such as the Amazon EC2 compute cloud and the corresponding Amazon S3 storage service. While porting the current GeMMA HPC implementation to the systems listed above should be relatively straight-forward the Amazon services will require substantial changes to the protocol. Amazon virtual machines can be 'rented' for weeks or months and used either in a cluster-like scenario resembling the current HPC implementation (e.g. via Sun SGE's new 'cloud adapter' software) or in a purely parallel way, e.g. each running one large superfamily at a time. In either case, scripts have to schedule and survey the individual processing tasks. We will develop a pipeline which allows us to run GeMMA once or twice a year i.e. with each release of CATH-Gene3D. Over the last two years, CATH-Gene3D has doubled the number of sequences classified, to ~5 million distinct protein sequences coming from a number of sequence repositories. However, international sequencing efforts, particularly the JGI's GEBA genomes project and the large metagenome initiatives will lead to even greater expansions of the classification.

Planned Impact

Communications and Engagement The modified GeMMA protocol will allow us to provide more accurate functional annotations for all the major protein domain superfamilies in nature. We will disseminate this information relying on our extensive resource and service design expertise: - Extend the CATH-Gene3D website with new subfamily pages and a subfamily assignment server Users will be able to submit query sequences for subfamily assignment and investigate functional annotations. The complete GeMMA profile library will also be available for download. - Distribute GeMMA annotations via the InterPro web site CATH-Gene3D is one of the InterPro member databases and we regularly provide superfamily HMMs to InterPro, forming an important part of this reknown annotation meta-server. The GeMMA subfamily profiles will become part of this package. InterPro receives nearly 5 million web page accesses per month. CATH-Gene3D annotations are also hosted on the CARGO website for cancer mutations (CNIO, Madrid) and the e-pipe website for splice variants (TU Denmark, Lingby). - Provide the annotations through web services. We already supply annotations for CATH-Gene3D superfamilies via the DAS (Distributed Annotations Services) Registry at the EBI (http://www.dasregistry.org/). The GeMMA functional annotations will be made available through DAS, the EMBRACE registry and BioCatalogue (http://www.biocatalogue.org/). The CATH-Gene3D website receives 1 million web hits per month (excluding search engine robots) corresponding to 372,104 page impressions per month from 8,444 unique hosts. CATH-Gene3D is widely used in teaching undergraduate and postgraduate students because of the intuitive presentation of the data. Many other highly accessed sites (e.g. InterPro, PDB, Pfam, PSI-Knowledgebase, PDBsum) provide links to CATH-Gene3D. We will further publicise the new subclassification in workshops, e.g. within IMPACT and ENFIN. Collaboration We are involved in several collaborations with experimental groups who will benefit directly from the GeMMA classification: Protein Structure Initiative (PSI) This is a large initiative funded by the NIH in the US. We are members of the Midwest Consortium for Structural Genomics (MCSG), one of the four major centres involved in PSI. MCSG includes 8 groups comprising more than 50 structural biologists (with >200 for PSI as a whole), and the data is accessed and exploited by many other scientists involved in similar initiatives. We are currently using GeMMA to identify subfamilies within very highly populated and functionally diverse superfamilies as targets for structure determination. This follows the utlimate aim to represent each of these subfamilies i.e. functions by at least one solved structure. London Pain Consortium This Wellcome funded network includes 7 experimental groups studying neuropathic pain. GeMMA will be used to functionally characterise genes identified by proteomics and microarray studies as being associated with signalling pathways involved in pain. EU ENFIN Network of Excellence for Systems Biology This network pairs computational groups with experimental groups. We are collaborating with several experimental groups working on angiogenesis, mitotic spindle and the PLK1 and LKB1 signalling pathways implicated in cancer. As above, GeMMA functional annotations will be used to characterise genes identified by microarray and proteomics studies. Collaborations with Metagenomics Initiatives As a member of the DOE funded Centre for Structural Genomics in Infectious Diseases (CSGID) we collaborate with groups analysing metagenome sequences at the J. Craig Venter Institute (JCVI), whose annotators will exploit the GeMMA profiles. The functional repertoire of metagenomic datasets can reveal targets for structure determination, e.g. structural features of subfamilies highly expressed in enterobacterial pathogens could guide drug development.

Publications

10 25 50
 
Description Less than 10% of known protein sequences have detailed experimental characterisation, even in important model organisms like human, fly and mouse. Therefore computational methods which exploit the conservation of protein functions in evolutionary families of proteins are valuable for suggesting putative functions for uncharacterised proteins, which can then be tested experimentally. Although 2,700 domain structure families have been identified in our in-house CATH-Gene3D resource, some of these are very diverse in their structures and functions. Whilst these diverse families account for less than 10% of families, they comprise nearly 70% of all known sequences. Therefore, it is important to know how function has diverged within these families in order to improve the inheritance of functional properties to uncharacterised relatives assigned to them.

This project developed a novel algorithm (DFX) for protein function prediction based on functional sub-classification of superfamilies in the CATH database. The method involved a computationally expensive clustering protocol that grouped relatives according to similarities in their sequences. In order to run this method on all the large, diverse families a new computational platform was developed that allowed the method to be run on the Amazon Cloud and on a large, computer farm (Legion) at UCL. The method was independently validated in an international function prediction competition and came 7th out 56 methods for prediction of molecular function. It was also found to perform well in predicting residue sites likely to be involved in the function of the protein.
Exploitation Route The DFX method has improved the functional classification of relatives in CATH and will be used for regularly updating this data. CATH functional information is widely used by the biological community - the CATH resource receives more than 2 million web page accesses per month from more than 10,000 unique visitors.

CATH is a member database in the widely accessed InterPro resource hosted at the EBI, which has more than 5 million webpage accesses per month. Sequence profiles generated by DFX are being used by InterPro to provide more accurate functional annotations for protein domain regions.

Functional families derived by DFX can be used for assigning functions to metagenome sequences and will therefore be valuable for researchers interpreting this data.
Sectors Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology

URL http://www.cathdb.info
 
Description Functional families derived by DFX have been used by two structural genomics consortia in the States to select uncharacterised relatives from protein domain families for structure determination: - Midwest Centre for Structural Genomics to target important functional subfamilies found in bacterial organisms in the human gut microbiome - Centre for Structural Genomics in Infectious Diseases to target functional sub-families likely to be associated with virulence or drug resistance in pathogenic bacteria. DFX functional families have also been used to predict which proteins are likely to be interacting/associating in signalling networks implicated in neuropathic pain - this was performed by inheriting protein interactions across functional families.
First Year Of Impact 2012
Sector Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Title GeMMA 
Description GeMMA (Genome Modelling and Model Annotation) is an approach to automatic functional subfamily classification within families and superfamilies of protein sequences. It is a profile-based agglomerative clustering algorithm that exploits COMPASS to compare the profiles derived from the multiple sequence alignments (MSAs) of clusters present at each stage of the clustering. At each iteration, the cluster profiles matching above a threshold are merged and profiles are generated for the new clusters. These iterations continue giving a hierarchical clustering tree built from the leaf nodes to the root, till a single cluster remains. 
Type Of Material Improvements to research infrastructure 
Provided To Others? No  
Impact GeMMA clustering program allows subclassification of the CATH-Gene3D superfamilies into smaller groups of sequence relatives that are functionally and structurally related to each other which is great importance in understanding the structure function relationships in a superfamily. The use of GeMMA to increase the functional annotation coverage of functionally diverse Pfam families has been demonstrated. GeMMA clusters can also help to predict the impact of experimentally determining a protein domain structure on comparative protein modelling coverage, in the context of structural genomics.