CATH-FunL: Improving Gene Target Selection by Predicting Functional Modules in Biological Systems

Lead Research Organisation: University College London
Department Name: Structural Molecular Biology

Abstract

In the past decades, a marked increase in data availability has revolutionized the study of biology. Advances in experimental techniques mean that we now have an abundance of information about the genes and proteins in our cells and their interactions. This unprecedented volume of data presents a challenge for biologists: how to best combine and exploit different data sources to gain meaningful biological insights.

CATH-FunL is a tool designed to address this problem. FunL will allow users to predict novel proteins ('targets') likely to be associated with a set of proteins they are interested in - for example, known components in a protein signalling pathway. CATH-FunL will also allow users to gain further insight into these predicted targets by organizing and annotating this list of predicted genes. Finally, CATH-FunL will provide intuitive visualizations of the predicted targets and the functional relations between them.

CATH-FunL's prediction methods are based on the well-documented concept of guilt-by-association. Much of the data produced by modern experimental techniques can be used to infer whether proteins participate in the same biological process - that is, whether they are functionally associated. Evidence for functional association comprises physical binding between proteins, correlation in expression patterns and numerous other, more indirect indicators. Guilty-by-association methods represent this information as a network of functional associations between proteins and attempts to use the structure of the network to predict new associations.

The simplest methods simply make predictions based on the direct network neighbours of a protein. This, however, ignores the rich information present in the overall topology of the network: for example, groups of proteins relating to the same function are known to form densely connected clusters within the network, with fewer connections to other proteins. FunL aims to exploit this type of structure using a powerful and well-studied approach known as graph kernels.

CATH-FunL will integrate a large volume of protein interaction/association information, from several public repositories and our own in-house tools for protein association prediction. These data will be represented as networks, combined and then transformed into a ranked list of potential targets using kernel-based methods, based on a set of query and known proteins provided by the user. Query proteins will be ranked by the strength of their association to known proteins.

FunL will provide further insight into the target proteins by providing information about their function. Functional annotation is often performed using terms from the Gene Ontology (GO). However, on average, <10% of genes in an organism have been experimentally characterised - GO annotations can therefore be sparse or unreliable for many proteins. Therefore, we will supplement experimental GO annotations with predicted annotations using state-of-the-art, in-house, sequence based prediction methods.

Once the target list has been computed, CATH-FunL will organise the list into functionally coherent sub-groups. This will allow users to detect potential patterns in the predicted targets and to focus on particular biological processes of interest to them. Because much of the computational work involved in this clustering will already be done by FunL at the query stage, this provides a very efficient way of classifying the target list proteins.

Finally, FunL will visualise the results in an intuitive way. We will use both network based visualisations and explore more innovative approaches related to the kernel-based methods.

In summary, CATH-FunL will allow users to combine their own datasets of experimentally analysed genes with information from heterogeneous publicly available repositories and our in-house functional annotation datasets to gain valuable functional insights into biological processes they are interested in.

Technical Summary

CATH-FunL will be a new tool in the CATH-Gene3D resource, to prioritise proteins in a large query set, generated by a high-throughput experiment. It will allow biologists to identify a subset of genes, likely to be associated with a biological system of interest, for more detailed experimental characterisation. Users will specify a biological process eg by providing a set of proteins or GO terms, known to be associated with the process. The novel feature of FunL will be its ability to identify network modules within this prioritised list, enriched in the known proteins or the relevant GO terms.

CATH-FunL will use protein interaction/association data for ten model organisms (including human) from a range of public sources (eg IrefIndex etc). Data from each source will be transformed into similarity matrices using our well-established, kernels-based approaches. These matrices will then be combined and transformed into a final matrix and a new in-house method, COMPASS, applied to give a prioritised list of genes

We already have a small pilot FunL platform, built to handle small query datasets (<100). This will be significantly improved by including the more powerful COMPASS method for ranking the proteins and another novel method for identifying enriched network modules in the ranked list, to better prioritise the proteins. COMPASS, applies partial least squares regression to prioritise targets more effectively. Functional enrichment analysis will be enabled by annotating proteins in the network with our in-house, CATH-Gene3D functional family data.

CATH-FunL will be re-engineered to be robust to multiple queries from groups submitting large datasets. This will be done by pre-calculating the underlying matrices and protein functional annotation data on the UCL 5500 node compute farm (Legion), and on the Cloud. We will also explore porting user queries to the Cloud and running the whole project externally on Google Cloud (https://cloud.google.com).

Planned Impact

Who will benefit from the research

As described already, FunL will address BBSRC strategic areas by aiding experimental groups involved in high throughput studies eg generating next generation sequence data and proteomics data. There are a number of such groups that we work with already eg on ageing, pain, fly development and cancer, who would be willing to continue testing the CATH-FunL tool for us.

However, apart from experimental groups involved in high-throughput 'functional genomics' style studies, other groups involved in high throughput structural biology will also find the tool valuable. For example, we collaborate with two large structural genomics consortia who use CATH-Gene3D functional annotations to guide selection of suitable targets in metagenomics studies, a priority area for the BBSRC. Structural genomics groups such as these and structural biologists will clearly benefit from FunL to help guide their selection of new targets for structure determination. Perhaps more valuable, they will also use FunL to suggest possible interactors for proteins they are interested in. Knowing the interactors for a protein target can considerably aid solubilisation and purification of proteins during the crystallisation process especially where these proteins are involved in forming stable complexes with the target protein.

Another potentially large group of beneficiaries are researchers in industry. There is growing interest in industry for exploiting protein networks to aid target selection. For example, identification of network modules enriched in highly expressed genes, that correlate with a particular phenotype, can suggest suitable targets for drug design. Here, the links between FunL and the CATH-Gene3D superfamilies will be particularly valuable as researchers will be able to identify the domain constituents in an enriched module (CATH domain IDs will be reported alongside the GO functions of node proteins) and this will help in determining whether a poly-pharmacological strategy could be employed eg where a weakly binding drug increases in efficacy because it targets multiple copies of a particular CATH domain within a protein network module.

In this context development of the new CATH-FunL tool will benefit from a collaboration between the Orengo group and computational researchers at Glaxo Smith Kline (GSK) on a project exploring how drug poly-pharmacology can be enhanced by targeting specific domains within protein network modules. This project has EU funding which supports a Marie Curie Fellow, Dr Aurelio Garcia, within the Orengo Group for the next two years.

All the data generated by CATH-FunL (ie predicted protein interactions/associations, similarity matrices produced by the kernel based analysis of the graph network topology, GO functional annotations of proteins in the model organisms used by FunL) will be freely available to all users to download from the CATH-Gene3D site.

The PDRA who will be working on the project, Sonja Lehtinen, is already experienced in protein network generation and analysis as she has worked as a PhD student in the Orengo group for the last 3 years. She developed the powerful COMPASS tool which is competitive with, and in some cases outperforms the widely used GeneMania algorithm, that also exploits protein network topology for target prioritisation. This one year project will give Sonja the opportunity to extend her network analysis skills by developing a novel clustering approach to detect modules in networks. It will also give her experience of using a large compute farm and Google Cloud and it will give her experience of web-page construction. All these skills are likely to be valuable when seeking future academic or industry-based posts as there is a shortage of skilled researchers in this area and a significant demand for this expertise to analyse large scale functional genomics data, such as next generation sequencing and proteomics data.
 
Description The FunL resource exploits protein interaction networks to prioritise genes that appear to be associated with genes in a query set.

To increase the power of FunL over the last year we have:

1. Extended the predicted protein interaction data in CATH-Gene3D by expanding Gene3D with new domain sequences which provide greater links to experimental data on protein interactions.

2. Imported further information on public interactions (known and predicted) from public sources (eg IRefIndex, STRING etc).

3. Extended the networks by using text mining data.

4. We have improved our kernel based COMPASS method which uses partial least squares regression to predict associations between proteins by exploiting network topology and provides a ranked list of genes likely to be associated with a given query gene set. COMPASS has now been validated to outperform the widely used GeneMania method on our benchmarks. We also explicitly explored problems associated with the non-independence of functional association data and test data. The latest implementation of COMPASS has now been published in PLoS One 2015.

5. We have improved the FunL website by adding additional functionality and more intuitive web pages, expanded it to include model organisms and made it publicly available. The updated FunL was published in NAR 2015.

6. We assessed various avenues of scalability and performance and found that the UCL ChuCKLE computer cluster (with ~5000 compute nodes) provided the best solution. We have moved our Oracle database which helps power FunL and COMPASS onto a virtual machine on the cluster. This allows computing power storage to be dynamically allocated as needed and will facilitate the maintenance and development of these resources as network datasets continue to grow. We will be moving our server for these resources onto a similar virtual machine setup.

7. We exploited the new network data generated by the project to assist in the analysis of biogenesis of epithelial junctions with collaborators at Imperial College. This work led to a publication in Nature Communications on which the PDRA was an author.
Exploitation Route Fun-L is a tool that will allow biologists to identify a subset of genes that are likely to be associated with a biological system of interest. This tool has been engineered to allow many queries to be run at the same time, even with large datasets - and can help direct biologists towards a shortlist of candidate genes for detailed experimental characterisation.
Sectors Education,Manufacturing, including Industrial Biotechology,Other

URL http://funl.org/
 
Description The FunL website is linked to the CATH-Gene3D which is widely used by companies in the pharmaceutical and biotech industries.
First Year Of Impact 2015
Sector Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Title Fun-L 
Description Please note that this tool is still being continuously developed and improved. Fun-L (Functional Lists) is a tool for target prioritisation for experimentalists. Fun-L carries out the following: given a set of query genes known to be involved in the pathway of interest the remainder of the genome is ranked by likelihood of shared pathway membership to this initial query. Testing the candidates near the top of the ranking improves the success rate of subsequent experiments. 
Type Of Material Improvements to research infrastructure 
Year Produced 2014 
Provided To Others? Yes  
Impact The predictions from FUN-L have been validated with independent RNAi screens to confirm that the lists produced by FUN-L are enriched in genes with the expected phenotypes. 
URL http://funl.org
 
Title PainNetworks 
Description Pain Networks is focused on network analysis of diseases, with a focus on different types of pain. Another focus of Pain Networks is to facilitate drug discovery by combining experimental, network and drug binding data in one resource. 
Type Of Material Improvements to research infrastructure 
Year Produced 2014 
Provided To Others? Yes  
Impact Pain-Networks has been used to support publications from other groups relating to Pain. 
URL http://painnetworks.org/
 
Description Computational Biology conference in July 2017 (Prague, Czech Republic) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Intelligent Systems for Molecular Biology (ISMB) is an annual academic conference on the subjects of bioinformatics and computational biology organised by the International Society for Computational Biology (ISCB). In July 2017, ISMB/ECCB was held in Prague. The principal focus of the conference is on the development and application of advanced computational methods for biological problems. Talks and posters were presented during various sessions at this conference. Christine Orengo gave a talk on
on computational analyses exploiting CATH-Gene3D and Genome3D data.
Year(s) Of Engagement Activity 2017
URL https://www.iscb.org/ismbeccb2017