A GPU-based high performance system for discovering consensus domain architecture and functional annotation of protein families

Lead Research Organisation: Royal Holloway University of London
Department Name: Computer Science

Abstract

The list of organisms with completed genome sequence is continuously growing and this has led to the identification of thousands of genes whose function is still unknown. These genes could potentially be involved in important biological cell functions and could represent important targets for diagnostic and pharmacogenomics studies and be of industrial and agronomical importance. A major undertaking for biology is therefore that of identifying the function of these uncharacterized genes on a genomic scale. The challenge for bioinformatics is then to develop algorithms that, given a gene, can predict a hypothesis for its function.

Comparisons of sequences from complete genomes have revealed that gene duplication, divergence and rearrangement are predominant mechanisms that drive the expansion of the set of proteins of a given organism during evolution. This means that proteins can be grouped into families, where members are likely to perform similar functions. The identification of these protein families is therefore central as it can provide important clues for the function of proteins.

Proteins are often composed of several domains. A domain is segment of protein sequence that can evolve independently of the rest of the protein chain. Each domain forms a compact three-dimensional structure and it can appear in a variety of different proteins. Protein function depends on the mutual interplay between the distinct domains and the links between them. In other words, protein function depends on the domain architecture of the protein.

Therefore we would like to have a tool that can group proteins into families according to their architecture: all proteins with the same architecture should belong to the same group. The development of such a tool is exactly the goal of this project. Moreover the tool that we plan here will also be able to suggest possible functional roles for the various architectures.

Our tool is aimed at working on very large sets of proteins. The amount of calculations for problems of this size is only feasible by taking advantage of the latest advances in graphical processing unit (GPU) technology. Modern GPUs are very efficient for graphics but their highly parallel structure makes them extremely effective for algorithms where processing of large blocks of data is done in parallel - even more effective than general-purpose CPUs.
The use of GPU technology will allow us to create a web application that will be used by scientists to obtain the architectures for very large set of proteins together with possible functional roles for the various architectures. Importantly, we shall periodically run our system on the major genomes available and we will thus be able to through our web server architectures and relative annotation for all the proteins in those genomes. All these web services will be made freely available to the scientific community.

Technical Summary

Proteins can be grouped into families, where members are likely to perform similar functions. The identification of these protein families is important as it can provide important clues for the function of proteins. Proteins are often composed of more than one domain (basic structural or functional units of evolution) and protein function depends on the mutual interplay between the distinct domains and the links between them. In other words, protein function depends on the domain architecture of the protein.

The preliminary work for this project is constituted by GFam, a system that we have recently developed which is able to group proteins into families where proteins share common domain architecture. GFam has been applied to sets of about 30k proteins.

The current proposal is aimed at creating tools for providing consensus signature architectures for very large sequence datasets. To do this we shall develop a new high performance implementation of the GFam pipeline. This will run in parallel on multiple processors servers and multiple GPGPUs.
Moreover we shall develop a web application that will provide a display of the architectures through a user-friendly web interface. The interface will allow users to to retrieve proteins with the same architecture in either the same or in different organisms. Importantly will also provide sets of functional annotation terms associated with each of the different consensus signature architectures.
The system will be run periodically on all complete genome projects and will provide protein families architectures and their functional annotation for all the proteins in those genomes.

Planned Impact

This project will benefit biologists interested in protein function, both experimental and computational scientists from academia and industry. The results will impact any biologist interested in understanding how organisms and life processes arise through natural selection mechanisms acting on the protein repertoire encoded by the genome of the organism.

Our method will make a serious impact towards understanding the large proportion of uncharacterized genes and proteins as genome sequencing efforts have left us with near-complete knowledge of hundreds of full genomic sequences, but without a comparably exhaustive inventory of what all these genes and proteins do; in many cases we have no clues to the function of these genes. This is critical as obtaining even some clue on the function of the 40% of functionally uncharacterized proteins in model organism genomes can have significant impact in a broad variety of areas e.g. drug, antibody and vaccine design, agronomic trait improvement, biochemical engineering, protein design and even nanotechnology.

The key impact of our research is that our web portal seeks to serve as a one-stop source on proteins, their organization in the form of domains and their myriad biological, biochemical and cellular functions under one umbrella. Given that our web portal will be an integrated resource, it has the potential to be a great teaching resource for under-graduate and graduate course in genome biology, protein science and evolutionary biology.

Since this project aims to use the latest computing advancements in hardware technology through the use of Graphics Processor Units, the staff working on this project will take with them transferable skills that are much sought after in academia and industry. For staff interested in continuing in academia, the dissemination of the research through peer-reviewed journal articles will help attain their career goals in the form of faculty or postdoc positions.

Publications

10 25 50
 
Description With the advent of faster and cheaper sequencing technologies the availability of genomic sequences has grown in an exponential manner. Knowing the sequences of an organism is a necessary step for their understanding at a molecular level: certain sequences may represent important targets for diagnostic and pharmacogenomics studies and be of industrial and agronomical importance. However, the process of obtaining these genetic sequences needs to be followed by the understanding of the role (function) played by each one of these genes or their corresponding proteins. A fundamental goal is therefore to identify the function of uncharacterized genes on a genomic scale. It is difficult to design functional assays for uncharacterized genes so a major current challenge in bioinformatics is to devise algorithmic methods that, given a gene, can predict a hypothesis for its function that can then be validated experimentally.

Comparisons of sequences from complete genomes have revealed that gene duplication, divergence and rearrangement are predominant mechanisms that drive the expansion of the set of proteins of a given organism during evolution. This means that proteins can be grouped into families, where members are likely to perform similar functions. The identification of these protein families is therefore central as it can provide important clues for the function of proteins. Besides, proteins are often composed of several domains. A domain is a segment of protein sequence that can evolve independently of the rest of the protein chain. Each domain forms a compact three-dimensional structure and it can appear in a variety of different proteins. Protein function depends on the mutual interplay between the distinct domains and the links between them (the domain architecture of the protein).

We have produced a computational method that can group proteins into families according to their architecture. All proteins with the same architecture belong to the same group and therefore this can be used to assign them properties (functions) within that group. We provide two types of functional descriptions: (a) Gene Ontology terms (functions from a controlled vocabulary of biological functions which is a de facto standard in biology); (b) English words with an associated numerical score (weight), describing the protein in plain English terms. Moreover:

1) we have built a tool (ConSAT) aimed at working on very large sets of proteins (i.e. whole genomes) and produce functional assignments for each sequence. This tool is written in Python and it is available at https://github.com/alfonsoeromero/ConSAT with an free software licence (GPLv3), which means that anyone is free to download, modify and use it for any purpose.

2) we have precomputed a functional assignment for all UniProtKB proteins (all the publicly available protein sequences) and we have displayed our prediction on the ConSAT webserver http://paccanarolab.org/consat. On the website the user can query, browse and download the predictions for any given protein or set of proteins. The website is freely available for any user at no charge.
Exploitation Route ConSAT is of high interest to researchers focusing on protein domain architectures and functional genomics. To name a few, potential beneficiaries of this project are:

1. The biological community at large, interested in comprehensive annotation of genomes.

2. Genome database annotators involved in curating model organism resources like Flybase (Drosophila), Wormbase (round worm), ZFIN (zebra fish), MGI (mouse), TAIR (Arabidopsis) etc.

3. Agriculture: predicting function for plant genes should enable to design genetic methods to improve plant performance.

4. Structural genomics projects that try to understand 3D structure and function of domains with only sequence information.

5. Genome researchers interested in comparative genomic tools and applications for understanding how organisms evolve and species radiate.

6. Researchers interested in designing molecular switches through domain engineering strategies.

7. Pharmaceutical companies looking to attack species-specific pathways containing a certain protein domain architecture.

8. New sequencing efforts: this method could enable researchers to rapidly assign putative function to new genes in freshly sequenced organisms.

So far we are aware of our work being used by researchers of the second group (Rajkumar Sasidharan, at TAIR), third group (Pablo Sotelo, Asuncion) and the eighth group (Matteo Pellegrini, UCLA).
Sectors Agriculture, Food and Drink,Chemicals,Energy,Environment,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL http://www.paccanarolab.org/consat
 
Description We are aware of ConSAT being used by two independent research groups which are actively working with organism of a high practical interest for crop production and for biofuel production. 1) Pablo Sotelo, from the Universidad Nacional de Asuncion (Paraguay) is working with the fungus Macrophomina phaseolina, a plant pathogen affecting more than 500 plant species (many crops among them). ConSAT has been used to produce a functional annotation of the proteome of this fungus, which is the first step for its characterization and is aimed at finding better and more targeted mechanisms for pest-control. Paraguay is one of the largest producers of soya in the world, this work has important economic implications for the country. 2) Matteo Pellegrini (University of California, Los Angeles) leads a lab with a high interest in algal genomics. The lab is currently sequencing the genome of the unicellular alga Cyclotella cryptica, a model organism for lipid accumulation. This work has application in the biofuel production industry. The Pellegrini lab has been using the functional predictions provided by ConSAT to annotate this algal genome. In our group, we have used ConSAT to participate in the second CAFA challenge, a competition of protein function prediction. Although this activity is within the academic domain, we think it has been important for acquiring visibility and engaging further collaborations. Finally GoSSToWeb (our web tool for computing semantic similarities between gene products), although it was published very recently (August 2014), has been used numerous times by more than 50 different researchers. We have also received more than ten communications from different users (mainly life scientists) showing interest in the tool and asking different queries regarding its usage.
First Year Of Impact 2013
Sector Agriculture, Food and Drink,Chemicals,Energy,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description ABI innovation
Amount $1,203,514 (USD)
Organisation National Science Foundation (NSF) 
Sector Public
Country United States
Start 09/2017 
End 09/2020
 
Description BBSRC Tools and Resources Development Fund
Amount £114,257 (GBP)
Funding ID BB/K004131/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 07/2012 
End 12/2013
 
Description EU, Marie Curie Fellowship to Dr Beatrix Horvath
Amount € 309,235 (EUR)
Organisation Marie Sklodowska-Curie Actions 
Sector Charity/Non Profit
Country Global
Start 03/2013 
End 05/2015
 
Description EU, Marie Curie Fellowship to Dr Fabio Manfredini (with Prof Mark Brown)
Amount € 221,606 (EUR)
Organisation Marie Sklodowska-Curie Actions 
Sector Charity/Non Profit
Country Global
Start 04/2014 
End 04/2016
 
Description EU, Marie Curie Fellowship to Dr Papdi Csaba (with Prof L. Bogre)
Amount € 221,606 (EUR)
Organisation Marie Sklodowska-Curie Actions 
Sector Charity/Non Profit
Country Global
Start 04/2013 
End 04/2015
 
Title ClusterONE 
Description Cluster ONE (Clustering with Overlapping Neighborhood Expansion) is a graph clustering algorithm that is able to handle weighted graphs and readily generates overlapping clusters. Owing to these properties, it is especially useful for detecting protein complexes in protein-protein interaction networks with associated confidence values. Cluster ONE is available as a standalone command-line application, as a plugin to Cytoscape or ProCope and as a web application. 
Type Of Material Computer model/algorithm 
Year Produced 2012 
Provided To Others? Yes  
Impact ClusterONE was one of the key steps in our Soluble Human Protein Complexes project, which provided the largest catalogue to date of human protein complexes from cell culture. The original publication describing the ClusterONE algorithm has received in excess of 130 citations so far (Google Scholar). 
URL http://www.paccanarolab.org/clusterone
 
Title ConSAT 
Description ConSAT is a database of Consensus Signature Architectures. A consensus architecture is a set of non-overlapping domain assignments (considering insertions) which tries to define uniquely each protein. These architectures are used for prediction of GO categories, and to assign weighted words derived from mining PubMed abstracts. The database is available at http://paccanarolab.org/consat 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
Impact The results contained in this database are currently being used by two research groups who are actively working with organism of a high practical interest for crop production and for biofuel production (Pablo Sotelo, Universidad Nacional de Asuncion (Paraguay); Matteo Pellegrini, University of California, Los Angeles (USA)). 
URL http://paccanarolab.org/consat
 
Title Disease Similarity 
Description We introduce a MeSH-based method that accurately quantifies similarity between heritable diseases at molecular level. This method effectively brings together the existing information about diseases that is scattered across the vast corpus of biomedical literature. We prove that sets of MeSH terms provide a highly descriptive representation of heritable disease and that the structure of MeSH provides a natural way of combining individual MeSH vocabularies. We show that our measure can be used effectively in the prediction of candidate disease genes. 
Type Of Material Computer model/algorithm 
Year Produced 2015 
Provided To Others? Yes  
Impact There are no impacts yet, this work appeared only about 3 months ago. 
 
Title Landis 
Description Disease similarity measures quantify the distance between disease modules on the interactome. These measures can provide a starting point for in-depth exploration of the diseases at molecular level, and are of particular relevance for orphan diseases. LanDis is an explorable database, containing the disease similarities of 28.5 million pairs of heritable diseases. These are calculated by summarising the existing phenotype information about diseases through large scale analysis of hand curated data. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact The paper presenting this database/model is still under review, so most scientist are not aware of its existence yet. However, I have already presented to conferences and meetings, receiving an extremely good feedback from everyone who tried it, especially clinician scientists. 
URL http://www.paccanarolab.org/landis/
 
Title MAPK 
Description This is a general repository of MAPK sequences and orthologues in the plant kingdom. Orthologues were inferred using the InParanoid and Plaza orthologue identifier programs. This site also contains pointers to published evidence for constructing MAPK networks in Arabidopsis Yeast and Human, including high throughput and targeted experiments. The base dataset included here appeared in the paper by Dóczi, Ökrész, Romero, Paccanaro and Bögre (see reference). 
Type Of Material Database/Collection of data 
Year Produced 2012 
Provided To Others? Yes  
Impact The original paper has been cited more than 20 times (Google Scholar). 
URL http://paccanarolab.org/static_content/MAPKevol/index.html
 
Title S2F 
Description S2F (Sequence to Function) is a software package implementing our diffusion-based method for predicting protein function in organisms for which little or no experimental data is available and the only available information is the set of protein sequences. Protein function is predicted with respect to terms in the Gene Ontology (GO). For a given protein the system provides a probability distribution over the GO terms, which is consistent with the ontology structure, i.e. the probability of a more general term is always higher than the probability of a more specific one. The stand-alone package is self-contained, including tools for generating a set of initial seed functional labels to diffuse as well as methods for inferring the biological networks onto which to diffuse the labels. 
Type Of Material Computer model/algorithm 
Year Produced 2012 
Provided To Others? Yes  
Impact The results obtained by this algorithm are currently being used by two research groups who are actively working with organism of a high practical interest for crop production and for biofuel production (Pablo Sotelo, Universidad Nacional de Asuncion (Paraguay); Matteo Pellegrini, University of California, Los Angeles (USA)). 
URL http://paccanarolab.org/s2f
 
Title SemanticSimilarity 
Description The introduction of ontologies for gene functional annotation allows us to compare genes by quantifying the similarity of the terms with which they are annotated. These comparisons are important as they contribute to the inference of functional relationships between gene products by providing a perspective that complements both experimental information and sequence-based approaches. The proposed measure, which we call the random walk contribution (RWC) can be integrated with any standard semantic similarity measure, which we call host similarity measure (HSM), to yield an integrated similarity measure (ISM) that takes into account the whole ontology structure. In other words our random walk similarity measure is a kind of 'add on' to one's favourite underlying similarity measure. 
Type Of Material Computer model/algorithm 
Year Produced 2012 
Provided To Others? Yes  
Impact One of the key steps in our Soluble Human Complexes project was the application of our Semantic Similarity method for calculating semantic similarities between human genes on the Gene Ontology. To date, the publication containing the method itself has been cited 22 times (Google Scholar). 
URL http://www.paccanarolab.org/static_content/gosim/
 
Title SolubleComplexes 
Description Our research on diffusion methods for protein function prediction led to the development of methods for inference and structure discovery in biological networks. We applied some of these methods within a collaboration project with the labs of Andrew Emili (University of Toronto) and Edward Marcotte (Universty of Texas, Austin) which was aimed at detecting human protein complexes. In particular, for this project we deployed: ClusterONE, our algorithm for detecting overlapping protein complexes from PPI networks; GOSSTO, our method for calculating semantic similarities on the Gene Ontology; an information diffusion method we developed for denoising protein interaction data. The protein interaction networks identified experimentally in Emili's lab were enriched with networks generated using comparative genomics approaches in Marcotte's lab. Then, in my lab, we integrated this network with a semantic similarity graph (obtained using GOSSTO), applied our denoising procedure, and finally clustered the resulting graph using ClusterONE. We thus obtained the largest catalogue to date of human protein complexes from cell culture. The human protein complexes repository contains all the data generated in this study in an easily navigable format. These include all the pairwise protein interactions obtained through integration of the experimental data with public genomic evidence and the subunit composition of the 622 putative protein complexes obtained by clustering using ClusterONE. 
Type Of Material Database/Collection of data 
Year Produced 2012 
Provided To Others? Yes  
Impact The original publication where this dataset was first release has been cited, to date, more than 100 times (Google Scholar). 
URL http://human.med.utoronto.ca/php/data_download.php
 
Title mutation3d 
Description A new algorithm and Web server, mutation3D (http://mutation3d.org), proposes driver genes in cancer by identifying clusters of amino acid substitutions within tertiary protein structures. We demonstrated the feasibility of using a 3D clustering approach to implicate proteins in cancer based on explorations of single proteins using the mutation3D Web interface. 
Type Of Material Computer model/algorithm 
Year Produced 2016 
Provided To Others? Yes  
Impact No notable impacts yet, the paper only appeared about a month ago. 
URL http://mutation3d.org/
 
Description Cancer genomics -- Haiyuan Yu (Cornell University) 
Organisation Cornell University
Country United States 
Sector Academic/University 
PI Contribution We recently started a collaboration with Yu lab in the field of cancer genomics, where we contributed to the development of a clustering method to predict cancer mutation hotspots in proteins. We used our expertise in clustering methods to provide an efficient solution an integrate it into a comprehensive analysis pipeline.
Collaborator Contribution Prof Yu and his lab have great expertise in the field of cancer genomics. They have contributed the biological question and the data.
Impact A journal paper describing the method is currently under review in BMC Biology. The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year 2013
 
Description Clustering of protein interaction networks -- Haiyuan Yu (Cornell University) 
Organisation Cornell University
Country United States 
Sector Academic/University 
PI Contribution We developed ClusterONE, a new method for protein complex detection using clustering on protein-protein interaction networks.
Collaborator Contribution Haiyuan Yu is an expert in protein-protein interaction screening, and protein-protein interaction prediction and he proposed different ways to evaluate the quality of the predictions. He also gave important feedback on the method. ClusterONE was published in 2012 in Nature methods (see below).
Impact This is an interdisciplinary collaboration between molecular biologists (Yu lab) and computational scientists (our lab). The collaboration has produced one clustering algorithm for detecting protein complexes from protein protein interaction networks, and its corresponding implementation (ClusterONE). The details of the publication are the following: T. Nepusz, H. Yu, and A. Paccanaro Detecting overlapping protein complexes in protein-protein interaction networks Nature Methods, vol. 9, pp. 471-472, 2012. The software (ClusterONE) is available in our website ( http://www.paccanarolab.org/clusterone/ ) an is released under a free software license (can be freely downloaded, executed and eventually, modified). The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year 2011
 
Description Development of a web resource for protein functional annotation -- Raj Sasidharan (BASF) 
Organisation BASF
Country Germany 
Sector Private 
PI Contribution We developed ConSAT, a tool for protein functional annotation using protein consensus domain architectures. In this project a new algorithm was developed and a web resource (ConSAT) with precomputed results was created (available at http://paccanarolab.org/consat ). The method includes three different types of functional prediction methods, two assigning Gene Ontology terms from the protein architecture, and one assigning English weighted words.
Collaborator Contribution Rajkumar Sasidharan's help was very important for the development of this project, mainly in two different fields: first, he provided expert knowledge in structural biology; second, he helped giving feedback on the usability of the web server, leading to its improvement.
Impact The project main output is the above referenced website. Publications are currently being written. The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year 2012
 
Description Disease gene prioritisation by the combination of gene networks -- Giorgio Valentini (Milan) 
Organisation University of Milan
Country Italy 
Sector Academic/University 
PI Contribution We preprocessed, cleaned and provided a set of biological datasets to Giorgio Valentini to assist in the development of several methods of gene networks combination for disease-gene prioritisation (that is, finding new causative genes for diseases). We provided, among others, several semantic similarity networks among sets of human genes. We also suggested new evaluation measures for this task.
Collaborator Contribution Giorgio Valentini developed a set of algorithms for finding new disease-gene associations. In that context he proposed many different ways in which different gene networks (both weighted and unweighted) could be combined to produce a resulting network resembling a relation based on the fact that two linked genes are supposed to share an underlying disease. The new predictions are given as an output of the paper (available at http://homes.di.unimi.it/re/suppmat/genesmeshnetwpred/supmatTBL1.html ).
Impact Apart from the above mentioned URL, the collaboration led to the following publication: G Valentini, A Paccanaro, H Caniza, AE Romero, M Re An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods Artificial Intelligence in Medicine 61 (2), 63-78
Start Year 2013
 
Description Drug side effect prediction (with Mark Gerstein and Shantao Li, Yale University) 
Organisation Yale University
Country United States 
Sector Academic/University 
PI Contribution We have developed a new method for predicting side effects of drugs. Our preliminary results show that our method represents a great improvement with respect to the existing state of the art in terns of side effect prediction. Moreover, it is the first method that can predict the expected frequency of side effects in the population.
Collaborator Contribution They are helping us to provide an explanation of some aspects of our models in terms of the biology/biochemistry/pharmacology.
Impact A journal article is in preparation.The collaboration is multi-disciplinary involving biologists and computer scientists.
Start Year 2017
 
Description Enhancer prediction using epigenetic signals in different mouse tissues (with Mark Gerstein and Mengting Gu, Yale University) 
Organisation Yale University
Department Department of Molecular Biophysics and Biochemistry
Country United States 
Sector Academic/University 
PI Contribution Apply machine learning, signal processing and pattern recognition methods for improving the performance of the enhancer prediction for different tissues in the mouse genome. Preliminary results indicate that ensemble methods perform better than other classifiers. More advanced methods for feature extraction such as deep learning are going to be tested on the data.
Collaborator Contribution Members of the Gerstein Lab developed a pattern recognition method called matched filters for enhancer prediction. However, our preliminary results show that advanced machine learning may improve prediction accuracy. The Gerstein Lab supplied the data and will interpret the results in the context of enhancer and promoters in the genome.
Impact The collaboration is multi-disciplinary involving biologists and computer scientists.
Start Year 2017
 
Description Functional prediction for Cyclotella cryptica -- Matteo Pellegrini (UCLA) 
Organisation University of California, Los Angeles (UCLA)
Country United States 
Sector Academic/University 
PI Contribution The Pellegrini Lab is interested in better understanding certain metabolic pathways in the genome of the alga Cyclotella cryptica. This alga is particularly important from an economic perspective as it is important to the growing algal biofuels industry due to its higher levels of lipid production. In order to better understand those pathways, an important step is to provide a functional annotation in the genes of the organism. Our contribution to Prof Pellegrini research has been based in providing a functional annotation of this alga using ConSAT and S2F (the function annotation tools that we developed in the context of our grants).
Collaborator Contribution Though this collaboration is still ongoing, feedback from the Pellegrini's lab has been incorporated into our tool to make it more usable.
Impact We expect journal publications to be written soon. The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year 2012
 
Description Functional prediction for Macrophomina phaseolina -- Pablo Sotelo (Universidad Nacional de Asuncion) 
Organisation National University of Asuncion
Country Paraguay 
Sector Academic/University 
PI Contribution We have provided the Sotelo lab with a complete functional annotation of the fungus Macrophomina phaseolina. This was done using both S2F and CONSAT, our systems for protein function prediction. Macrophomina phaseolina has been recently sequenced and is responsible for a plague affecting many crops and particularly soya, of which Paraguay is one of the largest producers in the world. Our contribution will help, in ultimate analysis, both the development of new pesticides to fight this fungus, and in the research of genetically modified varieties of soya, resistant to this plague.
Collaborator Contribution The Sotelo lab has been providing us with feedback to improve our system and on the accuracy of our predictions. This is very helpful for us in order to improve our system.
Impact This is a multidisciplinary collaboration, between computational scientists (Paccanaro lab) and life scientists (Sotelo lab). We expect to produce a joint publication in the near future as an output of this collaboration. The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year 2014
 
Description Gene prioritisation for lymphoma growth on mutagenesis study 
Organisation Medical Research Council (MRC)
Department MRC Clinical Sciences Centre (CSC)
Country United Kingdom 
Sector Public 
PI Contribution Prediction of lymphoma growth stage by analysis of gene clonality values from a sample. Prioritisation of genes selected from broad loci sources involved in lymphomagenesis. This process yielded a set of about 20 genes selected for further studies.
Collaborator Contribution Mutagenesis developed lymphoma studies on over 500 mice, with the corresponding sample clonality analysis. Ongoing gene relevance analysis.
Impact Studies are still ongoing on the relevance of the selected genes. We expect to obtain a publication about this work when the process finishes. The study is multi-disciplinary and it comprises the following disciplines: cancer genomics, molecular biotechnology, systems biology, computer science, big data analysis, bioinformatics.
Start Year 2015
 
Description GoSSTo, a Tool for computing Gene Ontology Semantic Similarites -- Giorgio Valentini (University of Milan) 
Organisation University of Milan
Country Italy 
Sector Academic/University 
PI Contribution We developed GoSSTo a command line based-tool to compute semantic similarities between gene products. The tool implemented an algorithm previously published in our group, trying to make it accessible to any possible researcher. We also implemented GoSSToWeb, a web server providing easier access to this tool for biological researchers.
Collaborator Contribution Giorgio Valentini and his lab provided help for the development of the web interface of our tool for computing semantic similarities which was recently published, and also provided user feedback on the command line tool.
Impact The output is constituted by our software tools (GoSSTo and GoSSToWeb). Our web tool, available at www.paccanarolab.org/gosstoweb has had over 50 registered users and 70 submitted jobs thus far. Moreover, the collaboration is manifested in the following publication: H. Caniza, A. E. Romero, S. Heron, H. Yang, A. Devoto, M. Frasca, M. Mesiti, G. Valentini, and A. Paccanaro, GOssTo: a user-friendly stand-alone and web tool for calculating semantic similarities on the Gene Ontology Bioinformatics, vol. 30, iss. pp. 2235-2236, 2014. A preliminary version of this paper was submitted and accepted to the ISMB conference in 2013: H. Caniza, A. E. Romero, S. Heron, H. Yang, M. Frasca, M. Mesiti, G. Valentini, and A. Paccanaro. 'GOssTo and GOssToWeb: user-friendly tools for calculating semantic similarities on the Gene Ontology.' Bio-Ontologies SIG 2013-ISMB 2013 (2013).
Start Year 2012
 
Description Human Protein Complexes -- Emili (Un. Toronto), Marcotte (Un. Texas, Austin) 
Organisation University of Toronto
Country Canada 
Sector Academic/University 
PI Contribution Our research on diffusion methods for protein function prediction led to the development of methods for inference and structure discovery in biological networks. We applied some of these methods within a collaboration project with the labs of Andrew Emili (University of Toronto) and Edward Marcotte (Universty of Texas, Austin) which was aimed at detecting human protein complexes. In particular, for this project we deployed: ClusterONE, our algorithm for detecting overlapping protein complexes from PPI networks; GOSSTO, our method for calculating semantic similarities on the Gene Ontology; an information diffusion method we developed for denoising protein interaction data. The protein interaction networks identified experimentally in Emili's lab were enriched with networks generated using comparative genomics approaches in Marcotte's lab. Then, in my lab, we integrated this network with a semantic similarity graph (obtained using GOSSTO), applied our denoising procedure, and finally clustered the resulting graph using ClusterONE. We thus obtained the largest catalogue to date of human protein complexes from cell culture.
Collaborator Contribution The protein interaction networks identified experimentally in Emili's lab were enriched with networks generated using comparative genomics approaches in Marcotte's lab. Then, in my lab, we integrated this network with a semantic similarity graph (obtained using GOSSTO), applied our denoising procedure, and finally clustered the resulting graph using ClusterONE.
Impact 1) The human protein complexes repository contains all the data generated in this study in an easily navigable format. These include all the pairwise protein interactions obtained through integration of the experimental data with public genomic evidence and the subunit composition of the 622 putative protein complexes obtained by clustering using ClusterONE. 2) P. C. Havugimana, T. G. Hart, T. Nepusz, H. Yang, A. L. Turinsky, Z. Li, P. I. Wang, D. R. Boutz, V. Fong, S. Phanse, M. Babu, S. A. Craig, P. Hu, C. Wan, J. Vlasblom, V. U. Dar, A. Bezginov, G. W. Clark, G. C. Wu, S. J. Wodak, E. R. Tillier, A. Paccanaro, E. M. Marcotte, and A. Emili A census of human soluble protein complexes Cell, vol. 150, iss. 5, pp. 1068-1081, 2012. The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year 2009
 
Description Learning disease-gene associations by exploiting disease similarities (with Mark Gerstein, Yale University) 
Organisation Yale University
Department Department of Molecular Biophysics and Biochemistry
Country United States 
Sector Academic/University 
PI Contribution We recently developed a disease similarity measure and calculated all the disease-disease similarities between OMIM diseases. We established a prior disease-gene association probability and provided training and testing datasets for the learning. We fitted the model.
Collaborator Contribution Developed a Lipschitz diffusion model, that we used to spread the disease-gene association through the interactome, and a fully functional fast implementation of the algorithm.
Impact The collaboration is multi-disciplinary involving biologists and computer scientists.
Start Year 2017
 
Description Network-based Genome Analysis Reveals Structural and Functional Properties of Genes (with Mark Gerstein and Koon-Kiu Yan, Yale University) 
Organisation Yale University
Country United States 
Sector Academic/University 
PI Contribution We have analysed the spatial proximity of all pathway genes (KEGG Database) across various cancer cell lines. Our preliminary results provide strong evidence for a relationship between disease pathways and cancer. The study also helps identify candidate genes for a number of diseases.
Collaborator Contribution They have successfully applied network community detection techniques to Hi-C data (three-dimensional architecture of genomes) in order to identify topologically associating domains (TADs) of genomic regions.
Impact The collaboration is multi-disciplinary involving biologists and computer scientists.
Start Year 2017
 
Description Objective of the project is to elucidate the mechanism of action of a drug for multiple sclerosis 
Organisation Imperial College London
Department Faculty of Medicine
Country United Kingdom 
Sector Academic/University 
PI Contribution To analyse transcriptomics data obtained from a trial on human patients using network medicine approaches.
Collaborator Contribution They hosted a trial with human patients and extracted transcriptomics data at different times..
Impact No outputs yet. This collaboration is multidisciplinary involving: computer science, network science, machine learning, medicine, biology and pharmacology.
Start Year 2015
 
Title CONSAT 
Description ConSAT is a terminal-based application which can be used to functionally annotate a set of proteins, using its consensus domain architecture. Proteins are assigned Gene Ontology terms based on the domains composition of the architecture and on the already known experimental terms of proteins with a given architecture. In order to help in the production of a description of a protein sequence, it also assigns weighted English words derived from mining PubMed articles. ConSAT is written in Python. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact ConSAT has been used to produce the homonym database (see 'databases'), which is being used in two external collaborations (with Pablo Sotello and Matteo Pellegrini, see 'collaborations'). ConSAT has been used for our participation in the second CAFA challenge, organized by an international research community of more than 50 research groups devoted to the study of protein function prediction methods. 
URL http://paccanarolab.org/ConSAT
 
Title ClusterONE 
Description ClusterONE (Clustering with Overlapping Neighborhood Expansion) is a graph clustering algorithm that is able to handle weighted graphs and readily generates overlapping clusters. Owing to these properties, it is especially useful for detecting protein complexes in protein-protein interaction networks with associated confidence values. ClusterONE is available as a standalone command-line application, as a plugin to Cytoscape or ProCope. 
Type Of Technology Software 
Year Produced 2012 
Open Source License? Yes  
Impact For the creation of the Human protein complexes repository (http://human.med.utoronto.ca/) the standalone version of ClusterONE was used to produce the putative protein complexes. This project provided the largest catalogue to date of human protein complexes from cell culture. All versions of the ClusterONE Cytoscape plugin have been downloaded a total of 4801 times, with 5 releases produced so far. The ClusterONE publication has in excess 130 citations. 
URL http://paccanarolab.org/clusterone
 
Title GOSSTO 
Description Semantic similarity calculations aim to provide a quantifiable measure of functional relatedness of genes by assessing the similarity of the functional terms with which they are annotated. GOSSTO (Gene Ontology Semantic Similarity Tool) is a tool for calculating this measure with respect to Gene Ontology terms. It implements an improved diffusion-based measure developed in this project, as well as several well-established measures, such as those proposed by Resnik, Lin, Jiang, simUI. Powerful extension capabilities are included in GOSSTO, enabling the user to extend it with new similarity measures. GOSSTO is available as a standalone command-line application running on Windows, GNU/Linux and MacOS as well as a web tool. The webtool is available at www.paccanarolab.org/gosstoweb 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact For the creation of the Human protein complexes repository (http://human.med.utoronto.ca/) the standalone version of GOSSTO was used to compute semantic similarities between human genes in the Gene Ontology. This project provided the largest catalogue to date of human protein complexes from cell culture. Our web tool, available at www.paccanarolab.org/gosstoweb has had over 50 registered users and 70 submitted jobs thus far. 
URL http://paccanarolab.org/gossto
 
Title JustClust 
Description JustClust is a tool for analysing biological data with cluster analysis. JustClust can handle many formats of data and cluster the data with many state-of-the-art techniques. The aim of JustClust is to provide an easy-to-use application which can perform any analysis on any data. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact The manuscript is currently being finalised. 
URL http://paccanarolab.org/justclust
 
Title Landis 
Description Disease similarity measures quantify the distance between disease modules on the interactome. These measures can provide a starting point for in-depth exploration of the diseases at molecular level, and are of particular relevance for orphan diseases. LanDis is a freely available web-based interactive tool that allows domain experts, medical doctors and the larger community to graphically navigate the landscape of human disease similarities. LanDis is designed to explore the similarity landscape of over 28.5 million pairs of heritable diseases, introducing a fully interactive and navigable plot in which diseases are represented as nodes and their pairwise similarity as the links joining them. 
Type Of Technology Webtool/Application 
Year Produced 2016 
Impact The paper presenting this webtool is still under review, so most scientist are not aware of its existence yet. However, I have already presented to conferences and meetings, receiving an extremely good feedback from everyone who tried it, especially clinician scientists. 
URL http://www.paccanarolab.org/landis
 
Title S2F 
Description S2F (Sequence-to-Function) is a software package implementing our diffusion-based method for predicting protein function in organisms for which little or no experimental data is available and the only available information is the set of protein sequences. Protein function is predicted with respect to terms in the Gene Ontology (GO). For a given protein the system provides a probability distribution over the GO terms, which is consistent with the ontology structure, i.e. the probability of a more general term is always higher than the probability of a more specific one. The stand-alone package is self-contained, including tools for generating a set of initial seed functional labels to diffuse as well as methods for inferring the biological networks onto which to diffuse the labels. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact The results obtained using S2F are currently being used by two research groups who are actively working with organism of a high practical interest for crop production and for biofuel production (Pablo Sotelo, Universidad Nacional de Asuncion (Paraguay); Matteo Pellegrini, University of California, Los Angeles (USA)). S2F has been used for our participation in two CAFA challenges, organized by an international research community of more than 50 research groups devoted to the study of protein function prediction methods. 
URL http://paccanarolab.org/s2f
 
Title mutation3D 
Description mutation3D is a functional prediction and visualization tool for studying the spatial arrangement of amino acid substitutions on protein models and structures. It is intended to be used to identify clusters of amino acid substitutions arising from somatic cancer mutations across many patients in order to identify functional hotspots and fuel downstream hypotheses. It is also useful for clustering other kinds of mutational data, or simply as a tool to quickly assess relative locations of amino acids in proteins. 
Type Of Technology Webtool/Application 
Year Produced 2016 
Impact It is still too early, the tool was released about a month ago. 
URL http://mutation3d.org/
 
Description Bristol2012 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact The talks led to interesting discussions and finding new contacts

Some plans were made for future collaboration
Year(s) Of Engagement Activity 2012
 
Description Cambridge2013 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Talks about our research and methods with peers

Setting collaboration activities with our peers
Year(s) Of Engagement Activity 2013
 
Description ClusterONE press release 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact We advertised on the Royal Holloway college website the publication of the ClusterONE algorithm and of its accompanying software in Nature Methods. The advertisements sparked a lot of interest for the algorithm in the college.

As a consequence of the advertisement, we were approached by biologists in the School of Biological Sciences at Royal Holloway with whom we started collaborating for clustering large scale experimental co-expression networks that they were producing.
Year(s) Of Engagement Activity 2012
URL https://www.royalholloway.ac.uk/computerscience/news/newsarticles/researchersalgorithmpublishedinsci...
 
Description Cornell2013 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact The talk sparked discussions with other scientists. The feedback I obtained was useful for my current research. The talk was important to advertise my research and to make contacts for future collaborations.

A collaboration was initiated with the group of Prof. Haiyuan Yu for a new joint research project aimed at finding hotspot mutations in Cancer proteins. The collaboration is ongoing and a paper is currently under review in BMC Biology.
Year(s) Of Engagement Activity 2013
 
Description GlaxoSmithKline2013 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Engagement with contacts and discussions of mutual interests

Plans for collaboration with some contacts made
Year(s) Of Engagement Activity 2013
 
Description ISMB BioOntologies SIG 2013 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact From the poster presentation, some interesting talks were developed and new contacts were made

The feedback from the activity was useful for further develop on our research
Year(s) Of Engagement Activity 2013
URL http://www.iscb.org/ismbeccb2013-program/ismbeccb2013-satellite-meetings#bio
 
Description ISMB NetBIO SIG 2013 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact After the poster presentation, some contacts were made and we had interesting discussions of the presented work

We analysed our work with other researchers that helped us improve it furtherly
Year(s) Of Engagement Activity 2013
URL http://www.iscb.org/ismbeccb2013-program/ismbeccb2013-satellite-meetings#netbio
 
Description Invited participation in experts' roundtable at the The Bioinformatics Strategy Meeting in London 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact I participated in an Experts' roundtable together with other academics and members of Industry
Year(s) Of Engagement Activity 2016
 
Description MRC2012 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Discussions about biological problems that we could help on, that were analysed on their community

Establishing links with biologists and creating collaboration networks
Year(s) Of Engagement Activity 2012
 
Description Poster ClusterONE 2013 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact The poster presentation led to discussions on the work with fellow researchers

The feedback provided by our peers was useful for further development
Year(s) Of Engagement Activity 2013
URL http://www.iscb.org/ismbeccb2013
 
Description RHUL Open Days 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Public/other audiences
Results and Impact The University opens to the public and each department presents a showcase of its research, in a way which is accessible to a wider, non-specialist audience.
This generated interest in the Research done by the CS Department.

Many students joined the Computer Science Department
Year(s) Of Engagement Activity 2009,2010,2011,2012,2013,2014
 
Description Talks to the groups of Martin Wilkins and Paul Matthews -- summer 2015 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Other audiences
Results and Impact I presented our recent results in the area of Network Medicine to Prof Martin Wilkins and Prof Paul Matthews and their groups (I gave two separate talks) at the Department of Medicine, Imperial College, Hammersmith Hospital. The talk sparked interesting discussions and it was the beginning of a very interesting collaboration with the lab of Prof Matthews in the area of Multiple Sclerosis.
Year(s) Of Engagement Activity 2015
 
Description Tasters courses 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Schools
Results and Impact One-day courses opened to school pupils. They enquired about the courses that Computer Science departments offered, and future studies possibilities.

Some students chose to follow the lead we gave them and engaged in Computer Science studies in our department.
Year(s) Of Engagement Activity 2009,2010,2011,2012,2013,2014
 
Description UCAS open days 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Schools
Results and Impact During this talk I try to convey to school pupils what computer science is and why it is an exciting field of study.
Often the talked sparked questions and discussions.

A high percentage of school pupils who came to the talk decided to study Computer Science and many of these chose to study it in our department at Royal Holloway.
Year(s) Of Engagement Activity 2008,2009,2010,2011,2012,2013,2014
 
Description UCLondon2012 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Using the talks as a medium we met with multiple peers and engaged in interesting conversations

We elaborated plans around the talks we had with some peers
Year(s) Of Engagement Activity 2012
 
Description Venice2012 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other academic audiences (collaborators, peers etc.)
Results and Impact Our presentation led to meetings with contacts

We developed some plans for collaborations
Year(s) Of Engagement Activity 2009,2012
 
Description talk at Galway -- May 2015 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Other audiences
Results and Impact I presented my work at the School of Mathematics, Statistics and Applied Mathematics at Galway University, Ireland. The talk sparked discussions with other scientists. The feedback I obtained was useful for my current research. The talk was important to advertise my research and to make contacts for future collaborations.
Year(s) Of Engagement Activity 2015