Development of a graph-theoretic approach to predict protein function by integrating large scale heterogeneous data

Lead Research Organisation: Royal Holloway University of London

Department Name: Computer Science

Abstract

The list of organisms with completed genome sequence is continuously growing and this has led to the identification of thousands of genes whose function is still unknown. These genes could potentially be involved in important biological cell functions and could represent important targets for diagnostic and pharmacogenomics studies and be of industrial and agronomical importance. A major undertaking for biology is therefore that of identifying the function of these uncharacterized genes on a genomic scale. The challenge for bioinformatics is then to devise algorithmic methods that, given a gene, can predict a hypothesis for its function that can then be validated by wet-lab assays. Luckily, new experimental techniques have become available, producing data which offer clues about protein function and can therefore be employed for function prediction, e.g. protein interaction data, gene expression data. Some experimental and computational data have a natural representation as networks (e.g. protein interaction data), others are inherently 'one-dimensional' (e.g. sequence patterns). Three facts have recently become clear: while each data type contains important information that can help in determining the function of a protein, no single data type by itself suffices; large-scale functional inference greatly improves by integrating evidence from different sources; for those data types which can be represented as networks, the best results are obtained by algorithms that take advantage of the networks' topologies. So far, methods that make functional inferences on networks are very limited in the type of data they can integrate, while methods that can integrate a greater variety of data do not take advantage of the networks' topologies. I intend to investigate a general method that can integrate essentially any data type currently available taking into account its intrinsic structure: it takes advantage of the graph topology for network data, and it can integrate this evidence together with one-dimensional information. I shall develop graph-theoretical methods that use the diffusion of information over graphs to generate functional evidence from network data. This evidence is then combined with other one-dimensional information using machine learning techniques. The strength of the methodology lies in its ability to use diverse sets of noisy data, and to combine them to obtain sound statistical inferences; the weak signals contained in each dataset is enhanced by integrating the data. The methodology will be first developed on Yeast, and I shall then transfer this approach to higher organisms such as C. elegans, D. melanogaster, A. thaliana, and H. sapiens. For all these organisms the performance of the algorithms will then be evaluated 'in silico' by means of test sets; that is I shall verify the accuracy of the methods at predicting the function for genes whose annotation is known. The approach will then be tested 'in vivo' on a sub-network of genes that form signalling pathways (MAPK signalling) and function to transmit information from receptors to gene expression. MAPK pathway components are highly diversified in the model plant, Arabidopsis thaliana, with 123 components. For many of these we do not know how they connect up and what their biological functions are. These will be predicted by the algorithms and then functionally tested by silencing their expression using RNA interference and in mutant lines. I shall also design and implement stand-alone and web-based software tools incorporating the algorithms developed. The applications will enable the biologist to easily apply the algorithms through a user-friendly interface; to visualize the relevant biological networks thus making the inference process transparent and providing an explanation for the functional annotation predicted by the system. A web tool will also be created. All these tools will be made freely available to the scientific community.

Technical Summary

Statistically sound large-scale protein function prediction can be obtained only by integrating evidence from different sources. Functional inference methods that exploit biological networks topologies offer good performance. But so far such methods are limited in the type of data they can integrate, while methods that can integrate a greater variety of data do not take advantage of the networks' topologies. I propose a general method that can integrate essentially any data type available taking into account the intrinsic structure of each data type: it uses graph-theoretic methods to produce functional evidence from network data, and it integrates it with evidence from one-dimensional information using machine learning techniques. Defining function in terms of the Gene Ontology, I shall collect datasets for S. cerevisiae, C. elegans, D. melanogaster, A. thaliana, H. sapiens. Algorithm development and testing will be done on S. cerevisiae. I shall then verify how these methods transfer to the other organisms. Performance on these organisms will be evaluated 'in silico', by means of test sets. The approach will also be tested 'in vivo' by predicting the Biological Process for a group of MAP kinases that belong to the signalling pathways of A. thaliana. These predictions will be tested through functional assays: 1. an RNAi screen and quantitative measurements of MAPK signalling outputs, MAPK activities and promoter activations in cultured Arabidopsis cells 2. quantitative phenotypic tests for selected phenotypes in cell differentiation (e.g. stomata development) and stress responses. I shall design and implement stand-alone and web-based software tools incorporating the algorithms developed. These will enable the biologist to easily apply the algorithms through a user-friendly interface; visualization tools will make the functional inference process transparent to the user. All these tools will be made freely available to the scientific community.

Funded Value:

£419,814

Funded Period:

Sep 08 - Feb 12

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/F00964X/1

Principal Investigator:

Alberto Paccanaro

Research Subject:

Omic sciences & technologies (25%)

Tools, technologies & methods (37%)

Research Topic:

Bioinformatics (37%)

Functional genomics (13%)

Proteomics (12%)

Organisations

People	ORCID iD
Alberto Paccanaro (Principal Investigator)
Laszlo Bogre (Co-Investigator)

Publications

Author Name Title Publication Date Published

|< < 1 2 > >|

10 25 50

Menges M (2008) Comprehensive gene expression atlas for the Arabidopsis MAP kinase signalling pathways. in The New phytologist

Yang, H. (2008) A Maximal Eigenvalue Method for Detecting Process Representative Genes by Integrating Data from Multiple Sources

Hu P (2009) Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins. in PLoS biology

Gianoulis TA (2009) Quantifying environmental adaptation of metabolic pathways in metagenomics. in Proceedings of the National Academy of Sciences of the United States of America

Umbrasaite J (2010) MAPK phosphatase AP2C3 induces ectopic proliferation of epidermal cells leading to stomata development in Arabidopsis. in PloS one

Nepusz T (2010) SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale. in BMC bioinformatics

Dóczi R (2011) Mitogen-activated protein kinase activity and reporter gene assays in plants. in Methods in molecular biology (Clifton, N.J.)

Havugimana PC (2012) A census of human soluble protein complexes. in Cell

Sasidharan R (2012) GFam: a platform for automatic annotation of gene families. in Nucleic acids research

Bhat P (2012) Computational selection of transcriptomics experiments improves Guilt-by-Association analyses. in PloS one

Nepusz T (2012) Detecting overlapping protein complexes in protein-protein interaction networks. in Nature methods

Dóczi R (2012) Exploring the evolutionary path of plant MAPK networks. in Trends in plant science

Abbruscato P (2012) OsWRKY22, a monocot WRKY gene, plays a role in the resistance response to blast. in Molecular plant pathology

Yang H (2012) Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty. in Bioinformatics (Oxford, England)

Radivojac P (2013) A large-scale evaluation of computational protein function prediction. in Nature methods

Caniza H (2014) GOssTo: a stand-alone application and a web tool for calculating semantic similarities on the Gene Ontology. in Bioinformatics (Oxford, England)

Nepusz T (2014) Springer Handbook of Bio-/Neuroinformatics

Pérez-Salamó I (2014) The heat shock factor A4A confers salt tolerance and is regulated by oxidative stress and the mitogen-activated protein kinases MPK3 and MPK6. in Plant physiology

Valentini G (2014) An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods. in Artificial intelligence in medicine

Smieszek SP (2014) Progressive promoter element combinations classify conserved orthogonal plant circadian gene expression modules. in Journal of the Royal Society, Interface

Caniza H (2015) A network medicine approach to quantify distance between hereditary disease modules on the interactome. in Scientific reports

Kohoutová L (2015) The Arabidopsis mitogen-activated protein kinase 6 is associated with ?-tubulin on microtubules, phosphorylates EB1c and maintains spindle orientation under nitrosative stress. in The New phytologist

Nagy SK (2015) Activation of AtMPK9 through autophosphorylation that makes it independent of the canonical MAPK cascades. in The Biochemical journal

Galeano D (2016) Drug targets prediction using chemical similarity

Caceres J (2016) Combining interactomes from multiple organisms: A case study on human-mouse

Jiang Y (2016) An expanded evaluation of protein function prediction methods shows an improvement in accuracy. in Genome biology

Meyer MJ (2016) mutation3D: Cancer Gene Prediction Through Atomic Clustering of Coding Variants in the Structural Proteome. in Human mutation

Manfredini F (2017) Neurogenomic Signatures of Successes and Failures in Life-History Transitions in a Key Insect Pollinator. in Genome biology and evolution

Torres M (2017) Drug cocktail selection for the treatment of chagas disease: A multi-objective approach

Webster P (2017) Tracking Subclonal Mutation Frequencies Throughout Lymphomagenesis Identifies Cancer Drivers in Mouse Models of Lymphoma

Caniza H (2017) Mining the biomedical literature to predict shared drug targets in DrugBank

Webster P (2018) Subclonal mutation selection in mouse lymphomagenesis identifies known cancer loci and suggests novel candidates. in Nature communications

Cáceres JJ (2019) Disease gene prediction for molecularly uncharacterized diseases. in PLoS computational biology

Galeano D (2019) Predicting the Frequency of Drug Side effects

Zhou N (2019) MOESM2 of The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Webster P (2019) Author Correction: Subclonal mutation selection in mouse lymphomagenesis identifies known cancer loci and suggests novel candidates. in Nature communications

Zhou N (2019) The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. in Genome biology

Frasca F (2019) Learning Interpretable Disease Self-Representations for Drug Repositioning

Zhou N (2019) The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Zhou N (2019) MOESM1 of The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

Gliozzo J (2020) Network modeling of patients' biomolecular profiles for clinical phenotype/outcome prediction. in Scientific reports

Ye C (2020) The corrected gene proximity map for analyzing the 3D genome organization using Hi-C data. in BMC bioinformatics

Dawes J (2020) Additional file 1 of LUMI-PCR: an Illumina platform ligation-mediated PCR protocol for integration site cloning, provides molecular quantitation of integration sites

Dawes JC (2020) LUMI-PCR: an Illumina platform ligation-mediated PCR protocol for integration site cloning, provides molecular quantitation of integration sites. in Mobile DNA

Galeano D (2020) Predicting the frequencies of drug side effects. in Nature communications

Torres M (2021) Protein function prediction for newly sequenced organisms in Nature Machine Intelligence

McDonald JT (2021) Role of miR-2392 in driving SARS-CoV-2 infection. in Cell reports

Galeano D (2022) Machine learning prediction of side effects for drugs in clinical trials. in Cell reports methods

Santos SS (2022) Machine learning and network medicine approaches for drug repositioning for COVID-19. in Patterns (New York, N.Y.)

Gliozzo J (2022) Heterogeneous data integration methods for patient similarity networks. in Briefings in bioinformatics

Artistic and Creative Products
Key Findings
Impact Summary
Further Funding
Research Databases and Models
Research Tools and Methods
Collaboration
Software and Technical Products
Engagement Activities


Title	Artist in residence Kerry Lemon
Description	Drawing of plants with increased understanding how development shapes growth
Type Of Art	Artwork
Year Produced	2014
Impact	Stimulating discussions with students. Media release. Planned exhibition.
URL	http://www.kerrylemon.co.uk/


Description	The list of organisms with completed genome sequence is continuously growing and this has led to the identification of thousands of genes whose function is still unknown. These genes could potentially be involved in important biological cell functions and could represent important targets for diagnostic and pharmacogenomics studies and be of industrial and agronomical importance. A major undertaking for biology is therefore that of identifying the function of these uncharacterized genes on a genomic scale. The challenge for bioinformatics is then to devise algorithmic methods that, given a gene, can predict a hypothesis for its function that can then be validated by wet-lab assays. In this grant we focused our attention to the problem of protein function for organisms for which little or no experimental data is available and the only available information is the set of protein sequences. This is a relevant problem with important implications for both industry and human health - it is the case, for example, of newly sequenced bacterial genomes. We successfully developed a new method for solving this problem based on a recent development in computer science: the diffusion of information over graphs. These methods emulate the way in which heat diffuses on a metal bar. We also developed two further methods that predict protein function by grouping proteins into families. The first method, called GFam (Gene Family Annotation and Maintenance) groups proteins in a way that proteins in the same group share common domain architecture, and hence function. SCPS (Spectral Clustering of Protein Sequences) groups proteins according to their sequence similarity - similar proteins are likely to have evolved from a common ancestor and therefore are likely to share a similar function. Our research in protein function prediction also led to the development of novel methods for inference and structure discovery in biological networks. This included ClusterONE, an algorithm for detecting protein complexes from experimental data, and GOSSTO, a method for quantifying the functional similarity between two genes. Importantly, we applied these methods within a collaboration project with the labs of Andrew Emili (University of Toronto) and Edward Marcotte (Universty of Texas, Austin) which was aimed at detecting human protein complexes - the fundamental molecular machineries in the cell. We were able to obtain the largest catalogue to date of human protein complexes from cell culture. In total, we detected 622 complexes encompassing 2,634 distinct proteins. Notably, the majority (62%; 385/622) of the complexes were previously unknown (i.e., only 237 were already present in curated public databases). This catalogue constitutes a first draft of human protein complexes and therefore it provides a glimpse into the global physical molecular organization of human cells. An important output of this project is constituted by user-friendly and reliable software packages implementing the algorithms that we developed. We have created a piece of software for every algorithm developed in this project, namely: S2F, GFam, SCPS, ClusterONE, GOSSTO. These tools allow biologists and bioinformaticians to easily deploy our methods, without the need of re-implementing our algorithms. All our software packages are freely available for the scientific community as downloadable applications from the lab website. Some of our tools are also available as web applications hosted on our servers. The high number of downloads of our tools testifies their importance for the scientific community; for example ClusterONE has already been downloaded 4801 times.
Exploitation Route	The problem of protein function prediction is central in today's biology. Possible beneficiaries include: 1. The biological community at large, interested in comprehensive annotation of genomes. 2. The medical community, since elucidating human gene function can help us associate genes with certain human diseases. 3. Agriculture: predicting function for plant genes should enable us to design genetic methods to improve plant performance. Particularly, the signaling pathways on which we worked in this grant are important for plant adaptation to environmental changes. 4. Pharmaceutical companies looking to attack specific pathways. 5. New sequencing efforts: our software enables scientists to rapidly assign putative function to new genes in freshly sequenced organisms without conducting expensive functional assays.
Sectors	Agriculture Food and Drink Chemicals Energy Environment Healthcare Manufacturing including Industrial Biotechology Pharmaceuticals and Medical Biotechnology
URL	http://www.paccanarolab.org


Description	We are in contact with two research groups who have been using the output of S2F for organisms of high practical interest for crop production and for biofuel production. 1) Pablo Sotelo, from the Universidad Nacional de Asuncion (Paraguay) is working with the fungus Macrophomina phaseolina, a plant pathogen affecting more than 500 plant species (many crops among them, including soya). S2F has been used to produce a functional annotation for the proteome of this fungus, which is the first step for its characterization. This work is aimed at finding better and more targeted mechanisms for pest-control. Paraguay is one of the largest producers of soya in the world, and this work has important economic implications for the country. 2) Matteo Pellegrini (University of California, Los Angeles) leads a lab with a high interest in algal genomics. The lab is currently sequencing the genome of the unicellular alga Cyclotella cryptica, a model organism for lipid accumulation. This work has application in the biofuel production industry. The Pellegrini lab has been using the functional predictions provided by S2F to annotate this algal genome. Importantly, the algorithms we developed for specific biological networks can be applied to other types of networks. Therefore, some of our algorithms have impact not only on those problems for which we originally developed them, but also on different problems in Systems Biology as well as in other disciplines such as Pharmacology, Medicine or even Social Networks. For example, we originally developed ClusterONE for detecting protein complexes from protein interaction networks. However, ClusterONE is a general algorithm for overlapping clustering on weighted large scale networks. Therefore other research groups have successfully applied ClusterONE and proved its usefulness in several different domains. Some examples include: 1. Medicine: Clustering a genome-scale network obtained by integrating SNP array, gene expression microarray, array-CGH, CGH, GWAS and gene mutation data. This study was aimed at identifying key functional modules in lung adenocarcinoma. 2. Pharmacology: Associating drugs with protein domains in the context of myocardial infarction. 3. Pharmacology: Studying the mechanisms of adverse side effects of Torcetrapib, a drug being developed to treat hypercholesterolemia (elevated cholesterol levels) and prevent cardiovascular disease (its development was halted in 2006). 4. Social Networks: Detecting communities in Social Networks. Our research on diffusion methods for protein function prediction led to the development of methods for inference and structure discovery in biological networks. We applied some of these methods within a collaboration project with the labs of Andrew Emili (University of Toronto) and Edward Marcotte (Universty of Texas, Austin) which was aimed at detecting human protein complexes. In particular, for this project we deployed: ClusterONE, our algorithm for detecting overlapping protein complexes from PPI networks; GOSSTO, our method for calculating semantic similarities on the Gene Ontology; an information diffusion method we developed for denoising protein interaction data. The protein interaction networks identified experimentally in Emili's lab were enriched with networks generated using comparative genomics approaches in Marcotte's lab. Then, in our lab, we integrated this network with a semantic similarity graph (obtained using GOSSTO), applied our denoising procedure, and finally clustered the resulting graph using ClusterONE. We thus obtained the largest catalogue to date of human protein complexes from cell culture. The human protein complexes repository contains all the data generated in this study in an easily navigable format. These include all the pairwise protein interactions obtained through integration of the experimental data with public genomic evidence and the subunit composition of the 622 putative protein complexes obtained by clustering using ClusterONE. In our group, we have used S2F to participate in the second CAFA challenge, a competition of protein function prediction. Although this activity is within the academic domain, we think it has been important for acquiring visibility and engaging further collaborations. Finally, GFam was successfully used on Arabidopsis and the family groupings it provided were included in the TAIR10 genome release.
First Year Of Impact	2012
Sector	Agriculture, Food and Drink,Chemicals,Energy,Environment,Healthcare,Pharmaceuticals and Medical Biotechnology
Impact Types	Economic


Description	11. Ara-MKK-D: A bioinformatics and systems biology approach for the functional analysis of a growth-regulating MAP kinase pathway in Arabidopsis.
Amount	€ 189,670 (EUR)
Funding ID	41909
Organisation	European Commission
Sector	Public
Country	European Union (EU)
Start	09/2007
End	10/2009


Description	ABI innovation
Amount	$1,203,514 (USD)
Organisation	National Science Foundation (NSF)
Sector	Public
Country	United States
Start	08/2017
End	09/2020


Description	BBSRC Tools and Resources Development Fund
Amount	£114,257 (GBP)
Funding ID	BB/K004131/1
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	06/2012
End	12/2013


Description	EU, Marie Curie Fellowship to Dr Beatrix Horvath
Amount	€ 309,235 (EUR)
Organisation	Marie Sklodowska-Curie Actions
Sector	Charity/Non Profit
Country	Global
Start	03/2013
End	05/2015


Description	EU, Marie Curie Fellowship to Dr Fabio Manfredini (with Prof Mark Brown)
Amount	€ 221,606 (EUR)
Organisation	Marie Sklodowska-Curie Actions
Sector	Charity/Non Profit
Country	Global
Start	03/2014
End	04/2016


Description	EU, Marie Curie Fellowship to Dr Papdi Csaba (with Prof L. Bogre)
Amount	€ 221,606 (EUR)
Organisation	Marie Sklodowska-Curie Actions
Sector	Charity/Non Profit
Country	Global
Start	03/2013
End	04/2015


Description	Inference of RBR network and dynamic RBR complexes during leaf development.
Amount	€ 319,888 (EUR)
Funding ID	330789
Organisation	European Commission
Sector	Public
Country	European Union (EU)
Start	03/2013
End	03/2015


Description	MAPK signalling network to adapt leaf growth to drought conditions.
Amount	€ 221,765 (EUR)
Funding ID	330713
Organisation	European Commission
Sector	Public
Country	European Union (EU)
Start	04/2013
End	05/2015


Description	Molecular signatures: a systems biology tool to understand how leaf development is constrained by drought.
Amount	€ 121,869 (EUR)
Funding ID	255035
Organisation	European Commission
Sector	Public
Country	European Union (EU)
Start	07/2010
End	07/2011


Description	Newton International Fellowship to Dr Tamas Nepusz
Amount	£98,000 (GBP)
Organisation	The Royal Society
Sector	Charity/Non Profit
Country	United Kingdom
Start	02/2009
End	02/2011


Title	Purification of protein complexes
Description	Use genomic tagged GFP lines for rapid purification of protein complexes and identification of protein complex components
Type Of Material	Biological samples
Year Produced	2016
Provided To Others?	Yes
Impact	Established collaborations and accepted manuscript in EMBO J in 2017


Title	mutant lines, antibodies, GFP-tagged lines
Description	Tools for lipid signalling kinases, MAPKs, E2F-RBR such as antibodies, mutant lines, GFP-tagged lines
Type Of Material	Cell line
Provided To Others?	Yes
Impact	shared research material facilitate research in other groups


Title	ClusterONE
Description	Cluster ONE (Clustering with Overlapping Neighborhood Expansion) is a graph clustering algorithm that is able to handle weighted graphs and readily generates overlapping clusters. Owing to these properties, it is especially useful for detecting protein complexes in protein-protein interaction networks with associated confidence values. Cluster ONE is available as a standalone command-line application, as a plugin to Cytoscape or ProCope and as a web application.
Type Of Material	Computer model/algorithm
Year Produced	2012
Provided To Others?	Yes
Impact	ClusterONE was one of the key steps in our Soluble Human Protein Complexes project, which provided the largest catalogue to date of human protein complexes from cell culture. The original publication describing the ClusterONE algorithm has received in excess of 130 citations so far (Google Scholar).
URL	http://www.paccanarolab.org/clusterone


Title	ConSAT
Description	ConSAT is a database of Consensus Signature Architectures. A consensus architecture is a set of non-overlapping domain assignments (considering insertions) which tries to define uniquely each protein. These architectures are used for prediction of GO categories, and to assign weighted words derived from mining PubMed abstracts. The database is available at http://paccanarolab.org/consat
Type Of Material	Database/Collection of data
Year Produced	2014
Provided To Others?	Yes
Impact	The results contained in this database are currently being used by two research groups who are actively working with organism of a high practical interest for crop production and for biofuel production (Pablo Sotelo, Universidad Nacional de Asuncion (Paraguay); Matteo Pellegrini, University of California, Los Angeles (USA)).
URL	http://paccanarolab.org/consat


Title	Disease Similarity
Description	We introduce a MeSH-based method that accurately quantifies similarity between heritable diseases at molecular level. This method effectively brings together the existing information about diseases that is scattered across the vast corpus of biomedical literature. We prove that sets of MeSH terms provide a highly descriptive representation of heritable disease and that the structure of MeSH provides a natural way of combining individual MeSH vocabularies. We show that our measure can be used effectively in the prediction of candidate disease genes.
Type Of Material	Computer model/algorithm
Year Produced	2015
Provided To Others?	Yes
Impact	There are no impacts yet, this work appeared only about 3 months ago.


Title	GFAM
Description	GFam (Gene Family Annotation and Maintenance) is a command-line tool for automatic functional annotation of gene families. GFam offers a framework for complete genome initiatives and model organism resources to build domain-based gene families, derive meaningful functional labels and maintain family annotation across genome releases seamlessly. Our approach constitutes a unified system for grouping proteins based on evolutionary and functional relationships.
Type Of Material	Computer model/algorithm
Year Produced	2012
Provided To Others?	Yes
Impact	The family groupings provided by GFam for Arabidopsis were included in the tenth (and last) release of TAIR (The Arabidopsis Information Resource). The dataset produced with our method can be found at ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_domain_architectures.tab.t10
URL	http://www.paccanarolab.org/gfam


Title	Landis
Description	Disease similarity measures quantify the distance between disease modules on the interactome. These measures can provide a starting point for in-depth exploration of the diseases at molecular level, and are of particular relevance for orphan diseases. LanDis is an explorable database, containing the disease similarities of 28.5 million pairs of heritable diseases. These are calculated by summarising the existing phenotype information about diseases through large scale analysis of hand curated data.
Type Of Material	Database/Collection of data
Year Produced	2016
Provided To Others?	Yes
Impact	The paper presenting this database/model is still under review, so most scientist are not aware of its existence yet. However, I have already presented to conferences and meetings, receiving an extremely good feedback from everyone who tried it, especially clinician scientists.
URL	http://www.paccanarolab.org/landis/


Title	MAPK
Description	This is a general repository of MAPK sequences and orthologues in the plant kingdom. Orthologues were inferred using the InParanoid and Plaza orthologue identifier programs. This site also contains pointers to published evidence for constructing MAPK networks in Arabidopsis Yeast and Human, including high throughput and targeted experiments. The base dataset included here appeared in the paper by Dóczi, Ökrész, Romero, Paccanaro and Bögre (see reference).
Type Of Material	Database/Collection of data
Year Produced	2012
Provided To Others?	Yes
Impact	The original paper has been cited more than 20 times (Google Scholar).
URL	http://paccanarolab.org/static_content/MAPKevol/index.html


Title	S2F
Description	S2F (Sequence to Function) is a software package implementing our diffusion-based method for predicting protein function in organisms for which little or no experimental data is available and the only available information is the set of protein sequences. Protein function is predicted with respect to terms in the Gene Ontology (GO). For a given protein the system provides a probability distribution over the GO terms, which is consistent with the ontology structure, i.e. the probability of a more general term is always higher than the probability of a more specific one. The stand-alone package is self-contained, including tools for generating a set of initial seed functional labels to diffuse as well as methods for inferring the biological networks onto which to diffuse the labels.
Type Of Material	Computer model/algorithm
Year Produced	2012
Provided To Others?	Yes
Impact	The results obtained by this algorithm are currently being used by two research groups who are actively working with organism of a high practical interest for crop production and for biofuel production (Pablo Sotelo, Universidad Nacional de Asuncion (Paraguay); Matteo Pellegrini, University of California, Los Angeles (USA)).
URL	http://paccanarolab.org/s2f


Title	SemanticSimilarity
Description	The introduction of ontologies for gene functional annotation allows us to compare genes by quantifying the similarity of the terms with which they are annotated. These comparisons are important as they contribute to the inference of functional relationships between gene products by providing a perspective that complements both experimental information and sequence-based approaches. The proposed measure, which we call the random walk contribution (RWC) can be integrated with any standard semantic similarity measure, which we call host similarity measure (HSM), to yield an integrated similarity measure (ISM) that takes into account the whole ontology structure. In other words our random walk similarity measure is a kind of 'add on' to one's favourite underlying similarity measure.
Type Of Material	Computer model/algorithm
Year Produced	2012
Provided To Others?	Yes
Impact	One of the key steps in our Soluble Human Complexes project was the application of our Semantic Similarity method for calculating semantic similarities between human genes on the Gene Ontology. To date, the publication containing the method itself has been cited 22 times (Google Scholar).
URL	http://www.paccanarolab.org/static_content/gosim/


Title	SolubleComplexes
Description	Our research on diffusion methods for protein function prediction led to the development of methods for inference and structure discovery in biological networks. We applied some of these methods within a collaboration project with the labs of Andrew Emili (University of Toronto) and Edward Marcotte (Universty of Texas, Austin) which was aimed at detecting human protein complexes. In particular, for this project we deployed: ClusterONE, our algorithm for detecting overlapping protein complexes from PPI networks; GOSSTO, our method for calculating semantic similarities on the Gene Ontology; an information diffusion method we developed for denoising protein interaction data. The protein interaction networks identified experimentally in Emili's lab were enriched with networks generated using comparative genomics approaches in Marcotte's lab. Then, in my lab, we integrated this network with a semantic similarity graph (obtained using GOSSTO), applied our denoising procedure, and finally clustered the resulting graph using ClusterONE. We thus obtained the largest catalogue to date of human protein complexes from cell culture. The human protein complexes repository contains all the data generated in this study in an easily navigable format. These include all the pairwise protein interactions obtained through integration of the experimental data with public genomic evidence and the subunit composition of the 622 putative protein complexes obtained by clustering using ClusterONE.
Type Of Material	Database/Collection of data
Year Produced	2012
Provided To Others?	Yes
Impact	The original publication where this dataset was first release has been cited, to date, more than 100 times (Google Scholar).
URL	http://human.med.utoronto.ca/php/data_download.php


Title	mutation3d
Description	A new algorithm and Web server, mutation3D (http://mutation3d.org), proposes driver genes in cancer by identifying clusters of amino acid substitutions within tertiary protein structures. We demonstrated the feasibility of using a 3D clustering approach to implicate proteins in cancer based on explorations of single proteins using the mutation3D Web interface.
Type Of Material	Computer model/algorithm
Year Produced	2016
Provided To Others?	Yes
Impact	No notable impacts yet, the paper only appeared about a month ago.
URL	http://mutation3d.org/


Description	Albrecht Von Arnim
Organisation	University of Tennessee
Department	Department of Geography
Country	United States
Sector	Academic/University
PI Contribution	TOR and S6K signalling, EBP1
Collaborator Contribution	regulation of translation, making constructs for root meristem specific analysis of translatome and translational regulation
Impact	project partner, manuscripts in preparation
Start Year	2015


Description	Cancer genomics -- Haiyuan Yu (Cornell University)
Organisation	Cornell University
Country	United States
Sector	Academic/University
PI Contribution	We recently started a collaboration with Yu lab in the field of cancer genomics, where we contributed to the development of a clustering method to predict cancer mutation hotspots in proteins. We used our expertise in clustering methods to provide an efficient solution an integrate it into a comprehensive analysis pipeline.
Collaborator Contribution	Prof Yu and his lab have great expertise in the field of cancer genomics. They have contributed the biological question and the data.
Impact	A journal paper describing the method is currently under review in BMC Biology. The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year	2013


Description	Clustering of protein interaction networks -- Haiyuan Yu (Cornell University)
Organisation	Cornell University
Country	United States
Sector	Academic/University
PI Contribution	We developed ClusterONE, a new method for protein complex detection using clustering on protein-protein interaction networks.
Collaborator Contribution	Haiyuan Yu is an expert in protein-protein interaction screening, and protein-protein interaction prediction and he proposed different ways to evaluate the quality of the predictions. He also gave important feedback on the method. ClusterONE was published in 2012 in Nature methods (see below).
Impact	This is an interdisciplinary collaboration between molecular biologists (Yu lab) and computational scientists (our lab). The collaboration has produced one clustering algorithm for detecting protein complexes from protein protein interaction networks, and its corresponding implementation (ClusterONE). The details of the publication are the following: T. Nepusz, H. Yu, and A. Paccanaro Detecting overlapping protein complexes in protein-protein interaction networks Nature Methods, vol. 9, pp. 471-472, 2012. The software (ClusterONE) is available in our website ( http://www.paccanarolab.org/clusterone/ ) an is released under a free software license (can be freely downloaded, executed and eventually, modified). The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year	2011


Description	Development of a web resource for protein functional annotation -- Raj Sasidharan (BASF)
Organisation	BASF
Country	Germany
Sector	Private
PI Contribution	We developed ConSAT, a tool for protein functional annotation using protein consensus domain architectures. In this project a new algorithm was developed and a web resource (ConSAT) with precomputed results was created (available at http://paccanarolab.org/consat ). The method includes three different types of functional prediction methods, two assigning Gene Ontology terms from the protein architecture, and one assigning English weighted words.
Collaborator Contribution	Rajkumar Sasidharan's help was very important for the development of this project, mainly in two different fields: first, he provided expert knowledge in structural biology; second, he helped giving feedback on the usability of the web server, leading to its improvement.
Impact	The project main output is the above referenced website. Publications are currently being written. The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year	2012


Description	Disease gene prioritisation by the combination of gene networks -- Giorgio Valentini (Milan)
Organisation	University of Milan
Country	Italy
Sector	Academic/University
PI Contribution	We preprocessed, cleaned and provided a set of biological datasets to Giorgio Valentini to assist in the development of several methods of gene networks combination for disease-gene prioritisation (that is, finding new causative genes for diseases). We provided, among others, several semantic similarity networks among sets of human genes. We also suggested new evaluation measures for this task.
Collaborator Contribution	Giorgio Valentini developed a set of algorithms for finding new disease-gene associations. In that context he proposed many different ways in which different gene networks (both weighted and unweighted) could be combined to produce a resulting network resembling a relation based on the fact that two linked genes are supposed to share an underlying disease. The new predictions are given as an output of the paper (available at http://homes.di.unimi.it/re/suppmat/genesmeshnetwpred/supmatTBL1.html ).
Impact	Apart from the above mentioned URL, the collaboration led to the following publication: G Valentini, A Paccanaro, H Caniza, AE Romero, M Re An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods Artificial Intelligence in Medicine 61 (2), 63-78
Start Year	2013


Description	Dr Tamas Meszaros
Organisation	Semmelweiss University
Country	Hungary
Sector	Academic/University
PI Contribution	In vitro translation of RBR and E2Fs and CDK kinases for protein-protein interaction and phosphorylation studies. In vitro translation of MAPKs and MKKs. Study protein-protein interaction and activation.
Collaborator Contribution	In vitro protein interaction and phosphorylation screen
Impact	Joined publications, projects
Start Year	2015


Description	Dr Zoltan Magyar
Organisation	Hungarian Academy of Sciences (MTA)
Department	Biological Research Centre (BRC)
Country	Hungary
Sector	Academic/University
PI Contribution	Working on RBR-E2F, connecting translational regulation and cell cycle
Collaborator Contribution	Providing antibodies and mutants in the RBR-E2F pathway
Impact	research papers, collaboration with Bayern Crop Science


Description	Drug side effect prediction (with Mark Gerstein and Shantao Li, Yale University)
Organisation	Yale University
Country	United States
Sector	Academic/University
PI Contribution	We have developed a new method for predicting side effects of drugs. Our preliminary results show that our method represents a great improvement with respect to the existing state of the art in terns of side effect prediction. Moreover, it is the first method that can predict the expected frequency of side effects in the population.
Collaborator Contribution	They are helping us to provide an explanation of some aspects of our models in terms of the biology/biochemistry/pharmacology.
Impact	A journal article is in preparation.The collaboration is multi-disciplinary involving biologists and computer scientists.
Start Year	2017


Description	Enhancer prediction using epigenetic signals in different mouse tissues (with Mark Gerstein and Mengting Gu, Yale University)
Organisation	Yale University
Department	Department of Molecular Biophysics and Biochemistry
Country	United States
Sector	Academic/University
PI Contribution	Apply machine learning, signal processing and pattern recognition methods for improving the performance of the enhancer prediction for different tissues in the mouse genome. Preliminary results indicate that ensemble methods perform better than other classifiers. More advanced methods for feature extraction such as deep learning are going to be tested on the data.
Collaborator Contribution	Members of the Gerstein Lab developed a pattern recognition method called matched filters for enhancer prediction. However, our preliminary results show that advanced machine learning may improve prediction accuracy. The Gerstein Lab supplied the data and will interpret the results in the context of enhancer and promoters in the genome.
Impact	The collaboration is multi-disciplinary involving biologists and computer scientists.
Start Year	2017


Description	Finding evolutionary relations between plant MAPKs -- Laszlo Bogre (Royal Holloway)
Organisation	Royal Holloway, University of London
Department	School of Biological Sciences
Country	United Kingdom
Sector	Academic/University
PI Contribution	We collaborated with the Bogre lab in the elucidation of the evolutionary relations between the different Mitogen-activated protein kinases (MAPKs) in different model plants. Using computational techniques we were able to depict some of this relations, ultimately leading to the construction of the 'Plant MAPK Network Resource', available at http://www.paccanarolab.org/static_content/MAPKevol/ .
Collaborator Contribution	Prof Bogre and his team provided us with their MAPK dataset, their expert knowledge in the field and their biological questions. This lead to the improvement of our methods for ortholog detection. The collaboration is still ongoing and we are currently developing new computational methods to detect relations between MAPKs and substrates.
Impact	The outputs of this project are two: one web resource (the plant MAPK network resource, see above) and one joint publication: R. Dóczi, L. Ökrész, A. E. Romero, A. Paccanaro, and L. Bögre Exploring the evolutionary path of plant MAPK networks Trends in Plant Science, vol. 17, iss. 9, pp. 518-525, 2012. The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year	2011


Description	Functional prediction for Cyclotella cryptica -- Matteo Pellegrini (UCLA)
Organisation	University of California, Los Angeles (UCLA)
Country	United States
Sector	Academic/University
PI Contribution	The Pellegrini Lab is interested in better understanding certain metabolic pathways in the genome of the alga Cyclotella cryptica. This alga is particularly important from an economic perspective as it is important to the growing algal biofuels industry due to its higher levels of lipid production. In order to better understand those pathways, an important step is to provide a functional annotation in the genes of the organism. Our contribution to Prof Pellegrini research has been based in providing a functional annotation of this alga using ConSAT and S2F (the function annotation tools that we developed in the context of our grants).
Collaborator Contribution	Though this collaboration is still ongoing, feedback from the Pellegrini's lab has been incorporated into our tool to make it more usable.
Impact	We expect journal publications to be written soon. The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year	2012


Description	Functional prediction for Macrophomina phaseolina -- Pablo Sotelo (Universidad Nacional de Asuncion)
Organisation	National University of Asuncion
Country	Paraguay
Sector	Academic/University
PI Contribution	We have provided the Sotelo lab with a complete functional annotation of the fungus Macrophomina phaseolina. This was done using both S2F and CONSAT, our systems for protein function prediction. Macrophomina phaseolina has been recently sequenced and is responsible for a plague affecting many crops and particularly soya, of which Paraguay is one of the largest producers in the world. Our contribution will help, in ultimate analysis, both the development of new pesticides to fight this fungus, and in the research of genetically modified varieties of soya, resistant to this plague.
Collaborator Contribution	The Sotelo lab has been providing us with feedback to improve our system and on the accuracy of our predictions. This is very helpful for us in order to improve our system.
Impact	This is a multidisciplinary collaboration, between computational scientists (Paccanaro lab) and life scientists (Sotelo lab). We expect to produce a joint publication in the near future as an output of this collaboration. The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year	2014


Description	GFam, a tool to predict protein architectures -- Raj Sasidharan (UCLA, TAIR)
Organisation	BASF
Country	Germany
Sector	Private
PI Contribution	Our contribution (the GFam software) was motivated by the needs of TAIR (The Arabidopsis Genome Initiative, a public hub initiative to understand the plant genomes) of an automatic tool to curate functional categories assigned to the official release of the Arabidopsis thaliana genome. GFam was specifically created for this purpose, although it was published as a general tool for protein function annotation. GFam was used to produce the tenth official release of the functional annotation of Arabidopsis thaliana, the model organism for plants.
Collaborator Contribution	Several of the ideas implemented in GFam came from the semi-manual procedures used in TAIR by Raj Sasidharan and others to perform functional annotation of protein sequences.
Impact	The GFam families for Arabidopsis can be found in TAIR (ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_domain_architectures.tab.t10) The collaboration also led to a publication: R. Sasidharan, T. Nepusz, D. Swarbreck, E. Huala, and A. Paccanaro GFam: a platform for automatic annotation of gene families Nucleic Acids Research, vol. 40, iss. 19, p. 152, 2012. The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year	2008


Description	GFam, a tool to predict protein architectures -- Raj Sasidharan (UCLA, TAIR)
Organisation	University of California, Los Angeles (UCLA)
Country	United States
Sector	Academic/University
PI Contribution	Our contribution (the GFam software) was motivated by the needs of TAIR (The Arabidopsis Genome Initiative, a public hub initiative to understand the plant genomes) of an automatic tool to curate functional categories assigned to the official release of the Arabidopsis thaliana genome. GFam was specifically created for this purpose, although it was published as a general tool for protein function annotation. GFam was used to produce the tenth official release of the functional annotation of Arabidopsis thaliana, the model organism for plants.
Collaborator Contribution	Several of the ideas implemented in GFam came from the semi-manual procedures used in TAIR by Raj Sasidharan and others to perform functional annotation of protein sequences.
Impact	The GFam families for Arabidopsis can be found in TAIR (ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_domain_architectures.tab.t10) The collaboration also led to a publication: R. Sasidharan, T. Nepusz, D. Swarbreck, E. Huala, and A. Paccanaro GFam: a platform for automatic annotation of gene families Nucleic Acids Research, vol. 40, iss. 19, p. 152, 2012. The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year	2008


Description	Gene prioritisation for lymphoma growth on mutagenesis study
Organisation	Medical Research Council (MRC)
Department	MRC Clinical Sciences Centre (CSC)
Country	United Kingdom
Sector	Public
PI Contribution	Prediction of lymphoma growth stage by analysis of gene clonality values from a sample. Prioritisation of genes selected from broad loci sources involved in lymphomagenesis. This process yielded a set of about 20 genes selected for further studies.
Collaborator Contribution	Mutagenesis developed lymphoma studies on over 500 mice, with the corresponding sample clonality analysis. Ongoing gene relevance analysis.
Impact	Studies are still ongoing on the relevance of the selected genes. We expect to obtain a publication about this work when the process finishes. The study is multi-disciplinary and it comprises the following disciplines: cancer genomics, molecular biotechnology, systems biology, computer science, big data analysis, bioinformatics.
Start Year	2015


Description	GoSSTo, a Tool for computing Gene Ontology Semantic Similarites -- Giorgio Valentini (University of Milan)
Organisation	University of Milan
Country	Italy
Sector	Academic/University
PI Contribution	We developed GoSSTo a command line based-tool to compute semantic similarities between gene products. The tool implemented an algorithm previously published in our group, trying to make it accessible to any possible researcher. We also implemented GoSSToWeb, a web server providing easier access to this tool for biological researchers.
Collaborator Contribution	Giorgio Valentini and his lab provided help for the development of the web interface of our tool for computing semantic similarities which was recently published, and also provided user feedback on the command line tool.
Impact	The output is constituted by our software tools (GoSSTo and GoSSToWeb). Our web tool, available at www.paccanarolab.org/gosstoweb has had over 50 registered users and 70 submitted jobs thus far. Moreover, the collaboration is manifested in the following publication: H. Caniza, A. E. Romero, S. Heron, H. Yang, A. Devoto, M. Frasca, M. Mesiti, G. Valentini, and A. Paccanaro, GOssTo: a user-friendly stand-alone and web tool for calculating semantic similarities on the Gene Ontology Bioinformatics, vol. 30, iss. pp. 2235-2236, 2014. A preliminary version of this paper was submitted and accepted to the ISMB conference in 2013: H. Caniza, A. E. Romero, S. Heron, H. Yang, M. Frasca, M. Mesiti, G. Valentini, and A. Paccanaro. 'GOssTo and GOssToWeb: user-friendly tools for calculating semantic similarities on the Gene Ontology.' Bio-Ontologies SIG 2013-ISMB 2013 (2013).
Start Year	2012


Description	Human Protein Complexes -- Emili (Un. Toronto), Marcotte (Un. Texas, Austin)
Organisation	University of Toronto
Country	Canada
Sector	Academic/University
PI Contribution	Our research on diffusion methods for protein function prediction led to the development of methods for inference and structure discovery in biological networks. We applied some of these methods within a collaboration project with the labs of Andrew Emili (University of Toronto) and Edward Marcotte (Universty of Texas, Austin) which was aimed at detecting human protein complexes. In particular, for this project we deployed: ClusterONE, our algorithm for detecting overlapping protein complexes from PPI networks; GOSSTO, our method for calculating semantic similarities on the Gene Ontology; an information diffusion method we developed for denoising protein interaction data. The protein interaction networks identified experimentally in Emili's lab were enriched with networks generated using comparative genomics approaches in Marcotte's lab. Then, in my lab, we integrated this network with a semantic similarity graph (obtained using GOSSTO), applied our denoising procedure, and finally clustered the resulting graph using ClusterONE. We thus obtained the largest catalogue to date of human protein complexes from cell culture.
Collaborator Contribution	The protein interaction networks identified experimentally in Emili's lab were enriched with networks generated using comparative genomics approaches in Marcotte's lab. Then, in my lab, we integrated this network with a semantic similarity graph (obtained using GOSSTO), applied our denoising procedure, and finally clustered the resulting graph using ClusterONE.
Impact	1) The human protein complexes repository contains all the data generated in this study in an easily navigable format. These include all the pairwise protein interactions obtained through integration of the experimental data with public genomic evidence and the subunit composition of the 622 putative protein complexes obtained by clustering using ClusterONE. 2) P. C. Havugimana, T. G. Hart, T. Nepusz, H. Yang, A. L. Turinsky, Z. Li, P. I. Wang, D. R. Boutz, V. Fong, S. Phanse, M. Babu, S. A. Craig, P. Hu, C. Wan, J. Vlasblom, V. U. Dar, A. Bezginov, G. W. Clark, G. C. Wu, S. J. Wodak, E. R. Tillier, A. Paccanaro, E. M. Marcotte, and A. Emili A census of human soluble protein complexes Cell, vol. 150, iss. 5, pp. 1068-1081, 2012. The collaboration is multi-disciplinary involving biologists and computational scientists.
Start Year	2009


Description	Learning disease-gene associations by exploiting disease similarities (with Mark Gerstein, Yale University)
Organisation	Yale University
Department	Department of Molecular Biophysics and Biochemistry
Country	United States
Sector	Academic/University
PI Contribution	We recently developed a disease similarity measure and calculated all the disease-disease similarities between OMIM diseases. We established a prior disease-gene association probability and provided training and testing datasets for the learning. We fitted the model.
Collaborator Contribution	Developed a Lipschitz diffusion model, that we used to spread the disease-gene association through the interactome, and a fully functional fast implementation of the algorithm.
Impact	The collaboration is multi-disciplinary involving biologists and computer scientists.
Start Year	2017


Description	Network-based Genome Analysis Reveals Structural and Functional Properties of Genes (with Mark Gerstein and Koon-Kiu Yan, Yale University)
Organisation	Yale University
Country	United States
Sector	Academic/University
PI Contribution	We have analysed the spatial proximity of all pathway genes (KEGG Database) across various cancer cell lines. Our preliminary results provide strong evidence for a relationship between disease pathways and cancer. The study also helps identify candidate genes for a number of diseases.
Collaborator Contribution	They have successfully applied network community detection techniques to Hi-C data (three-dimensional architecture of genomes) in order to identify topologically associating domains (TADs) of genomic regions.
Impact	The collaboration is multi-disciplinary involving biologists and computer scientists.
Start Year	2017


Description	Objective of the project is to elucidate the mechanism of action of a drug for multiple sclerosis
Organisation	Imperial College London
Department	Faculty of Medicine
Country	United Kingdom
Sector	Academic/University
PI Contribution	To analyse transcriptomics data obtained from a trial on human patients using network medicine approaches.
Collaborator Contribution	They hosted a trial with human patients and extracted transcriptomics data at different times..
Impact	No outputs yet. This collaboration is multidisciplinary involving: computer science, network science, machine learning, medicine, biology and pharmacology.
Start Year	2015


Description	Pavla Binarova
Organisation	Academy of Sciences of the Czech Republic
Country	Czech Republic
Sector	Academic/University
PI Contribution	Analysing RBR phosphorylation and interaction with microtubules.
Collaborator Contribution	Microtubules, cell biology
Impact	research papers, joined projects
Start Year	2010


Description	Robert Doczi. MAPK evolutionary network, MAPK substrate prediction.
Organisation	Hungarian Academy of Sciences (MTA)
Department	Centre for Agricultural Research (ATK)
Country	Hungary
Sector	Academic/University
PI Contribution	The Paccanaro group analysed MAPK docking sites, and MAPK-MKK interaction surfaces when there is no canonical docking site.
Collaborator Contribution	Developed a high throughput in vivo MAPK activation screen
Impact	publications. Multidisciplinary collaboration. Computer Science, Biology
Start Year	2010


Title	CONSAT
Description	ConSAT is a terminal-based application which can be used to functionally annotate a set of proteins, using its consensus domain architecture. Proteins are assigned Gene Ontology terms based on the domains composition of the architecture and on the already known experimental terms of proteins with a given architecture. In order to help in the production of a description of a protein sequence, it also assigns weighted English words derived from mining PubMed articles. ConSAT is written in Python.
Type Of Technology	Software
Year Produced	2014
Open Source License?	Yes
Impact	ConSAT has been used to produce the homonym database (see 'databases'), which is being used in two external collaborations (with Pablo Sotello and Matteo Pellegrini, see 'collaborations'). ConSAT has been used for our participation in the second CAFA challenge, organized by an international research community of more than 50 research groups devoted to the study of protein function prediction methods.
URL	http://paccanarolab.org/ConSAT


Title	ClusterONE
Description	ClusterONE (Clustering with Overlapping Neighborhood Expansion) is a graph clustering algorithm that is able to handle weighted graphs and readily generates overlapping clusters. Owing to these properties, it is especially useful for detecting protein complexes in protein-protein interaction networks with associated confidence values. ClusterONE is available as a standalone command-line application, as a plugin to Cytoscape or ProCope.
Type Of Technology	Software
Year Produced	2012
Open Source License?	Yes
Impact	For the creation of the Human protein complexes repository (http://human.med.utoronto.ca/) the standalone version of ClusterONE was used to produce the putative protein complexes. This project provided the largest catalogue to date of human protein complexes from cell culture. All versions of the ClusterONE Cytoscape plugin have been downloaded a total of 4801 times, with 5 releases produced so far. The ClusterONE publication has in excess 130 citations.
URL	http://paccanarolab.org/clusterone


Title	GFAM
Description	GFam (Gene Family Annotation and Maintenance) is a command-line tool for automatic functional annotation of gene families. GFam offers a framework for complete genome initiatives and model organism resources to build domain-based gene families, derive meaningful functional labels and maintain family annotation across genome releases seamlessly. Our approach constitutes a unified system for grouping proteins based on evolutionary and functional relationships.
Type Of Technology	Software
Year Produced	2012
Open Source License?	Yes
Impact	The family groupings provided by GFam for Arabidopsis were included in TAIR10 genome release. The results are available from the official TAIR (The Arabidopsis Information Resource) website: ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR10_genome_release/TAIR10_domain_architectures.tab.t10
URL	http://paccanarolab.org/gfam


Title	GOSSTO
Description	Semantic similarity calculations aim to provide a quantifiable measure of functional relatedness of genes by assessing the similarity of the functional terms with which they are annotated. GOSSTO (Gene Ontology Semantic Similarity Tool) is a tool for calculating this measure with respect to Gene Ontology terms. It implements an improved diffusion-based measure developed in this project, as well as several well-established measures, such as those proposed by Resnik, Lin, Jiang, simUI. Powerful extension capabilities are included in GOSSTO, enabling the user to extend it with new similarity measures. GOSSTO is available as a standalone command-line application running on Windows, GNU/Linux and MacOS as well as a web tool. The webtool is available at www.paccanarolab.org/gosstoweb
Type Of Technology	Software
Year Produced	2014
Open Source License?	Yes
Impact	For the creation of the Human protein complexes repository (http://human.med.utoronto.ca/) the standalone version of GOSSTO was used to compute semantic similarities between human genes in the Gene Ontology. This project provided the largest catalogue to date of human protein complexes from cell culture. Our web tool, available at www.paccanarolab.org/gosstoweb has had over 50 registered users and 70 submitted jobs thus far.
URL	http://paccanarolab.org/gossto


Title	JustClust
Description	JustClust is a tool for analysing biological data with cluster analysis. JustClust can handle many formats of data and cluster the data with many state-of-the-art techniques. The aim of JustClust is to provide an easy-to-use application which can perform any analysis on any data.
Type Of Technology	Software
Year Produced	2014
Open Source License?	Yes
Impact	The manuscript is currently being finalised.
URL	http://paccanarolab.org/justclust


Title	Landis
Description	Disease similarity measures quantify the distance between disease modules on the interactome. These measures can provide a starting point for in-depth exploration of the diseases at molecular level, and are of particular relevance for orphan diseases. LanDis is a freely available web-based interactive tool that allows domain experts, medical doctors and the larger community to graphically navigate the landscape of human disease similarities. LanDis is designed to explore the similarity landscape of over 28.5 million pairs of heritable diseases, introducing a fully interactive and navigable plot in which diseases are represented as nodes and their pairwise similarity as the links joining them.
Type Of Technology	Webtool/Application
Year Produced	2016
Impact	The paper presenting this webtool is still under review, so most scientist are not aware of its existence yet. However, I have already presented to conferences and meetings, receiving an extremely good feedback from everyone who tried it, especially clinician scientists.
URL	http://www.paccanarolab.org/landis


Title	S2F
Description	S2F (Sequence-to-Function) is a software package implementing our diffusion-based method for predicting protein function in organisms for which little or no experimental data is available and the only available information is the set of protein sequences. Protein function is predicted with respect to terms in the Gene Ontology (GO). For a given protein the system provides a probability distribution over the GO terms, which is consistent with the ontology structure, i.e. the probability of a more general term is always higher than the probability of a more specific one. The stand-alone package is self-contained, including tools for generating a set of initial seed functional labels to diffuse as well as methods for inferring the biological networks onto which to diffuse the labels.
Type Of Technology	Software
Year Produced	2014
Open Source License?	Yes
Impact	The results obtained using S2F are currently being used by two research groups who are actively working with organism of a high practical interest for crop production and for biofuel production (Pablo Sotelo, Universidad Nacional de Asuncion (Paraguay); Matteo Pellegrini, University of California, Los Angeles (USA)). S2F has been used for our participation in two CAFA challenges, organized by an international research community of more than 50 research groups devoted to the study of protein function prediction methods.
URL	http://paccanarolab.org/s2f


Title	SCPS
Description	SCPS (Spectral Clustering of Protein Sequences) is an efficient, user-friendly, scalable and multi-platform implementation of a spectral clustering method for clustering homologous proteins. SCPS also implements connected component analysis and hierarchical clustering, integrates TribeMCL and interfaces with external tools such as Cytoscape and NCBI BLAST.
Type Of Technology	Software
Year Produced	2010
Open Source License?	Yes
Impact	The paper is classified as 'highly accessed' on the journal website. The work has been cited 28 times already. Many of the papers citing SCPS make use of the software for large scale clustering of protein sequences in practical, real world applications.
URL	http://paccanarolab.org/scps


Title	mutation3D
Description	mutation3D is a functional prediction and visualization tool for studying the spatial arrangement of amino acid substitutions on protein models and structures. It is intended to be used to identify clusters of amino acid substitutions arising from somatic cancer mutations across many patients in order to identify functional hotspots and fuel downstream hypotheses. It is also useful for clustering other kinds of mutational data, or simply as a tool to quickly assess relative locations of amino acids in proteins.
Type Of Technology	Webtool/Application
Year Produced	2016
Impact	It is still too early, the tool was released about a month ago.
URL	http://mutation3d.org/


Description	Artist in Residence Kerry Lemon
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Public/other audiences
Results and Impact	Stimulating discussions to bridge the gap between science and artistic thinking Artistic drawing with understanding of plant development
Year(s) Of Engagement Activity	2014,2015,2016,2017
URL	http://www.kerrylemon.co.uk/


Description	Bristol2012
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	The talks led to interesting discussions and finding new contacts Some plans were made for future collaboration
Year(s) Of Engagement Activity	2012


Description	Cambridge2013
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	Talks about our research and methods with peers Setting collaboration activities with our peers
Year(s) Of Engagement Activity	2013


Description	ClusterONE press release
Form Of Engagement Activity	A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Media (as a channel to the public)
Results and Impact	We advertised on the Royal Holloway college website the publication of the ClusterONE algorithm and of its accompanying software in Nature Methods. The advertisements sparked a lot of interest for the algorithm in the college. As a consequence of the advertisement, we were approached by biologists in the School of Biological Sciences at Royal Holloway with whom we started collaborating for clustering large scale experimental co-expression networks that they were producing.
Year(s) Of Engagement Activity	2012
URL	https://www.royalholloway.ac.uk/computerscience/news/newsarticles/researchersalgorithmpublishedinsci...


Description	Co-PI Talk
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	The Co-PI gave a talk at a conference organized for grammar school pupils at Aylesbury Grammar School. Discussions of the complexities of signalling pathways and why they are important. As the conferences were focused in plant biology, a number of pupils decided to find out more options to study Biological Sciences after high school.
Year(s) Of Engagement Activity	2010


Description	Cornell2010
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	Meeting with researchers and interesting work discussions Elaboration of plans for future collaboration
Year(s) Of Engagement Activity	2010


Description	Cornell2013
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	The talk sparked discussions with other scientists. The feedback I obtained was useful for my current research. The talk was important to advertise my research and to make contacts for future collaborations. A collaboration was initiated with the group of Prof. Haiyuan Yu for a new joint research project aimed at finding hotspot mutations in Cancer proteins. The collaboration is ongoing and a paper is currently under review in BMC Biology.
Year(s) Of Engagement Activity	2013


Description	GlaxoSmithKline2013
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	Engagement with contacts and discussions of mutual interests Plans for collaboration with some contacts made
Year(s) Of Engagement Activity	2013


Description	ISMB 2010
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	The poster presentation generated interest and positive feedback from the participants of the event Other participants provided interesting ideas that helped us on our research
Year(s) Of Engagement Activity	2010
URL	http://www.iscb.org/archive/conferences/iscb/ismb2010.html


Description	ISMB BioOntologies SIG 2013
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	From the poster presentation, some interesting talks were developed and new contacts were made The feedback from the activity was useful for further develop on our research
Year(s) Of Engagement Activity	2013
URL	http://www.iscb.org/ismbeccb2013-program/ismbeccb2013-satellite-meetings#bio


Description	ISMB NetBIO SIG 2013
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	After the poster presentation, some contacts were made and we had interesting discussions of the presented work We analysed our work with other researchers that helped us improve it furtherly
Year(s) Of Engagement Activity	2013
URL	http://www.iscb.org/ismbeccb2013-program/ismbeccb2013-satellite-meetings#netbio


Description	Invited participation in experts' roundtable at the The Bioinformatics Strategy Meeting in London
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	I participated in an Experts' roundtable together with other academics and members of Industry
Year(s) Of Engagement Activity	2016


Description	London Area Plant Molecular Sciences
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Professional Practitioners
Results and Impact	Increase the togetherness and cohesion of plant science in the London area repeated yearly meetings for 10 years
Year(s) Of Engagement Activity	Pre-2006,2006,2007,2008,2009,2010


Description	MRC2012
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	Discussions about biological problems that we could help on, that were analysed on their community Establishing links with biologists and creating collaboration networks
Year(s) Of Engagement Activity	2012


Description	Milan2009-Biology
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	The talks continued with analysing some other problems with the research we talked about Some plans were made for future collaboration with our new contacts
Year(s) Of Engagement Activity	2009


Description	Milan2009-CS
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	The talks generated interesting discussions and we met some contacts Some plans for future collaboration were made with the University
Year(s) Of Engagement Activity	2009


Description	NIPS 2008
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	After the poster presentation, we made interesting contacts and we had positive discussions about our work We got feedback that allowed further development on our research
Year(s) Of Engagement Activity	2008
URL	http://nips.cc/Conferences/2008/


Description	Poster ClusterONE 2013
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	The poster presentation led to discussions on the work with fellow researchers The feedback provided by our peers was useful for further development
Year(s) Of Engagement Activity	2013
URL	http://www.iscb.org/ismbeccb2013


Description	RHUL Open Days
Form Of Engagement Activity	Participation in an open day or visit at my research institution
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Public/other audiences
Results and Impact	The University opens to the public and each department presents a showcase of its research, in a way which is accessible to a wider, non-specialist audience. This generated interest in the Research done by the CS Department. Many students joined the Computer Science Department
Year(s) Of Engagement Activity	2009,2010,2011,2012,2013,2014


Description	School visits
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Schools
Results and Impact	increase awareness in plant research Increased interest, motivation of school kids
Year(s) Of Engagement Activity	Pre-2006,2006,2007,2008,2009,2010,2011,2012,2013,2014


Description	Science Club at the Desborough School
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	Dr Safina Khan organized a Science Club at the Desborough School (Maidenhead, UK) for a period of one year. This consisted of weekly meetings of one hour during which pupils performed experiments designed by Dr Khan and discussed with her scientific ideas, which also included concepts from this project. This generated interest and discussions from the students. Recently Dr Khan obtained a grant to continue this work funded by the Royal Society Partnership Grant Scheme together with the Desborough School.
Year(s) Of Engagement Activity	2009


Description	Talks to the groups of Martin Wilkins and Paul Matthews -- summer 2015
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Other audiences
Results and Impact	I presented our recent results in the area of Network Medicine to Prof Martin Wilkins and Prof Paul Matthews and their groups (I gave two separate talks) at the Department of Medicine, Imperial College, Hammersmith Hospital. The talk sparked interesting discussions and it was the beginning of a very interesting collaboration with the lab of Prof Matthews in the area of Multiple Sclerosis.
Year(s) Of Engagement Activity	2015


Description	Tasters courses
Form Of Engagement Activity	Participation in an open day or visit at my research institution
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Schools
Results and Impact	One-day courses opened to school pupils. They enquired about the courses that Computer Science departments offered, and future studies possibilities. Some students chose to follow the lead we gave them and engaged in Computer Science studies in our department.
Year(s) Of Engagement Activity	2009,2010,2011,2012,2013,2014


Description	UCA-Py2009
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	The talks generated interest and requests for more information on our work to some students We got a full time PhD student for the Computer Science department at Royal Holloway
Year(s) Of Engagement Activity	2009


Description	UCAS open days
Form Of Engagement Activity	Participation in an open day or visit at my research institution
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Schools
Results and Impact	During this talk I try to convey to school pupils what computer science is and why it is an exciting field of study. Often the talked sparked questions and discussions. A high percentage of school pupils who came to the talk decided to study Computer Science and many of these chose to study it in our department at Royal Holloway.
Year(s) Of Engagement Activity	2008,2009,2010,2011,2012,2013,2014


Description	UCLondon2012
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	Using the talks as a medium we met with multiple peers and engaged in interesting conversations We elaborated plans around the talks we had with some peers
Year(s) Of Engagement Activity	2012


Description	UNA-Py2009
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	We presented our research and the departmental study programs, that led to requests for more information and to meeting research contacts We extended our contact network for collaborations
Year(s) Of Engagement Activity	2009


Description	Venice2009
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	Following our talk, some interesting discussions sparked with new research contacts Some plans to collaborate with the researchers were done
Year(s) Of Engagement Activity	2009


Description	Venice2012
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	Our presentation led to meetings with contacts We developed some plans for collaborations
Year(s) Of Engagement Activity	2009,2012


Description	Yale2010
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other academic audiences (collaborators, peers etc.)
Results and Impact	Talks about our work and meeting with new contacts Construction of plans for future collaboration
Year(s) Of Engagement Activity	2010


Description	talk at Galway -- May 2015
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Other audiences
Results and Impact	I presented my work at the School of Mathematics, Statistics and Applied Mathematics at Galway University, Ireland. The talk sparked discussions with other scientists. The feedback I obtained was useful for my current research. The talk was important to advertise my research and to make contacts for future collaborations.
Year(s) Of Engagement Activity	2015

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications