Mining protein interaction data and its context from the scientific literature

Lead Research Organisation: University of Manchester
Department Name: Life Sciences


The main archive of life sciences literature currently contains more than 17 million references and grows by approximately 2,000 articles every day. This information is invaluable and represents a rich source of knowledge for academic, biomedical and industrial researchers. However its current, let alone future size, is rendering it virtually impossible for individuals scientists to keep the pace with publications in their own area, let alone related ones. It is therefore likely that there is a significant degree of repeated scientific attempts to re-discover phenomena that might have been already studied in similar experiments. This has led to the generation of extensive secondary data sets mined from the published literature, e.g., for yeast (Reguly et al., 2006; J. Biol. 5:11), microbes (Rajagopala et al., 2008; Bioinformatics 24:2622) and HIV (Pinney et al., 2009; AIDS 23:549) among others. In recent years much emphasis has been placed on using text mining to identify protein interactions and in this area several relatively successful systems have been developed (Krallinger et al., 2008; Genome Biol. 9:S4). However, the extracted information is typically represented in the form of simple interacting pairs, with limited background information to characterise the interaction: little attempt is made to capture the context of such information (e.g. experimental conditions, methods used, how reliable it is, what is the nature of interaction is etc). Furthermore, literature curated data can be problematic as it can contains curation errors and redundant data. In addition a diverse collection of experimental methods will have been used to determine interactions. In this project we propose to study the way findings, experiments and knowledge about protein interactions is presented in the literature, and in particular how contextual information that details an interaction are encoded and presented. The aim will be to put interaction data into its semantic and biological context. To do this, we will implement a text mining framework to extract contextual information from full-text articles, and link and contrast it with data in other (structured) resources. The knowledge extracted will be characterised by both qualitative and quantitative features. Qualitative attributes will model experimental context (e.g. outcomes, interaction types, conditions, constraints, methods, model organisms, etc). We will explore and if necessary customise existing modelling frameworks (including, for example, PSI-MI, EXPO etc.) to represent experimental context extracted from the literature. Quantitative measures will represent features that may be indicative of data quality or relevance for a specific data set. Bibliometrics assigned to protein interactions, such as number of citations and mentions; peaks and changes over time; association with specific entities such as experimental methods, model systems, drug associations, outcomes, etc. will be explored. To achieve these the student will develop a generic framework where interaction data will be systematically collected from the literature, and then integrated, explored and visualised. The specific methodology will follow a hybrid approach that will combine existing biomedical resources, e.g., terminological dictionaries and ontologies, with a rule-based approach to bootstrap data set-specific patterns, whereas suitable machine-learning based methods will be developed to improve the coverage of the information extracted. The information will be presented via interaction networks augmented with context data, which will facilitate more biologically informed exploration of protein-protein relationships. Importantly, the general framework developed for placing biological 'facts' in context will be applicable across biological and text-mining domains, but will be implemented and evaluated in a specific context in collaboration with the industrial partner.


10 25 50