Mining term associations from literature to support knowledge discovery in biology

Lead Research Organisation: University of Manchester
Department Name: Computer Science


Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

In this project we propose combining various text mining approaches to establishing associations among biological terms. Our aim is to support biological knowledge discovery and develop novel text mining techniques to extract and present non-trivial knowledge and term associations (e.g. related proteins, their molecular functions, localisations, etc). More specifically, the objectives of the proposal are: to implement text-based methods for determining term similarity from large document collections; to investigate, implement and evaluate a novel term kernel method for biological text mining; to identify, implement and evaluate suitable kernel-based technologies for solving user-elicited biological text mining scenarios; to make the tools available to the wider research community via the National Centre for Text Mining. Terms are vital for processing scientific texts. A term is the lexical realisation of a concept. It can be a single word e.g. protein or a multiword phrase e.g. son of sevenless. We will focus on extracting term relationships as a basis for text mining. We will combine lexical, syntactic and contextual similarities extracted from the literature. Measurement of lexical term similarities will be based on considering substrings that are shared among bio-terms. In addition, we will investigate the use of string and subsequence kernels for this task. Measurement of syntactic similarity will be based on co-occurrence of terms within term enumerations, coordinations and conjunctions, i.e. in expressions where a sequence of terms appears as a single syntactic unit. Contextual term similarities will be measured by comparing contexts in which terms appear. Context of a term will be represented by a regular expression containing different elements, such as part-of-speech and syntactic tags, terminological and additional ontological information, and lemmatised contextual elements. Contexts will be mined automatically from documents, linguistically normalised and biologically generalised, and then compared using a vector representation. We will select sensible weighting schemes and test their performance in detecting term associations using existing resources for validation. The endpoint of mining term similarities is appropriate representation, analysis and visualisation of information in order to support biologists in knowledge discovery. By combining these similarities, terms can be linked into semantic networks and further used for text mining. Based on term similarities, we will develop a novel term kernel for biological text mining. Once we have such a kernel, we can use the whole gamut of emergent kernelised data mining methods. The technologies to be investigated for supporting knowledge discovery include term clustering, classification, principle component analysis, regression, and correlation. For example, discovery of correlations between textual and non-text information derived from post-genomic techniques such as expression array and sequence analysis is a powerful hypothesis generation method. For instance, entities, that appear similar from the results of text mining might behave very differently under a particular set of experimental conditions; this suggests the experiment is uncovering something that was previously unknown and is worthy of further investigation. At present, there are no good tools for detecting these types of patterns; we will develop such tools. We will demonstrate the utility of these technologies in solving user-elicited biological text mining scenarios. Scenarios are small-scale, but real-world problems that we can help solving using term-based text mining. These scenarios will include, but not be limited to the following: compound toxicity prediction, quantification and classification; linking genes from quantitative trait loci and expression array data using the literature, etc. These scenarios will be defined and evaluated in close collaboration with biologists.


10 25 50
publication icon
Nenadic G (2006) Mining semantically related terms from biomedical literature in ACM Transactions on Asian Language Information Processing (TALIP)

publication icon
Yang H (2009) Assigning roles to protein mentions: the case of transcription factors. in Journal of biomedical informatics

publication icon
Yang H (2009) A text mining approach to the prediction of disease status from clinical discharge summaries. in Journal of the American Medical Informatics Association : JAMIA

Description This project has developed a text mining framework to identify a number of key biological concepts in text and use various classification and prediction algorithms to suggest new potential relationships that might be worth investigating.
Exploitation Route The methodology developed has been used in real biological/bioinformatics projects to support knowledge management and discovery.
Sectors Agriculture, Food and Drink,Chemicals,Digital/Communication/Information Technologies (including Software),Healthcare,Government, Democracy and Justice,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Biotechnology

Description Pubmed2ensembl: a Resource for Linking Biological Literature to Genome Sequences
Amount £99,000 (GBP)
Funding ID BB/G000093/1 
Organisation Biotechnology and Biological Sciences Research Council (BBSRC) 
Sector Public
Country United Kingdom
Start 02/2009 
End 02/2010
Title bioMITA 
Description Extraction of concepts and relationships from the literature. 
Type Of Material Model of mechanisms or symptoms - human 
Year Produced 2008 
Provided To Others? Yes  
Impact Used for pubmed2ensembl project. 
Title BioMITA 
Description Term recognition and classification using CRF 
Type Of Technology Software 
Year Produced 2009 
Open Source License? Yes  
Impact Used in pubmed2ensembl