Mining term associations from literature to support knowledge discovery in biology

Lead Research Organisation: University of Manchester

Department Name: Computer Science

Abstract

Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.

Technical Summary

In this project we propose combining various text mining approaches to establishing associations among biological terms. Our aim is to support biological knowledge discovery and develop novel text mining techniques to extract and present non-trivial knowledge and term associations (e.g. related proteins, their molecular functions, localisations, etc). More specifically, the objectives of the proposal are: to implement text-based methods for determining term similarity from large document collections; to investigate, implement and evaluate a novel term kernel method for biological text mining; to identify, implement and evaluate suitable kernel-based technologies for solving user-elicited biological text mining scenarios; to make the tools available to the wider research community via the National Centre for Text Mining. Terms are vital for processing scientific texts. A term is the lexical realisation of a concept. It can be a single word e.g. protein or a multiword phrase e.g. son of sevenless. We will focus on extracting term relationships as a basis for text mining. We will combine lexical, syntactic and contextual similarities extracted from the literature. Measurement of lexical term similarities will be based on considering substrings that are shared among bio-terms. In addition, we will investigate the use of string and subsequence kernels for this task. Measurement of syntactic similarity will be based on co-occurrence of terms within term enumerations, coordinations and conjunctions, i.e. in expressions where a sequence of terms appears as a single syntactic unit. Contextual term similarities will be measured by comparing contexts in which terms appear. Context of a term will be represented by a regular expression containing different elements, such as part-of-speech and syntactic tags, terminological and additional ontological information, and lemmatised contextual elements. Contexts will be mined automatically from documents, linguistically normalised and biologically generalised, and then compared using a vector representation. We will select sensible weighting schemes and test their performance in detecting term associations using existing resources for validation. The endpoint of mining term similarities is appropriate representation, analysis and visualisation of information in order to support biologists in knowledge discovery. By combining these similarities, terms can be linked into semantic networks and further used for text mining. Based on term similarities, we will develop a novel term kernel for biological text mining. Once we have such a kernel, we can use the whole gamut of emergent kernelised data mining methods. The technologies to be investigated for supporting knowledge discovery include term clustering, classification, principle component analysis, regression, and correlation. For example, discovery of correlations between textual and non-text information derived from post-genomic techniques such as expression array and sequence analysis is a powerful hypothesis generation method. For instance, entities, that appear similar from the results of text mining might behave very differently under a particular set of experimental conditions; this suggests the experiment is uncovering something that was previously unknown and is worthy of further investigation. At present, there are no good tools for detecting these types of patterns; we will develop such tools. We will demonstrate the utility of these technologies in solving user-elicited biological text mining scenarios. Scenarios are small-scale, but real-world problems that we can help solving using term-based text mining. These scenarios will include, but not be limited to the following: compound toxicity prediction, quantification and classification; linking genes from quantitative trait loci and expression array data using the literature, etc. These scenarios will be defined and evaluated in close collaboration with biologists.

Funded Value:

£192,905

Funded Period:

Jan 06 - Jun 09

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/C007360/1

Principal Investigator:

Goran Nenadic

Research Topic:

Unclassified

Organisations

University of Manchester (Lead Research Organisation)

People	ORCID iD
Goran Nenadic (Principal Investigator)
John Keane (Co-Investigator)

Publications

Author Name Title Publication Date Published

10 25 50

Nenadic G (2006) Mining semantically related terms from biomedical literature in ACM Transactions on Asian Language Information Processing

Yang H (2007) A cascaded approach to normalising gene mentions in biomedical literature. in Bioinformation

Yang H (2008) Identification of transcription factor contexts in literature using machine learning approaches. in BMC bioinformatics

Yang H (2009) A text mining approach to the prediction of disease status from clinical discharge summaries. in Journal of the American Medical Informatics Association : JAMIA

Yang H (2009) Assigning roles to protein mentions: the case of transcription factors. in Journal of biomedical informatics

Key Findings
Further Funding
Research Tools and Methods
Software and Technical Products


Description	This project has developed a text mining framework to identify a number of key biological concepts in text and use various classification and prediction algorithms to suggest new potential relationships that might be worth investigating.
Exploitation Route	The methodology developed has been used in real biological/bioinformatics projects to support knowledge management and discovery.
Sectors	Agriculture, Food and Drink,Chemicals,Digital/Communication/Information Technologies (including Software),Healthcare,Government, Democracy and Justice,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Biotechnology


Description	Pubmed2ensembl: a Resource for Linking Biological Literature to Genome Sequences
Amount	£99,000 (GBP)
Funding ID	BB/G000093/1
Organisation	Biotechnology and Biological Sciences Research Council (BBSRC)
Sector	Public
Country	United Kingdom
Start	02/2009
End	02/2010


Title	bioMITA
Description	Extraction of concepts and relationships from the literature.
Type Of Material	Model of mechanisms or symptoms - human
Year Produced	2008
Provided To Others?	Yes
Impact	Used for pubmed2ensembl project.


Title	BioMITA
Description	Term recognition and classification using CRF
Type Of Technology	Software
Year Produced	2009
Open Source License?	Yes
Impact	Used in pubmed2ensembl

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications