Mining term associations from literature to support knowledge discovery in biology
Lead Research Organisation:
University of Manchester
Department Name: Computer Science
Abstract
Abstracts are not currently available in GtR for all funded research. This is normally because the abstract was not required at the time of proposal submission, but may be because it included sensitive information such as personal details.
Technical Summary
In this project we propose combining various text mining approaches to establishing associations among biological terms. Our aim is to support biological knowledge discovery and develop novel text mining techniques to extract and present non-trivial knowledge and term associations (e.g. related proteins, their molecular functions, localisations, etc). More specifically, the objectives of the proposal are: to implement text-based methods for determining term similarity from large document collections; to investigate, implement and evaluate a novel term kernel method for biological text mining; to identify, implement and evaluate suitable kernel-based technologies for solving user-elicited biological text mining scenarios; to make the tools available to the wider research community via the National Centre for Text Mining. Terms are vital for processing scientific texts. A term is the lexical realisation of a concept. It can be a single word e.g. protein or a multiword phrase e.g. son of sevenless. We will focus on extracting term relationships as a basis for text mining. We will combine lexical, syntactic and contextual similarities extracted from the literature. Measurement of lexical term similarities will be based on considering substrings that are shared among bio-terms. In addition, we will investigate the use of string and subsequence kernels for this task. Measurement of syntactic similarity will be based on co-occurrence of terms within term enumerations, coordinations and conjunctions, i.e. in expressions where a sequence of terms appears as a single syntactic unit. Contextual term similarities will be measured by comparing contexts in which terms appear. Context of a term will be represented by a regular expression containing different elements, such as part-of-speech and syntactic tags, terminological and additional ontological information, and lemmatised contextual elements. Contexts will be mined automatically from documents, linguistically normalised and biologically generalised, and then compared using a vector representation. We will select sensible weighting schemes and test their performance in detecting term associations using existing resources for validation. The endpoint of mining term similarities is appropriate representation, analysis and visualisation of information in order to support biologists in knowledge discovery. By combining these similarities, terms can be linked into semantic networks and further used for text mining. Based on term similarities, we will develop a novel term kernel for biological text mining. Once we have such a kernel, we can use the whole gamut of emergent kernelised data mining methods. The technologies to be investigated for supporting knowledge discovery include term clustering, classification, principle component analysis, regression, and correlation. For example, discovery of correlations between textual and non-text information derived from post-genomic techniques such as expression array and sequence analysis is a powerful hypothesis generation method. For instance, entities, that appear similar from the results of text mining might behave very differently under a particular set of experimental conditions; this suggests the experiment is uncovering something that was previously unknown and is worthy of further investigation. At present, there are no good tools for detecting these types of patterns; we will develop such tools. We will demonstrate the utility of these technologies in solving user-elicited biological text mining scenarios. Scenarios are small-scale, but real-world problems that we can help solving using term-based text mining. These scenarios will include, but not be limited to the following: compound toxicity prediction, quantification and classification; linking genes from quantitative trait loci and expression array data using the literature, etc. These scenarios will be defined and evaluated in close collaboration with biologists.
Organisations
People |
ORCID iD |
Goran Nenadic (Principal Investigator) | |
John Keane (Co-Investigator) |
Publications
Rebholz-Schuhmann D.
(2006)
Annotation and disambiguation of semantic types in biomedical text: A cascaded approach to named entity recognition
in Proceedings of the 5th Workshop on NLP and XML: Multi-Dimensional Markup in Natural Language Processing, NLPXML 2006 at the 11th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2006
Nenadic G
(2006)
Mining semantically related terms from biomedical literature
in ACM Transactions on Asian Language Information Processing
Nenadic G.
(2006)
Towards a terminological resource for biomedical text mining
in Proceedings of the 5th International Conference on Language Resources and Evaluation, LREC 2006
Yang H
(2007)
A cascaded approach to normalising gene mentions in biomedical literature.
in Bioinformation
Yang H
(2008)
Identification of transcription factor contexts in literature using machine learning approaches.
in BMC bioinformatics
Yang H
(2009)
Assigning roles to protein mentions: the case of transcription factors.
in Journal of biomedical informatics
Yang H
(2009)
A text mining approach to the prediction of disease status from clinical discharge summaries.
in Journal of the American Medical Informatics Association : JAMIA
Description | This project has developed a text mining framework to identify a number of key biological concepts in text and use various classification and prediction algorithms to suggest new potential relationships that might be worth investigating. |
Exploitation Route | The methodology developed has been used in real biological/bioinformatics projects to support knowledge management and discovery. |
Sectors | Agriculture, Food and Drink,Chemicals,Digital/Communication/Information Technologies (including Software),Healthcare,Government, Democracy and Justice,Culture, Heritage, Museums and Collections,Pharmaceuticals and Medical Biotechnology |
Description | Pubmed2ensembl: a Resource for Linking Biological Literature to Genome Sequences |
Amount | £99,000 (GBP) |
Funding ID | BB/G000093/1 |
Organisation | Biotechnology and Biological Sciences Research Council (BBSRC) |
Sector | Public |
Country | United Kingdom |
Start | 02/2009 |
End | 02/2010 |
Title | bioMITA |
Description | Extraction of concepts and relationships from the literature. |
Type Of Material | Model of mechanisms or symptoms - human |
Year Produced | 2008 |
Provided To Others? | Yes |
Impact | Used for pubmed2ensembl project. |
Title | BioMITA |
Description | Term recognition and classification using CRF |
Type Of Technology | Software |
Year Produced | 2009 |
Open Source License? | Yes |
Impact | Used in pubmed2ensembl |