Lexical Acquisition for the Biomedical Domain

Lead Research Organisation: University of Cambridge
Department Name: Computer Science and Technology

Abstract

Natural Language Processing (NLP) is now critically needed to assist the processing, mining and extraction of knowledge from the rapidly growing literature in the area of biomedicine. In recent years, considerable progress has been made in the development of basic NLP techniques for biomedicine. The current challenge is to improve these techniques with richer and deeper analysis capable of supporting a wide range of real-world tasks. High-quality lexical resources (e.g. accurate and comprehensive lexicons and word classifications) are critically needed for this. Most lexical resources used in current systems are developed manually by linguists. Manual work is extremely costly, and the resulting resources require extensive labour-intensive porting to new (sub-)domains and tasks. Automatic acquisition or updating of lexical information from repositories of un-annotated text (e.g. corpora of biomedical articles) is a more promising avenue to pursue. Since lexical acquisition gathers usage and frequency information directly from relevant data, it can considerably enhance the viability and portability of NLP technology. Research into automatic lexical acquisition is now starting to produce large-scale resources useful for practical NLP tasks. However, the application of such techniques to biomedical texts has been limited because many existing techniques require adaptation before they can perform optimally in this linguistically challenging domain. In this project, we will take existing techniques capable of acquiring basic syntactic-semantic information for verbs from corpus data and will adapt them to the biomedical domain. We will focus on verbal (i) subcategorization frames, (ii) selectional preferences, and (ii) lexical-semantic classes. This information, when tailored to the domain in question, can aid key NLP tasks such as parsing, anaphora resolution, Information Extraction (IE), and question-answering (QA). Building on our pilot studies and expanding on the adaptive, state-of-the-art text processing tools available to us, we will improve existing techniques further and extend them with novel unsupervised and semi-supervised methods capable of supporting efficient domain adaptation. We will evaluate and demonstrate the capabilities of our techniques directly and in the context of practical BIO-NLP tasks. We will use the final version of the system to acquire a substantial lexical database from a biomedical corpus. The resulting resource will be distributed freely to the research community, along with the software which can be used to tune the frequency information stored in the database to particular biomedical sub-domains/tasks.We expect this project to (i) advance BIO-NLP and improve its usefulness for practical tasks in biomedicine, (ii) advance NLP by improving the accuracy, robustness and portability of lexical acquisition to real-world tasks, and (iii) provide an important large-scale study of domain-adaptation in the critical area of lexical acquisition.
 
Description Natural Language Processing (NLP) is now critically needed to assist the processing, mining and extraction of knowledge from the rapidly growing literature in the area of biomedicine. In recent years, considerable progress has been made in the development of basic NLP techniques for biomedicine. The current challenge is to improve these techniques with richer and deeper analysis capable of supporting a wide range of real-world tasks. High-quality lexical resources (e.g. accurate and comprehensive lexicons and word classifications) are critically needed for this. Most lexical resources used in current systems are developed manually by linguists. Manual work is extremely costly, and the resulting resources require extensive labour-intensive porting to new (sub-)domains and tasks. Automatic acquisition or updating of lexical information from repositories of un-annotated text (e.g. corpora of biomedical articles) is a more promising avenue to pursue. Since lexical acquisition gathers usage and frequency information directly from relevant data, it can considerably enhance the viability and portability of NLP technology. Research into automatic lexical acquisition is now starting to produce large-scale resources useful for practical NLP tasks. However, the application of such techniques to biomedical texts has been limited because many existing techniques require adaptation before they can perform optimally in this linguistically challenging domain.



In this project, we first investigated sub-domain variation in biomedicine and discovered that the sub-domains of this domain vary considerably in terms of their lexical, syntactic and semantic properties. This enforced the need to develop adaptive (i.e. not just domain-adapted) technology for lexical acquisition and also highlighted the need for techniques that are fairly robust to data sparsity and do not depend crucially on the availability relevant annotated training sets.



We then took existing techniques capable of acquiring basic syntactic-semantic information for verbs from corpus data and investigated their limitations when applied to the biomedical domain. We focussed mainly on verbal (i) subcategorization frames, (ii) selectional preferences, and (ii) lexical-semantic classes. These types of information, when tailored to the domain in question, can aid key NLP tasks such as parsing, anaphora resolution, Information Extraction (IE), and question-answering (QA). Building on our pilot studies and expanding on the adaptive, state-of-the-art text processing tools available to us, we developed novel unsupervised and minimally supervised acquisition techniques based on, for example, Bayesian graphical models, Tensor Factorization and active learning.



We evaluated and demonstrated the capabilities of our techniques directly and in the context of practical tasks in biomedicine such as text mining, literature review, extractive summarization and generation of research hypotheses in cancer research. These evaluations yielded promising results and led to several conference and journal publications. We also used the best of our techniques to acquire lexicons from biomedical texts and made them freely available to the research community. Finally, realising that research on minimally supervised learning requires more encouragement in the research community, we organized workshops on this topic in conjunction of two major conferences, EMNLP 2010 and EACL 2011.



In sum, our project (i) advanced BIO-NLP and improve its usefulness for practical tasks in biomedicine, (ii) advanced NLP by improving the accuracy, robustness and portability of lexical acquisition to real-world tasks, and (iii) provided an important large-scale study of domain-adaptation in the critical area of lexical acquisition.
Exploitation Route The technology can benefit health related organizations and biomedical industries requiring access to information in published literature. This research involves improving NLP techniques that involve the processing of biomedical texts. The techniques benefit important tasks and applications in the area of biomedicine which involve accessing, classifying and reviewing information in scientific articles. Examples include literature curation and literature-based research e.g. in academia, health related organizations (clinical and non-clinical), as well as biomedical industries.
Sectors Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology

 
Description MRC methodology grant, standard responsive mode
Amount £399,229 (GBP)
Funding ID MR/M013049/1 
Organisation Medical Research Council (MRC) 
Sector Public
Country United Kingdom
Start 06/2015 
End 05/2018
 
Description Swedish Research Council
Amount £150,000 (GBP)
Funding ID Swedish Research Council 2009-19295-70061-32 
Organisation Karolinska Institute 
Sector Academic/University
Country Sweden
Start  
 
Description Text Mining for Improved Cancer Risk Assessment
Amount £150,000 (GBP)
Organisation Government of Sweden 
Sector Public
Country Sweden
Start 01/2010 
End 12/2012