Lexical Disambiguation for the Biomedical Domain

Lead Research Organisation: University of Sheffield
Department Name: Computer Science

Abstract

In biomedicine the amount of published material has been growing exponentially in recent years, particularly in very productive areas, such as genomics. The management of the information derived from this material poses a problem for researchers, who cannot cope with this magnitude of data and find it increasingly difficult to find information that is necessary for their research. Automatic processing of these documents would provide the means to efficiently access the information they contain by building tools to search for documents and identify facts within them. However, this is made difficult by the fact that texts in the biomedical domain, like those on other topics, contain a range of ambiguities. This research project aims to develop tools and algorithms to resolve lexical ambiguity in the biomedical domain. We will apply novel unsupervised word sense disambiguation methods to three distinct forms of lexical ambiguity. The problem of obtaining adequate amounts of training data for these approaches will be addressed by adapting techniques for automatically generating disambiguated sentences using information from a domain ontology and unannotated corpora. Finally, the resulting disambiguation systems will be integrated into Termino, a publicly available terminology recognition tool, to improve its functionality. The three forms of biomedical lexical ambiguity that we will focus on in this project are the following: (i) Terms which refer to multiple concepts. The phenomenon of polysemy, where a word may have several possible meanings, occurs in all texts. A special consideration which must be taken into account when resolving this form of ambiguity is the fact that words may be used with meanings that are unlikely to occur in text from other domains. For example, cold'' has six possible meanings in the Unified Medical Language System (UMLS) Metathesaurus including well known meanings like common cold'' and cold sensation'' but there are also domain-specific usages, for example, Chronic Obstructive Airway Disease (COLD)''. (ii) Abbreviations with more than one possible expansion. Abbreviations are used frequently in biomedical text but a single abbreviation may have several expanded forms. It has been reported that abbreviations in MedLine consisting of six characters or less have an average of 4.61 possible meanings, making it difficult to interpret these texts automatically. (iii) Systematic relation between possible meanings for a set of terms. This form of ambiguity, which is particularly common in genomics literature, occurs when the same term can refer to a gene, protein or mRNA. This is a form of regular polysemy since the meanings are related (the gene produces mRNA and a protein). Since the same term can refer to a variety of compounds, it can be difficult to determine which one is meant in a particular usage. Resolving these various forms of lexical ambiguity is critical for the automatic processing of biomedical texts.

Publications

10 25 50