Language Processing for Literature Based Discovery in Medicine

Lead Research Organisation: University of Sheffield
Department Name: Computer Science

Abstract

The amount of published material in biomedicine has been growing exponentially in recent years, particularly in very productive areas, such as genomics. The knowledge it contains is now so vast and fragmented that it is no longer possible for any individual or research group to keep up with the advances relevant to their area. The research literature is also fragmented and researchers naturally concentrate their attention on their own area of expertise, meaning they may not identify research that is relevant to their own if it does not appear within the literature of their scientific discipline. However, medical research is becoming increasingly interdisciplinary with progress being made by combining outputs from various fields.

Hidden knowledge occurs when a connection can be inferred by combining information from multiple documents, but that connection has not been noticed. Literature Based Discovery (LBD) provides tools that analyse the research literature to identify hidden knowledge automatically. Connections it has been used to identify include treatments for diseases (e.g. that fish oil can be used to treat Raynaud's syndrome) and cases of diseases (e.g. that migraines can be related to magnesium deficiency). Despite these successes, the knowledge that has been discovered has been limited by the relatively simple techniques used to analyse the research literature.

This project will develop new approaches to LBD by applying recent advances in the automatic processing of biomedical literature. This analysis will provide a LBD system with more detailed and accurate information about this literature than has previously been possible. In particular, the project will make use of two language processing technologies, Information Extraction and Word Sense Disambiguation, which can now be applied to the biomedical literature on a large scale. Information Extraction will be used to identify connections between items mentioned in documents and will provide more accurate analysis than the simple techniques used by previous LBD systems. Word Sense Disambiguation will be used to avoid the problems caused by polysemy and synonymy (the suggestion of spurious connections and connections being missed) which can adversely effect LBD performance.

The project will implement a LBD system and test it on two domains: oncology and neuroscience. The effectiveness of the system will be judged by researchers working in these areas with interests in melanoma and Parkinson's disease.

Planned Impact

This project has the potential to significantly improve the quality of life, health and well being of a significant portion of society through the development of novel treatments and therapies for important diseases. In particular the project will investigate the causes of and treatments for cancer and Parkinson's Disease, both of which affect significant potions of society. Cancer has been estimated to cause around 13% of human deaths worldwide while Parkinson's Disease affects around 4% of those over 80 years of age.

There is also an economic benefit to providing treatments of these diseases; the annual cost of Parkinson's Disease to the UK has been estimated to be between 449 million and 3.3 pounds annually [1].

The project will also improve the capability of groups carrying out medical research including the NHS, university departments and independent research institutes. The systems developed will allow these groups to access medical literature more effectively and to identify the hidden knowledge it contains. More generally the techniques could benefit any organisation that carries out automatic analysis of large bodies of text. Examples of such organisation include intelligence agencies, internet search engines, marketing companies and the police.

The project will enhance the UK's position as a leader in the language technology industry. This has been estimated to be worth 8.4 billion Euros and growing at a rate of 10% per year [2]. It has also been estimated that the growth rate for companies that focused on particular industries, such as the life sciences, exceeds the industry average [3]. The technologies developed in this project will improve the automatic processing of documents in the life sciences.

[1] L. Findley (2007) "The Economic Impact of Parkinson's Disease" Parkinsonism and Related Disorders 5(6):525-535
[2] http://www.euractiv.com/en/culture/eu-language-industry-worth-84bn-euros/article-187814
[3] S. Grimes (2002) "Text Technologies in the Mainstream" report at Text Analytics Summit 2008

Publications

10 25 50

publication icon
McInnes BT (2014) Determining the difficulty of Word Sense Disambiguation. in Journal of biomedical informatics

publication icon
Preiss J (2015) Exploring relation types for literature-based discovery. in Journal of the American Medical Informatics Association : JAMIA

publication icon
Preiss J (2016) The effect of word sense disambiguation accuracy on literature based discovery. in BMC medical informatics and decision making

publication icon
Preiss, J. (2013) DALE: A Word Sense Disambiguation System for Biomedical Documents Trained using Automatically Labeled Examples in Proceedings of the 2013 NAACL HLT Demonstration Session

publication icon
Roller, R. (2013) Applying UMLS for Distantly Supervised Relation Detection in Proceedings of the The Fifth International Workshop on Health Text Mining and Information Analysis

 
Description The project developed novel techniques to explore the information contained in large collections of documents by carrying out data mining to generate new hypotheses about potential connections. The approaches developed in the project can be applied to larger document collections than those that were previously available.
Exploitation Route The techniques developed in the project could be used to extract potentially useful information from any large collection of documents. The most obvious areas of application is the field of medicine (i.e. the area explored in the project), although there are many others. The techniques could be applied in collaboration with researchers who can see the potential of the technology to assist them in exploring research literature and reducing the effort required to generate novel hypotheses from it. The techniques are general and could also be applied to other areas in which large amounts of information are available in document collections.
Sectors Aerospace, Defence and Marine,Digital/Communication/Information Technologies (including Software),Healthcare,Pharmaceuticals and Medical Biotechnology,Security and Diplomacy

 
Description Defence Growth Partnership (DGP) Innovation Challenge
Amount £66,584 (GBP)
Organisation Defence Science & Technology Laboratory (DSTL) 
Sector Public
Country United Kingdom
Start 01/2016 
End 06/2016
 
Description Invited talk at workshop 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Delivered an invited talk at the Hypothesis Generating in Genetics and Biomedical Text Mining workshop in Lancaster on the 8th January 2019. A range of participants attended, including high profile academics from Germany and USA. A number of participants commented that the material I presented led them to change their approach to large-scale biomedical knowledge discovery.
Year(s) Of Engagement Activity 2019
URL http://wp.lancs.ac.uk/btm/hg2btm/