Literature-based discovery for cancer biology

Lead Research Organisation: University of Cambridge
Department Name: Linguistics

Abstract

Over the past decades, the volume of published science has increased dramatically, particularly in rapidly developing areas such as biomedicine. PubMed (the US National Library of Medicine's literature service) provides access to more than 23M citations, adding thousands of records daily. It is now impossible for scientists to read all the literature relevant to their field, let alone adjacent fields. As a consequence, critical hypothesis generating evidence is often discovered long after it was first published, leading to wasted research time and resources. This hinders the progress on solving fundamental problems such as understanding the mechanisms underlying diseases and developing the means for their effective treatment and prevention.

Automated Literature Based Discovery (LBD) aims to address this problem. It generates new knowledge by combining what is already known in literature. Facilitating large-scale hypothesis testing and generation from huge collections of literature, LBD could significantly support scientific research. It has been used to identify new connections between e.g. genes, drugs and diseases in texts and it has resulted in new scientific discoveries (e.g. identification of candidate genes and treatments for illnesses). However, based on fairly shallow techniques (e.g. dictionary matching) current LBD captures only some of the information available in literature.

Enabling automatic analysis of biomedical texts, Text Mining (TM) could open the doors to much deeper, wider coverage and dynamic LBD better capable of evolving with the development of science. The last decade has seen massive application of TM to biomedicine and has resulted in tools supporting important tasks such as literature curation and the development of semantic databases. Although TM could similarly support LBD, little work exists in this area. Extending recent developments in adaptive Natural Language Processing (NLP) and TM, we will develop improved methodology for identifying concepts, events and relations in diverse biomedical texts. We will also introduce novel, improved methodology for knowledge discovery which uses link prediction for high quality LBD in the complex network of concepts resulting from TM. Link prediction can optimally exploit the rich information generated by TM, can improve the accuracy of LBD and can yield output which is more useful for scientists.

To evaluate and demonstrate the benefits of the resulting approach, we will initially target this methodology to the literature-intensive, interdisciplinary area of cancer biology. We will develop an LBD tool in close collaboration with cancer researchers and will evaluate the tool by using it to conduct case studies which investigate current research problems in cancer biology. The most promising findings will be evaluated and validated via laboratory experiments.

All the data, resources, results and technology resulting from this research will be made freely available.
We expect our project (i) to improve the capacity of LBD so that it can, in the future, support scientific discovery in a manner similar to widely employed retrieval and sequencing tools, (ii) to improve the adaptability and portability of TM and LBD, (iii) to produce the first dedicated LBD tool for cancer biology, and (iii) to provide an important case study on integration of advanced TM and DM -based LBD in real-life biomedical research.

Technical Summary

We will improve automated LBD so that it can better support biomedical research. Our idea is to replace current, fairly shallow LBD methodology (e.g. dictionary matching) by methodology capable of deeper, wider and more dynamic knowledge discovery. We will base this methodology on advanced Text Mining (TM) and Data Mining (DM). Application of TM to LBD is challenged by the fact that real-life knowledge discovery integrates knowledge from different areas of biomedicine, while most current TM is optimised to perform well in a clearly defined area. Extending recent advances in Natural Language Processing (NLP) and TM, we will develop adaptive, minimally supervised techniques for identification of concepts, events and relations in diverse biomedical texts. We will formulate our models as probabilistic graphical models and will optimise their performance via use of joint learning and inference of closely related tasks and via guiding learning across tasks and domains with easily obtainable expert (e.g. shared task, domain) knowledge. We will also introduce novel, improved methodology for knowledge discovery which uses link prediction for high quality LBD in the complex network of concepts resulting from TM. Link prediction can optimally exploit the rich information generated by TM, improve the accuracy of LBD and yield more useful output. We will initially target this methodology to the literature-intensive, interdisciplinary area of cancer biology. We will develop an LBD tool in close collaboration with cancer researchers and will evaluate the tool by using it to conduct case studies in cancer biology. Findings will be validated via laboratory experiments. All the outputs from this project will be made freely available and access to them will be maximised by following the best international practices on data standards, management, collection and sharing (e.g. as defined by Open Annotation Data Model, ELIXIR).

Planned Impact

Vast amounts of new information are generated daily. Much of this information is in the form of written text, and a significant proportion of it has scientific, economic and/or societal value. Text Mining (TM) is needed to access this information and to fully exploit its potential. According to recent surveys (D. McDonald, U. Kelly, U. 2012. "Value and benefits of text mining". JISC; I. Hargreaves. 2011; "Digital Opportunity: A Review of Intellectual Property and Growth". UK Intellectual Property Office) TM has the potential to yield significant benefits in key areas of society, including science, knowledge infrastructure, health and economy. Our project will advance basic TM in the direction where it is more adaptive and portable, and thus capable of supporting a wider range of real-life applications.

In this project, we will use adaptive TM to support LBD. LBD aims to generate new knowledge from existing knowledge in literature. It is particularly interesting for scientific research. Our technology can improve the efficiency of scientific research, reduce research costs and enable scientists to spend more of their time on intellectually challenging tasks.
We will demonstrate the usefulness of our research for biomedicine - a literature-intensive area with high societal and economic value. Our TM and LBD focuses on key biomedical concepts which are relevant for many sub-domains of biomedicine and their application areas. Given the dramatic rise of health problems and the resulting economic burden, technologies supporting biomedicine are much needed. We target our tool specifically to cancer biology. Scientific discoveries in this area can lead to improved understanding of the underlying mechanisms of cancer development and result in more effective ways to prevent and treat cancer. This is a significant benefit, considering cancer is one of the leading causes of death globally, and the one which has the most devastating economic impact.

During the life-time of the project, we focus on demonstrating the usefulness of LBD for scientific research in cancer biology and develop a dedicated research tool for this community. As cancer biology is also of interest to industrial R&D sectors (e.g biotechnology, pharmaceuticals) and public health, we plan to engage with these communities during the project via events and our personal contacts in order to raise wider awareness of our technology.

Drug discovery is one important industrial application area as there are over 600 specialist suppliers in the UK supporting companies that develop and market medicines. In the genetic cancer epidemiology field, genome-wide association study (GWAS) has been expensively applied for comprehensive understanding of cancer predisposition to facilitate drug development, early diagnosis of cancer and its prevention, and choice of optimal therapy indication. LBD could provide an excellent alternative and complementary approach to GWAS, and thus potentially contributes to enhancing public health. Another highly relevant application area related to public health is cancer risk assessment of chemicals. With recent European legislation related to chemical substances and environmental protection, there is increasing need for risk assessment throughout Europe (with worldwide knock-on effects). Not only academics but also industry, government agencies and international health organisations (e.g. WHO) are involved in this. Our previous project (see Case for Support, 2.1. People and Track Record) produced a tool specifically aimed at supporting literature review in cancer risk assessment. The tool we are going to develop in this project can be used to test and discover specific hypotheses related to how chemicals cause cancer.

Publications

10 25 50
 
Title Basic infrastructure for text mining technology for the needs of cancer risk assessment 
Description A taxonomy which captures the scientific evidence needed for cancer risk assessment, over 1000 MEDLINE abstracts annotated according to the taxonomy, and text mining technology for automatic classification of MEDLINE abstracts to taxonomy classes. 
Type Of Material Improvements to research infrastructure 
Provided To Others? No  
Impact The most immediate impact so far are publications which act as a proof of concept for the idea of this innovative project. We hope that these publications will make it easier to obtain funding for a larger project aimed at developing a fully functional, publically available text mining tool for the risk assessment community. 
 
Title Cancer Hallmarks Analytics Tool (CHAT) 
Description CHAT is a research tool based on an extensive Hallmarks of Cancer taxonomy and automatic text mining methodology. It is capable of retrieving and organizing millions of cancer-related references from PubMed into the taxonomy. The correlations identified by the tool show that it offers a great potential to organize and correctly classify cancer-related literature. Cancer researchers can use the tool for many purposes, e.g. to identify hallmarks associated with extrinsic factors, biomarkers and therapeutics targets. 
Type Of Material Improvements to research infrastructure 
Year Produced 2017 
Provided To Others? Yes  
Impact The tool has been published only recently. In our publication we demonstrated the usefulness of the tool for case studies in cancer research. 
URL http://chat.lionproject.net
 
Title LION - a literature-based discovery system for cancer biology 
Description LION, a literature-based discovery (LBD) system enables researchers to navigate published information and supports hypothesis generation and testing. The system is built with a particular focus on the molecular biology of cancer using state-of-the-art machine learning and natural language processing methods, including named entity recognition and grounding to domain ontologies covering a wide range of entity types and a novel approach to detecting references to the hallmarks of cancer in text. LION LBD implements a broad selection of co-occurrence based metrics for analyzing the strength of entity associations, and its design allows real-time search to discover indirect associations between entities in a database of tens of millions of publications while preserving the ability of users to explore each mention in its original context in the literature. 
Type Of Material Improvements to research infrastructure 
Year Produced 2017 
Provided To Others? Yes  
Impact This system has been made available to the research community only very recently. Our own evaluation demonstrates its ability to identify undiscovered links and rank relevant concepts highly among potential connections in cancer research literature. 
URL http://lbd.lionproject.net
 
Title LION LBD tool 
Description The system, called LION LBD and developed by computer scientists and cancer researchers at the University of Cambridge, has been designed to assist scientists in the search for cancer-related discoveries. It is the first literature-based discovery system aimed at supporting cancer research. The results are reported in the journal Bioinformatics. 
Type Of Material Improvements to research infrastructure 
Year Produced 2018 
Provided To Others? Yes  
Impact LION LBD is the first system developed specifically for the needs of cancer research. It has a particular focus on the molecular biology of cancer and uses state-of-the-art machine learning and natural language processing techniques, in order to detect references to the hallmarks of cancer in the text. Evaluations of the system have demonstrated its ability to identify undiscovered links and to rank relevant concepts highly among potential connections. 
URL http://lbd.lionproject.net
 
Title Bio-SimVerb 
Description Evaluation Dataset: Samples/words in Bio-SimVerb (verbs) and Bio-SimLex (nouns) are collected from a pre-processed PubMed Central Open Access subset (PMC). POS tags and tokens in this resource are generated using the BLLIP constituency parser, trained on a biomedical corpus. The resource covers over 1.4M full articles with more than 388M parsed sentences. Details can be referred in Section 3 of the paper: Bio-SimVerb and Bio-SimLex: wide-coverage evaluation sets of word similarity in biomedicine 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
 
Title Code supporting: A Neural Network Multi- Task Learning Approach to Biomedical Named Entity Recognition 
Description Code for the single-task and multi-task models described in paper: A Neural Network Multi-Task Learning Approach to Biomedical Named Entity Recognition. 
Type Of Technology Software 
Year Produced 2017 
 
Description Press release and interview for national news 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Media (as a channel to the public)
Results and Impact The university issued a press release of the LION LBD system (the end product of the project). This ended up in various International news channels and also in a local TV station.
Year(s) Of Engagement Activity 2018
URL https://www.cam.ac.uk/research/news/ai-system-may-accelerate-search-for-cancer-discoveries