Automated Biological Event Extraction from the Literature for Drug Discovery

Lead Research Organisation: University of Manchester
Department Name: Computer Science

Abstract

The development of new drugs is both expensive and time-consuming: it can take over a decade for a new drug to be proven effective and safe, even with the many advances we have seen in the life sciences. From a batch of promising early candidates, only a few will eventually be approved. The longer a candidate lasts before being found unusable (attrition), the more expensive the cost, especially if clinical trials have been involved. Attrition rates run at ca 90%, and attrition is thus ruinously costly to the pharmaceutical industry, so there is an urgent need to reduce its impact. UK researchers, leading in biological and pharmaceutical research, would benefit greatly from means to identify as early as possible drug candidates that are likely to fail, preferably long before the clinical stage is reached. Another current area of concern is how drugs may be targeted to groups of individuals: not every individual responds in the same way to the same drug.. If we can discover which genes are implicated in this, then we can hope both to focus on the more promising drug candidates and find ways of tailoring treatments to (groups of) individuals. Unfortunately, however, scientists are faced with a severe knowledge gap: no scientist can keep up, using traditional means, with the vast amount of experimental data and especially its massive associated literature that is being (and has been )generated in the life sciences. Moreover, much knowledge is hidden in the literature: it has been shown that entirely new knowledge has been available for discovery in the literature, often for many years, but that the vastness of the literature has prevented researchers from achieving the required level of information retrieval, that is the first step in linking and synthesizing it into new, previously unsuspected knowledge. The main target of information finding is the MEDLINE resource, which currently contains some 17 million abstracts: this is seemingly large but is nevertheless a fraction of the information and hidden knowledge contained in the associated full text scientific articles. The proposed project is designed to help scientists overcome this knowledge gap, by developing automatic means to filter information and to synthesise new knowledge from the scientific literature. As a direct link between a (number of) proteins(s) and a physiological or pathophysiological process is not always described explicitly in a text, we must hunt for indirect evidence. This involves looking for indications of biological processes that are associated with proteins. When writing, biologists essentially describe 'events' such as such as phosphorylation that are involved in higher order bioprocesses such as angiogenesis. By identifying and extracting such events, and the particular biological entities (proteins, diseases), we can collect many fragments of information about bioprocesses from many thousands of texts. These fragments can then be used to find new knowledge by establishing associations among the fragments. To achieve such extraction of fragments for knowledge finding, powerful semantic text mining techniques are required that can handle the special languages of biologists, and that can achieve appropriate levels of abstraction far beyond mere word search. This project will customise the generic tools of the National Centre for Text Mining and carry out research to find the best ways of extracting events concerning biological processes from the literature. AstraZeneca will be closely involved, both in terms of informing the research, and providing practical domain expertise, requirements, data and concrete evaluation scenarios. Their interest is also manifest in a substantial cash contribution to the project. The result of this programme will be a text mining service to academic researchers, offered NaCTeM, supporting them in their task of discovering protein -bioprocess associations from the literature.

Technical Summary

In establishing drug target confidence, it is essential to have evidence of the type of relationship between the target and key protein-bioprocesses. However, the primary starting point for target choice, and the context for interpretation of all pre-clinical observations is the literature. Text mining (TM) is ideally suited to support the discovery of reliable drug targets. But for TM systems to help researchers understand the role proteins play in biological processes, they have to extract, normalise and identify the context of complex relationships between genes, diseases and their underlying bioprocesses. Our TM techniques will recognise diverse surface forms in text describing bioprocesses and will link them with events and the proteins associated with them. Our methods are based on a combination of advanced semantic text mining (deep parsing, named entity recognition) and machine learning techniques, as we shall automatically identify events (involving proteins) such as decrease [in concentration], phosphorylation, ubiquitination, etc. Bioprocesses such as angiogenesis are composed of individual events described in the literature. We propose to identify these bioprocesses automatically and to link them with the associated events. A combination of kernel methods with knowledge resources and annotated texts (evaluated by biologists) will be used to automatically learn how bioprocesses underlying higher level processes are linked with which events. We shall concentrate on angiogenesis as an example. We shall thereby produce and make available a text mining service for researchers working in drug discovery. Both the software tools used for event extraction as well as the annotated texts used for training purposes will be made available. Co-funded by EPSRC under the RCUK Cross-Council Funding Agreement.

Publications

10 25 50
 
Description We enabled biologists working in drug discovery to extract information automatically from the literature.
a) we customised deep semantic text mining techniques to extract protein-biological process associations automatically;
b) we extracted biological events pertaining to protein-disease associations automatically from the literature;
c) we supported the semi-automatic production of annotated texts pertaining to biological information for text mining applications;
d) we identified automatically bioprocesses linked with protein-disease events;
e) produced a text mining service supporting biologists researching into protein-bioprocesses from the vast amount of literature.
Exploitation Route Our resources http://www.nactem.ac.uk/MLEE/ and http://www.nactem.ac.uk/anatomy/ are widely used by other teams working in drug discovery using text mining.
Recognisers for anatomy http://nactem.ac.uk/anatomytagger/ have been used by the EU PubMedCentral project to develop search services for biosciences. http://www.nactem.ac.uk/EvidenceFinderAnatomyMK/
All resources and trained models are open under a BY-SA licence.
Sectors Chemicals,Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology

URL http://www.nactem.ac.uk/az/
 
Description The development of new drugs is both expensive and time-consuming: it can take over a decade for a new drug to be proven effective and safe, even with the many advances we have seen in the life sciences. From a batch of promising early candidates, only a few will eventually be approved. The longer a candidate lasts before being found unusable (attrition), the more expensive the cost, especially if clinical trials have been involved. Attrition rates run at ca 90%, and attrition is thus ruinously costly to the pharmaceutical industry, so there is an urgent need to reduce its impact. UK researchers, leading in biological and pharmaceutical research, would benefit greatly from means to identify as early as possible drug candidates that are likely to fail, preferably long before the clinical stage is reached. Another current area of concern is how drugs may be targeted to groups of individuals: not every individual responds in the same way to the same drug.. If we can discover which genes are implicated in this, then we can hope both to focus on the more promising drug candidates and find ways of tailoring treatments to (groups of) individuals. Unfortunately, however, scientists are faced with a severe knowledge gap: no scientist can keep up, using traditional means, with the vast amount of experimental data and especially its massive associated literature that is being (and has been )generated in the life sciences. Moreover, much knowledge is hidden in the literature: it has been shown that entirely new knowledge has been available for discovery in the literature, often for many years, but that the vastness of the literature has prevented researchers from achieving the required level of information retrieval, that is the first step in linking and synthesizing it into new, previously unsuspected knowledge. The main target of information finding is the MEDLINE resource, which currently contains some 25 million abstracts: this is seemingly large but is nevertheless a fraction of the information and hidden knowledge contained in the associated full text scientific articles. The proposed project was designed to help scientists overcome this knowledge gap, by developing automatic means to filter information and to synthesise new knowledge from the scientific literature. As a direct link between a (number of) proteins(s) and a physiological or pathophysiological process is not always described explicitly in a text, we must hunt for indirect evidence. This involves looking for indications of biological processes that are associated with proteins. When writing, biologists essentially describe 'events' such as such as phosphorylation that are involved in higher order bioprocesses such as angiogenesis. By identifying and extracting such events, and the particular biological entities (proteins, diseases), we can collect many fragments of information about bioprocesses from many thousands of texts. These fragments can then be used to find new knowledge by establishing associations among the fragments. To achieve such extraction of fragments for knowledge finding, powerful semantic text mining techniques are required that can handle the special languages of biologists, and that can achieve appropriate levels of abstraction far beyond mere word search. This project customised the generic tools of the National Centre for Text Mining and carried out research to find the best ways of extracting events concerning biological processes from the literature. AstraZeneca was closely involved, both in terms of informing the research, and providing practical domain expertise, requirements, data and concrete evaluation scenarios. The result of this programme was a text mining service to academic researchers, offered by NaCTeM, supporting them in their task of discovering protein -bioprocess associations from the literature.
First Year Of Impact 2011
Sector Healthcare,Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description Copyright and Licensing in relation to Text and Data Mining
Geographic Reach Multiple continents/international 
Policy Influence Type Contribution to a national consultation/review
Impact The National Centre for Text Mining played a leading role in advising on policy and development of UK legislation regarding a copyright exception in relation to text mining. Contributions included talks at events at the Houses of Parliament, the European Parliament, London School of Economics, and participation in consultations by the IPO and the EC (on the wider issue of copyright and licensing issues in the EU). Advice was also given on numerous occasions by request of the IPO during development of the legislation which came into force on 1st June 2014. It is somewhat too early to ascertain impact, however this has already led to major initiatives such as Europe PubMed Central being able to lawfully text mine full papers as well as increased levels of text mining within such bodies as the British Library and also within institutional repositories. It has also led to increased scope and expected impact of research projects as these can tackle for the first time large scale text mining of full text articles which are lawfully subscribed to in addition to open access material.
URL http://www.jisc.ac.uk/sites/default/files/value-text-mining.pdf
 
Description Automated Measurement and Analysis of Open Source Software
Amount € 540,000 (EUR)
Funding ID 318736 
Organisation European Commission 
Department Seventh Framework Programme (FP7)
Sector Public
Country European Union (EU)
Start 10/2012 
End 04/2015
 
Description Big Science Mechanism
Amount £678,153 (GBP)
Funding ID W911NF-14-1-0333 
Organisation Defense Advanced Research Projects Agency (DARPA) 
Sector Public
Country United States
Start 11/2014 
End 05/2017
 
Description Digging into Data Challenge
Amount £99,000 (GBP)
Funding ID N/A 
Organisation Jisc 
Sector Public
Country United Kingdom
Start 04/2014 
End 07/2015
 
Description Digging into Data Challenge
Amount £99,000 (GBP)
Funding ID N/A 
Organisation Jisc 
Sector Public
Country United Kingdom
Start 01/2012 
End 12/2013
 
Description EuropePubMedCentral
Amount £680,000 (GBP)
Funding ID N/A 
Organisation Wellcome Trust 
Department KEMRI-Wellcome Trust Research Programme
Sector Academic/University
Country Kenya
Start 03/2008 
End 12/2015
 
Title Argo for Biodiversity 
Description Argo is an interoperable infrastructure for building and running text-analysis solutions. It facilitates the development of custom text mining workflows from a selection of text mining components. We have augmented Argo to include biodiversity text mining tools. 
Type Of Material Improvements to research infrastructure 
Year Produced 2017 
Provided To Others? Yes  
Impact Supports the curation of databases, user collaboration, includes numerous (and third party) processing components, allows the creation of text mining workflows. Includes text mining tools for biodiversity. 
URL http://argo.nactem.ac.uk
 
Title EventMine 
Description EventMine is a machine learning-based pipeline system, which extracts events from documents that already contain named entity annotations (e.g., genes/proteins, etc.). Given appropriate training data, it can be trained to extract many different types and structures of events. 
Type Of Material Improvements to research infrastructure 
Year Produced 2012 
Provided To Others? Yes  
Impact Community shared tasks; other research teams improved results Customised to different domains and application areas; Part of the Argo text mining platform http://argo.nactem.ac.uk 
URL http://www.nactem.ac.uk/EventMine/
 
Title Anatomical entity mention recognition AnaTEM 
Description The extended Anatomical Entity Mention corpus (AnatEM) consists of 1212 documents (approx. 250,000 words) manually annotated to identify over 13,000 mentions of anatomical entities. Each annotation is assigned one of 12 granularity-based types such as Cellular component, Tissue and Organ, defined with reference to the Common Anatomy Reference Ontology. The corpus builds in part on two previously introduced resources, AnEM and MLEE. The corpus annotations were created using the brat annotation tool. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact Embedded in Europe PubMed Central Includes lexical resources, AnatomyTagger, UIMA components 
URL http://nactem.ac.uk/anatomytagger/
 
Title Anatomy annotated corpora 
Description Multi-Level Event Extraction (MLEE) corpus - abstracts of publications on angiogenesis, annotated with entity mentions and events across multiple levels of biological organization from the molecular to the organ system level. Over 8,000 entities with fine-grained types and over 6,000 structured events are annotated. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact The corpus annotation was created with reference to previously introduced annotation created by subdomain experts to identify spans of text that expressing statements relevant to their interests. To create the MLEE corpus, we have established ontological foundations for the annotation with reference to the community-standard OBO Foundry resources such as the Gene Ontology (GO) and the Common Anatomy Reference Ontology (CARO), revising existing span annotations accordingly to identify over 8,000 entities with fine-grained types and introducing structured annotation for over 6,000 events. 
URL http://www.nactem.ac.uk/MLEE/
 
Title BioCause 
Description Causality lies at the heart of biomedical knowledge, such as diagnosis, pathology or systems biology, and, thus, automatic causality recognition can greatly reduce the human workload by suggesting possible causal connections and aiding in the curation of pathway models. A biomedical text corpus annotated with such relations is, hence, crucial for developing and evaluating biomedical text mining. BioCause, a collection of open-access full-text biomedical journal articles belonging to the subdomain of infectious diseases. These documents have been pre-annotated with named entity and event information in the context of a previous shared task, BioNLP 2011 ST ID. The BioNLP 2011 ST ID corpus consists of 19 full-text documents that have been manually annotated with biomedical entities and events. The annotations provide classified, structured representations of relationships between biomedical terms, and as such, the corpus consitututes a valuable resource for the training of IE systems. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact Improved search and information extraction for biomedical text mining. 
URL http://www.nactem.ac.uk/biocause/
 
Title BioNLP Shared Task Resources 2013 
Description The BioNLP Shared Task (BioNLP-ST) series represents a community-wide trend in text-mining for biology toward fine-grained information extraction (IE). The Pathway Curation (PC) task is a main task of the BioNLP Shared Task 2013. The PC task aims to evaluate the applicability of event extraction systems to support the curation, evaluation and maintenance of biomolecular pathway models and to encourage the further development of methods for these tasks. The Cancer Genetics (CG) task is an information extraction task organized as part of the BioNLP Shared Task 2013. The CG task aims to advance the automatic extraction of information from statements on the biological processes relating to the development and progression of cancer. 
Type Of Material Database/Collection of data 
Year Produced 2013 
Provided To Others? Yes  
Impact The BioNLP Shared Task series has been instrumental in encouraging the development of methods and resources for the automatic extraction of bio-processes from text, but efforts within this framework have been almost exclusively focused on molecular and sub-cellular level entities and events. To be relevant to cancer biology, event extraction technology must be generalized to be able to address physical entities entities and processes at higher levels of biological organization, such as cell proliferation, apoptosis, blood vessel development, and organ growth. The CG task aims to advance the development of such event extraction methods and the capacity of automatic analysis of texts on cancer biology. Despite more than a decade of work in biomedical text mining on tasks under headings such as "automatic pathway extraction", natural language processing and information extraction methods have not been widely embraced by biomedical pathway curation communities. Until recently, biomedical domain IE efforts concentrated on simple representations (e.g. physical entity pairs) that were not suf?ciently expressive to address pathway curation, and most work also involved different semantics from those applied in curation efforts. We believe that the structured event representation applied in BioNLP Shared Task main tasks offers many opportunities to make a signi?cant contribution to practical pathway curation efforts. The PC task is proposed as a step toward realizing these opportunities. To assure that the task and its data is relevant to the needs of pathway curation efforts, the PC task defines its extraction targets and their semantics with reference to physical entity and reaction types applied in pathway model standardization efforts and relevant ontologies such as the Systems Biology Ontology (SBO). Further, The corpus texts are selected on the basis of relevance to a selection of pathway models from Panther Pathway DB and BioModels, covering both signaling and metabolic pathways. The texts involve both PubMed publication abstracts and PMC Open Access full-text paper extracts. 
URL http://2013.bionlp-st.org
 
Title Metaknowledge corpus 
Description A corpus of 1000 MEDLINE abstracts manually annotated with events (based on the GENIA ontology) and enriched with scientific discourse information. 
Type Of Material Database/Collection of data 
Year Produced 2011 
Provided To Others? Yes  
Impact Annotation of scientific discourse attracted interest from publishers. Improved search in EuropePubMedCentral system. 
URL http://www.nactem.ac.uk/meta-knowledge/
 
Description PathText 
Organisation The Systems Biology Institute
Country Japan 
Sector Charity/Non Profit 
PI Contribution Providing text mining infrastructure to systems biologists
Collaborator Contribution Supplied pathway editor for our text mining platform
Impact workshops, training events, tutorials, software, publications Members of the Garuda alliance http://www.garuda-alliance.org/alliancemembers
Start Year 2010
 
Title Acromine Disambiguation 
Description Automatically disambiguates acronyms into their expanded long forms from text. 
Type Of Technology Webtool/Application 
Year Produced 2010 
Impact Improved search services by refining query expansion 
URL http://www.nactem.ac.uk/software/acromine_disambiguation/
 
Title Anatomy Tagger 
Description An open-source entity mention tagger for anatomical entities, based on the AnatEM anatomical entity mention corpus, and related open data resources. More information: Sampo Pyysalo and Sophia Ananiadou (2013). Anatomical Entity Mention Recognition at Literature Scale. Bioinformatics. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact Embedded in Europe PubMedCentral search tools 
URL http://nactem.ac.uk/anatomytagger/
 
Title Argo - collaborative text mining workbench 
Description Argo is a workbench for building and running text-analysis solutions. It facilitates the development of custom workflows from a selection of elementary analytics. 
Type Of Technology Webtool/Application 
Year Produced 2012 
Impact Curation of databases and pathways through Workflow Design The web interface allows the user to create complex processing workflows composed of processing components and multiple branching and merging points. User-interactive components, such as Manual Annotation Editor, make the processing of workflows pause and wait for input from the user, processing components, remote processing, user collaboration Top performing system in BioCreative IV user interactive task 
URL http://argo.nactem.ac.uk
 
Title EventMine 
Description EventMine is a machine learning-based pipeline system, which extracts events from documents that already contain named entity annotations (e.g., genes/proteins, etc.). Given appropriate training data, it can be trained to extract many different types and structures of events. 
Type Of Technology Webtool/Application 
Year Produced 2012 
Impact EventMine has been trained on a number of different corpora, and corresponding web services are available. EventMine outperformed on a number of community shared tasks BioNLP 2011 and 2013. It is adaptable to any domain. 
URL http://www.nactem.ac.uk/EventMine/
 
Title FACTA+ 
Description A text mining service for mining direct and indirect associations 
Type Of Technology Webtool/Application 
Year Produced 2009 
Impact Used for hypothesis generation for clinical and biological applications. Linked with pathway curation. Cited in New Scientist http://www.nactem.ac.uk/newsitem.php?item=272 
URL http://www.nactem.ac.uk/facta/
 
Title PathText 
Description A novel method for associating pathway model reactions with relevant publications. Our approach extracts the reactions directly from the models and then turns them into queries for three text mining-based MEDLINE literature search systems. These queries are executed, and the resulting documents are combined and ranked according to their relevance to the reactions of interest. We manually annotate document-reaction pairs with the relevance of the document to the reaction and use this annotation to study several ranking methods, using various heuristic and machine-learning approaches. 
Type Of Technology Webtool/Application 
Year Produced 2013 
Impact Our evaluation shows that the annotated document-reaction pairs can be used to create a rule-based document ranking system, and that machine learning can be used to rank documents by their relevance to pathway reactions. We find that a Support Vector Machine-based system outperforms several baselines and matches the performance of the rule-based system. The success of the query extraction and ranking methods are used to update our existing pathway search system, PathText. 
URL http://www.nactem.ac.uk/pathtext2/
 
Title Species Disambiguation System for Biological Named Entities 
Description This tool automatically associates concepts to entity mentions in biomedical text (e.g., MEDLINE abstracts). A considerable amount of research was put into lexical disambiguation of the biomedical names. This is because a string of words often refers to different meanings depending on the context, hence causing ambiguity. A more sensible way to organise information is by concepts, where a concept has unambiguous meaning and can be associated with a unique identifier. We carry out organism disambiguation by automatically identifying the species-indicating words (e.g., human) and biomedical named entities (e.g., protein P53) in text, and then judging whether the species-entity relations are positive, where a positive relation means that an entity belongs to the organism indicated by the species-indicating word. 
Type Of Technology Webtool/Application 
Year Produced 2008 
Impact This tool tackled one major source of ambiguity in entity mentions: model organisms. Model organisms are species studied to understand particular biological phenomena. Biological experiments are often conducted on one species, with the expectation that the discoveries will provide insight into the workings of others, including humans, which are more difficult to study directly. From viruses, prokaryotes, to plants and animals, there are dozens of organisms commonly used in biological studies, such as E. coli, C. elegans, Drosophila, Homo sapiens, and hundreds more are frequently mentioned in biological research papers. Given an article, it is often essential for readers to understand what organisms the biomedical entities (e.g., proteins) belong to, and on what organisms the experiments were carried out. 
URL http://www.nactem.ac.uk/deca_details/
 
Title brat: annotation visualization and editing 
Description Intuitive visualization and editing of text annotations is important for communicating the "meaning" of annotations and for reducing the effort of creating new annotations. brat is a web-based tool for annotation visualization and editing. brat supports a rich set of fully configurable annotation primitives: Typed text spans (e.g. entity mention) Binary relations (e.g. coreference) n-ary associations (e.g. events) Attributes/meta-knowledge (e.g. Negation, Speculation, etc.) Free-form text "notes" 
Type Of Technology Software 
Year Produced 2012 
Open Source License? Yes  
Impact widely used by the text mining community as an annotation tool par excellence (88 citations since 2012) 
URL http://www.nactem.ac.uk/brat-annotation/
 
Description Keynote at META-NET launch event- Strategic Research Agenda for Multilingual Europe 2020 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Invited talk - interest on Argo infrastructure

Sophia Ananiadou invited to be one of the META-NET executive board members- NaCTeM hub for text mining in the UK
Year(s) Of Engagement Activity 2013
URL http://www.meta-net.eu/events/meta-net-ga-2013/programme
 
Description Licences for Europe 
Form Of Engagement Activity A press release, press conference or response to a media enquiry/interview
Part Of Official Scheme? Yes
Geographic Reach International
Primary Audience Policymakers/politicians
Results and Impact Influenced decision making about licences, the role of text mining and legislation change in copyright material

Contributed to UK legislation change
Year(s) Of Engagement Activity 2013
URL http://ec.europa.eu/research/innovation-union/pdf/TDM-report_from_the_expert_group-042014.pdf