Tools for the text mining-based visualisation of the provenance of biochemical networks

Lead Research Organisation: University of Manchester
Department Name: Chemistry

Abstract

Systems biology is concerned with the modelling, visualisation and analysis of biochemical networks in which, for instance, metabolites are 'linked' by arrows representing the enzymes which turn one molecule into another or which are modified by particular substances. However, these diagrams are divorced from the scientific evidence on which they are based, which is represented by the scientific literature (and increasingly by online databases). However, the historical scientific literature is huge, and is increasing at an enormous rate (several thousand papers per week) so no one can possibly read it all. One solution is to use computers to 'read' these papers and present to the user only those which carry relevant information. Aspects of this subject are variously known as Natural Language Processing and Text Mining. What Text Mining does is to go through papers, extract the relevant pieces of information from each paper, and present them to the biological reader. A particular problem is the use by biologists of multiple names for the same thing. Text Mining can assist here since it is able to find all the variations of the same name and link them with the relevant text and databases. Text mining can also find the TYPES of relationship between these names, and this is the basis by which computers can discover and display scientific evidence. The Text Mining System will produce and index such evidence, for specific problems, and this will be stored in an appropriately structured database. The aim of the project is therefore to develop and deploy the necessary Text Mining tools and to use them to display the different relationships to the user and the literature on which they are based. This will be done by encoding the interactions using arrows of various colours that will link to a dynamic website of relevant literature that will thus give a direct linkage between the systems biology diagrams and the evidence for them.

Technical Summary

Systems biology is concerned with the modelling, visualisation and analysis of biochemical networks in which, for instance, metabolites are 'linked' by arrows representing the enzymes which turn one molecule into another or which are modified by particular substances. SBML provides a computer-readable 'standard' for describing such biochemical or signalling networks. However, these diagrams (and thus the SBML models) are divorced from the scientific evidence on which they are based, represented by the scientific literature (and increasingly by online databases). In order to overcome the problems of reading the burgeoning scientific literature, we shall deploy Text Mining TM. TM involves named entity recognition (i.e. semantic annotation of enzymes, metabolites, etc) and information extraction (i.e. relationship extraction between named entities). An important part of this proposal is to find solutions for the terminology problem in systems biology, by developing techniques for recognising synonym terms.Based on our efficient parsing techniques, we shall extract relationships between entities that will form the basis by which we shall can discover, index, store and display the scientific evidence for such linkages. The selection of the most pertinent relationships will be performed using our preferred methods of advanced machine learning (Support Vector Machines and Genetic Programming). The overall aim of the project is thus to develop and deploy the necessary TM tools and to use them to display the different relationships to the user together with the literature from which they have been extracted. The different types (and strength) of evidence for these interactions will then be visualised directly and linked to a dynamic website of the literature. This will thus give users a direct linkage between the systems biology diagrams encoded in (an advanced form of) SBML and the scientific evidence for them. Where available, linkages to kinetic data will also be made.

Publications

10 25 50

publication icon
Kocbek S (2011) AGRA: analysis of gene ranking algorithms. in Bioinformatics (Oxford, England)

publication icon
Pyysalo S (2014) Anatomical entity mention recognition at literature scale. in Bioinformatics (Oxford, England)

publication icon
Okazaki N (2010) Building a high-quality sense inventory for improved abbreviation disambiguation. in Bioinformatics (Oxford, England)

publication icon
Pyysalo S (2012) Event extraction across multiple levels of biological organization. in Bioinformatics (Oxford, England)

publication icon
Ananiadou S (2010) Event extraction for systems biology by text mining the literature. in Trends in biotechnology

publication icon
Sasaki Y (2011) EXTRACTING SECONDARY BIO-EVENT ARGUMENTS WITH EXTRACTION CONSTRAINTS in Computational Intelligence

 
Description Systems biology is concerned with the modelling, visualisation and analysis of biochemical networks in which, for instance, metabolites are 'linked' by arrows representing the enzymes which turn one molecule into another or which are modified by particular substances. However, these diagrams are divorced from the scientific evidence on which they are based, which is represented by the scientific literature (and increasingly by online databases).

However, the historical scientific literature is huge, and is increasing at an enormous rate (several thousand papers per week) so no one can possibly read it all. One solution is to use computers to 'read' these papers and present to the user only those which carry relevant information. Aspects of this subject are variously known as Natural Language Processing and Text Mining. Our contribution was to link text with pathways by going through papers, extracting the relevant pieces of information from each paper, and presenting them to the biological reader.
Exploitation Route The benefits from the outputs of this research had an impact in the way systems biologists carry out pathway reconstruction. The actual reconstruction of signalling/metabolic pathways benefits pharma and systems medicine.
Sectors Chemicals,Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology

URL http://www.nactem.ac.uk/facta/
 
Description Findings have been used by Pharma industry to support drug discovery
First Year Of Impact 2010
Sector Chemicals,Digital/Communication/Information Technologies (including Software),Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description Big Science Mechanism
Amount £678,153 (GBP)
Funding ID W911NF-14-1-0333 
Organisation Defense Advanced Research Projects Agency (DARPA) 
Sector Public
Country United States
Start 11/2014 
End 05/2017
 
Description EuropePubMedCentral
Amount £680,000 (GBP)
Funding ID N/A 
Organisation Wellcome Trust 
Department KEMRI-Wellcome Trust Research Programme
Sector Academic/University
Country Kenya
Start 03/2008 
End 12/2015
 
Description METANET4U
Amount € 350,000 (EUR)
Funding ID ICT PSP 270893 
Organisation European Commission 
Sector Public
Country European Union (EU)
Start 01/2011 
End 02/2013
 
Title AcroMine 
Description Recognises automatically and expands biomedical acronyms. 
Type Of Material Improvements to research infrastructure 
Year Produced 2006 
Provided To Others? Yes  
Impact Improved search for other text mining services such as KLEIO http://www.nactem.ac.uk/Kleio/ and Europe PubMed Central Evidence Finder http://labs.europepmc.org/evf 
URL http://www.nactem.ac.uk/software/acromine/
 
Title BioNLP Shared Task Resources 
Description The BioNLP Shared Task series represents a community-wide move in bio-textmining toward fine-grained information extraction (IE). Manually annotated data where all annotations are bound to specific expressions in text will be provided for training, development and evaluation of extraction methods, and tools for detailed evaluation of system outputs are all publicly available. 
Type Of Material Database/Collection of data 
Year Produced 2014 
Provided To Others? Yes  
Impact The task setup and data have since served as the basis of numerous studies and published event extraction systems and datasets. 
URL https://sites.google.com/site/bionlpst/
 
Description KISTI Pathway 
Organisation Korea Institute of Science and Technology Information (KISTI)
Country Korea, Republic of 
Sector Academic/University 
PI Contribution The construction of detailed, machine-readable models of biomolecular pathways is a major goal of systems biology, and hundreds of models capturing the physical entities and reactions involved in various pathways are already available from repositories such as the BioModels Database and the PANTHER Pathway repository. Support biologists by providing Biomedical text mining systems, increasingly capable of creating rich structured representations of information automatically extracted from literature. Such text mining systems open many opportunities for supporting the curation, validation, and updating of pathway models. Building on the PathText text mining integration technology for pathways, text mining systems such as MEDIE, event extraction tools such as EventMine, we are developing methods for identifying literature relevant to specific reactions in pathway models and for automatically analysing documents to extract event structures that capture the full semantics of pathway reactions.
Collaborator Contribution Joint proposal of the BioNLP 2013 shared task, biologists from KISTI annotated reactions in a variety of signalling and metabolic pathways
Impact http://2013.bionlp-st.org organisation of shared tasks with resources made available to the community
Start Year 2012
 
Title Acromine 
Description Automatically discovers acronyms from biomedical text and expands them into their long forms 
IP Reference  
Protection Copyrighted (e.g. software)
Year Protection Granted 2006
Licensed Yes
Impact joint projects with Elsevier
 
Title Accelerated annotation tool 
Description The ACELA (ACcElerated Annotation) tool aims to reduce the human effort required to produce a gold standard corpus of named entity (NE) annotations. The process of annotation is similar to active leaning, in that it is performed as an iterative and interactive process between the human annotator and a machine-learned NE tagger. 
Type Of Technology Webtool/Application 
Year Produced 2008 
Impact The aim of the tool is to ensure that all NEs of a given type are annotated in a given corpus with minimum effort from the human annotator. Only those sentences that are most likely to contain NEs of the target type (according to the predictions of the tagger) are displayed for the human to annotate, which means that it is not necessary to read through all sentences in the corpus that do contain relevant entities. At each iteration of the process, the NE tagger is re-trained on all available sentences that have been human anntated, meaning that it makes increasingly accurate predictions about which sentences contain named entities. The tool also makes estimates about the number of entities in the corpus that have been annotated by the human (coverage), and the annotation process stops when the figure is close to 100%. 
URL http://www.nactem.ac.uk/acela/
 
Title Acromine 
Description Acromine is an abbreviation dictionary automatically constructed from the whole MEDLINE as of April, 2009. Acromine identifies abbreviation definitions by assuming a word sequence co-occurring frequently with a parenthetical expression to be a potential expanded form. Applied to the whole MEDLINE (9,635,599 abstracts), the implemented system extracted 68,007 abbreviation candidates and recognized 467,402 expanded forms. The current Acromine achieves 99% precision and 82-95% recall on our evaluation corpus that roughly emulates the whole MEDLINE. 
Type Of Technology Webtool/Application 
Year Produced 2006 
Impact Improves search by including acronyms in query expansion; included in Europe PubMedCentral search system EvidenceFinder 
URL http://www.nactem.ac.uk/software/acromine/
 
Title FACTA+ 
Description A text mining service for mining direct and indirect associations 
Type Of Technology Webtool/Application 
Year Produced 2009 
Impact Used for hypothesis generation for clinical and biological applications. Linked with pathway curation. Cited in New Scientist http://www.nactem.ac.uk/newsitem.php?item=272 
URL http://www.nactem.ac.uk/facta/
 
Title U-Compare 
Description Interoperable text mining environment for comparing text mining tools, evaluating them, creating quickly web services 
Type Of Technology Software 
Year Produced 2010 
Open Source License? Yes  
Impact Applied to areas beyond biomedicine, as part of the language network of excellence METANET and METASHARE. NaCTeM became the UK hub of the network of excellence METANET because of U-Compare 
URL http://nactem.ac.uk/ucompare/
 
Description Invited talk at the Royal Society of Chemistry 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Invited talk about text mining methods for mining chemical information

networking activities with pharma
Year(s) Of Engagement Activity 2008
URL http://www.rsc.org/events/detail/2773?CFID=3274168&CFTOKEN=dea17cccbb272dda-817E0913-DF41-7AF5-16045...