Tools for the text mining-based visualisation of the provenance of biochemical networks
Lead Research Organisation:
University of Manchester
Department Name: Chemistry
Abstract
Systems biology is concerned with the modelling, visualisation and analysis of biochemical networks in which, for instance, metabolites are 'linked' by arrows representing the enzymes which turn one molecule into another or which are modified by particular substances. However, these diagrams are divorced from the scientific evidence on which they are based, which is represented by the scientific literature (and increasingly by online databases). However, the historical scientific literature is huge, and is increasing at an enormous rate (several thousand papers per week) so no one can possibly read it all. One solution is to use computers to 'read' these papers and present to the user only those which carry relevant information. Aspects of this subject are variously known as Natural Language Processing and Text Mining. What Text Mining does is to go through papers, extract the relevant pieces of information from each paper, and present them to the biological reader. A particular problem is the use by biologists of multiple names for the same thing. Text Mining can assist here since it is able to find all the variations of the same name and link them with the relevant text and databases. Text mining can also find the TYPES of relationship between these names, and this is the basis by which computers can discover and display scientific evidence. The Text Mining System will produce and index such evidence, for specific problems, and this will be stored in an appropriately structured database. The aim of the project is therefore to develop and deploy the necessary Text Mining tools and to use them to display the different relationships to the user and the literature on which they are based. This will be done by encoding the interactions using arrows of various colours that will link to a dynamic website of relevant literature that will thus give a direct linkage between the systems biology diagrams and the evidence for them.
Technical Summary
Systems biology is concerned with the modelling, visualisation and analysis of biochemical networks in which, for instance, metabolites are 'linked' by arrows representing the enzymes which turn one molecule into another or which are modified by particular substances. SBML provides a computer-readable 'standard' for describing such biochemical or signalling networks. However, these diagrams (and thus the SBML models) are divorced from the scientific evidence on which they are based, represented by the scientific literature (and increasingly by online databases). In order to overcome the problems of reading the burgeoning scientific literature, we shall deploy Text Mining TM. TM involves named entity recognition (i.e. semantic annotation of enzymes, metabolites, etc) and information extraction (i.e. relationship extraction between named entities). An important part of this proposal is to find solutions for the terminology problem in systems biology, by developing techniques for recognising synonym terms.Based on our efficient parsing techniques, we shall extract relationships between entities that will form the basis by which we shall can discover, index, store and display the scientific evidence for such linkages. The selection of the most pertinent relationships will be performed using our preferred methods of advanced machine learning (Support Vector Machines and Genetic Programming). The overall aim of the project is thus to develop and deploy the necessary TM tools and to use them to display the different relationships to the user together with the literature from which they have been extracted. The different types (and strength) of evidence for these interactions will then be visualised directly and linked to a dynamic website of the literature. This will thus give users a direct linkage between the systems biology diagrams encoded in (an advanced form of) SBML and the scientific evidence for them. Where available, linkages to kinetic data will also be made.
Publications

Ananiadou S
(2010)
Event extraction for systems biology by text mining the literature.
in Trends in biotechnology

Demner-Fushman D
(2008)
Themes in biomedical natural language processing: BioNLP08
in BMC Bioinformatics

Hull D
(2008)
Defrosting the digital library: bibliographic tools for the next generation web.
in PLoS computational biology

Kano Y
(2010)
Text mining meets workflow: linking U-Compare with Taverna.
in Bioinformatics (Oxford, England)

Kano Y
(2009)
U-Compare: share and compare text mining tools with UIMA.
in Bioinformatics (Oxford, England)

Kemper B
(2010)
PathText: a text mining integrator for biological pathway visualizations.
in Bioinformatics (Oxford, England)

Kocbek S
(2011)
AGRA: analysis of gene ranking algorithms.
in Bioinformatics (Oxford, England)

Kolluru B
(2011)
Automatic extraction of microorganisms and their habitats from free text using text mining workflows.
in Journal of integrative bioinformatics

Okazaki N
(2010)
Building a high-quality sense inventory for improved abbreviation disambiguation.
in Bioinformatics (Oxford, England)

Pyysalo S
(2012)
Event extraction across multiple levels of biological organization.
in Bioinformatics (Oxford, England)
Description | Systems biology is concerned with the modelling, visualisation and analysis of biochemical networks in which, for instance, metabolites are 'linked' by arrows representing the enzymes which turn one molecule into another or which are modified by particular substances. However, these diagrams are divorced from the scientific evidence on which they are based, which is represented by the scientific literature (and increasingly by online databases). However, the historical scientific literature is huge, and is increasing at an enormous rate (several thousand papers per week) so no one can possibly read it all. One solution is to use computers to 'read' these papers and present to the user only those which carry relevant information. Aspects of this subject are variously known as Natural Language Processing and Text Mining. Our contribution was to link text with pathways by going through papers, extracting the relevant pieces of information from each paper, and presenting them to the biological reader. |
Exploitation Route | The benefits from the outputs of this research had an impact in the way systems biologists carry out pathway reconstruction. The actual reconstruction of signalling/metabolic pathways benefits pharma and systems medicine. |
Sectors | Chemicals,Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology |
URL | http://www.nactem.ac.uk/facta/ |
Description | Findings have been used by Pharma industry to support drug discovery |
First Year Of Impact | 2010 |
Sector | Chemicals,Digital/Communication/Information Technologies (including Software),Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology |
Impact Types | Economic |
Description | Big Science Mechanism |
Amount | £678,153 (GBP) |
Funding ID | W911NF-14-1-0333 |
Organisation | Defense Advanced Research Projects Agency (DARPA) |
Sector | Public |
Country | United States |
Start | 11/2014 |
End | 05/2017 |
Description | EuropePubMedCentral |
Amount | £680,000 (GBP) |
Funding ID | N/A |
Organisation | Wellcome Trust |
Department | KEMRI-Wellcome Trust Research Programme |
Sector | Academic/University |
Country | Kenya |
Start | 03/2008 |
End | 12/2015 |
Description | METANET4U |
Amount | € 350,000 (EUR) |
Funding ID | ICT PSP 270893 |
Organisation | European Commission |
Sector | Public |
Country | European Union (EU) |
Start | 01/2011 |
End | 02/2013 |
Title | AcroMine |
Description | Recognises automatically and expands biomedical acronyms. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2006 |
Provided To Others? | Yes |
Impact | Improved search for other text mining services such as KLEIO http://www.nactem.ac.uk/Kleio/ and Europe PubMed Central Evidence Finder http://labs.europepmc.org/evf |
URL | http://www.nactem.ac.uk/software/acromine/ |
Title | BioNLP Shared Task Resources |
Description | The BioNLP Shared Task series represents a community-wide move in bio-textmining toward fine-grained information extraction (IE). Manually annotated data where all annotations are bound to specific expressions in text will be provided for training, development and evaluation of extraction methods, and tools for detailed evaluation of system outputs are all publicly available. |
Type Of Material | Database/Collection of data |
Year Produced | 2014 |
Provided To Others? | Yes |
Impact | The task setup and data have since served as the basis of numerous studies and published event extraction systems and datasets. |
URL | https://sites.google.com/site/bionlpst/ |
Description | KISTI Pathway |
Organisation | Korea Institute of Science and Technology Information (KISTI) |
Country | Korea, Republic of |
Sector | Academic/University |
PI Contribution | The construction of detailed, machine-readable models of biomolecular pathways is a major goal of systems biology, and hundreds of models capturing the physical entities and reactions involved in various pathways are already available from repositories such as the BioModels Database and the PANTHER Pathway repository. Support biologists by providing Biomedical text mining systems, increasingly capable of creating rich structured representations of information automatically extracted from literature. Such text mining systems open many opportunities for supporting the curation, validation, and updating of pathway models. Building on the PathText text mining integration technology for pathways, text mining systems such as MEDIE, event extraction tools such as EventMine, we are developing methods for identifying literature relevant to specific reactions in pathway models and for automatically analysing documents to extract event structures that capture the full semantics of pathway reactions. |
Collaborator Contribution | Joint proposal of the BioNLP 2013 shared task, biologists from KISTI annotated reactions in a variety of signalling and metabolic pathways |
Impact | http://2013.bionlp-st.org organisation of shared tasks with resources made available to the community |
Start Year | 2012 |
Title | Acromine |
Description | Automatically discovers acronyms from biomedical text and expands them into their long forms |
IP Reference | |
Protection | Copyrighted (e.g. software) |
Year Protection Granted | 2006 |
Licensed | Yes |
Impact | joint projects with Elsevier |
Title | Accelerated annotation tool |
Description | The ACELA (ACcElerated Annotation) tool aims to reduce the human effort required to produce a gold standard corpus of named entity (NE) annotations. The process of annotation is similar to active leaning, in that it is performed as an iterative and interactive process between the human annotator and a machine-learned NE tagger. |
Type Of Technology | Webtool/Application |
Year Produced | 2008 |
Impact | The aim of the tool is to ensure that all NEs of a given type are annotated in a given corpus with minimum effort from the human annotator. Only those sentences that are most likely to contain NEs of the target type (according to the predictions of the tagger) are displayed for the human to annotate, which means that it is not necessary to read through all sentences in the corpus that do contain relevant entities. At each iteration of the process, the NE tagger is re-trained on all available sentences that have been human anntated, meaning that it makes increasingly accurate predictions about which sentences contain named entities. The tool also makes estimates about the number of entities in the corpus that have been annotated by the human (coverage), and the annotation process stops when the figure is close to 100%. |
URL | http://www.nactem.ac.uk/acela/ |
Title | Acromine |
Description | Acromine is an abbreviation dictionary automatically constructed from the whole MEDLINE as of April, 2009. Acromine identifies abbreviation definitions by assuming a word sequence co-occurring frequently with a parenthetical expression to be a potential expanded form. Applied to the whole MEDLINE (9,635,599 abstracts), the implemented system extracted 68,007 abbreviation candidates and recognized 467,402 expanded forms. The current Acromine achieves 99% precision and 82-95% recall on our evaluation corpus that roughly emulates the whole MEDLINE. |
Type Of Technology | Webtool/Application |
Year Produced | 2006 |
Impact | Improves search by including acronyms in query expansion; included in Europe PubMedCentral search system EvidenceFinder |
URL | http://www.nactem.ac.uk/software/acromine/ |
Title | FACTA+ |
Description | A text mining service for mining direct and indirect associations |
Type Of Technology | Webtool/Application |
Year Produced | 2009 |
Impact | Used for hypothesis generation for clinical and biological applications. Linked with pathway curation. Cited in New Scientist http://www.nactem.ac.uk/newsitem.php?item=272 |
URL | http://www.nactem.ac.uk/facta/ |
Title | U-Compare |
Description | Interoperable text mining environment for comparing text mining tools, evaluating them, creating quickly web services |
Type Of Technology | Software |
Year Produced | 2010 |
Open Source License? | Yes |
Impact | Applied to areas beyond biomedicine, as part of the language network of excellence METANET and METASHARE. NaCTeM became the UK hub of the network of excellence METANET because of U-Compare |
URL | http://nactem.ac.uk/ucompare/ |
Description | Invited talk at the Royal Society of Chemistry |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | Yes |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | Invited talk about text mining methods for mining chemical information networking activities with pharma |
Year(s) Of Engagement Activity | 2008 |
URL | http://www.rsc.org/events/detail/2773?CFID=3274168&CFTOKEN=dea17cccbb272dda-817E0913-DF41-7AF5-16045... |