Tools for the text mining-based visualisation of the provenance of biochemical networks

Lead Research Organisation: University of Manchester

Department Name: Chemistry

Abstract

Systems biology is concerned with the modelling, visualisation and analysis of biochemical networks in which, for instance, metabolites are 'linked' by arrows representing the enzymes which turn one molecule into another or which are modified by particular substances. However, these diagrams are divorced from the scientific evidence on which they are based, which is represented by the scientific literature (and increasingly by online databases). However, the historical scientific literature is huge, and is increasing at an enormous rate (several thousand papers per week) so no one can possibly read it all. One solution is to use computers to 'read' these papers and present to the user only those which carry relevant information. Aspects of this subject are variously known as Natural Language Processing and Text Mining. What Text Mining does is to go through papers, extract the relevant pieces of information from each paper, and present them to the biological reader. A particular problem is the use by biologists of multiple names for the same thing. Text Mining can assist here since it is able to find all the variations of the same name and link them with the relevant text and databases. Text mining can also find the TYPES of relationship between these names, and this is the basis by which computers can discover and display scientific evidence. The Text Mining System will produce and index such evidence, for specific problems, and this will be stored in an appropriately structured database. The aim of the project is therefore to develop and deploy the necessary Text Mining tools and to use them to display the different relationships to the user and the literature on which they are based. This will be done by encoding the interactions using arrows of various colours that will link to a dynamic website of relevant literature that will thus give a direct linkage between the systems biology diagrams and the evidence for them.

Technical Summary

Systems biology is concerned with the modelling, visualisation and analysis of biochemical networks in which, for instance, metabolites are 'linked' by arrows representing the enzymes which turn one molecule into another or which are modified by particular substances. SBML provides a computer-readable 'standard' for describing such biochemical or signalling networks. However, these diagrams (and thus the SBML models) are divorced from the scientific evidence on which they are based, represented by the scientific literature (and increasingly by online databases). In order to overcome the problems of reading the burgeoning scientific literature, we shall deploy Text Mining TM. TM involves named entity recognition (i.e. semantic annotation of enzymes, metabolites, etc) and information extraction (i.e. relationship extraction between named entities). An important part of this proposal is to find solutions for the terminology problem in systems biology, by developing techniques for recognising synonym terms.Based on our efficient parsing techniques, we shall extract relationships between entities that will form the basis by which we shall can discover, index, store and display the scientific evidence for such linkages. The selection of the most pertinent relationships will be performed using our preferred methods of advanced machine learning (Support Vector Machines and Genetic Programming). The overall aim of the project is thus to develop and deploy the necessary TM tools and to use them to display the different relationships to the user together with the literature from which they have been extracted. The different types (and strength) of evidence for these interactions will then be visualised directly and linked to a dynamic website of the literature. This will thus give users a direct linkage between the systems biology diagrams encoded in (an advanced form of) SBML and the scientific evidence for them. Where available, linkages to kinetic data will also be made.

Funded Value:

£549,457

Funded Period:

Jan 07 - Jun 10

Funder:

BBSRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

BB/E004431/1

Principal Investigator:

Sophia Ananiadou

Research Topic:

Unclassified

Organisations

People	ORCID iD
Sophia Ananiadou (Principal Investigator)	http://orcid.org/0000-0002-4097-9191
Junichi Tsujii (Co-Investigator)
Pedro Mendes (Co-Investigator)
Steve Pettifer (Co-Investigator)

Publications

Author Name Title

Publication Date Published

|< < 1 2 3 > >|

10 25 50

Tsuruoka Y (2008) Accelerating the annotation of sparse named entities by dynamic sentence selection. in BMC bioinformatics

Kocbek S (2011) AGRA: analysis of gene ranking algorithms. in Bioinformatics (Oxford, England)

Pyysalo S (2014) Anatomical entity mention recognition at literature scale. in Bioinformatics (Oxford, England)

Kolluru B (2011) Automatic extraction of microorganisms and their habitats from free text using text mining workflows. in Journal of integrative bioinformatics

Okazaki N (2010) Building a high-quality sense inventory for improved abbreviation disambiguation. in Bioinformatics (Oxford, England)

Thompson P (2009) Construction of an annotated corpus to support biomedical information extraction. in BMC bioinformatics

Hull D (2008) Defrosting the digital library: bibliographic tools for the next generation web. in PLoS computational biology

Pyysalo S (2012) Event extraction across multiple levels of biological organization. in Bioinformatics (Oxford, England)

Ananiadou S (2010) Event extraction for systems biology by text mining the literature. in Trends in biotechnology

Sasaki Y (2011) EXTRACTING SECONDARY BIO-EVENT ARGUMENTS WITH EXTRACTION CONSTRAINTS in Computational Intelligence

Key Findings
Impact Summary
Further Funding
Research Databases and Models
Research Tools and Methods
Collaboration
Intellectual Property
Software and Technical Products
Engagement Activities


Description	Systems biology is concerned with the modelling, visualisation and analysis of biochemical networks in which, for instance, metabolites are 'linked' by arrows representing the enzymes which turn one molecule into another or which are modified by particular substances. However, these diagrams are divorced from the scientific evidence on which they are based, which is represented by the scientific literature (and increasingly by online databases). However, the historical scientific literature is huge, and is increasing at an enormous rate (several thousand papers per week) so no one can possibly read it all. One solution is to use computers to 'read' these papers and present to the user only those which carry relevant information. Aspects of this subject are variously known as Natural Language Processing and Text Mining. Our contribution was to link text with pathways by going through papers, extracting the relevant pieces of information from each paper, and presenting them to the biological reader.
Exploitation Route	The benefits from the outputs of this research had an impact in the way systems biologists carry out pathway reconstruction. The actual reconstruction of signalling/metabolic pathways benefits pharma and systems medicine.
Sectors	Chemicals,Digital/Communication/Information Technologies (including Software),Pharmaceuticals and Medical Biotechnology
URL	http://www.nactem.ac.uk/facta/


Description	Findings have been used by Pharma industry to support drug discovery
First Year Of Impact	2010
Sector	Chemicals,Digital/Communication/Information Technologies (including Software),Manufacturing, including Industrial Biotechology,Pharmaceuticals and Medical Biotechnology
Impact Types	Economic


Description	Big Science Mechanism
Amount	£678,153 (GBP)
Funding ID	W911NF-14-1-0333
Organisation	Defense Advanced Research Projects Agency (DARPA)
Sector	Public
Country	United States
Start	11/2014
End	05/2017


Description	EuropePubMedCentral
Amount	£680,000 (GBP)
Funding ID	N/A
Organisation	Wellcome Trust
Department	KEMRI-Wellcome Trust Research Programme
Sector	Academic/University
Country	Kenya
Start	03/2008
End	12/2015


Description	METANET4U
Amount	€ 350,000 (EUR)
Funding ID	ICT PSP 270893
Organisation	European Commission
Sector	Public
Country	European Union (EU)
Start	01/2011
End	02/2013


Title	AcroMine
Description	Recognises automatically and expands biomedical acronyms.
Type Of Material	Improvements to research infrastructure
Year Produced	2006
Provided To Others?	Yes
Impact	Improved search for other text mining services such as KLEIO http://www.nactem.ac.uk/Kleio/ and Europe PubMed Central Evidence Finder http://labs.europepmc.org/evf
URL	http://www.nactem.ac.uk/software/acromine/


Title	BioNLP Shared Task Resources
Description	The BioNLP Shared Task series represents a community-wide move in bio-textmining toward fine-grained information extraction (IE). Manually annotated data where all annotations are bound to specific expressions in text will be provided for training, development and evaluation of extraction methods, and tools for detailed evaluation of system outputs are all publicly available.
Type Of Material	Database/Collection of data
Year Produced	2014
Provided To Others?	Yes
Impact	The task setup and data have since served as the basis of numerous studies and published event extraction systems and datasets.
URL	https://sites.google.com/site/bionlpst/


Description	KISTI Pathway
Organisation	Korea Institute of Science and Technology Information (KISTI)
Country	Korea, Republic of
Sector	Academic/University
PI Contribution	The construction of detailed, machine-readable models of biomolecular pathways is a major goal of systems biology, and hundreds of models capturing the physical entities and reactions involved in various pathways are already available from repositories such as the BioModels Database and the PANTHER Pathway repository. Support biologists by providing Biomedical text mining systems, increasingly capable of creating rich structured representations of information automatically extracted from literature. Such text mining systems open many opportunities for supporting the curation, validation, and updating of pathway models. Building on the PathText text mining integration technology for pathways, text mining systems such as MEDIE, event extraction tools such as EventMine, we are developing methods for identifying literature relevant to specific reactions in pathway models and for automatically analysing documents to extract event structures that capture the full semantics of pathway reactions.
Collaborator Contribution	Joint proposal of the BioNLP 2013 shared task, biologists from KISTI annotated reactions in a variety of signalling and metabolic pathways
Impact	http://2013.bionlp-st.org organisation of shared tasks with resources made available to the community
Start Year	2012


Title	Acromine
Description	Automatically discovers acronyms from biomedical text and expands them into their long forms
IP Reference
Protection	Copyrighted (e.g. software)
Year Protection Granted	2006
Licensed	Yes
Impact	joint projects with Elsevier


Title	Accelerated annotation tool
Description	The ACELA (ACcElerated Annotation) tool aims to reduce the human effort required to produce a gold standard corpus of named entity (NE) annotations. The process of annotation is similar to active leaning, in that it is performed as an iterative and interactive process between the human annotator and a machine-learned NE tagger.
Type Of Technology	Webtool/Application
Year Produced	2008
Impact	The aim of the tool is to ensure that all NEs of a given type are annotated in a given corpus with minimum effort from the human annotator. Only those sentences that are most likely to contain NEs of the target type (according to the predictions of the tagger) are displayed for the human to annotate, which means that it is not necessary to read through all sentences in the corpus that do contain relevant entities. At each iteration of the process, the NE tagger is re-trained on all available sentences that have been human anntated, meaning that it makes increasingly accurate predictions about which sentences contain named entities. The tool also makes estimates about the number of entities in the corpus that have been annotated by the human (coverage), and the annotation process stops when the figure is close to 100%.
URL	http://www.nactem.ac.uk/acela/


Title	Acromine
Description	Acromine is an abbreviation dictionary automatically constructed from the whole MEDLINE as of April, 2009. Acromine identifies abbreviation definitions by assuming a word sequence co-occurring frequently with a parenthetical expression to be a potential expanded form. Applied to the whole MEDLINE (9,635,599 abstracts), the implemented system extracted 68,007 abbreviation candidates and recognized 467,402 expanded forms. The current Acromine achieves 99% precision and 82-95% recall on our evaluation corpus that roughly emulates the whole MEDLINE.
Type Of Technology	Webtool/Application
Year Produced	2006
Impact	Improves search by including acronyms in query expansion; included in Europe PubMedCentral search system EvidenceFinder
URL	http://www.nactem.ac.uk/software/acromine/


Title	FACTA+
Description	A text mining service for mining direct and indirect associations
Type Of Technology	Webtool/Application
Year Produced	2009
Impact	Used for hypothesis generation for clinical and biological applications. Linked with pathway curation. Cited in New Scientist http://www.nactem.ac.uk/newsitem.php?item=272
URL	http://www.nactem.ac.uk/facta/


Title	U-Compare
Description	Interoperable text mining environment for comparing text mining tools, evaluating them, creating quickly web services
Type Of Technology	Software
Year Produced	2010
Open Source License?	Yes
Impact	Applied to areas beyond biomedicine, as part of the language network of excellence METANET and METASHARE. NaCTeM became the UK hub of the network of excellence METANET because of U-Compare
URL	http://nactem.ac.uk/ucompare/


Description	Invited talk at the Royal Society of Chemistry
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	Yes
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	Invited talk about text mining methods for mining chemical information networking activities with pharma
Year(s) Of Engagement Activity	2008
URL	http://www.rsc.org/events/detail/2773?CFID=3274168&CFTOKEN=dea17cccbb272dda-817E0913-DF41-7AF5-16045...

Abstract

Technical Summary

Organisations

People

ORCID iD

Publications