PheneBank: automatic extraction and validation of a database of human phenotype-disease associations in the scientific literature

Lead Research Organisation: University of Cambridge

Department Name: Linguistics

Abstract

Free text scientific literature has the potential to be an incredibly valuable source of data for uncovering the often hidden relationships between genes, diseases and phenotypes. Phenotypic descriptions cover abnormalities in anatomical structures, processes and behaviours. For example 'growth delay' and 'body weight loss'. Such descriptions form the basis for determining the existence and treatment of a disease but, because of their inherent complexity, have previously received less attention by the text mining community.
In recent years, significant effort has been spent by a small number of expert curators to create coding systems for phenotypes (called "ontologies"), such as the Human Phenotype Ontology (HP) [1] and the Mammalian Phenotype Ontology (MP). The PheneBank project proposes to support and speed up curation using terms discovered directly from the literature and to automatically integrate them with such standard ontologies
There are three major challenges we seek to address: (1) knowledge brokering: to develop state of the art text mining approaches to identify phenotypic descriptions in scientific texts; (2) knowledge management: to create a structured resource of phenotype terms used in scientific texts and link them to existing coding systems; and (3) adding insight to evidence: to work with domain experts to utilize statistical association algorithms to identify meaningful phenotype-disease / phenotype-gene profiles. The disease profiles will be evaluated against hand curated standards in human disease databases (e.g. Online Mendelian Inheritance of Man and OrphaNet) with a focus on rare diseases. Mined data will be provided in a machine understandable database - a definitive output of the project - to support clinicians and scientists.
At the technological level the project will pioneer new methods for text mining that exploit machine learning (ML). Scientific texts remain a challenging area for a variety of reasons: descriptive naming, high levels of ambiguity/out of vocabulary words, use of complex sentence structures and an evolving vocabulary. Current techniques in term recognition employ ML in pipelines to search for continuous sequences of words that represent genes, proteins and cells etc. State of the art models include conditional random fields using feature sets based on dictionaries as well as the local and topical context where the term is located. However, phenotype descriptions are often represented by discontinuous sequences, such as 'growth in the patient was delayed'. One key aspect not previously addressed is in the capture of such non-canonical terms. This requires a different paradigm based on grammatical parsing algorithms that capture structural relations as well as joint learning techniques that can leverage large numbers of features simultaneously and optimise these across the diverse contexts in which phenotypes are mentioned.
The project also seeks to harness texts for extracting statistically significant associations between phenotypes, diseases and genes. Earlier approaches have suffered from not providing deep semantic descriptions of the relations they tried to target. This means that association scores merge notions of genetic, pharmacological, and epidemiological relations etc. without distinction. Our parsing-based approach is an attempt to overcome this issue by discovering more precise relationships. The approach follows ground breaking work at the Wellcome Trust Sanger Institute (WTSI), including terminology alignment of phenotypes using pairwise scoring of the conceptual elements that make up the phenotype.
An exciting aspect of this project is inter-disciplinary collaboration across stakeholders to build a resource of phenotype-disease profiles: (a) computer scientists from the Universities of Cambridge, Colorado and Manchester; (b) bioinformaticians and life scientists from the WTSI, McGill University and EMBL-EBI, and (c) clinicians from the NIHR Bioresource.

Technical Summary

PheneBank will exploit state of the art text mining (TM) together with existing ontological resources to collect fragmented biological results about the phenotypic profiles of human diseases and integrate this into a machine understandable semantic database and tool set for use by the clinical and scientific communities in their workflows. TM will be used to automatically extract phenotype, disease and gene terms from the full text scientific literature and harmonise these to ontological resources. Importantly, we will exploit grammatical parsing (the BLLIP parser) for recovering discontinuous terms and then a de-compositional approach in which the elements of the complex phenotype terms are individually linked to ontologies (e.g. Gene Ontology: processes, UBERON for structures). Harmonisation will explore learn-to-rank techniques to optimize concept selection from an array of tools such as cTAKES/MetaMap. Following this, typed relations between terms will be filtered using a range of machine learning classifiers on grammatical parse trees. Domain adaptation techniques will be tested for both term and relation extraction. Validation will follow standard protocols for text mining, including construction of a gold standard annotated corpus of 5000 sentences sampled from cited literature with a focus on rare human diseases. Phenotypic profiles will be built from strongly associated and related terms (e.g. phenotype-disease) by exploring a range of measures (e.g. Jaccard index, Information Content). Validation will use existing expert curated profiles in the OMIM and OrphaNet databases. Additionally we will explore the cross-species harmonisation of the human phenotype terms to closely associated mouse phenotypes. This will take place through conceptual mappings of PheneBank terms to the HPO and MP. The semantic database will include phenotypes and associated links for querying, navigation and download in a variety of formats (OWN/RDF/JSON).

Planned Impact

The PheneBank project aims to revolutionise how health experts leverage phenotypic evidence from the literature for their own clinical and scientific studies. This will be done using automatic text mining (TM) techniques that encode the scientific literature according to a machine understandable semantic representation. PheneBank goes beyond traditional TM by discovering structured associations across the literature as well as cross-species mappings between human and mouse phenotypes. This is highly relevant to a range of experts across lifescience domains. Semantic integration of phenotypes between the scientific literature and ontological coding systems brings us onto a path for full integration with electronic patient records.
Who will benefit from this research?
1. Life scientists and clinicians from a variety of disciplines will benefit from a novel database of evidence about phenotype associations. They will be able to more accurately, thoroughly and efficiently access phenotypic evidence from the scientific literature for use in their own workflows.
2. Bioinformaticians and database curators involved in knowledge discovery and data integration will benefit from data and tools that they can incorporate into their own workflows. This is highly relevant to knowledge integration efforts in initiatives such as ELIXIR.
3. Researchers and engineers in human language technologies, e-Science and information retrieval will benefit from new techniques, tools and data.
How will they benefit?
1. PheneBank will be of particular benefit to life scientists and clinicians investigating rare diseases through our partnership with Prof. Willem Ouwehand (NIHR Bioresource for Rare Diseases) and advisory group member Prof. Paul Lasko (McGill University and Chair of the International Rare Disease Research Consortium). In particular we expect that mined phenotype profiles will contribute to the understanding, diagnosis and treatment of rare diseases by bringing together previously hidden evidence from across the literature using term harmonisation techniques. Building on the project partner's network, we will engage directly with the rare disease community, providing a pathway to deploying the tools and data.
2. There will be multiple opportunities throughout the project to engage with bioinformaticians and database curators. To take just three examples: (1) Partnership with Dr. Peter Robinson leader of the Human Phenotype Ontology will allow data dissemination to one of the most widely used coding systems and engagement with his rare disease partner network. PheneBank outputs will supplement scarce human expertise using high throughput text processing, bridging traditional disciplines and be of a measurable quality. (2) Collaboration with Dr. Skarnes at the Wellcome Trust Sanger Institute (WTSI) will facilitate collaboration with UK groups involved in the informatics of rare diseases such as the Deciphering Developmental Disorders project. (3) Working with Dr. Jo McEntyre at Europe PMCprovides an opportunity to make our annotations available to the scientific research and database curation community,engage with curators and industrial groups (e.g. through the EMBL-EBI Industry Programme).
3. The PheneBank project pioneers new methods for Natural Language Processing (NLP) and Machine Learning (ML) on scientific literature. We propose to develop a novel combination ML approaches on maximally rich NLP features in order to understand the meaning of disjoint scientific terms, harmonise them to clinical ontology standards and explore a range of association measures against human curation standards. Researchers and engineers will benefit from tools, novel data sets (e.g. CRAFT and the Europe PMC corpus) and techniques released through publication.
A stakeholder workshop will be organised in the second year of the project at a major biomedical informatics conference such as ISMB/ECCB to raise awareness of the project and invite feedback.

Funded Value:

£464,013

Funded Period:

Nov 15 - Nov 18

Funder:

MRC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

MR/M025160/1

Principal Investigator:

Nigel Collier

Health Category:

Unclassified

Organisations

People	ORCID iD
Nigel Collier (Principal Investigator)
Bill Skarnes (Co-Investigator)
Anna Korhonen (Co-Investigator)
Damian Smedley (Researcher)

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 > >|

10 25 50

Camacho-Collados J. (2017) SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity in Proceedings of the Annual Meeting of the Association for Computational Linguistics

Gritta M (2018) A Pragmatic Guide to Geoparsing Evaluation

Gritta M (2019) A pragmatic guide to geoparsing evaluation

Gritta M (2020) A pragmatic guide to geoparsing evaluation: Toponyms, Named Entity Recognition and pragmatics. in Language resources and evaluation

Gritta M (2019) A Pragmatic Guide to Geoparsing Evaluation

Gritta M (2017) Vancouver Welcomes You! Minimalist Location Metonymy Resolution

Gritta M (2019) A pragmatic guide to geoparsing evaluation

Le H (2016) Sieve-based coreference resolution enhances semi-supervised learning model for chemical-induced disease relation extraction.

Le H (2018) Large-scale Exploration of Neural Relation Classification Architectures

Le HQ (2016) Sieve-based coreference resolution enhances semi-supervised learning model for chemical-induced disease relation extraction. in Database : the journal of biological databases and curation

Further Funding
Research Databases and Models
Research Tools and Methods
Collaboration
Engagement Activities


Description	EPI-AI: Automated Understanding and Alerting of Disease Outbreaks from Global News Media
Amount	£491,373 (GBP)
Funding ID	ES/T012277/1
Organisation	Economic and Social Research Council
Sector	Public
Country	United Kingdom
Start	02/2020
End	01/2023


Description	Health Data Research National Text Analytics project
Amount	£6,000,000 (GBP)
Organisation	Health Data Research UK
Sector	Private
Country	United Kingdom
Start	01/2020
End	02/2021


Title	PheneBank named entity recognition software
Description	PheneBank aims at automatic extraction and validation of a database of human phenotype-disease associations in the scientific literature. This software package provides code, data, and models for the following two purposes: (1) named entity recognition of phenotypes and other entities of biological interest; (2) harmonisation of entities of interest to standard vocabularies including SNOMED CT and the Human Phenotype Ontology.
Type Of Material	Improvements to research infrastructure
Year Produced	2018
Provided To Others?	Yes
Impact	Our team has developed, tested and used this software method to semantically annotate the whole of Medline and the Open Access collection of PubMed.
URL	https://github.com/pilehvar/phenebank


Title	PheneBank database
Description	Free text scientific literature has the potential to be an incredibly valuable source of data for uncovering the often hidden relationships between genes, diseases and phenotypes. Phenotypic descriptions cover abnormalities in anatomical structures, processes and behaviours. For example 'growth delay' and 'body weight loss'. Such descriptions form the basis for determining the existence and treatment of a disease but, because of their inherent complexity, have previously received less attention by the text mining community. In recent years, significant effort has been spent by a small number of expert curators to create coding systems for phenotypes (called "ontologies"), such as the Human Phenotype Ontology (HP) and the Mammalian Phenotype Ontology (MP). The PheneBank project proposes to support and speed up curation using terms discovered directly from the literature and to automatically integrate them with such standard ontologies. The online database at phenebank.org presents our results from harnessing texts to extract statistically significant associations between mentions of phenotypes and diseases in the scientific literature. Three options can be taken: (1) Users can submit their own scientific texts and have them semantically annotated for entities and ontology identifiers; (2) Users can browse the database of 24 million Medline abstracts that have been semantically annotated in PheneBank; (3) Users can search for relations between diseases and phenotypes that have been discovered from the scientific literature.
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
Impact	The database will be publicized in 2019 through a journal publication that we are currently writing.
URL	http://phenebank.org/


Title	Reesarch data for entire Medline annotated automatically with phenotypes, diseases, chemicals and gene entities
Description	As an output of the PheneBank project, we release the set of 24 million MEDLINE abstracts annotated with 9 classes of entity: Phenotype, Disease, Anatomy, Cell, Cell_line, GPR, Gene_variant, Molecule, and Pathway. The entities have been mapped to five major ontologies: SNOMED, HPO, MeSH, PRO, and FMA. Note that the computational model was improved during the later half of 2018 and re-run to produce an improved version (available from the same DOI).
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
Impact	The database is currently being used to mine rare disease/phenotype associations as part of the PheneBank project. We have made it freely available to further the research investigations of other groups.
URL	https://zenodo.org/record/1167696#.Wp0zb3zLiUk


Title	Research data supporting "Vancouver Welcomes You! Minimalist Location Metonymy Resolution"
Description	Complete supporting/replication data and code for the ACL Publication. The paper was published in August 2017 at www.acl2017.org
Type Of Material	Database/Collection of data
Year Produced	2017
Provided To Others?	Yes


Title	Research data supporting "What's missing in geographical parsing?"
Description	Full code and data required for replication and experimentation.
Type Of Material	Database/Collection of data
Year Produced	2017
Provided To Others?	Yes


Description	Healtex: UK Healthcare Text Analytics Research Network
Organisation	University of Manchester
Department	Health E-Research Centre
Country	United Kingdom
Sector	Academic/University
PI Contribution	Healtex is an EPSRC-sponsored (EP/N027280/1) UK multi-disciplinary research network that aims to explore the barriers to effectively utilising healthcare narrative text data, road-map research efforts and principles for sharing text data and text analytics methods between academia, NHS and industry. It is funded as part of the EPSRC Healthcare Technologies Grand Challenges theme. I am co-leading a challenge stream on 'Data-driven text mining and NLP'.
Collaborator Contribution	The HealTex network opens up dialogue between technologists in NLP/text mining and the potential user community in the NHS and industry. As such it is a valuable avenue to impact for the work taking place in the EPSRC SIPHS project. I am co-leading the HealTex network's 'data-driven text mining and NLP' challenge stream and aim to use this to promote dialogue and uptake around the SIPHS project themes.
Impact	Invited talk at HealTex launch event
Start Year	2016


Description	HealTex launch event
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	Approximately 100 clinicians, technologists and members of the business community attended the opening launch event for the EPSRC UK HealTex network where I gave an invited talk entitled 'NLP capabilities and challenges in the health arena'. The talk resulted in follow up requests from colleagues for further information and participation (e.g. in social media coding for veterinary medical insights).
Year(s) Of Engagement Activity	2016
URL	http://healtex.org/event/healtex-launch/


Description	Invited talk at Big Data in Medicine, Cancer Research UK
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Professional Practitioners
Results and Impact	Approximately 500 clinicians, life scientists and others attended my invited talk at the Big Data in Medicine Workshop held at Cancer Research UK Cambridge Institute. The title of the talk was "Undiscovered scientific knowledge from large unstructured collections in an era of Big Data". The talk prompted discussions afterwards and a contact from an industrial group seeking talks on collaboration.
Year(s) Of Engagement Activity	2015
URL	http://www.bigdata.cam.ac.uk/events/events-archive/big-data-in-medicine-exemplars-and-opportunities-...


Description	Invited talk at Kings College London
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	An invited talk to the Kings College London, SGDP Research Centre entitled "Automated Coding of Biomedical Texts". The talk led to discussions concerning digital phenotyping and its role within ongoing projects nationally.
Year(s) Of Engagement Activity	2018


Description	Invited talk at the European Bioinformatics Institute
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Professional Practitioners
Results and Impact	Approximately 60 lifescientists, database curators, bioinformatians and software engineers attended my talk on "Natural language processing for semantic interoperability in unstructured big data".
Year(s) Of Engagement Activity	2015


Description	Organised and attended the Phenotype Day workshop (ISMB 2016, Orlando)
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Approximately 50 life scientists, clinicians, bioinformaticians and computer scientists attended the Phenotype Day workshop at ISMB 2016. We discussed the representation, acquisition, discovery and interoperability of clinical phenotype data including in new reporting media such as patient forums. In addition to a call for research papers the workshop included invited talks from keynote speakers were Wendy Chapman (University of Utah), Zhiyong Lu (National Center for Biotechnology Information, US). A special issue of the workshop proceedings was released in the Journal of Biomedical Semantics.
Year(s) Of Engagement Activity	2016
URL	https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-016-0108-7


Description	Talk at the Alan Turing Institute
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Professional Practitioners
Results and Impact	10 researchers in the field of Natural Langauge Processing attended a talk I gave at the Alan Turing Institute in London entitled "Entity Linking using Heterogenous Health Text Data". The talk led to further discussions concerning the alignment of textual and knowledge graph spaces as well as ongoing discussions concerning future funding applications.
Year(s) Of Engagement Activity	2018


Description	Talk to the Cambridge University Science Society
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Undergraduate students
Results and Impact	I was invited by the Cambridge University Science Society to give a talk about my research on supporting health research using Natural Language Processing. The talk was attended by about 60 undergraduate students, postdocs and senior scientists. The talk sparked questions and discussions afterwards about how NLP could support integration of evidence in biomedical informatics.
Year(s) Of Engagement Activity	2020
URL	http://talks.cam.ac.uk/talk/index/137884