PheneBank: automatic extraction and validation of a database of human phenotype-disease associations in the scientific literature

Lead Research Organisation: University of Cambridge
Department Name: Linguistics

Abstract

Free text scientific literature has the potential to be an incredibly valuable source of data for uncovering the often hidden relationships between genes, diseases and phenotypes. Phenotypic descriptions cover abnormalities in anatomical structures, processes and behaviours. For example 'growth delay' and 'body weight loss'. Such descriptions form the basis for determining the existence and treatment of a disease but, because of their inherent complexity, have previously received less attention by the text mining community.
In recent years, significant effort has been spent by a small number of expert curators to create coding systems for phenotypes (called "ontologies"), such as the Human Phenotype Ontology (HP) [1] and the Mammalian Phenotype Ontology (MP). The PheneBank project proposes to support and speed up curation using terms discovered directly from the literature and to automatically integrate them with such standard ontologies
There are three major challenges we seek to address: (1) knowledge brokering: to develop state of the art text mining approaches to identify phenotypic descriptions in scientific texts; (2) knowledge management: to create a structured resource of phenotype terms used in scientific texts and link them to existing coding systems; and (3) adding insight to evidence: to work with domain experts to utilize statistical association algorithms to identify meaningful phenotype-disease / phenotype-gene profiles. The disease profiles will be evaluated against hand curated standards in human disease databases (e.g. Online Mendelian Inheritance of Man and OrphaNet) with a focus on rare diseases. Mined data will be provided in a machine understandable database - a definitive output of the project - to support clinicians and scientists.
At the technological level the project will pioneer new methods for text mining that exploit machine learning (ML). Scientific texts remain a challenging area for a variety of reasons: descriptive naming, high levels of ambiguity/out of vocabulary words, use of complex sentence structures and an evolving vocabulary. Current techniques in term recognition employ ML in pipelines to search for continuous sequences of words that represent genes, proteins and cells etc. State of the art models include conditional random fields using feature sets based on dictionaries as well as the local and topical context where the term is located. However, phenotype descriptions are often represented by discontinuous sequences, such as 'growth in the patient was delayed'. One key aspect not previously addressed is in the capture of such non-canonical terms. This requires a different paradigm based on grammatical parsing algorithms that capture structural relations as well as joint learning techniques that can leverage large numbers of features simultaneously and optimise these across the diverse contexts in which phenotypes are mentioned.
The project also seeks to harness texts for extracting statistically significant associations between phenotypes, diseases and genes. Earlier approaches have suffered from not providing deep semantic descriptions of the relations they tried to target. This means that association scores merge notions of genetic, pharmacological, and epidemiological relations etc. without distinction. Our parsing-based approach is an attempt to overcome this issue by discovering more precise relationships. The approach follows ground breaking work at the Wellcome Trust Sanger Institute (WTSI), including terminology alignment of phenotypes using pairwise scoring of the conceptual elements that make up the phenotype.
An exciting aspect of this project is inter-disciplinary collaboration across stakeholders to build a resource of phenotype-disease profiles: (a) computer scientists from the Universities of Cambridge, Colorado and Manchester; (b) bioinformaticians and life scientists from the WTSI, McGill University and EMBL-EBI, and (c) clinicians from the NIHR Bioresource.

Technical Summary

PheneBank will exploit state of the art text mining (TM) together with existing ontological resources to collect fragmented biological results about the phenotypic profiles of human diseases and integrate this into a machine understandable semantic database and tool set for use by the clinical and scientific communities in their workflows. TM will be used to automatically extract phenotype, disease and gene terms from the full text scientific literature and harmonise these to ontological resources. Importantly, we will exploit grammatical parsing (the BLLIP parser) for recovering discontinuous terms and then a de-compositional approach in which the elements of the complex phenotype terms are individually linked to ontologies (e.g. Gene Ontology: processes, UBERON for structures). Harmonisation will explore learn-to-rank techniques to optimize concept selection from an array of tools such as cTAKES/MetaMap. Following this, typed relations between terms will be filtered using a range of machine learning classifiers on grammatical parse trees. Domain adaptation techniques will be tested for both term and relation extraction. Validation will follow standard protocols for text mining, including construction of a gold standard annotated corpus of 5000 sentences sampled from cited literature with a focus on rare human diseases. Phenotypic profiles will be built from strongly associated and related terms (e.g. phenotype-disease) by exploring a range of measures (e.g. Jaccard index, Information Content). Validation will use existing expert curated profiles in the OMIM and OrphaNet databases. Additionally we will explore the cross-species harmonisation of the human phenotype terms to closely associated mouse phenotypes. This will take place through conceptual mappings of PheneBank terms to the HPO and MP. The semantic database will include phenotypes and associated links for querying, navigation and download in a variety of formats (OWN/RDF/JSON).

Planned Impact

The PheneBank project aims to revolutionise how health experts leverage phenotypic evidence from the literature for their own clinical and scientific studies. This will be done using automatic text mining (TM) techniques that encode the scientific literature according to a machine understandable semantic representation. PheneBank goes beyond traditional TM by discovering structured associations across the literature as well as cross-species mappings between human and mouse phenotypes. This is highly relevant to a range of experts across lifescience domains. Semantic integration of phenotypes between the scientific literature and ontological coding systems brings us onto a path for full integration with electronic patient records.
Who will benefit from this research?
1. Life scientists and clinicians from a variety of disciplines will benefit from a novel database of evidence about phenotype associations. They will be able to more accurately, thoroughly and efficiently access phenotypic evidence from the scientific literature for use in their own workflows.
2. Bioinformaticians and database curators involved in knowledge discovery and data integration will benefit from data and tools that they can incorporate into their own workflows. This is highly relevant to knowledge integration efforts in initiatives such as ELIXIR.
3. Researchers and engineers in human language technologies, e-Science and information retrieval will benefit from new techniques, tools and data.
How will they benefit?
1. PheneBank will be of particular benefit to life scientists and clinicians investigating rare diseases through our partnership with Prof. Willem Ouwehand (NIHR Bioresource for Rare Diseases) and advisory group member Prof. Paul Lasko (McGill University and Chair of the International Rare Disease Research Consortium). In particular we expect that mined phenotype profiles will contribute to the understanding, diagnosis and treatment of rare diseases by bringing together previously hidden evidence from across the literature using term harmonisation techniques. Building on the project partner's network, we will engage directly with the rare disease community, providing a pathway to deploying the tools and data.
2. There will be multiple opportunities throughout the project to engage with bioinformaticians and database curators. To take just three examples: (1) Partnership with Dr. Peter Robinson leader of the Human Phenotype Ontology will allow data dissemination to one of the most widely used coding systems and engagement with his rare disease partner network. PheneBank outputs will supplement scarce human expertise using high throughput text processing, bridging traditional disciplines and be of a measurable quality. (2) Collaboration with Dr. Skarnes at the Wellcome Trust Sanger Institute (WTSI) will facilitate collaboration with UK groups involved in the informatics of rare diseases such as the Deciphering Developmental Disorders project. (3) Working with Dr. Jo McEntyre at Europe PMCprovides an opportunity to make our annotations available to the scientific research and database curation community,engage with curators and industrial groups (e.g. through the EMBL-EBI Industry Programme).
3. The PheneBank project pioneers new methods for Natural Language Processing (NLP) and Machine Learning (ML) on scientific literature. We propose to develop a novel combination ML approaches on maximally rich NLP features in order to understand the meaning of disjoint scientific terms, harmonise them to clinical ontology standards and explore a range of association measures against human curation standards. Researchers and engineers will benefit from tools, novel data sets (e.g. CRAFT and the Europe PMC corpus) and techniques released through publication.
A stakeholder workshop will be organised in the second year of the project at a major biomedical informatics conference such as ISMB/ECCB to raise awareness of the project and invite feedback.
 
Description EPI-AI: Automated Understanding and Alerting of Disease Outbreaks from Global News Media
Amount £491,373 (GBP)
Funding ID ES/T012277/1 
Organisation Economic and Social Research Council 
Sector Public
Country United Kingdom
Start 02/2020 
End 01/2023
 
Description Health Data Research National Text Analytics project
Amount £6,000,000 (GBP)
Organisation Health Data Research UK 
Sector Private
Country United Kingdom
Start 01/2020 
End 02/2021
 
Title PheneBank named entity recognition software 
Description PheneBank aims at automatic extraction and validation of a database of human phenotype-disease associations in the scientific literature. This software package provides code, data, and models for the following two purposes: (1) named entity recognition of phenotypes and other entities of biological interest; (2) harmonisation of entities of interest to standard vocabularies including SNOMED CT and the Human Phenotype Ontology. 
Type Of Material Improvements to research infrastructure 
Year Produced 2018 
Provided To Others? Yes  
Impact Our team has developed, tested and used this software method to semantically annotate the whole of Medline and the Open Access collection of PubMed. 
URL https://github.com/pilehvar/phenebank
 
Title PheneBank database 
Description Free text scientific literature has the potential to be an incredibly valuable source of data for uncovering the often hidden relationships between genes, diseases and phenotypes. Phenotypic descriptions cover abnormalities in anatomical structures, processes and behaviours. For example 'growth delay' and 'body weight loss'. Such descriptions form the basis for determining the existence and treatment of a disease but, because of their inherent complexity, have previously received less attention by the text mining community. In recent years, significant effort has been spent by a small number of expert curators to create coding systems for phenotypes (called "ontologies"), such as the Human Phenotype Ontology (HP) and the Mammalian Phenotype Ontology (MP). The PheneBank project proposes to support and speed up curation using terms discovered directly from the literature and to automatically integrate them with such standard ontologies. The online database at phenebank.org presents our results from harnessing texts to extract statistically significant associations between mentions of phenotypes and diseases in the scientific literature. Three options can be taken: (1) Users can submit their own scientific texts and have them semantically annotated for entities and ontology identifiers; (2) Users can browse the database of 24 million Medline abstracts that have been semantically annotated in PheneBank; (3) Users can search for relations between diseases and phenotypes that have been discovered from the scientific literature. 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact The database will be publicized in 2019 through a journal publication that we are currently writing. 
URL http://phenebank.org/
 
Title Reesarch data for entire Medline annotated automatically with phenotypes, diseases, chemicals and gene entities 
Description As an output of the PheneBank project, we release the set of 24 million MEDLINE abstracts annotated with 9 classes of entity: Phenotype, Disease, Anatomy, Cell, Cell_line, GPR, Gene_variant, Molecule, and Pathway. The entities have been mapped to five major ontologies: SNOMED, HPO, MeSH, PRO, and FMA. Note that the computational model was improved during the later half of 2018 and re-run to produce an improved version (available from the same DOI). 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
Impact The database is currently being used to mine rare disease/phenotype associations as part of the PheneBank project. We have made it freely available to further the research investigations of other groups. 
URL https://zenodo.org/record/1167696#.Wp0zb3zLiUk
 
Title Research data supporting "Vancouver Welcomes You! Minimalist Location Metonymy Resolution" 
Description Complete supporting/replication data and code for the ACL Publication. The paper was published in August 2017 at www.acl2017.org 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
 
Title Research data supporting "What's missing in geographical parsing?" 
Description Full code and data required for replication and experimentation. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
 
Description Healtex: UK Healthcare Text Analytics Research Network 
Organisation University of Manchester
Department Health E-Research Centre
Country United Kingdom 
Sector Academic/University 
PI Contribution Healtex is an EPSRC-sponsored (EP/N027280/1) UK multi-disciplinary research network that aims to explore the barriers to effectively utilising healthcare narrative text data, road-map research efforts and principles for sharing text data and text analytics methods between academia, NHS and industry. It is funded as part of the EPSRC Healthcare Technologies Grand Challenges theme. I am co-leading a challenge stream on 'Data-driven text mining and NLP'.
Collaborator Contribution The HealTex network opens up dialogue between technologists in NLP/text mining and the potential user community in the NHS and industry. As such it is a valuable avenue to impact for the work taking place in the EPSRC SIPHS project. I am co-leading the HealTex network's 'data-driven text mining and NLP' challenge stream and aim to use this to promote dialogue and uptake around the SIPHS project themes.
Impact Invited talk at HealTex launch event
Start Year 2016
 
Description HealTex launch event 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Approximately 100 clinicians, technologists and members of the business community attended the opening launch event for the EPSRC UK HealTex network where I gave an invited talk entitled 'NLP capabilities and challenges in the health arena'. The talk resulted in follow up requests from colleagues for further information and participation (e.g. in social media coding for veterinary medical insights).
Year(s) Of Engagement Activity 2016
URL http://healtex.org/event/healtex-launch/
 
Description Invited talk at Big Data in Medicine, Cancer Research UK 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact Approximately 500 clinicians, life scientists and others attended my invited talk at the Big Data in Medicine Workshop held at Cancer Research UK Cambridge Institute. The title of the talk was "Undiscovered scientific knowledge from large unstructured collections in an era of Big Data". The talk prompted discussions afterwards and a contact from an industrial group seeking talks on collaboration.
Year(s) Of Engagement Activity 2015
URL http://www.bigdata.cam.ac.uk/events/events-archive/big-data-in-medicine-exemplars-and-opportunities-...
 
Description Invited talk at Kings College London 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact An invited talk to the Kings College London, SGDP Research Centre entitled "Automated Coding of Biomedical Texts". The talk led to discussions concerning digital phenotyping and its role within ongoing projects nationally.
Year(s) Of Engagement Activity 2018
 
Description Invited talk at the European Bioinformatics Institute 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Approximately 60 lifescientists, database curators, bioinformatians and software engineers attended my talk on "Natural language processing for semantic interoperability in unstructured big data".
Year(s) Of Engagement Activity 2015
 
Description Organised and attended the Phenotype Day workshop (ISMB 2016, Orlando) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Approximately 50 life scientists, clinicians, bioinformaticians and computer scientists attended the Phenotype Day workshop at ISMB 2016. We discussed the representation, acquisition, discovery and interoperability of clinical phenotype data including in new reporting media such as patient forums. In addition to a call for research papers the workshop included invited talks from keynote speakers were Wendy Chapman (University of Utah), Zhiyong Lu (National Center for Biotechnology Information, US). A special issue of the workshop proceedings was released in the Journal of Biomedical Semantics.
Year(s) Of Engagement Activity 2016
URL https://jbiomedsem.biomedcentral.com/articles/10.1186/s13326-016-0108-7
 
Description Talk at the Alan Turing Institute 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact 10 researchers in the field of Natural Langauge Processing attended a talk I gave at the Alan Turing Institute in London entitled "Entity Linking using Heterogenous Health Text Data". The talk led to further discussions concerning the alignment of textual and knowledge graph spaces as well as ongoing discussions concerning future funding applications.
Year(s) Of Engagement Activity 2018
 
Description Talk to the Cambridge University Science Society 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Undergraduate students
Results and Impact I was invited by the Cambridge University Science Society to give a talk about my research on supporting health research using Natural Language Processing. The talk was attended by about 60 undergraduate students, postdocs and senior scientists. The talk sparked questions and discussions afterwards about how NLP could support integration of evidence in biomedical informatics.
Year(s) Of Engagement Activity 2020
URL http://talks.cam.ac.uk/talk/index/137884