SIPHS: Semantic interpretation of personal health messages for generating public health summaries

Lead Research Organisation: University of Cambridge
Department Name: English and Applied Linguistics

Abstract

Open online data such as microblogs and discussion board messages have the potential to be an incredibly valuable source of information about health in populations. Such data has been rapidly growing, is low cost, real-time and seems likely to cover a significant proportion of the demographic. To take two examples, PatientsLikeMe has enjoyed 10% growth and now has over 200,000 users covering over 1500 health conditions; the generic Twitter service is expanding at a rate of 30% annually with over 200 million active users. Going beyond simple keyword search and harnessing this data for public health represents both an opportunity and a challenge to natural language processing (NLP). This fellowship proposal is about helping health experts leverage social media for their own clinical and scientific studies through automatic techniques that encode messages according to a machine understandable semantic representation. There are three major challenges this project seeks to address: (1) knowledge brokering: to develop algorithms to identify and code the informal descriptions of conditions, treatments, medications, behaviours and attitudes to standard ontologies such as the UMLS; (2) knowledge management: to create a structured resource of patient vocabulary used in blog texts and link it to existing coding systems; and (3) adding insight to evidence: to work with domain experts to utilize the coded information to automatically generate meaningful summaries for follow up investigation. At the technological level the fellowship seeks to pioneer new methods for NLP and machine learning (ML). Social media remains a challenging area for NLP for a variety of reasons: short de-contextualised messages, high levels of ambiguity/out of vocabulary words, use of slang and an evolving vocabulary, as well as inherent bias towards sensational topics. The fellowship seeks to harness the progress made so far in NLP for social media analysis in the commercial domain and develop it further to provide meaningful public health evidence. One key aspect not previously addressed is in the clinical coding of patient messages. Although knowledge brokering systems exist for clinical and scientific texts (e.g. the NLM's MetaMap), their performance on social media messages has been poor. The fellowship will utilise the rich availability of ontological resources in biomedicine together with ML on annotated message data to disambiguate informal language. Research will also aim to understanding the communicative function of messages, for example whether the message reports direct experience or is related to news, humour or marketing. If these problems are successfully overcome an important barrier to data integration with other types of clinical data will be removed. The advantage of providing health coding for social media reports is its potential for studying very-large scale cohorts and also in real-time early alerting of aberrations. In the fellowship I will research the potential for multi-variate time series alerting from semantically coded features, working with domain experts to evaluate across a range of metrics (e.g. sensitivity, timeliness, false alerting rates). A variety of approaches will be explored to generate real time risk summaries across social media sources. Two real-world applications have been chosen to take this forwards: early alerting for Adverse drug reactions (ADRs) and Infectious disease surveillance (IDS). Project outcomes will include fundamental technologies as well as open source algorithms, data sets and ontology. An exciting aspect of this fellowship is inter-disciplinary collaboration across stakeholders at all levels: scientists, public health experts and industry. Finally, participation will be opened up to the international community through the release of open source data. Colleagues working on social media technologies will be invited to participate in discussions with users at a new challenge evaluation workshop.

Planned Impact

The SIPHS project aims to revolutionise how health experts leverage personal health evidence for their own clinical and scientific studies through automatic techniques that encode social media messages according to a machine understandable semantic representation. SIPHS will deliver state of the art knowledge extraction solutions for evidence relating to human diseases. This is highly relevant to a range of experts across domains such as public health, pharmacology and molecular biology.

Who will benefit from this research?
1. Public health experts performing infectious disease surveillance (IDS), situation awareness and risk assessment functions will benefit from becoming more efficient and having access to earlier warnings and greater coverage about health threats such as pandemic influenza, chemical/ biological/ radiological/ nuclear (CBRN) terrorist attacks;
2. Researchers and engineers in human language technologies, e-Science and information retrieval will benefit from software tools and data sets that can reliably encode social media messages for clinically important concepts;
3. The pharmaceutical industry and those involved in biotechnology and drug discovery will benefit from having access to a new and extensive database of evidence about adverse drug reactions and potentially novel therapeutic properties for licensed drugs;
4. Life scientists and clinicians involved in translational studies will benefit from having a novel database of evidence about phenotype associations to drugs and human diseases that links to the existing scientific and clinical data infrastructure through networks. As noted in Section 2(b) I reiterate that SIPHS is highly relevant to initiatives such as ELIXIR which coordinates and links European biomedical resources;
5. The public will benefit from having improved technologies for early detection of health threats and improved understanding about those technologies through the PI's outreach activities, e.g. a public blog, participation in the Cambridge Science Festival, press releases and a Wikipedia page.

How will they benefit?
1. Building on Dr. Collier's existing global public health network, the PI will continue to work directly with public health experts at Public Health England, the CORDS network and at the WHO to deploy the proposed technologies and database. The innovative techniques advocated in this proposal extends proven high throughput techniques developed by the PI which successfully detected A(H1N1). The techniques supplement scarce human expertise, bring in evidence beyond national boundaries and cover segments of the population who may not interact with traditional sensor networks (e.g. patients who may not visit a GP). The novel techniques will be measured against existing human surveillance network standard;
2. The fellowship pioneers new methods for Natural Language Processing (NLP) and Machine Learning (ML) on social media. We propose to develop a novel combination of supervised and semi-supervised approaches on maximally rich NLP features in order to understand the context of personal health messages, ground layman's terms to clinical standards and provide timely alert summaries. Researchers and engineers will benefit from tools, data sets and techniques;
3. The technology in this proposal will help the pharmaceutical industry in the monitoring of patient reports for ADRs as required by EU and national regulations and to reveal novel therapeutics;
4. The database developed through the SIPHS project will generate high visibility in the lifescience and clinical communities. The integration of the different data resources and the automatic analysis of the social media will lead to benefits for the research community and the general public. If the problem of message coding in personal health messages is successfully overcome an important barrier to data integration - for example with data from clinical trials or electronic patient records - will be removed.
 
Description Participation in the Korea-UK Spring Health Forum, hosted by the British Embassy Seoul and South Korean Health Ministry
Geographic Reach Multiple continents/international 
Policy Influence Type Implementation circular/rapid advice/letter to e.g. Ministry of Health
 
Description Steering committee membership for the Patient Experience Data project (PI: Caroline Sanders, University of Manchester) NIHR Health Services and Delivery Research Programme
Geographic Reach National 
Policy Influence Type Participation in a advisory committee
 
Description MRC Methodology Panel
Amount £464,014 (GBP)
Funding ID MR/M025160/1 
Organisation Medical Research Council (MRC) 
Sector Academic/University
Country United Kingdom
Start 12/2015 
End 11/2018
 
Title Software for Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs 
Description Code and resources for the EMNLP 2018 paper "Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs" [1] can be found at the following repository: https://bitbucket.org/dimkart/ms-lstm The model maps efficiently unrestricted text to knowledge graph entities using the following process: (1) The KB graph is extended with textual features weighted by their importance with respect to the entity nodes. (2) A synthetic "corpus" of biased random walks is created and used as input to the skipgram model. This generates an enhanced KB space to be used as target for the text-to-entity mapping process (3) The transformation from text to entities/concepts is achieved via a supervised multi-sense compositional model, which generated a point in the KB space for every input text. (4) The model is an LSTM equipped with an attentional mechanism that dynamically disambiguates the embeddings of the input words given the surrounding context. Reference: [1] D. Kartsaklis, M.T. Pilehvar, N. Collier (2018). Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium. 
Type Of Material Improvements to research infrastructure 
Year Produced 2018 
Provided To Others? Yes  
Impact The software tool addresses the problem of mapping natural language text to knowledge base entities. The mapping process is approached as a composition of a phrase or a sentence into a point in a multi-dimensional entity space obtained from a knowledge graph. The compositional model is an LSTM equipped with a dynamic disambiguation mechanism on the input word embeddings (a Multi-Sense LSTM), addressing polysemy issues. Further, the knowledge base space is prepared by collecting random walks from a graph enhanced with textual features, which act as a set of semantic bridges between text and knowledge base entities. These ideas have been demonstrated in our EMNLP 2018 paper available at https://www.repository.cam.ac.uk/handle/1810/287907. 
URL https://github.com/cambridgeltl/SIPHS/blob/master/Kartsaklis_etal_EMNLP_2018_code.md
 
Title ACL 2016 Data 
Description Data and supplementary information for the paper entitled 'Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation' published at ACL 2016: the 54th Annual Meeting of the Association for Computational Linguistics - August 7-12, 2016 - Berlin, Germany. The database contains a list of social media phrases and their encodings in SNOMED-CT. 
Type Of Material Database/Collection of data 
Year Produced 2016 
Provided To Others? Yes  
Impact Results are published in the ACL 2016 paper cited in the above description. The impact is an improvement in performance for automatically encoding free text phrases with biomedical concepts using deep neural networks. 
URL https://zenodo.org/record/55013#.WH9TK302U50
 
Title EMNLP 2015 Data 
Description Data and supplementary information for the paper entitled "Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages" to be published at EMNLP 2015: Conference on Empirical Methods in Natural Language Processing - September 17-21, 2015 - Lisboa, Portugal. The database contains a list of social media phrases and their encodings in SNOMED-CT. 
Type Of Material Database/Collection of data 
Year Produced 2015 
Provided To Others? Yes  
Impact Since the data was just released there have not been any results external to the paper in which the results were reported. 
URL https://zenodo.org/record/27354
 
Title Research data supporting "Vancouver Welcomes You! Minimalist Location Metonymy Resolution" 
Description Complete supporting/replication data and code for the ACL Publication. The paper was published in August 2017 at www.acl2017.org 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
 
Title Research data supporting "What's missing in geographical parsing?" 
Description Full code and data required for replication and experimentation. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
 
Description Healtex: UK Healthcare Text Analytics Research Network 
Organisation University of Manchester
Department Health E-Research Centre
Country United Kingdom 
Sector Academic/University 
PI Contribution Healtex is an EPSRC-sponsored (EP/N027280/1) UK multi-disciplinary research network that aims to explore the barriers to effectively utilising healthcare narrative text data, road-map research efforts and principles for sharing text data and text analytics methods between academia, NHS and industry. It is funded as part of the EPSRC Healthcare Technologies Grand Challenges theme. I am co-leading a challenge stream on 'Data-driven text mining and NLP'.
Collaborator Contribution The HealTex network opens up dialogue between technologists in NLP/text mining and the potential user community in the NHS and industry. As such it is a valuable avenue to impact for the work taking place in the EPSRC SIPHS project. I am co-leading the HealTex network's 'data-driven text mining and NLP' challenge stream and aim to use this to promote dialogue and uptake around the SIPHS project themes.
Impact Invited talk at HealTex launch event
Start Year 2016
 
Description Cambridge Language Sciences Symposium 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Postgraduate students
Results and Impact Approximately 250 linguists, engineers, scientists, students and members of the business community attended my invited talk at the Cambridge Language Sciences Annual Symposium on "Natural Language Processing and Online Health Reports (or OMG U Got Flu?)" A lively discussion followed along with requests from colleagues for further information.
Year(s) Of Engagement Activity 2016
URL http://sms.cam.ac.uk/media/2393150
 
Description Cambridge University Festival of Ideas 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Public/other audiences
Results and Impact The Cambridge University Festival of Ideas is an annual outreach activity to showcase research being done in the University to the general public (aged 12+). This year I gave an invited talk on 'Rumours, Diseases and Drugs: Tackling Textual Data for Knowledge Discovery in Health' outlining the work I am doing in the SIPHS project. Additionally students from my lab provided demonstrations of technologies associated with the project. The response was overwhelmingly positive and follow up questionnaires show that the audience felt informed.
Year(s) Of Engagement Activity 2016
URL http://www.festivalofideas.cam.ac.uk/events/language-detectives
 
Description Cambridge University Linguistics Society 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Undergraduate students
Results and Impact I gave an invited talk to approximately 50 linguists who are members of the Cambridge University Linguistics Society on 'Rumours, Diseases and Drugs: Tackling Textual Data for Knowledge Discovery in Health. There followed a lively series of questions about the merits of social media versus other forms of evidence and the linguistic issues involved in understanding this form of genre.
Year(s) Of Engagement Activity 2016
URL http://camlingsoc.soc.srcf.net/events/event/rumours-diseases-and-drugs-tackling-textual-data-for-kno...
 
Description HealTex launch event 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach National
Primary Audience Professional Practitioners
Results and Impact Approximately 100 clinicians, technologists and members of the business community attended the opening launch event for the EPSRC UK HealTex network where I gave an invited talk entitled 'NLP capabilities and challenges in the health arena'. The talk resulted in follow up requests from colleagues for further information and participation (e.g. in social media coding for veterinary medical insights).
Year(s) Of Engagement Activity 2016
URL http://healtex.org/event/healtex-launch/
 
Description Invited expert at the Epidemic Intelligence from Open Sources (EIOS) initiative, World Health Organization, Health Emergencies Programme, Geneva. 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Third sector organisations
Results and Impact I was invited to join along with approximately 100 public health experts attending the launch of the World Health Organisation's three day meeting in Geneva to advance efforts for the early detection, verification and assessment of health risks. The meeting saw presentations, facilitated discussions and collaborative planning for epidemic intelligence from open sources including social media and news media. The meeting had three specific objectives: (1) Understand the current landscape and trajectory for some of the currently available epidemic intelligence tools; (2) Document and prioritise requirements for enhancing the early detection, verification, assessment and communication of health risks; (3) Draft action plans for the collaborative development and implementation of solutions to prioritised requirements.
Year(s) Of Engagement Activity 2018
 
Description Invited talk at Big Data in Medicine, Cancer Research UK 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Professional Practitioners
Results and Impact Approximately 500 clinicians, life scientists and others attended my invited talk at the Big Data in Medicine Workshop held at Cancer Research UK Cambridge Institute. The title of the talk was "Undiscovered scientific knowledge from large unstructured collections in an era of Big Data". The talk prompted discussions afterwards and a contact from an industrial group seeking talks on collaboration.
Year(s) Of Engagement Activity 2015
URL http://www.bigdata.cam.ac.uk/events/events-archive/big-data-in-medicine-exemplars-and-opportunities-...
 
Description Invited talk at LOUHI 2016 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Postgraduate students
Results and Impact 25 international researchers in the field of language technology and health attended my invited talk at LOUHI 2016 which was collocated with EMNLP 2016 in Austin. Texas. The talk sparked questions about the technological difficulties of coding the social media using deep learning, and also about the ethical considerations for re-use of social media data for health.
Year(s) Of Engagement Activity 2016
URL https://louhi.limsi.fr/2016/
 
Description Invited talk at the 2017 Korea-UK Spring Health Forum, Seoul National University Hospital 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The 2017 Korea-UK Health Forum was organized by the Seoul National University Hospital (Prof. Kyong Soo Park) and the British Embassy in Seoul (Mr Gareth Davies) with support from the UK Department for Business Innovation and Skills, the Medical Research Council and the Korea Health Industry Development Institute. The meeting is part of a regular series of bilateral exchanges to promote collaboration between the medical research communities in the UK and Korea. As part of the workshop I gave a talk on 'Natural Language Processing for Mining Online Health Reports' which introduced the fundamental capabilities, techniques and challenges of NLP for tasks such as adverse drug reaction profiling, influenza surveillance and the study of psychological well being.
Year(s) Of Engagement Activity 2017
 
Description Invited talk at the European Bioinformatics Institute 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Professional Practitioners
Results and Impact Approximately 60 lifescientists, database curators, bioinformatians and software engineers attended my talk on "Natural language processing for semantic interoperability in unstructured big data".
Year(s) Of Engagement Activity 2015
 
Description Invited talk at the National Institute of Informatics in Tokyo 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact I have an invited talk to staff and postgraduate students from the National Institute of Informatics in Tokyo on the topic of 'Natural Language Processing for Mining Online Health Reports'. The talk covered the capabilities, technologies and limitations of NLP for use in monitoring health in the social media.
Year(s) Of Engagement Activity 2017
 
Description Invited talk at the PublicHealth@Cambridge Network Showcase 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Postgraduate students
Results and Impact Approximately 120 public health professionals, researchers and students attended my invited talk at the Cambridge PublicHealth Showcase on "Knowledge support for protecting and improving health through text-data mining". A lively panel discussion followed along with requests from colleagues for further information.
Year(s) Of Engagement Activity 2015
URL http://www.publichealth.cam.ac.uk/publichealthcambridge-2015-showcase/
 
Description Invited talk at the University of Warwick 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach Local
Primary Audience Undergraduate students
Results and Impact Approximately 110 computer scientists attended my talk on "Exploiting NLP for Digital Disease Informatics". The talk prompted a lively discussion afterwards and students reported interest in developing their own related projects.
Year(s) Of Engagement Activity 2015
URL http://www2.warwick.ac.uk/fac/sci/dcs/events/departmentseminars/past/
 
Description Invited talk to the International Society of Pharmacovigilence (ISoP) Annual Meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Industry/Business
Results and Impact I gave an invited talk at a pre-conference ISoP course on Pharmacovigilance and social media organised by Simon Maskell, University of Liverpool, Danushka Bollegala, University of Liverpool and Phil Tregunno, MHRA. My talk aimed to provide the necessary knowledge for industry practitioners and regulators to understand the capabilities and limitations of natural language processing for social media monitoring in the domain of pharmacovigilence.
Year(s) Of Engagement Activity 2017
URL http://isop2017liverpool.org/pre-conference-courses/
 
Description Organised a workshop on Social Media Mining for Health Applications Workshop and Shared Task 2017 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The workshop aimed to bring together experts from across disciplines to better understand and explore how knowledge contained in social media can be utilized for health-related tasks. Furthermore it aimed to (a) release annotated data to the biomedical informatics research community to develop data-driven systems; (b) enable the bench-marking and comparison of systems; and (c) enable those interested to work in this domain in the future to collaborate and discuss ideas.
Year(s) Of Engagement Activity 2017
URL https://healthlanguageprocessing.org/sharedtask2/
 
Description Organised and attended the BioMedical Linked Annotation Hackathon 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact The BioMedical Linked Annotation Hackathon (BLAH) aimed to bring together a community of practice around semantic annotation of open biomedical text data. Approximately 15 people attended the hackathon with another 30 people attending the workshop.
Year(s) Of Engagement Activity 2015
URL http://1.linkedannotation.org/
 
Description Organised and attended the Mining Online Health Reports workshop (MOHRS 2017) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Approximately 40 international researchers from academia and industry as well as a representative from the Welcome Trust attended the workshop held as part of the project's outreach activities. We discussed the state of the art in text mining technology, applications and the ethics around discovering health information from social media messages. The workshop sparked several lively debates on these issues, most notably the ethics. A report on the workshop will be published later this year along with a special issue of research papers in the Journal of Medical Internet Research. A working summary of the workshop is as follows: At MOHRS there was consensus and discussion on a number of points: (1) NLP/IR/ML technology has the potential to enhance health signal reporting and pull in novel data; (2) mining health reports on the theme of well-being and mental health is a growing area of research importance to our community; (3) using mined data for online intervention strategies are just now being proposed and explored but it is early days and without appropriate considerations for online patient communities we can expect push back; (4) we discussed the challenge of ethics for using online social media data and agreed that for some online health communities a 'social license' approach to match research goals with users' intent would be useful and where this is not the case time should be given by the researchers to understand online authors' motivations and expectations. More generally we agreed that as a community of practice it would be fruitful to explore the creation of working guidelines on the use of social media reports for health; (5) in terms of NLP technologies we agreed that whilst there is clear evidence of traditional (e.g. n-gram) modeling being effective there is interest and scope for the increased exploration of new technologies such as deep learning, e.g. for automated coding of social media messages to formal ontologies. One of our conclusions was that there is strong support for increased opportunities for the health, technology and ethics/legal communities to meet and hold discussions on health and social media.
Year(s) Of Engagement Activity 2017
URL https://sites.google.com/site/mohrs2017/home
 
Description Organised and attended the Phenotype Day workshop (ISMB 2015, Dublin) 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Approximately 50 life scientists, clinicians, bioinformaticians and computer scientists attended the Phenotype Day workshop at ISMB 2015. We discussed the representation, acquisition, discovery and interoperability of clinical phenotype data including in new reporting media such as patient forums.
Year(s) Of Engagement Activity 2015
URL http://phenoday2015.bio-lark.org/
 
Description Presented to Policy Fellow's annual forum (CSaP) 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? Yes
Geographic Reach National
Primary Audience Policymakers/politicians
Results and Impact Talk sparked questions and discussions afterwards.

My talk was probably the first chance the policy leaders had to hear about the use of digital media monitoring for public health and global disease surveillance. As such it informed about the use of 'big data' and 'data science' for these tasks and raised the profile of the technology among policy leaders in the UK government.
Year(s) Of Engagement Activity 2015
URL http://www.csap.cam.ac.uk/news/article-using-computers-understand-language-diseases/
 
Description Senior program committee member and attendee at the Workshop on Social Media Mining for Health Applications (SMM4H) Workshop & Shared Task 2018 at EMNLP 2018 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact This workshop aims to provide a forum for the natural language processing community to present and discuss advances specific to social media use in the particularly challenging area of health applications, following on the success of a session and accompanying Workshop on the topic that was hosted at the Pacific Symposium in Biocomputing (PSB) in 2016 and the AMIA Annual Conference in 2017. The workshop seeks to attract researchers interested in automatic methods for the collection, extraction, representation, analysis, and validation of social media data for health informatics. It serves as a unique forum to discuss novel approaches to text and data mining methods that are applicable to social media data and may prove invaluable for health monitoring and surveillance.
Year(s) Of Engagement Activity 2018
URL https://healthlanguageprocessing.org/smm4h18/