SIPHS: Semantic interpretation of personal health messages for generating public health summaries

Lead Research Organisation: University of Cambridge

Department Name: English and Applied Linguistics

Abstract

Open online data such as microblogs and discussion board messages have the potential to be an incredibly valuable source of information about health in populations. Such data has been rapidly growing, is low cost, real-time and seems likely to cover a significant proportion of the demographic. To take two examples, PatientsLikeMe has enjoyed 10% growth and now has over 200,000 users covering over 1500 health conditions; the generic Twitter service is expanding at a rate of 30% annually with over 200 million active users. Going beyond simple keyword search and harnessing this data for public health represents both an opportunity and a challenge to natural language processing (NLP). This fellowship proposal is about helping health experts leverage social media for their own clinical and scientific studies through automatic techniques that encode messages according to a machine understandable semantic representation. There are three major challenges this project seeks to address: (1) knowledge brokering: to develop algorithms to identify and code the informal descriptions of conditions, treatments, medications, behaviours and attitudes to standard ontologies such as the UMLS; (2) knowledge management: to create a structured resource of patient vocabulary used in blog texts and link it to existing coding systems; and (3) adding insight to evidence: to work with domain experts to utilize the coded information to automatically generate meaningful summaries for follow up investigation. At the technological level the fellowship seeks to pioneer new methods for NLP and machine learning (ML). Social media remains a challenging area for NLP for a variety of reasons: short de-contextualised messages, high levels of ambiguity/out of vocabulary words, use of slang and an evolving vocabulary, as well as inherent bias towards sensational topics. The fellowship seeks to harness the progress made so far in NLP for social media analysis in the commercial domain and develop it further to provide meaningful public health evidence. One key aspect not previously addressed is in the clinical coding of patient messages. Although knowledge brokering systems exist for clinical and scientific texts (e.g. the NLM's MetaMap), their performance on social media messages has been poor. The fellowship will utilise the rich availability of ontological resources in biomedicine together with ML on annotated message data to disambiguate informal language. Research will also aim to understanding the communicative function of messages, for example whether the message reports direct experience or is related to news, humour or marketing. If these problems are successfully overcome an important barrier to data integration with other types of clinical data will be removed. The advantage of providing health coding for social media reports is its potential for studying very-large scale cohorts and also in real-time early alerting of aberrations. In the fellowship I will research the potential for multi-variate time series alerting from semantically coded features, working with domain experts to evaluate across a range of metrics (e.g. sensitivity, timeliness, false alerting rates). A variety of approaches will be explored to generate real time risk summaries across social media sources. Two real-world applications have been chosen to take this forwards: early alerting for Adverse drug reactions (ADRs) and Infectious disease surveillance (IDS). Project outcomes will include fundamental technologies as well as open source algorithms, data sets and ontology. An exciting aspect of this fellowship is inter-disciplinary collaboration across stakeholders at all levels: scientists, public health experts and industry. Finally, participation will be opened up to the international community through the release of open source data. Colleagues working on social media technologies will be invited to participate in discussions with users at a new challenge evaluation workshop.

Planned Impact

The SIPHS project aims to revolutionise how health experts leverage personal health evidence for their own clinical and scientific studies through automatic techniques that encode social media messages according to a machine understandable semantic representation. SIPHS will deliver state of the art knowledge extraction solutions for evidence relating to human diseases. This is highly relevant to a range of experts across domains such as public health, pharmacology and molecular biology.

Who will benefit from this research?
1. Public health experts performing infectious disease surveillance (IDS), situation awareness and risk assessment functions will benefit from becoming more efficient and having access to earlier warnings and greater coverage about health threats such as pandemic influenza, chemical/ biological/ radiological/ nuclear (CBRN) terrorist attacks;
2. Researchers and engineers in human language technologies, e-Science and information retrieval will benefit from software tools and data sets that can reliably encode social media messages for clinically important concepts;
3. The pharmaceutical industry and those involved in biotechnology and drug discovery will benefit from having access to a new and extensive database of evidence about adverse drug reactions and potentially novel therapeutic properties for licensed drugs;
4. Life scientists and clinicians involved in translational studies will benefit from having a novel database of evidence about phenotype associations to drugs and human diseases that links to the existing scientific and clinical data infrastructure through networks. As noted in Section 2(b) I reiterate that SIPHS is highly relevant to initiatives such as ELIXIR which coordinates and links European biomedical resources;
5. The public will benefit from having improved technologies for early detection of health threats and improved understanding about those technologies through the PI's outreach activities, e.g. a public blog, participation in the Cambridge Science Festival, press releases and a Wikipedia page.

How will they benefit?
1. Building on Dr. Collier's existing global public health network, the PI will continue to work directly with public health experts at Public Health England, the CORDS network and at the WHO to deploy the proposed technologies and database. The innovative techniques advocated in this proposal extends proven high throughput techniques developed by the PI which successfully detected A(H1N1). The techniques supplement scarce human expertise, bring in evidence beyond national boundaries and cover segments of the population who may not interact with traditional sensor networks (e.g. patients who may not visit a GP). The novel techniques will be measured against existing human surveillance network standard;
2. The fellowship pioneers new methods for Natural Language Processing (NLP) and Machine Learning (ML) on social media. We propose to develop a novel combination of supervised and semi-supervised approaches on maximally rich NLP features in order to understand the context of personal health messages, ground layman's terms to clinical standards and provide timely alert summaries. Researchers and engineers will benefit from tools, data sets and techniques;
3. The technology in this proposal will help the pharmaceutical industry in the monitoring of patient reports for ADRs as required by EU and national regulations and to reveal novel therapeutics;
4. The database developed through the SIPHS project will generate high visibility in the lifescience and clinical communities. The integration of the different data resources and the automatic analysis of the social media will lead to benefits for the research community and the general public. If the problem of message coding in personal health messages is successfully overcome an important barrier to data integration - for example with data from clinical trials or electronic patient records - will be removed.

Funded Value:

£971,954

Funded Period:

Feb 15 - Feb 20

Funder:

EPSRC

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

EP/M005089/1

Principal Investigator:

Nigel Collier

Research Subject:

Info. & commun. Technol. (35%)

Linguistics (50%)

Tools, technologies & methods (15%)

Research Topic:

Artificial Intelligence (20%)

Bioinformatics (15%)

Comput./Corpus Linguistics (50%)

Information & Knowledge Mgmt (15%)

Organisations

People	ORCID iD
Nigel Collier (Principal Investigator / Fellow)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Alvaro N (2015) Crowdsourcing Twitter annotations to identify first-hand experiences of prescription drug use. in Journal of biomedical informatics

Alvaro N (2017) TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations. in JMIR public health and surveillance

Alvaro N (2017) TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations.

Basaldella M (2020) COMETA: A Corpus for Medical Entity Linking in the Social Media

Basaldella M (2019) BioReddit: Word Embeddings for User-Generation Biomedical NLP

Basaldella M. (2019) BioReddit: Word embeddings for user-generated biomedical NLP in LOUHI@EMNLP 2019 - 10th International Workshop on Health Text Mining and Information Analysis, Proceedings

Camacho-Collados J (2017) SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity

Can D C (2019) A richer-but-smarter shortest dependency path with attentive augmentation for relation extraction

Can D.-C. (2019) A richer-but-smarter shortest dependency path with attentive augmentation for relation extraction in NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference

Collier N (2017) WSDM 2017 Workshop on Mining Online Health Reports

Conforti C (2018) Towards Automatic Fake News Detection: Cross-Level Stance Detection in News Articles

Gritta M (2020) A pragmatic guide to geoparsing evaluation: Toponyms, Named Entity Recognition and pragmatics. in Language resources and evaluation

Gritta M (2019) A pragmatic guide to geoparsing evaluation

Gritta M (2017) What's missing in geographical parsing?

Gritta M (2019) A pragmatic guide to geoparsing evaluation

Gritta M (2019) A Pragmatic Guide to Geoparsing Evaluation

Gritta M (2017) Vancouver Welcomes You! Minimalist Location Metonymy Resolution

Gritta M (2018) Which Melbourne? Augmenting Geocoding with Maps

Gritta M (2018) What's missing in geographical parsing? in Language resources and evaluation

Gritta, M. (2018) Which Melbourne? Augmenting Geocoding with Maps

Gritta, M. (2017) Vancouver Welcomes You! Minimalist Location Metonymy Resolution

Kartsaklis D (2018) Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs

Kartsaklis D. (2018) Mapping text to knowledge graph entities using multi-sense LSTMs in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018

Le H (2016) Sieve-based coreference resolution enhances semi-supervised learning model for chemical-induced disease relation extraction.

Le H (2016) Sieve-based coreference resolution enhances semi-supervised learning model for chemical-induced disease relation extraction in Database

Key Findings
Impact Summary
Policy Influence
Further Funding
Research Databases and Models
Research Tools and Methods
Collaboration
Software and Technical Products
Engagement Activities


Description	During the course of the Fellowship several research objectives were tackled: (a) To use a range of Natural Language Processing (NLP) methods to identify entities of clinical interest. This objective was explored in a number of published works: (i) with Dr Nut Limsopatham (Cambridge University 2015-2017) who developed a neural network model for identifying entities in Twitter messages; (ii) with collaborators Dr Nestor Alvaro and Prof. Yusuke Miyao (National Institute of Informatics, Japan) who in 2017 developed and made available a corpus of Twitter messages annotated with drugs, diseases and symptoms; and (iii) with Dr Marco Basaldella (Cambridge University 2018-present) who developed more powerful distributed semantic representations for entity recognition using Reddit data. (b) To explore a range of machine learning algorithms for linking entities in text to clinical standard vocabularies/ontologies. This goal was at the heart of the SIPHS study and again was explored in a number of works: (i) with Dr Nut Limsopatham we examined a number of baseline approaches to this task including using conventional supervised machine learning such as support vector machines as well as more technically advanced methods involving distributed semantic representations in combination with neural networks; (ii) a number of studies were published with Dr Milan Gritta (University of Cambridge 2016 to 2019) on the special case of identifying and linking geographic entities in free text to the GeoNames database; this is particularly important when trying to understand health events such as disease outbreaks that might be reported in the social media; (iii) with Dr Dimitri Kartsaklis (University of Cambridge 2017 to 2018), we published a state-of-the-art approach for identifying clinical entities in free text (Kartsaklis, Pilehvar and Collier 2018), again based on encoding both the free text and the clinical ontology in a distributed semantic representation and mapping between them. This new technique was also aimed at being capable of handling the problem of words having multiple meanings depending on context. (c) To provide a human gold standard data set for evaluation and validation of (a) and (b). All publications and supporting data from the studies we report are publicly available in repositories such as the University of Cambridge Apollo, or the European Commission's Zenodo or through GitHub. A few of these are highlighted within each technical section. Of particular relevance to (a) is the TwiMED data set (Alvaro, Miyao and Collier 2017), the geographical entity data set (Gritta et al. 2017) and the TwADR-L data set (Limsopatham and Collier 2016). Additionally in collaboration with Dr Taher Pilehvar we produced a data set for the intrinsic evaluation of distributed semantic representations for rare words of clinical interest, published within the Cambridge Rare Word Dataset (Pilhevar, Kartsaklis, Prokhorov and Collier 2018). Software and data sets are all fully referenced from the SIPHS Project Web pages at www.siphs.org. (d) To use the automated techniques in (a) and (b) to support human expert construction of an openly available consumer health ontology that will provide coding for informal layman's clinical terms (e.g. symptoms, drugs, diseases) and links to standard clinical vocabularies/ontologies. This research was conducted in several stages: firstly we obtained approval to license a selection of health messages from patient forum provider HealthUnlocked which were then hand annotated and used to train entity and linking models from (a) and (b). These models were then applied to Reddit health forum data and used to suggest candidate consumer health vocabulary terms which were then expert evaluated. (e) To deploy and maintain an online system for adding insight to evidence by (a) the clinical encoding of personal health messages and a database of encoded personal health messages. The final expert-filtered terms from (d) became the large lexical database of SIPHS Consumer Health Vocabulary terms - each one expressing a concept in the SNOMED CT nomenclature. The fully searchable database of 5000 terms and 30,000 concordances can be found at www.siphs.org. The SIPHS Web portal includes a working demonstration version of the entity recogniser (a) which we are currently in the process of extending to integrate entity linking. Software for both entity recognition and linking is available as noted above. Besides technical publications, the outcomes of the project have been communicated to a variety of stakeholder communities. For example to the Public Health Community (e.g. World Health Organisation, Health Emergencies programme, Geneva 2018), the Pharmaceutic Industry (e.g. International Soceity of Pharmacovigilence, 2017), the Public (e.g. Festival of Ideas, University of Cambridge 2016), Policy Leaders (e.g. Centre for Science and Policy Leader's Meeting, 2015) and Students (e.g. Alan Turing Institute 2018).
Exploitation Route	The outcomes of this funding will be taken forwards in a number of ways. These include (i) using the techniques discovered in this research to underpin a new Digital Disease Detection system called EPI-AI that will be used for disease alerting from the news media (ESRC Canada-UK AI Initiative); (ii) continuing to develop and extend the SIPHS Consumer Health Vocabulary database; (iii) encouraging research participation in consumer health vocabulary entity linking through the release of a new challenge data set based on (e); and (iv) using the data and software resources from SIPHS to encourage a new generation of research students to take up the challenges that our research has shown, e.g. in automated knowledge representation of ontologies and in geo-coding social media texts.
Sectors	Digital/Communication/Information Technologies (including Software) Healthcare Pharmaceuticals and Medical Biotechnology
URL	http://www.siphs.org


Description	The fellowship has catalyzed significant strides in public health surveillance by tapping into the wealth of data available through social media. Looking back from 2024, we outline here the project's academic impacts, global health contributions, and future directions. Foundational Academic Contributions and Recognition Our research has led to significant advancements in natural language processing (NLP) and machine learning (ML) for health informatics, evidenced by publications that were cited both within and outside the AI community. For example, the papers "Self-alignment pretraining for biomedical entity representations" and "COMETA: A corpus for medical entity linking in social media" have garnered considerable attention, with 223 and 75 citations respectively. These works have not only contributed foundational knowledge to the field but have also facilitated further research in biomedical NLP, entity linking, and health information extraction across various applications. Moreover, the organization of the Mining Online Health Reports (MOHRS) workshop created a vital platform for interdisciplinary dialogue and collaboration. This forum attracted leading scholars and practitioners to discuss the latest advancements, ethical considerations, and practical applications of mining health-related information from online sources. Through keynote talks, paper presentations, and panel discussions, the workshop underscored the importance of integrating diverse perspectives for the advancement of public health surveillance technologies. Impact on Global Health Initiatives The fellowship's reach extends into global health policy and practice, as demonstrated by Prof. Collier's involvement with the World Health Organization's Epidemic Intelligence from Open Sources (EIOS) initiative. This collaboration showcases the direct application of our research in enhancing global health surveillance and early detection systems. Prof. Collier's role as a technical expert and Co-PI on the ESRC-funded EPI-AI project further bridges the gap between academic research and real-world health crisis detection. These efforts highlight the SIPH's project's pivotal role in developing responsible AI technologies for the next generation of disease detection, in partnership with international health organizations. Driving Future Innovations Building upon the SIPHS outputs, the EPI-AI project exemplifies the transition from foundational research to innovative applications. This initiative, rooted in the principles of responsible AI, sets a new precedent for the development of health surveillance tools that are ethical, effective, and globally applicable. Collaborations with entities like the WHO, Public Health England, and the Public Health Agency of Canada illustrate the project's significant influence on shaping the methodologies and technologies at the forefront of global disease surveillance.
First Year Of Impact	2015
Sector	Healthcare
Impact Types	Societal Policy & public services


Description	Participation in the Korea-UK Spring Health Forum, hosted by the British Embassy Seoul and South Korean Health Ministry
Geographic Reach	Multiple continents/international
Policy Influence Type	Implementation circular/rapid advice/letter to e.g. Ministry of Health


Description	Steering committee membership for the Patient Experience Data project (PI: Caroline Sanders, University of Manchester) NIHR Health Services and Delivery Research Programme
Geographic Reach	National
Policy Influence Type	Participation in a guidance/advisory committee


Description	EPI-AI: Automated Understanding and Alerting of Disease Outbreaks from Global News Media
Amount	£491,373 (GBP)
Funding ID	ES/T012277/1
Organisation	Economic and Social Research Council
Sector	Public
Country	United Kingdom
Start	02/2020
End	01/2023


Description	MRC Methodology Panel
Amount	£464,014 (GBP)
Funding ID	MR/M025160/1
Organisation	Medical Research Council (MRC)
Sector	Public
Country	United Kingdom
Start	12/2015
End	11/2018


Title	Software for Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs
Description	Code and resources for the EMNLP 2018 paper "Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs" [1] can be found at the following repository: https://bitbucket.org/dimkart/ms-lstm The model maps efficiently unrestricted text to knowledge graph entities using the following process: (1) The KB graph is extended with textual features weighted by their importance with respect to the entity nodes. (2) A synthetic "corpus" of biased random walks is created and used as input to the skipgram model. This generates an enhanced KB space to be used as target for the text-to-entity mapping process (3) The transformation from text to entities/concepts is achieved via a supervised multi-sense compositional model, which generated a point in the KB space for every input text. (4) The model is an LSTM equipped with an attentional mechanism that dynamically disambiguates the embeddings of the input words given the surrounding context. Reference: [1] D. Kartsaklis, M.T. Pilehvar, N. Collier (2018). Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium.
Type Of Material	Improvements to research infrastructure
Year Produced	2018
Provided To Others?	Yes
Impact	The software tool addresses the problem of mapping natural language text to knowledge base entities. The mapping process is approached as a composition of a phrase or a sentence into a point in a multi-dimensional entity space obtained from a knowledge graph. The compositional model is an LSTM equipped with a dynamic disambiguation mechanism on the input word embeddings (a Multi-Sense LSTM), addressing polysemy issues. Further, the knowledge base space is prepared by collecting random walks from a graph enhanced with textual features, which act as a set of semantic bridges between text and knowledge base entities. These ideas have been demonstrated in our EMNLP 2018 paper available at https://www.repository.cam.ac.uk/handle/1810/287907.
URL	https://github.com/cambridgeltl/SIPHS/blob/master/Kartsaklis_etal_EMNLP_2018_code.md


Title	ACL 2016 Data
Description	Data and supplementary information for the paper entitled 'Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation' published at ACL 2016: the 54th Annual Meeting of the Association for Computational Linguistics - August 7-12, 2016 - Berlin, Germany. The database contains a list of social media phrases and their encodings in SNOMED-CT.
Type Of Material	Database/Collection of data
Year Produced	2016
Provided To Others?	Yes
Impact	Results are published in the ACL 2016 paper cited in the above description. The impact is an improvement in performance for automatically encoding free text phrases with biomedical concepts using deep neural networks.
URL	https://zenodo.org/record/55013#.WH9TK302U50


Title	COMETA: A Corpus for Medical Entity Linking in the Social Media
Description	SIPHS Consumer Health Vocabulary (SIPHS-CHV) is a dataset of layman medical terminology. SIPHS-CHV has been collected by analysing four years of content in 68 health-themed subreddits and annotating the most frequent with their corresponding SNOMED-CT entities. Each term is assigned two annotations: a General SNOMED-CT identifier and a Specific one, denoting respectively the literal and contextual meaning of the term. COMETA is built over SIPHS, and provides four different biomedical Entity Linking scenarios for training and evaluation of machine learning algorithms, based on two different sampling strategies (stratified and zero-shot) and on SIPHS' General and Specific annotations.
Type Of Material	Database/Collection of data
Year Produced	2020
Provided To Others?	Yes
Impact	The data set has been requested so far by over 14 teams working on biomedical named entity linking for use in their own experimental work.
URL	https://www.siphs.org/corpus


Title	EMNLP 2015 Data
Description	Data and supplementary information for the paper entitled "Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages" to be published at EMNLP 2015: Conference on Empirical Methods in Natural Language Processing - September 17-21, 2015 - Lisboa, Portugal. The database contains a list of social media phrases and their encodings in SNOMED-CT.
Type Of Material	Database/Collection of data
Year Produced	2015
Provided To Others?	Yes
Impact	Since the data was just released there have not been any results external to the paper in which the results were reported.
URL	https://zenodo.org/record/27354


Title	Research data supporting "Vancouver Welcomes You! Minimalist Location Metonymy Resolution"
Description	Complete supporting/replication data and code for the ACL Publication. The paper was published in August 2017 at www.acl2017.org
Type Of Material	Database/Collection of data
Year Produced	2017
Provided To Others?	Yes


Title	Research data supporting "What's missing in geographical parsing?"
Description	Full code and data required for replication and experimentation.
Type Of Material	Database/Collection of data
Year Produced	2017
Provided To Others?	Yes


Title	Research data supporting "Which Melbourne? Augmenting Geocoding with Maps"
Description	Please unzip the files and read the README file for more instructions. Also visit my GitHub account for more information (milangritta)
Type Of Material	Database/Collection of data
Year Produced	2018
Provided To Others?	Yes
URL	https://www.repository.cam.ac.uk/handle/1810/277772


Description	Healtex: UK Healthcare Text Analytics Research Network
Organisation	University of Manchester
Department	Health E-Research Centre
Country	United Kingdom
Sector	Academic/University
PI Contribution	Healtex is an EPSRC-sponsored (EP/N027280/1) UK multi-disciplinary research network that aims to explore the barriers to effectively utilising healthcare narrative text data, road-map research efforts and principles for sharing text data and text analytics methods between academia, NHS and industry. It is funded as part of the EPSRC Healthcare Technologies Grand Challenges theme. I am co-leading a challenge stream on 'Data-driven text mining and NLP'.
Collaborator Contribution	The HealTex network opens up dialogue between technologists in NLP/text mining and the potential user community in the NHS and industry. As such it is a valuable avenue to impact for the work taking place in the EPSRC SIPHS project. I am co-leading the HealTex network's 'data-driven text mining and NLP' challenge stream and aim to use this to promote dialogue and uptake around the SIPHS project themes.
Impact	Invited talk at HealTex launch event
Start Year	2016


Title	Software supporting 'A Pragmatic Guide to Geoparsing Evaluation'
Description	Code and data for the NCRF++ model described in the paper. For more information, download the file to view the README files within.
Type Of Technology	Software
Year Produced	2019
URL	https://www.repository.cam.ac.uk/handle/1810/293888


Description	Cambridge Language Sciences Symposium
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Postgraduate students
Results and Impact	Approximately 250 linguists, engineers, scientists, students and members of the business community attended my invited talk at the Cambridge Language Sciences Annual Symposium on "Natural Language Processing and Online Health Reports (or OMG U Got Flu?)" A lively discussion followed along with requests from colleagues for further information.
Year(s) Of Engagement Activity	2016
URL	http://sms.cam.ac.uk/media/2393150


Description	Cambridge University Festival of Ideas
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Public/other audiences
Results and Impact	The Cambridge University Festival of Ideas is an annual outreach activity to showcase research being done in the University to the general public (aged 12+). This year I gave an invited talk on 'Rumours, Diseases and Drugs: Tackling Textual Data for Knowledge Discovery in Health' outlining the work I am doing in the SIPHS project. Additionally students from my lab provided demonstrations of technologies associated with the project. The response was overwhelmingly positive and follow up questionnaires show that the audience felt informed.
Year(s) Of Engagement Activity	2016
URL	http://www.festivalofideas.cam.ac.uk/events/language-detectives


Description	Cambridge University Linguistics Society
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Undergraduate students
Results and Impact	I gave an invited talk to approximately 50 linguists who are members of the Cambridge University Linguistics Society on 'Rumours, Diseases and Drugs: Tackling Textual Data for Knowledge Discovery in Health. There followed a lively series of questions about the merits of social media versus other forms of evidence and the linguistic issues involved in understanding this form of genre.
Year(s) Of Engagement Activity	2016
URL	http://camlingsoc.soc.srcf.net/events/event/rumours-diseases-and-drugs-tackling-textual-data-for-kno...


Description	HealTex launch event
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	National
Primary Audience	Professional Practitioners
Results and Impact	Approximately 100 clinicians, technologists and members of the business community attended the opening launch event for the EPSRC UK HealTex network where I gave an invited talk entitled 'NLP capabilities and challenges in the health arena'. The talk resulted in follow up requests from colleagues for further information and participation (e.g. in social media coding for veterinary medical insights).
Year(s) Of Engagement Activity	2016
URL	http://healtex.org/event/healtex-launch/


Description	Invited expert at the Epidemic Intelligence from Open Sources (EIOS) initiative, World Health Organization, Health Emergencies Programme, Geneva.
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Third sector organisations
Results and Impact	I was invited to join along with approximately 100 public health experts attending the launch of the World Health Organisation's three day meeting in Geneva to advance efforts for the early detection, verification and assessment of health risks. The meeting saw presentations, facilitated discussions and collaborative planning for epidemic intelligence from open sources including social media and news media. The meeting had three specific objectives: (1) Understand the current landscape and trajectory for some of the currently available epidemic intelligence tools; (2) Document and prioritise requirements for enhancing the early detection, verification, assessment and communication of health risks; (3) Draft action plans for the collaborative development and implementation of solutions to prioritised requirements.
Year(s) Of Engagement Activity	2018


Description	Invited talk at Big Data in Medicine, Cancer Research UK
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Professional Practitioners
Results and Impact	Approximately 500 clinicians, life scientists and others attended my invited talk at the Big Data in Medicine Workshop held at Cancer Research UK Cambridge Institute. The title of the talk was "Undiscovered scientific knowledge from large unstructured collections in an era of Big Data". The talk prompted discussions afterwards and a contact from an industrial group seeking talks on collaboration.
Year(s) Of Engagement Activity	2015
URL	http://www.bigdata.cam.ac.uk/events/events-archive/big-data-in-medicine-exemplars-and-opportunities-...


Description	Invited talk at LOUHI 2016
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Postgraduate students
Results and Impact	25 international researchers in the field of language technology and health attended my invited talk at LOUHI 2016 which was collocated with EMNLP 2016 in Austin. Texas. The talk sparked questions about the technological difficulties of coding the social media using deep learning, and also about the ethical considerations for re-use of social media data for health.
Year(s) Of Engagement Activity	2016
URL	https://louhi.limsi.fr/2016/


Description	Invited talk at the 2017 Korea-UK Spring Health Forum, Seoul National University Hospital
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The 2017 Korea-UK Health Forum was organized by the Seoul National University Hospital (Prof. Kyong Soo Park) and the British Embassy in Seoul (Mr Gareth Davies) with support from the UK Department for Business Innovation and Skills, the Medical Research Council and the Korea Health Industry Development Institute. The meeting is part of a regular series of bilateral exchanges to promote collaboration between the medical research communities in the UK and Korea. As part of the workshop I gave a talk on 'Natural Language Processing for Mining Online Health Reports' which introduced the fundamental capabilities, techniques and challenges of NLP for tasks such as adverse drug reaction profiling, influenza surveillance and the study of psychological well being.
Year(s) Of Engagement Activity	2017


Description	Invited talk at the 27th Conference on Intelligent Systems for Molecular Biology (ISMB), Basel, Switzerland
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	I was invited to a special session of ISMB (Text Mining for Biology and Healthcare) in 2019 to give a talk on the subject of NLP and the social media titled "Pushing natural language processing and social media: towards automated understanding of layman's language". The session was organised by industrial scientists from the pharmaceutical industry and attended by approximately 70 scientists from industry and academia.
Year(s) Of Engagement Activity	2019
URL	https://www.iscb.org/ismbeccb2019-program/special-sessions#sst01


Description	Invited talk at the European Bioinformatics Institute
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Professional Practitioners
Results and Impact	Approximately 60 lifescientists, database curators, bioinformatians and software engineers attended my talk on "Natural language processing for semantic interoperability in unstructured big data".
Year(s) Of Engagement Activity	2015


Description	Invited talk at the National Institute of Informatics in Tokyo
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	I have an invited talk to staff and postgraduate students from the National Institute of Informatics in Tokyo on the topic of 'Natural Language Processing for Mining Online Health Reports'. The talk covered the capabilities, technologies and limitations of NLP for use in monitoring health in the social media.
Year(s) Of Engagement Activity	2017


Description	Invited talk at the PublicHealth@Cambridge Network Showcase
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Postgraduate students
Results and Impact	Approximately 120 public health professionals, researchers and students attended my invited talk at the Cambridge PublicHealth Showcase on "Knowledge support for protecting and improving health through text-data mining". A lively panel discussion followed along with requests from colleagues for further information.
Year(s) Of Engagement Activity	2015
URL	http://www.publichealth.cam.ac.uk/publichealthcambridge-2015-showcase/


Description	Invited talk at the University of Warwick
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Local
Primary Audience	Undergraduate students
Results and Impact	Approximately 110 computer scientists attended my talk on "Exploiting NLP for Digital Disease Informatics". The talk prompted a lively discussion afterwards and students reported interest in developing their own related projects.
Year(s) Of Engagement Activity	2015
URL	http://www2.warwick.ac.uk/fac/sci/dcs/events/departmentseminars/past/


Description	Invited talk to the International Society of Pharmacovigilence (ISoP) Annual Meeting
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Industry/Business
Results and Impact	I gave an invited talk at a pre-conference ISoP course on Pharmacovigilance and social media organised by Simon Maskell, University of Liverpool, Danushka Bollegala, University of Liverpool and Phil Tregunno, MHRA. My talk aimed to provide the necessary knowledge for industry practitioners and regulators to understand the capabilities and limitations of natural language processing for social media monitoring in the domain of pharmacovigilence.
Year(s) Of Engagement Activity	2017
URL	http://isop2017liverpool.org/pre-conference-courses/


Description	Organised a workshop on Social Media Mining for Health Applications Workshop and Shared Task 2017
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The workshop aimed to bring together experts from across disciplines to better understand and explore how knowledge contained in social media can be utilized for health-related tasks. Furthermore it aimed to (a) release annotated data to the biomedical informatics research community to develop data-driven systems; (b) enable the bench-marking and comparison of systems; and (c) enable those interested to work in this domain in the future to collaborate and discuss ideas.
Year(s) Of Engagement Activity	2017
URL	https://healthlanguageprocessing.org/sharedtask2/


Description	Organised and attended the BioMedical Linked Annotation Hackathon
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	The BioMedical Linked Annotation Hackathon (BLAH) aimed to bring together a community of practice around semantic annotation of open biomedical text data. Approximately 15 people attended the hackathon with another 30 people attending the workshop.
Year(s) Of Engagement Activity	2015
URL	http://1.linkedannotation.org/


Description	Organised and attended the Mining Online Health Reports workshop (MOHRS 2017)
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Approximately 40 international researchers from academia and industry as well as a representative from the Welcome Trust attended the workshop held as part of the project's outreach activities. We discussed the state of the art in text mining technology, applications and the ethics around discovering health information from social media messages. The workshop sparked several lively debates on these issues, most notably the ethics. A report on the workshop will be published later this year along with a special issue of research papers in the Journal of Medical Internet Research. A working summary of the workshop is as follows: At MOHRS there was consensus and discussion on a number of points: (1) NLP/IR/ML technology has the potential to enhance health signal reporting and pull in novel data; (2) mining health reports on the theme of well-being and mental health is a growing area of research importance to our community; (3) using mined data for online intervention strategies are just now being proposed and explored but it is early days and without appropriate considerations for online patient communities we can expect push back; (4) we discussed the challenge of ethics for using online social media data and agreed that for some online health communities a 'social license' approach to match research goals with users' intent would be useful and where this is not the case time should be given by the researchers to understand online authors' motivations and expectations. More generally we agreed that as a community of practice it would be fruitful to explore the creation of working guidelines on the use of social media reports for health; (5) in terms of NLP technologies we agreed that whilst there is clear evidence of traditional (e.g. n-gram) modeling being effective there is interest and scope for the increased exploration of new technologies such as deep learning, e.g. for automated coding of social media messages to formal ontologies. One of our conclusions was that there is strong support for increased opportunities for the health, technology and ethics/legal communities to meet and hold discussions on health and social media.
Year(s) Of Engagement Activity	2017
URL	https://sites.google.com/site/mohrs2017/home


Description	Organised and attended the Phenotype Day workshop (ISMB 2015, Dublin)
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Approximately 50 life scientists, clinicians, bioinformaticians and computer scientists attended the Phenotype Day workshop at ISMB 2015. We discussed the representation, acquisition, discovery and interoperability of clinical phenotype data including in new reporting media such as patient forums.
Year(s) Of Engagement Activity	2015
URL	http://phenoday2015.bio-lark.org/


Description	Presented to Policy Fellow's annual forum (CSaP)
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	Yes
Geographic Reach	National
Primary Audience	Policymakers/politicians
Results and Impact	Talk sparked questions and discussions afterwards. My talk was probably the first chance the policy leaders had to hear about the use of digital media monitoring for public health and global disease surveillance. As such it informed about the use of 'big data' and 'data science' for these tasks and raised the profile of the technology among policy leaders in the UK government.
Year(s) Of Engagement Activity	2015
URL	http://www.csap.cam.ac.uk/news/article-using-computers-understand-language-diseases/


Description	Senior program committee member and attendee at the Workshop on Social Media Mining for Health Applications (SMM4H) Workshop & Shared Task 2018 at EMNLP 2018
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	This workshop aims to provide a forum for the natural language processing community to present and discuss advances specific to social media use in the particularly challenging area of health applications, following on the success of a session and accompanying Workshop on the topic that was hosted at the Pacific Symposium in Biocomputing (PSB) in 2016 and the AMIA Annual Conference in 2017. The workshop seeks to attract researchers interested in automatic methods for the collection, extraction, representation, analysis, and validation of social media data for health informatics. It serves as a unique forum to discuss novel approaches to text and data mining methods that are applicable to social media data and may prove invaluable for health monitoring and surveillance.
Year(s) Of Engagement Activity	2018
URL	https://healthlanguageprocessing.org/smm4h18/


Description	Talk to the Cambridge University Science Society
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	Regional
Primary Audience	Undergraduate students
Results and Impact	I was invited by the Cambridge University Science Society to give a talk about my research on supporting health research using Natural Language Processing. The talk was attended by about 60 undergraduate students, postdocs and senior scientists. The talk sparked questions and discussions afterwards about how NLP could support integration of evidence in biomedical informatics.
Year(s) Of Engagement Activity	2020
URL	http://talks.cam.ac.uk/talk/index/137884

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications