SIPHS: Semantic interpretation of personal health messages for generating public health summaries
Lead Research Organisation:
University of Cambridge
Department Name: English and Applied Linguistics
Abstract
Open online data such as microblogs and discussion board messages have the potential to be an incredibly valuable source of information about health in populations. Such data has been rapidly growing, is low cost, real-time and seems likely to cover a significant proportion of the demographic. To take two examples, PatientsLikeMe has enjoyed 10% growth and now has over 200,000 users covering over 1500 health conditions; the generic Twitter service is expanding at a rate of 30% annually with over 200 million active users. Going beyond simple keyword search and harnessing this data for public health represents both an opportunity and a challenge to natural language processing (NLP). This fellowship proposal is about helping health experts leverage social media for their own clinical and scientific studies through automatic techniques that encode messages according to a machine understandable semantic representation. There are three major challenges this project seeks to address: (1) knowledge brokering: to develop algorithms to identify and code the informal descriptions of conditions, treatments, medications, behaviours and attitudes to standard ontologies such as the UMLS; (2) knowledge management: to create a structured resource of patient vocabulary used in blog texts and link it to existing coding systems; and (3) adding insight to evidence: to work with domain experts to utilize the coded information to automatically generate meaningful summaries for follow up investigation. At the technological level the fellowship seeks to pioneer new methods for NLP and machine learning (ML). Social media remains a challenging area for NLP for a variety of reasons: short de-contextualised messages, high levels of ambiguity/out of vocabulary words, use of slang and an evolving vocabulary, as well as inherent bias towards sensational topics. The fellowship seeks to harness the progress made so far in NLP for social media analysis in the commercial domain and develop it further to provide meaningful public health evidence. One key aspect not previously addressed is in the clinical coding of patient messages. Although knowledge brokering systems exist for clinical and scientific texts (e.g. the NLM's MetaMap), their performance on social media messages has been poor. The fellowship will utilise the rich availability of ontological resources in biomedicine together with ML on annotated message data to disambiguate informal language. Research will also aim to understanding the communicative function of messages, for example whether the message reports direct experience or is related to news, humour or marketing. If these problems are successfully overcome an important barrier to data integration with other types of clinical data will be removed. The advantage of providing health coding for social media reports is its potential for studying very-large scale cohorts and also in real-time early alerting of aberrations. In the fellowship I will research the potential for multi-variate time series alerting from semantically coded features, working with domain experts to evaluate across a range of metrics (e.g. sensitivity, timeliness, false alerting rates). A variety of approaches will be explored to generate real time risk summaries across social media sources. Two real-world applications have been chosen to take this forwards: early alerting for Adverse drug reactions (ADRs) and Infectious disease surveillance (IDS). Project outcomes will include fundamental technologies as well as open source algorithms, data sets and ontology. An exciting aspect of this fellowship is inter-disciplinary collaboration across stakeholders at all levels: scientists, public health experts and industry. Finally, participation will be opened up to the international community through the release of open source data. Colleagues working on social media technologies will be invited to participate in discussions with users at a new challenge evaluation workshop.
Planned Impact
The SIPHS project aims to revolutionise how health experts leverage personal health evidence for their own clinical and scientific studies through automatic techniques that encode social media messages according to a machine understandable semantic representation. SIPHS will deliver state of the art knowledge extraction solutions for evidence relating to human diseases. This is highly relevant to a range of experts across domains such as public health, pharmacology and molecular biology.
Who will benefit from this research?
1. Public health experts performing infectious disease surveillance (IDS), situation awareness and risk assessment functions will benefit from becoming more efficient and having access to earlier warnings and greater coverage about health threats such as pandemic influenza, chemical/ biological/ radiological/ nuclear (CBRN) terrorist attacks;
2. Researchers and engineers in human language technologies, e-Science and information retrieval will benefit from software tools and data sets that can reliably encode social media messages for clinically important concepts;
3. The pharmaceutical industry and those involved in biotechnology and drug discovery will benefit from having access to a new and extensive database of evidence about adverse drug reactions and potentially novel therapeutic properties for licensed drugs;
4. Life scientists and clinicians involved in translational studies will benefit from having a novel database of evidence about phenotype associations to drugs and human diseases that links to the existing scientific and clinical data infrastructure through networks. As noted in Section 2(b) I reiterate that SIPHS is highly relevant to initiatives such as ELIXIR which coordinates and links European biomedical resources;
5. The public will benefit from having improved technologies for early detection of health threats and improved understanding about those technologies through the PI's outreach activities, e.g. a public blog, participation in the Cambridge Science Festival, press releases and a Wikipedia page.
How will they benefit?
1. Building on Dr. Collier's existing global public health network, the PI will continue to work directly with public health experts at Public Health England, the CORDS network and at the WHO to deploy the proposed technologies and database. The innovative techniques advocated in this proposal extends proven high throughput techniques developed by the PI which successfully detected A(H1N1). The techniques supplement scarce human expertise, bring in evidence beyond national boundaries and cover segments of the population who may not interact with traditional sensor networks (e.g. patients who may not visit a GP). The novel techniques will be measured against existing human surveillance network standard;
2. The fellowship pioneers new methods for Natural Language Processing (NLP) and Machine Learning (ML) on social media. We propose to develop a novel combination of supervised and semi-supervised approaches on maximally rich NLP features in order to understand the context of personal health messages, ground layman's terms to clinical standards and provide timely alert summaries. Researchers and engineers will benefit from tools, data sets and techniques;
3. The technology in this proposal will help the pharmaceutical industry in the monitoring of patient reports for ADRs as required by EU and national regulations and to reveal novel therapeutics;
4. The database developed through the SIPHS project will generate high visibility in the lifescience and clinical communities. The integration of the different data resources and the automatic analysis of the social media will lead to benefits for the research community and the general public. If the problem of message coding in personal health messages is successfully overcome an important barrier to data integration - for example with data from clinical trials or electronic patient records - will be removed.
Who will benefit from this research?
1. Public health experts performing infectious disease surveillance (IDS), situation awareness and risk assessment functions will benefit from becoming more efficient and having access to earlier warnings and greater coverage about health threats such as pandemic influenza, chemical/ biological/ radiological/ nuclear (CBRN) terrorist attacks;
2. Researchers and engineers in human language technologies, e-Science and information retrieval will benefit from software tools and data sets that can reliably encode social media messages for clinically important concepts;
3. The pharmaceutical industry and those involved in biotechnology and drug discovery will benefit from having access to a new and extensive database of evidence about adverse drug reactions and potentially novel therapeutic properties for licensed drugs;
4. Life scientists and clinicians involved in translational studies will benefit from having a novel database of evidence about phenotype associations to drugs and human diseases that links to the existing scientific and clinical data infrastructure through networks. As noted in Section 2(b) I reiterate that SIPHS is highly relevant to initiatives such as ELIXIR which coordinates and links European biomedical resources;
5. The public will benefit from having improved technologies for early detection of health threats and improved understanding about those technologies through the PI's outreach activities, e.g. a public blog, participation in the Cambridge Science Festival, press releases and a Wikipedia page.
How will they benefit?
1. Building on Dr. Collier's existing global public health network, the PI will continue to work directly with public health experts at Public Health England, the CORDS network and at the WHO to deploy the proposed technologies and database. The innovative techniques advocated in this proposal extends proven high throughput techniques developed by the PI which successfully detected A(H1N1). The techniques supplement scarce human expertise, bring in evidence beyond national boundaries and cover segments of the population who may not interact with traditional sensor networks (e.g. patients who may not visit a GP). The novel techniques will be measured against existing human surveillance network standard;
2. The fellowship pioneers new methods for Natural Language Processing (NLP) and Machine Learning (ML) on social media. We propose to develop a novel combination of supervised and semi-supervised approaches on maximally rich NLP features in order to understand the context of personal health messages, ground layman's terms to clinical standards and provide timely alert summaries. Researchers and engineers will benefit from tools, data sets and techniques;
3. The technology in this proposal will help the pharmaceutical industry in the monitoring of patient reports for ADRs as required by EU and national regulations and to reveal novel therapeutics;
4. The database developed through the SIPHS project will generate high visibility in the lifescience and clinical communities. The integration of the different data resources and the automatic analysis of the social media will lead to benefits for the research community and the general public. If the problem of message coding in personal health messages is successfully overcome an important barrier to data integration - for example with data from clinical trials or electronic patient records - will be removed.
Organisations
- University of Cambridge (Fellow, Lead Research Organisation)
- University of Manchester (Collaboration)
- Linguamatics (United Kingdom) (Project Partner)
- University of Zurich (Project Partner)
- European Bioinformatics Institute (Project Partner)
- University College London (Project Partner)
- University of California, San Diego (Project Partner)
- Public Health England (Project Partner)
- University of Utah (Project Partner)
- Connecting Orgs for Reg Disease Surv (Project Partner)
People |
ORCID iD |
Nigel Collier (Principal Investigator / Fellow) |
Publications
Alvaro N
(2015)
Crowdsourcing Twitter annotations to identify first-hand experiences of prescription drug use.
in Journal of biomedical informatics
Alvaro N
(2017)
TwiMed: Twitter and PubMed Comparable Corpus of Drugs, Diseases, Symptoms, and Their Relations.
in JMIR public health and surveillance
Basaldella M
(2020)
COMETA: A Corpus for Medical Entity Linking in the Social Media
Basaldella M
(2019)
BioReddit: Word Embeddings for User-Generation Biomedical NLP
Basaldella M.
(2019)
BioReddit: Word embeddings for user-generated biomedical NLP
in LOUHI@EMNLP 2019 - 10th International Workshop on Health Text Mining and Information Analysis, Proceedings
Camacho-Collados J
(2017)
SemEval-2017 Task 2: Multilingual and Cross-lingual Semantic Word Similarity
Can D.-C.
(2019)
A richer-but-smarter shortest dependency path with attentive augmentation for relation extraction
in NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference
Collier N
(2017)
WSDM 2017 Workshop on Mining Online Health Reports
Gritta M
(2020)
A pragmatic guide to geoparsing evaluation: Toponyms, Named Entity Recognition and pragmatics.
in Language resources and evaluation
Gritta M
(2019)
A pragmatic guide to geoparsing evaluation
Gritta M
(2017)
What's missing in geographical parsing?
Gritta M
(2019)
A pragmatic guide to geoparsing evaluation
Gritta M
(2019)
A Pragmatic Guide to Geoparsing Evaluation
Gritta M
(2017)
Vancouver Welcomes You! Minimalist Location Metonymy Resolution
Gritta M
(2018)
Which Melbourne? Augmenting Geocoding with Maps
Gritta M
(2018)
What's missing in geographical parsing?
in Language resources and evaluation
Gritta, M.
(2018)
Which Melbourne? Augmenting Geocoding with Maps
Gritta, M.
(2017)
Vancouver Welcomes You! Minimalist Location Metonymy Resolution
Kartsaklis D
(2018)
Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs
Kartsaklis D.
(2018)
Mapping text to knowledge graph entities using multi-sense LSTMs
in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
Description | During the course of the Fellowship several research objectives were tackled: (a) To use a range of Natural Language Processing (NLP) methods to identify entities of clinical interest. This objective was explored in a number of published works: (i) with Dr Nut Limsopatham (Cambridge University 2015-2017) who developed a neural network model for identifying entities in Twitter messages; (ii) with collaborators Dr Nestor Alvaro and Prof. Yusuke Miyao (National Institute of Informatics, Japan) who in 2017 developed and made available a corpus of Twitter messages annotated with drugs, diseases and symptoms; and (iii) with Dr Marco Basaldella (Cambridge University 2018-present) who developed more powerful distributed semantic representations for entity recognition using Reddit data. (b) To explore a range of machine learning algorithms for linking entities in text to clinical standard vocabularies/ontologies. This goal was at the heart of the SIPHS study and again was explored in a number of works: (i) with Dr Nut Limsopatham we examined a number of baseline approaches to this task including using conventional supervised machine learning such as support vector machines as well as more technically advanced methods involving distributed semantic representations in combination with neural networks; (ii) a number of studies were published with Dr Milan Gritta (University of Cambridge 2016 to 2019) on the special case of identifying and linking geographic entities in free text to the GeoNames database; this is particularly important when trying to understand health events such as disease outbreaks that might be reported in the social media; (iii) with Dr Dimitri Kartsaklis (University of Cambridge 2017 to 2018), we published a state-of-the-art approach for identifying clinical entities in free text (Kartsaklis, Pilehvar and Collier 2018), again based on encoding both the free text and the clinical ontology in a distributed semantic representation and mapping between them. This new technique was also aimed at being capable of handling the problem of words having multiple meanings depending on context. (c) To provide a human gold standard data set for evaluation and validation of (a) and (b). All publications and supporting data from the studies we report are publicly available in repositories such as the University of Cambridge Apollo, or the European Commission's Zenodo or through GitHub. A few of these are highlighted within each technical section. Of particular relevance to (a) is the TwiMED data set (Alvaro, Miyao and Collier 2017), the geographical entity data set (Gritta et al. 2017) and the TwADR-L data set (Limsopatham and Collier 2016). Additionally in collaboration with Dr Taher Pilehvar we produced a data set for the intrinsic evaluation of distributed semantic representations for rare words of clinical interest, published within the Cambridge Rare Word Dataset (Pilhevar, Kartsaklis, Prokhorov and Collier 2018). Software and data sets are all fully referenced from the SIPHS Project Web pages at www.siphs.org. (d) To use the automated techniques in (a) and (b) to support human expert construction of an openly available consumer health ontology that will provide coding for informal layman's clinical terms (e.g. symptoms, drugs, diseases) and links to standard clinical vocabularies/ontologies. This research was conducted in several stages: firstly we obtained approval to license a selection of health messages from patient forum provider HealthUnlocked which were then hand annotated and used to train entity and linking models from (a) and (b). These models were then applied to Reddit health forum data and used to suggest candidate consumer health vocabulary terms which were then expert evaluated. (e) To deploy and maintain an online system for adding insight to evidence by (a) the clinical encoding of personal health messages and a database of encoded personal health messages. The final expert-filtered terms from (d) became the large lexical database of SIPHS Consumer Health Vocabulary terms - each one expressing a concept in the SNOMED CT nomenclature. The fully searchable database of 5000 terms and 30,000 concordances can be found at www.siphs.org. The SIPHS Web portal includes a working demonstration version of the entity recogniser (a) which we are currently in the process of extending to integrate entity linking. Software for both entity recognition and linking is available as noted above. Besides technical publications, the outcomes of the project have been communicated to a variety of stakeholder communities. For example to the Public Health Community (e.g. World Health Organisation, Health Emergencies programme, Geneva 2018), the Pharmaceutic Industry (e.g. International Soceity of Pharmacovigilence, 2017), the Public (e.g. Festival of Ideas, University of Cambridge 2016), Policy Leaders (e.g. Centre for Science and Policy Leader's Meeting, 2015) and Students (e.g. Alan Turing Institute 2018). |
Exploitation Route | The outcomes of this funding will be taken forwards in a number of ways. These include (i) using the techniques discovered in this research to underpin a new Digital Disease Detection system called EPI-AI that will be used for disease alerting from the news media (ESRC Canada-UK AI Initiative); (ii) continuing to develop and extend the SIPHS Consumer Health Vocabulary database; (iii) encouraging research participation in consumer health vocabulary entity linking through the release of a new challenge data set based on (e); and (iv) using the data and software resources from SIPHS to encourage a new generation of research students to take up the challenges that our research has shown, e.g. in automated knowledge representation of ontologies and in geo-coding social media texts. |
Sectors | Digital/Communication/Information Technologies (including Software) Healthcare Pharmaceuticals and Medical Biotechnology |
URL | http://www.siphs.org |
Description | The fellowship has catalyzed significant strides in public health surveillance by tapping into the wealth of data available through social media. Looking back from 2024, we outline here the project's academic impacts, global health contributions, and future directions. Foundational Academic Contributions and Recognition Our research has led to significant advancements in natural language processing (NLP) and machine learning (ML) for health informatics, evidenced by publications that were cited both within and outside the AI community. For example, the papers "Self-alignment pretraining for biomedical entity representations" and "COMETA: A corpus for medical entity linking in social media" have garnered considerable attention, with 223 and 75 citations respectively. These works have not only contributed foundational knowledge to the field but have also facilitated further research in biomedical NLP, entity linking, and health information extraction across various applications. Moreover, the organization of the Mining Online Health Reports (MOHRS) workshop created a vital platform for interdisciplinary dialogue and collaboration. This forum attracted leading scholars and practitioners to discuss the latest advancements, ethical considerations, and practical applications of mining health-related information from online sources. Through keynote talks, paper presentations, and panel discussions, the workshop underscored the importance of integrating diverse perspectives for the advancement of public health surveillance technologies. Impact on Global Health Initiatives The fellowship's reach extends into global health policy and practice, as demonstrated by Prof. Collier's involvement with the World Health Organization's Epidemic Intelligence from Open Sources (EIOS) initiative. This collaboration showcases the direct application of our research in enhancing global health surveillance and early detection systems. Prof. Collier's role as a technical expert and Co-PI on the ESRC-funded EPI-AI project further bridges the gap between academic research and real-world health crisis detection. These efforts highlight the SIPH's project's pivotal role in developing responsible AI technologies for the next generation of disease detection, in partnership with international health organizations. Driving Future Innovations Building upon the SIPHS outputs, the EPI-AI project exemplifies the transition from foundational research to innovative applications. This initiative, rooted in the principles of responsible AI, sets a new precedent for the development of health surveillance tools that are ethical, effective, and globally applicable. Collaborations with entities like the WHO, Public Health England, and the Public Health Agency of Canada illustrate the project's significant influence on shaping the methodologies and technologies at the forefront of global disease surveillance. |
First Year Of Impact | 2015 |
Sector | Healthcare |
Impact Types | Societal Policy & public services |
Description | Participation in the Korea-UK Spring Health Forum, hosted by the British Embassy Seoul and South Korean Health Ministry |
Geographic Reach | Multiple continents/international |
Policy Influence Type | Implementation circular/rapid advice/letter to e.g. Ministry of Health |
Description | Steering committee membership for the Patient Experience Data project (PI: Caroline Sanders, University of Manchester) NIHR Health Services and Delivery Research Programme |
Geographic Reach | National |
Policy Influence Type | Participation in a guidance/advisory committee |
Description | EPI-AI: Automated Understanding and Alerting of Disease Outbreaks from Global News Media |
Amount | £491,373 (GBP) |
Funding ID | ES/T012277/1 |
Organisation | Economic and Social Research Council |
Sector | Public |
Country | United Kingdom |
Start | 02/2020 |
End | 01/2023 |
Description | MRC Methodology Panel |
Amount | £464,014 (GBP) |
Funding ID | MR/M025160/1 |
Organisation | Medical Research Council (MRC) |
Sector | Public |
Country | United Kingdom |
Start | 12/2015 |
End | 11/2018 |
Title | Software for Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs |
Description | Code and resources for the EMNLP 2018 paper "Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs" [1] can be found at the following repository: https://bitbucket.org/dimkart/ms-lstm The model maps efficiently unrestricted text to knowledge graph entities using the following process: (1) The KB graph is extended with textual features weighted by their importance with respect to the entity nodes. (2) A synthetic "corpus" of biased random walks is created and used as input to the skipgram model. This generates an enhanced KB space to be used as target for the text-to-entity mapping process (3) The transformation from text to entities/concepts is achieved via a supervised multi-sense compositional model, which generated a point in the KB space for every input text. (4) The model is an LSTM equipped with an attentional mechanism that dynamically disambiguates the embeddings of the input words given the surrounding context. Reference: [1] D. Kartsaklis, M.T. Pilehvar, N. Collier (2018). Mapping Text to Knowledge Graph Entities using Multi-Sense LSTMs, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium. |
Type Of Material | Improvements to research infrastructure |
Year Produced | 2018 |
Provided To Others? | Yes |
Impact | The software tool addresses the problem of mapping natural language text to knowledge base entities. The mapping process is approached as a composition of a phrase or a sentence into a point in a multi-dimensional entity space obtained from a knowledge graph. The compositional model is an LSTM equipped with a dynamic disambiguation mechanism on the input word embeddings (a Multi-Sense LSTM), addressing polysemy issues. Further, the knowledge base space is prepared by collecting random walks from a graph enhanced with textual features, which act as a set of semantic bridges between text and knowledge base entities. These ideas have been demonstrated in our EMNLP 2018 paper available at https://www.repository.cam.ac.uk/handle/1810/287907. |
URL | https://github.com/cambridgeltl/SIPHS/blob/master/Kartsaklis_etal_EMNLP_2018_code.md |
Title | ACL 2016 Data |
Description | Data and supplementary information for the paper entitled 'Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation' published at ACL 2016: the 54th Annual Meeting of the Association for Computational Linguistics - August 7-12, 2016 - Berlin, Germany. The database contains a list of social media phrases and their encodings in SNOMED-CT. |
Type Of Material | Database/Collection of data |
Year Produced | 2016 |
Provided To Others? | Yes |
Impact | Results are published in the ACL 2016 paper cited in the above description. The impact is an improvement in performance for automatically encoding free text phrases with biomedical concepts using deep neural networks. |
URL | https://zenodo.org/record/55013#.WH9TK302U50 |
Title | COMETA: A Corpus for Medical Entity Linking in the Social Media |
Description | SIPHS Consumer Health Vocabulary (SIPHS-CHV) is a dataset of layman medical terminology. SIPHS-CHV has been collected by analysing four years of content in 68 health-themed subreddits and annotating the most frequent with their corresponding SNOMED-CT entities. Each term is assigned two annotations: a General SNOMED-CT identifier and a Specific one, denoting respectively the literal and contextual meaning of the term. COMETA is built over SIPHS, and provides four different biomedical Entity Linking scenarios for training and evaluation of machine learning algorithms, based on two different sampling strategies (stratified and zero-shot) and on SIPHS' General and Specific annotations. |
Type Of Material | Database/Collection of data |
Year Produced | 2020 |
Provided To Others? | Yes |
Impact | The data set has been requested so far by over 14 teams working on biomedical named entity linking for use in their own experimental work. |
URL | https://www.siphs.org/corpus |
Title | EMNLP 2015 Data |
Description | Data and supplementary information for the paper entitled "Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages" to be published at EMNLP 2015: Conference on Empirical Methods in Natural Language Processing - September 17-21, 2015 - Lisboa, Portugal. The database contains a list of social media phrases and their encodings in SNOMED-CT. |
Type Of Material | Database/Collection of data |
Year Produced | 2015 |
Provided To Others? | Yes |
Impact | Since the data was just released there have not been any results external to the paper in which the results were reported. |
URL | https://zenodo.org/record/27354 |
Title | Research data supporting "Vancouver Welcomes You! Minimalist Location Metonymy Resolution" |
Description | Complete supporting/replication data and code for the ACL Publication. The paper was published in August 2017 at www.acl2017.org |
Type Of Material | Database/Collection of data |
Year Produced | 2017 |
Provided To Others? | Yes |
Title | Research data supporting "What's missing in geographical parsing?" |
Description | Full code and data required for replication and experimentation. |
Type Of Material | Database/Collection of data |
Year Produced | 2017 |
Provided To Others? | Yes |
Title | Research data supporting "Which Melbourne? Augmenting Geocoding with Maps" |
Description | Please unzip the files and read the README file for more instructions. Also visit my GitHub account for more information (milangritta) |
Type Of Material | Database/Collection of data |
Year Produced | 2018 |
Provided To Others? | Yes |
URL | https://www.repository.cam.ac.uk/handle/1810/277772 |
Description | Healtex: UK Healthcare Text Analytics Research Network |
Organisation | University of Manchester |
Department | Health E-Research Centre |
Country | United Kingdom |
Sector | Academic/University |
PI Contribution | Healtex is an EPSRC-sponsored (EP/N027280/1) UK multi-disciplinary research network that aims to explore the barriers to effectively utilising healthcare narrative text data, road-map research efforts and principles for sharing text data and text analytics methods between academia, NHS and industry. It is funded as part of the EPSRC Healthcare Technologies Grand Challenges theme. I am co-leading a challenge stream on 'Data-driven text mining and NLP'. |
Collaborator Contribution | The HealTex network opens up dialogue between technologists in NLP/text mining and the potential user community in the NHS and industry. As such it is a valuable avenue to impact for the work taking place in the EPSRC SIPHS project. I am co-leading the HealTex network's 'data-driven text mining and NLP' challenge stream and aim to use this to promote dialogue and uptake around the SIPHS project themes. |
Impact | Invited talk at HealTex launch event |
Start Year | 2016 |
Title | Software supporting 'A Pragmatic Guide to Geoparsing Evaluation' |
Description | Code and data for the NCRF++ model described in the paper. For more information, download the file to view the README files within. |
Type Of Technology | Software |
Year Produced | 2019 |
URL | https://www.repository.cam.ac.uk/handle/1810/293888 |
Description | Cambridge Language Sciences Symposium |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Postgraduate students |
Results and Impact | Approximately 250 linguists, engineers, scientists, students and members of the business community attended my invited talk at the Cambridge Language Sciences Annual Symposium on "Natural Language Processing and Online Health Reports (or OMG U Got Flu?)" A lively discussion followed along with requests from colleagues for further information. |
Year(s) Of Engagement Activity | 2016 |
URL | http://sms.cam.ac.uk/media/2393150 |
Description | Cambridge University Festival of Ideas |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Public/other audiences |
Results and Impact | The Cambridge University Festival of Ideas is an annual outreach activity to showcase research being done in the University to the general public (aged 12+). This year I gave an invited talk on 'Rumours, Diseases and Drugs: Tackling Textual Data for Knowledge Discovery in Health' outlining the work I am doing in the SIPHS project. Additionally students from my lab provided demonstrations of technologies associated with the project. The response was overwhelmingly positive and follow up questionnaires show that the audience felt informed. |
Year(s) Of Engagement Activity | 2016 |
URL | http://www.festivalofideas.cam.ac.uk/events/language-detectives |
Description | Cambridge University Linguistics Society |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Undergraduate students |
Results and Impact | I gave an invited talk to approximately 50 linguists who are members of the Cambridge University Linguistics Society on 'Rumours, Diseases and Drugs: Tackling Textual Data for Knowledge Discovery in Health. There followed a lively series of questions about the merits of social media versus other forms of evidence and the linguistic issues involved in understanding this form of genre. |
Year(s) Of Engagement Activity | 2016 |
URL | http://camlingsoc.soc.srcf.net/events/event/rumours-diseases-and-drugs-tackling-textual-data-for-kno... |
Description | HealTex launch event |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | National |
Primary Audience | Professional Practitioners |
Results and Impact | Approximately 100 clinicians, technologists and members of the business community attended the opening launch event for the EPSRC UK HealTex network where I gave an invited talk entitled 'NLP capabilities and challenges in the health arena'. The talk resulted in follow up requests from colleagues for further information and participation (e.g. in social media coding for veterinary medical insights). |
Year(s) Of Engagement Activity | 2016 |
URL | http://healtex.org/event/healtex-launch/ |
Description | Invited expert at the Epidemic Intelligence from Open Sources (EIOS) initiative, World Health Organization, Health Emergencies Programme, Geneva. |
Form Of Engagement Activity | A formal working group, expert panel or dialogue |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Third sector organisations |
Results and Impact | I was invited to join along with approximately 100 public health experts attending the launch of the World Health Organisation's three day meeting in Geneva to advance efforts for the early detection, verification and assessment of health risks. The meeting saw presentations, facilitated discussions and collaborative planning for epidemic intelligence from open sources including social media and news media. The meeting had three specific objectives: (1) Understand the current landscape and trajectory for some of the currently available epidemic intelligence tools; (2) Document and prioritise requirements for enhancing the early detection, verification, assessment and communication of health risks; (3) Draft action plans for the collaborative development and implementation of solutions to prioritised requirements. |
Year(s) Of Engagement Activity | 2018 |
Description | Invited talk at Big Data in Medicine, Cancer Research UK |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Professional Practitioners |
Results and Impact | Approximately 500 clinicians, life scientists and others attended my invited talk at the Big Data in Medicine Workshop held at Cancer Research UK Cambridge Institute. The title of the talk was "Undiscovered scientific knowledge from large unstructured collections in an era of Big Data". The talk prompted discussions afterwards and a contact from an industrial group seeking talks on collaboration. |
Year(s) Of Engagement Activity | 2015 |
URL | http://www.bigdata.cam.ac.uk/events/events-archive/big-data-in-medicine-exemplars-and-opportunities-... |
Description | Invited talk at LOUHI 2016 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Postgraduate students |
Results and Impact | 25 international researchers in the field of language technology and health attended my invited talk at LOUHI 2016 which was collocated with EMNLP 2016 in Austin. Texas. The talk sparked questions about the technological difficulties of coding the social media using deep learning, and also about the ethical considerations for re-use of social media data for health. |
Year(s) Of Engagement Activity | 2016 |
URL | https://louhi.limsi.fr/2016/ |
Description | Invited talk at the 2017 Korea-UK Spring Health Forum, Seoul National University Hospital |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | The 2017 Korea-UK Health Forum was organized by the Seoul National University Hospital (Prof. Kyong Soo Park) and the British Embassy in Seoul (Mr Gareth Davies) with support from the UK Department for Business Innovation and Skills, the Medical Research Council and the Korea Health Industry Development Institute. The meeting is part of a regular series of bilateral exchanges to promote collaboration between the medical research communities in the UK and Korea. As part of the workshop I gave a talk on 'Natural Language Processing for Mining Online Health Reports' which introduced the fundamental capabilities, techniques and challenges of NLP for tasks such as adverse drug reaction profiling, influenza surveillance and the study of psychological well being. |
Year(s) Of Engagement Activity | 2017 |
Description | Invited talk at the 27th Conference on Intelligent Systems for Molecular Biology (ISMB), Basel, Switzerland |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | I was invited to a special session of ISMB (Text Mining for Biology and Healthcare) in 2019 to give a talk on the subject of NLP and the social media titled "Pushing natural language processing and social media: towards automated understanding of layman's language". The session was organised by industrial scientists from the pharmaceutical industry and attended by approximately 70 scientists from industry and academia. |
Year(s) Of Engagement Activity | 2019 |
URL | https://www.iscb.org/ismbeccb2019-program/special-sessions#sst01 |
Description | Invited talk at the European Bioinformatics Institute |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Professional Practitioners |
Results and Impact | Approximately 60 lifescientists, database curators, bioinformatians and software engineers attended my talk on "Natural language processing for semantic interoperability in unstructured big data". |
Year(s) Of Engagement Activity | 2015 |
Description | Invited talk at the National Institute of Informatics in Tokyo |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | I have an invited talk to staff and postgraduate students from the National Institute of Informatics in Tokyo on the topic of 'Natural Language Processing for Mining Online Health Reports'. The talk covered the capabilities, technologies and limitations of NLP for use in monitoring health in the social media. |
Year(s) Of Engagement Activity | 2017 |
Description | Invited talk at the PublicHealth@Cambridge Network Showcase |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Postgraduate students |
Results and Impact | Approximately 120 public health professionals, researchers and students attended my invited talk at the Cambridge PublicHealth Showcase on "Knowledge support for protecting and improving health through text-data mining". A lively panel discussion followed along with requests from colleagues for further information. |
Year(s) Of Engagement Activity | 2015 |
URL | http://www.publichealth.cam.ac.uk/publichealthcambridge-2015-showcase/ |
Description | Invited talk at the University of Warwick |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Local |
Primary Audience | Undergraduate students |
Results and Impact | Approximately 110 computer scientists attended my talk on "Exploiting NLP for Digital Disease Informatics". The talk prompted a lively discussion afterwards and students reported interest in developing their own related projects. |
Year(s) Of Engagement Activity | 2015 |
URL | http://www2.warwick.ac.uk/fac/sci/dcs/events/departmentseminars/past/ |
Description | Invited talk to the International Society of Pharmacovigilence (ISoP) Annual Meeting |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Industry/Business |
Results and Impact | I gave an invited talk at a pre-conference ISoP course on Pharmacovigilance and social media organised by Simon Maskell, University of Liverpool, Danushka Bollegala, University of Liverpool and Phil Tregunno, MHRA. My talk aimed to provide the necessary knowledge for industry practitioners and regulators to understand the capabilities and limitations of natural language processing for social media monitoring in the domain of pharmacovigilence. |
Year(s) Of Engagement Activity | 2017 |
URL | http://isop2017liverpool.org/pre-conference-courses/ |
Description | Organised a workshop on Social Media Mining for Health Applications Workshop and Shared Task 2017 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | The workshop aimed to bring together experts from across disciplines to better understand and explore how knowledge contained in social media can be utilized for health-related tasks. Furthermore it aimed to (a) release annotated data to the biomedical informatics research community to develop data-driven systems; (b) enable the bench-marking and comparison of systems; and (c) enable those interested to work in this domain in the future to collaborate and discuss ideas. |
Year(s) Of Engagement Activity | 2017 |
URL | https://healthlanguageprocessing.org/sharedtask2/ |
Description | Organised and attended the BioMedical Linked Annotation Hackathon |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | The BioMedical Linked Annotation Hackathon (BLAH) aimed to bring together a community of practice around semantic annotation of open biomedical text data. Approximately 15 people attended the hackathon with another 30 people attending the workshop. |
Year(s) Of Engagement Activity | 2015 |
URL | http://1.linkedannotation.org/ |
Description | Organised and attended the Mining Online Health Reports workshop (MOHRS 2017) |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Approximately 40 international researchers from academia and industry as well as a representative from the Welcome Trust attended the workshop held as part of the project's outreach activities. We discussed the state of the art in text mining technology, applications and the ethics around discovering health information from social media messages. The workshop sparked several lively debates on these issues, most notably the ethics. A report on the workshop will be published later this year along with a special issue of research papers in the Journal of Medical Internet Research. A working summary of the workshop is as follows: At MOHRS there was consensus and discussion on a number of points: (1) NLP/IR/ML technology has the potential to enhance health signal reporting and pull in novel data; (2) mining health reports on the theme of well-being and mental health is a growing area of research importance to our community; (3) using mined data for online intervention strategies are just now being proposed and explored but it is early days and without appropriate considerations for online patient communities we can expect push back; (4) we discussed the challenge of ethics for using online social media data and agreed that for some online health communities a 'social license' approach to match research goals with users' intent would be useful and where this is not the case time should be given by the researchers to understand online authors' motivations and expectations. More generally we agreed that as a community of practice it would be fruitful to explore the creation of working guidelines on the use of social media reports for health; (5) in terms of NLP technologies we agreed that whilst there is clear evidence of traditional (e.g. n-gram) modeling being effective there is interest and scope for the increased exploration of new technologies such as deep learning, e.g. for automated coding of social media messages to formal ontologies. One of our conclusions was that there is strong support for increased opportunities for the health, technology and ethics/legal communities to meet and hold discussions on health and social media. |
Year(s) Of Engagement Activity | 2017 |
URL | https://sites.google.com/site/mohrs2017/home |
Description | Organised and attended the Phenotype Day workshop (ISMB 2015, Dublin) |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Approximately 50 life scientists, clinicians, bioinformaticians and computer scientists attended the Phenotype Day workshop at ISMB 2015. We discussed the representation, acquisition, discovery and interoperability of clinical phenotype data including in new reporting media such as patient forums. |
Year(s) Of Engagement Activity | 2015 |
URL | http://phenoday2015.bio-lark.org/ |
Description | Presented to Policy Fellow's annual forum (CSaP) |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | Yes |
Geographic Reach | National |
Primary Audience | Policymakers/politicians |
Results and Impact | Talk sparked questions and discussions afterwards. My talk was probably the first chance the policy leaders had to hear about the use of digital media monitoring for public health and global disease surveillance. As such it informed about the use of 'big data' and 'data science' for these tasks and raised the profile of the technology among policy leaders in the UK government. |
Year(s) Of Engagement Activity | 2015 |
URL | http://www.csap.cam.ac.uk/news/article-using-computers-understand-language-diseases/ |
Description | Senior program committee member and attendee at the Workshop on Social Media Mining for Health Applications (SMM4H) Workshop & Shared Task 2018 at EMNLP 2018 |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | This workshop aims to provide a forum for the natural language processing community to present and discuss advances specific to social media use in the particularly challenging area of health applications, following on the success of a session and accompanying Workshop on the topic that was hosted at the Pacific Symposium in Biocomputing (PSB) in 2016 and the AMIA Annual Conference in 2017. The workshop seeks to attract researchers interested in automatic methods for the collection, extraction, representation, analysis, and validation of social media data for health informatics. It serves as a unique forum to discuss novel approaches to text and data mining methods that are applicable to social media data and may prove invaluable for health monitoring and surveillance. |
Year(s) Of Engagement Activity | 2018 |
URL | https://healthlanguageprocessing.org/smm4h18/ |
Description | Talk to the Cambridge University Science Society |
Form Of Engagement Activity | A talk or presentation |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Undergraduate students |
Results and Impact | I was invited by the Cambridge University Science Society to give a talk about my research on supporting health research using Natural Language Processing. The talk was attended by about 60 undergraduate students, postdocs and senior scientists. The talk sparked questions and discussions afterwards about how NLP could support integration of evidence in biomedical informatics. |
Year(s) Of Engagement Activity | 2020 |
URL | http://talks.cam.ac.uk/talk/index/137884 |