EPI-AI: Automated Understanding and Alerting of Disease Outbreaks from Global News Media

Lead Research Organisation: University of Cambridge
Department Name: Modern & Medieval Languages

Abstract

Disease outbreaks, such as Zika, Ebola and SARS epidemics, are of the greatest importance to the international community and the UK/Canadian governments. Public health organisations need data as early as possible in an outbreak to respond rapidly and prevent human suffering. Traditional bio-surveillance relies on human laboratory networks, but these data are often unavailable in real-time, patchy in geographic coverage, and tuned to specific diseases. Digital disease surveillance (DDS) using Web-based news data overcomes some of these limitations, providing a critical supplement to traditional networks. However, current DDS systems rely to a large extent on manual screening of Web data for events of interest: a skilled and labour-intensive process given the volume, multilingualism, velocity and potential bias of news sources.
Research has shown that there is significant potential to automate DDS. Natural Language Processing (NLP) has been in use since the early 2000s to efficiently detect and track health threats from outbreak news reports. For example, the Canadian GPHIN system, which detected the first evidence of SARS, uses a combination of NLP and human experts to sift through over 20K online news reports each day in nine languages. However, traditional automated approaches are insensitive to context that can help experts to interpret risk factors and fail to take account of possible data biases.
Our goal in the EPI-AI project is to achieve a step-change in real-time automated DDS. Previous work has tended to take a siloed approach, focusing on Natural Language Processing methods or spatial analysis with little consideration of equality considerations that arise from biases in the data. We will use an interdisciplinary approach, combining expertise from three disciplines - computer science, epidemiology, and bioethics - to develop novel machine learning and statistical models adapted to the complex data and objectives of global epidemic surveillance.
Benefits that we see include: (i) improved geographic precision and coverage; (ii) improved ability to understand the topical focus of a report; (iii) automated normalisation of risk factors to a standard terminology for integration of evidence across systems; (iv) automated spatio-temporal analysis of reports to update global risk maps and trigger alerts; and (v) provision of contextual information on potential media bias to support interpretation of alerts.
This fundamentally interdisciplinary research will be closely aligned with key Canadian, UK and global public health stakeholders.

Planned Impact

The development of novel neural and statistical machine learning (ML) methods for the extraction of structured event data from news media, the understanding of bias in news data, and the integration of event data with baseline data for the purpose of timely risk assessment will have an important scientific and technological impact. This investigation is highly relevant to a several domains such as public health, computer science, life science and medicine.
1. Public health experts performing infectious disease alerting, situation awareness and risk assessment will benefit from being more efficient and access to earlier warnings and greater coverage about health threats, e.g. pandemic influenza, Zika, Ebola and Marburg. In addition to using modern ML techniques to dramatically improve Natural Language Processing (NLP), the proposed technology will also integrate these methods into an advanced spatial analysis framework to support public health analysts in early alerting, tracking and risk assessment. The techniques supplement scarce human expertise (by replacing manual search and de-duplication), bring in evidence beyond national boundaries, cover segments of the population who may not interact with traditional clinical surveillance networks (e.g. patients who may not visit a GP), and incorporate media bias. With respect to risk assessment, the project will provide valuable data to calibrate the parameters of disease transmission models which are often hampered by insufficient data.
2. Life scientists and clinicians involved in translational studies will benefit from having a novel database of epidemiological evidence about infectious diseases and their associations to risk factors such as symptoms, locations and population descriptors with links to existing scientific data infrastructure through standard ontological codes. The database will include region-specific information about potential sources of media bias that influence reporting.
3. Industry involved in AI technologies and e-science will benefit from open source software tools and publications explaining state-of-the-art ML techniques for text mining and risk alerting that takes account of bias reduction. AI has huge potential in healthcare as a means to support patients and clinicians in making decisions and to reduce administrative costs. The techniques pioneered in this proposal for bias reduction, NLP/Machine Translation and risk analysis are highly relevant to a wide community involved in the development and use of AI in clinical practice, e.g. technology companies involved in R&D on electronic patient records, the pharmaceutical industry looking for online evidence to repurpose drugs, and online patient support networks.
4. Decision-makers and the public will benefit from having improved AI technologies for early detection of health threats and improved understanding of their benefits and limitations. The main benefit is the potential for the research results to make the world safer from the threat of emerging and re-emerging epidemics by strengthening our global capacity to detect and control such threats rapidly, before they cause extensive human suffering. Public understanding of the project will be aided through the openly available EPI-AI portal, conference publications and demonstrations.

Publications

10 25 50
 
Description Independent review of the Global Public Health Intelligence Network (GPHIN)
Geographic Reach National 
Policy Influence Type Contribution to a national consultation/review
Impact No hard outcomes yet, but from a process perspective, the GPHIN system is currently being upgraded and we are contributing to that effort.
URL https://www.canada.ca/en/public-health/corporate/mandate/about-agency/external-advisory-bodies/list/...
 
Title MedLAMA knowledge probing benchmark based on UMLS 
Description Knowledge probing is crucial for understanding the knowledge transfer mechanism behind the pre-trained language models (PLMs). Despite the growing progress of probing knowledge for PLMs in the general domain, specialised areas such as the biomedical domain are vastly under-explored. To facilitate this, we release a well-curated biomedical knowledge probing benchmark, MedLAMA, constructed based on the Unified Medical Language System~(UMLS) Metathesaurus. We test a wide spectrum of state-of-the-art PLMs and probing approaches on our benchmark, reaching at most $3%$ of acc@10. While highlighting various sources of domain-specific challenges that amount to this underwhelming performance, we illustrate that the underlying PLMs have a higher potential for probing tasks. To achieve this, we propose Contrastive-Probe, a novel self-supervised contrastive probing approach, that adjusts the underlying PLMs without using any probing data. While Contrastive-Probe pushes the acc@10 to $28%$, the performance gap still remains notable. Our human expert evaluation suggests that the probing performance of our Contrastive-Probe is still under-estimated as UMLS still does not include the full spectrum of factual knowledge. We hope MedLAMA and Contrastive-Probe facilitate further developments of more suited probing techniques for this domain. 
Type Of Material Improvements to research infrastructure 
Year Produced 2021 
Provided To Others? Yes  
Impact A new benchmark data set for understanding how pre-trained language models understand biomedical concepts and their relationships to each other. The benchmark is helping us to understand the types of biomedical relationships, e.g. between diseases and symptoms, that can be handled by the latest state of the art language models. 
URL https://github.com/cambridgeltl/medlama
 
Title SapBERT: Self-alignment pretraining for BERT 
Description This repo holds code, data, and pretrained weights for (1) the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining for Biomedical Entity Representations; (2) the cross-lingual SapBERT and a cross-lingual biomedical entity linking benchmark (XL-BEL) proposed in our ACL 2021 paper: Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking. Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SAPBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SAPBERT offers an elegant one-model-for-all solution to the prob- lem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. 
Type Of Material Improvements to research infrastructure 
Year Produced 2021 
Provided To Others? Yes  
Impact In early excperiments SapBert has enabled us to improve the level of performance in biomedical named entity linking to a level consistent with supervised learning without the need for large amounts of labelled training data. This is a great advance and potentially reduces the need for expensive human annotation in this task. 
URL https://github.com/cambridgeltl/sapbert
 
Title BAND: Biomedical Alert News Dataset 
Description Infectious disease outbreaks continue to pose a significant threat to human health and well-being. To improve disease surveillance and understanding of disease spread, several surveillance systems have been developed to monitor daily news alerts and social media. However, existing systems lack thorough epidemiological analysis in relation to corresponding alerts or news, largely due to the scarcity of well-annotated reports data. To address this gap, we introduce the Biomedical Alert News Dataset (BAND), which includes 1,508 samples from existing reported news articles, open emails, and alerts, as well as 30 epidemiology-related questions. These questions necessitate the model's expert reasoning abilities, thereby offering valuable insights into the outbreak of the disease. The BAND dataset brings new challenges to the NLP world, requiring better disguise capability of the content and the ability to infer important information. We provide several benchmark tasks, including Named Entity Recognition (NER), Question Answering (QA), and Event Extraction (EE), to show how existing models are capable of handling these tasks in the epidemiology domain. To the best of our knowledge, the BAND corpus is the largest corpus of well-annotated biomedical outbreak alert news with elaborately designed questions, making it a valuable resource for epidemiologists and NLP researchers alike. 
Type Of Material Database/Collection of data 
Year Produced 2023 
Provided To Others? Yes  
Impact The dataset has just been published so it is too early to tell of any notable impacts. 
URL https://github.com/fuzihaofzh/BAND
 
Title biocaster.org 
Description BioCaster is a fully automated real-time media monitoring system based on Natural Language Processing (NLP) technology. Early detection and tracking of infectious disease outbreaks involves having access to information from a variety of sources. Increasingly this means monitoring many thousands of Internet news feeds simultaneously. However three difficulties exist in finding information using traditional search methods: firstly the massive volume of dynamically changing unstructured news data makes it extremely difficult for governments and public health workers to obtain a clear picture of the outbreak. Secondly, the initial reports of an outbreak are contained in only a few news articles which will usually be overlooked using simple keyword indexing methods. Thirdly, the initial reports of an infectious disease will usually be reported in local none-English news media. In order to capture outbreak information in the most timely manner it is therefore crucial for computer systems to have an understanding of several languages. As part of the EPI-AI project we have partnered with SDL (now part of RWS) to use their Machine Translation Edge technology to overcome the language barrier in 10 languages: Arabic, Chinese, French, Indonesian, Farsi, Korean, Portuguese, Spanish, Russian, and Swahili. The second generation of BioCaster has two major components: a web/database server (built on Elasticsearch and Kibana) and a backend cluster computer (Rocks) equipped with hybrid symbolic-neural NLP technology which continuously scans hundreds of RSS newsfeeds from local and national news providers. Since the NLP system has a detailed knowledge about the important concepts such as diseases, pathogens, phenotypes, people, places, drugs etc. this allows us to semantically index relevant parts of news articles, enabling users to have quicker and highly precise access to information. 
Type Of Technology Webtool/Application 
Year Produced 2021 
Impact The biocaster portal of infectious disease news is publicly available for education and scientificresearch purposes. 
URL http://biocaster.org/
 
Description Invited presentation at the WHO EIOS Global Technical Meeting 
Form Of Engagement Activity A talk or presentation
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Dr David Buckeridge went to Egypt to attend the WHO EIOS Global Technical Meeting where he gave a presentation on our work entitled "Connecting Information to Improve Detection in Event-Based Surveillance". The talk sparked questions and a discussion afterwards about automated event-based epidemic surveillance.
Year(s) Of Engagement Activity 2022
 
Description Visit to WHO Epidemic Intelligence Hub in Berlin 
Form Of Engagement Activity A formal working group, expert panel or dialogue
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Third sector organisations
Results and Impact The EPI-AI team from Cambridge and McGill Universities met with the WHO's Epidemic Intelligence from Open Sources team, and a team from the Robert Koch Institute, at the WHO Hub in Berlin from October 19th to 20th 2023. The meeting agenda including presentations on EPI-AI activities from all members, including text analytics, aberration detection and bias. This was followed by presentations from WHO and RKI, and on the second day we had an open discussion about directions for future collaborations.
Year(s) Of Engagement Activity 2023