EPI-AI: Automated Understanding and Alerting of Disease Outbreaks from Global News Media

Lead Research Organisation: University of Cambridge

Department Name: Modern & Medieval Languages

Abstract

Disease outbreaks, such as Zika, Ebola and SARS epidemics, are of the greatest importance to the international community and the UK/Canadian governments. Public health organisations need data as early as possible in an outbreak to respond rapidly and prevent human suffering. Traditional bio-surveillance relies on human laboratory networks, but these data are often unavailable in real-time, patchy in geographic coverage, and tuned to specific diseases. Digital disease surveillance (DDS) using Web-based news data overcomes some of these limitations, providing a critical supplement to traditional networks. However, current DDS systems rely to a large extent on manual screening of Web data for events of interest: a skilled and labour-intensive process given the volume, multilingualism, velocity and potential bias of news sources.
Research has shown that there is significant potential to automate DDS. Natural Language Processing (NLP) has been in use since the early 2000s to efficiently detect and track health threats from outbreak news reports. For example, the Canadian GPHIN system, which detected the first evidence of SARS, uses a combination of NLP and human experts to sift through over 20K online news reports each day in nine languages. However, traditional automated approaches are insensitive to context that can help experts to interpret risk factors and fail to take account of possible data biases.
Our goal in the EPI-AI project is to achieve a step-change in real-time automated DDS. Previous work has tended to take a siloed approach, focusing on Natural Language Processing methods or spatial analysis with little consideration of equality considerations that arise from biases in the data. We will use an interdisciplinary approach, combining expertise from three disciplines - computer science, epidemiology, and bioethics - to develop novel machine learning and statistical models adapted to the complex data and objectives of global epidemic surveillance.
Benefits that we see include: (i) improved geographic precision and coverage; (ii) improved ability to understand the topical focus of a report; (iii) automated normalisation of risk factors to a standard terminology for integration of evidence across systems; (iv) automated spatio-temporal analysis of reports to update global risk maps and trigger alerts; and (v) provision of contextual information on potential media bias to support interpretation of alerts.
This fundamentally interdisciplinary research will be closely aligned with key Canadian, UK and global public health stakeholders.

Planned Impact

The development of novel neural and statistical machine learning (ML) methods for the extraction of structured event data from news media, the understanding of bias in news data, and the integration of event data with baseline data for the purpose of timely risk assessment will have an important scientific and technological impact. This investigation is highly relevant to a several domains such as public health, computer science, life science and medicine.
1. Public health experts performing infectious disease alerting, situation awareness and risk assessment will benefit from being more efficient and access to earlier warnings and greater coverage about health threats, e.g. pandemic influenza, Zika, Ebola and Marburg. In addition to using modern ML techniques to dramatically improve Natural Language Processing (NLP), the proposed technology will also integrate these methods into an advanced spatial analysis framework to support public health analysts in early alerting, tracking and risk assessment. The techniques supplement scarce human expertise (by replacing manual search and de-duplication), bring in evidence beyond national boundaries, cover segments of the population who may not interact with traditional clinical surveillance networks (e.g. patients who may not visit a GP), and incorporate media bias. With respect to risk assessment, the project will provide valuable data to calibrate the parameters of disease transmission models which are often hampered by insufficient data.
2. Life scientists and clinicians involved in translational studies will benefit from having a novel database of epidemiological evidence about infectious diseases and their associations to risk factors such as symptoms, locations and population descriptors with links to existing scientific data infrastructure through standard ontological codes. The database will include region-specific information about potential sources of media bias that influence reporting.
3. Industry involved in AI technologies and e-science will benefit from open source software tools and publications explaining state-of-the-art ML techniques for text mining and risk alerting that takes account of bias reduction. AI has huge potential in healthcare as a means to support patients and clinicians in making decisions and to reduce administrative costs. The techniques pioneered in this proposal for bias reduction, NLP/Machine Translation and risk analysis are highly relevant to a wide community involved in the development and use of AI in clinical practice, e.g. technology companies involved in R&D on electronic patient records, the pharmaceutical industry looking for online evidence to repurpose drugs, and online patient support networks.
4. Decision-makers and the public will benefit from having improved AI technologies for early detection of health threats and improved understanding of their benefits and limitations. The main benefit is the potential for the research results to make the world safer from the threat of emerging and re-emerging epidemics by strengthening our global capacity to detect and control such threats rapidly, before they cause extensive human suffering. Public understanding of the project will be aided through the openly available EPI-AI portal, conference publications and demonstrations.

Funded Value:

£491,373

Funded Period:

Feb 20 - Dec 24

Funder:

FIC

Project Status:

Closed

Project Category:

Research Grant

Project Reference:

ES/T012277/1

Principal Investigator:

Nigel Collier

Research Subject:

Info. & commun. Technol. (32%)

Sociology (32%)

Tools, technologies & methods (32%)

Research Topic:

Artificial Intelligence (32%)

Bioinformatics (32%)

Stratification (32%)

Organisations

People	ORCID iD
Nigel Collier (Principal Investigator)
Nicholas King (Co-Investigator)
David Buckeridge (Co-Investigator)	http://orcid.org/0000-0003-1817-5047

Publications

Author Name

Title Publication Date Published

|< < 1 2 3 > >|

10 25 50

Anya OKHMATOVSKAIA (2022) A conceptual framework for representing events under public health surveillance

Ficek A (2022) How to tackle an emerging topic? Combining strong and weak labels for Covid news NER

Fu Z (2023) On the Effectiveness of Parameter-Efficient Fine-Tuning in Proceedings of the AAAI Conference on Artificial Intelligence

Fu Z (2024) BAND: Biomedical Alert News Dataset in Proceedings of the AAAI Conference on Artificial Intelligence

Fu Z (2023) Biomedical Named Entity Recognition via Dictionary-based Synonym Generalization

Hu T (2024) An Individualized News Affective Response Dataset

Li Y (2022) Improving Word Translation via Two-Stage Contrastive Learning

Liu F (2021) Self-Alignment Pretraining for Biomedical Entity Representations

Liu F (2021) Fast, Effective, and Self-Supervised: Transforming Masked Language Models into Universal Lexical and Sentence Encoders

Liu F (2021) Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking

Key Findings
Policy Influence
Research Databases and Models
Research Tools and Methods
Software and Technical Products
Engagement Activities


Description	Digital disease surveillance (DDS) uses Web-based data to overcome some of the limitations of traditional bio-surveillance systems, providing a critical supplement to traditional human networks. Our goal EPI-AI is to achieve a step-change in real-time DDS using techniques that enable us to process data from online news streams. We have adopted an interdisciplinary approach, combining expertise from computer science, epidemiology, and bioethics to develop novel machine learning and statistical models adapted to the structurally complex data required for global epidemic surveillance. Based on our combined expertise, we have identified areas of potential equity bias in existing DDS system. Based on this the project has produced a first-of-a-kind large-scale data set for the DDS community to use for both training of systems and as a reference standard for evaluation. As part of our efforts to process epidemiological data we developed improved machine learning methods including ways to represent terminology, integrate information from knowledge graphs, classify news reports, and understand the limitations of Large Language Models to extract event information.
Exploitation Route	We believe that the BAND corpus, as well as our findings about the use of LLMs in epidemiological monitoring will be very valuable enables for the digitial disease surveillance community. Having a benchmark is hugely valuable and especially one which has considered the need to balance events equitably across the world. We are still learning much about the capabilities of LLMs in this space and much useful work can be taken forward in the area of 'constitutional AI' - where LLMs apply organizational policies and guidelines to the task of epidemic detection.
Sectors	Healthcare Security and Diplomacy


Description	Independent review of the Global Public Health Intelligence Network (GPHIN)
Geographic Reach	National
Policy Influence Type	Contribution to a national consultation/review
Impact	No hard outcomes yet, but from a process perspective, the GPHIN system is currently being upgraded and we are contributing to that effort.
URL	https://www.canada.ca/en/public-health/corporate/mandate/about-agency/external-advisory-bodies/list/...


Title	BAND: biomedical alert news dataset
Description	To improve disease surveillance and understanding of disease spread, several surveillance systems have been developed to monitor daily news alerts and social media. However, existing systems lack thorough epidemiological analysis in relation to corresponding alerts or news, largely due to the scarcity of well-annotated reports data. To address this gap, we introduce the Biomedical Alert News Dataset (BAND), which includes 1,508 samples from existing reported news articles, open emails, and alerts, as well as 30 epidemiology-related questions. These questions necessitate the model's expert reasoning abilities, thereby offering valuable insights into the outbreak of the disease. The BAND dataset brings new challenges to the NLP world, requiring better inference capability of the content and the ability to infer important information. We provide several benchmark tasks, including Named Entity Recognition (NER), Question Answering (QA), and Event Extraction (EE), to demonstrate existing models' capabilities and limitations in handling epidemiology-specific tasks.
Type Of Material	Improvements to research infrastructure
Year Produced	2024
Provided To Others?	Yes
Impact	To the best of our knowledge, the BAND corpus is the largest corpus of well-annotated biomedical outbreak alert news with elaborately designed questions, making it a valuable resource for epidemiologists and NLP researchers alike.
URL	https://github.com/fuzihaofzh/BAND


Title	MedLAMA knowledge probing benchmark based on UMLS
Description	Knowledge probing is crucial for understanding the knowledge transfer mechanism behind the pre-trained language models (PLMs). Despite the growing progress of probing knowledge for PLMs in the general domain, specialised areas such as the biomedical domain are vastly under-explored. To facilitate this, we release a well-curated biomedical knowledge probing benchmark, MedLAMA, constructed based on the Unified Medical Language System~(UMLS) Metathesaurus. We test a wide spectrum of state-of-the-art PLMs and probing approaches on our benchmark, reaching at most $3%$ of acc@10. While highlighting various sources of domain-specific challenges that amount to this underwhelming performance, we illustrate that the underlying PLMs have a higher potential for probing tasks. To achieve this, we propose Contrastive-Probe, a novel self-supervised contrastive probing approach, that adjusts the underlying PLMs without using any probing data. While Contrastive-Probe pushes the acc@10 to $28%$, the performance gap still remains notable. Our human expert evaluation suggests that the probing performance of our Contrastive-Probe is still under-estimated as UMLS still does not include the full spectrum of factual knowledge. We hope MedLAMA and Contrastive-Probe facilitate further developments of more suited probing techniques for this domain.
Type Of Material	Improvements to research infrastructure
Year Produced	2021
Provided To Others?	Yes
Impact	A new benchmark data set for understanding how pre-trained language models understand biomedical concepts and their relationships to each other. The benchmark is helping us to understand the types of biomedical relationships, e.g. between diseases and symptoms, that can be handled by the latest state of the art language models.
URL	https://github.com/cambridgeltl/medlama


Title	SapBERT: Self-alignment pretraining for BERT
Description	This repo holds code, data, and pretrained weights for (1) the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining for Biomedical Entity Representations; (2) the cross-lingual SapBERT and a cross-lingual biomedical entity linking benchmark (XL-BEL) proposed in our ACL 2021 paper: Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking. Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SAPBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SAPBERT offers an elegant one-model-for-all solution to the prob- lem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets.
Type Of Material	Improvements to research infrastructure
Year Produced	2021
Provided To Others?	Yes
Impact	In early excperiments SapBert has enabled us to improve the level of performance in biomedical named entity linking to a level consistent with supervised learning without the need for large amounts of labelled training data. This is a great advance and potentially reduces the need for expensive human annotation in this task.
URL	https://github.com/cambridgeltl/sapbert


Title	BAND: Biomedical Alert News Dataset
Description	Infectious disease outbreaks continue to pose a significant threat to human health and well-being. To improve disease surveillance and understanding of disease spread, several surveillance systems have been developed to monitor daily news alerts and social media. However, existing systems lack thorough epidemiological analysis in relation to corresponding alerts or news, largely due to the scarcity of well-annotated reports data. To address this gap, we introduce the Biomedical Alert News Dataset (BAND), which includes 1,508 samples from existing reported news articles, open emails, and alerts, as well as 30 epidemiology-related questions. These questions necessitate the model's expert reasoning abilities, thereby offering valuable insights into the outbreak of the disease. The BAND dataset brings new challenges to the NLP world, requiring better disguise capability of the content and the ability to infer important information. We provide several benchmark tasks, including Named Entity Recognition (NER), Question Answering (QA), and Event Extraction (EE), to show how existing models are capable of handling these tasks in the epidemiology domain. To the best of our knowledge, the BAND corpus is the largest corpus of well-annotated biomedical outbreak alert news with elaborately designed questions, making it a valuable resource for epidemiologists and NLP researchers alike.
Type Of Material	Database/Collection of data
Year Produced	2023
Provided To Others?	Yes
Impact	The dataset has just been published so it is too early to tell of any notable impacts.
URL	https://github.com/fuzihaofzh/BAND


Title	biocaster.org
Description	BioCaster is a fully automated real-time media monitoring system based on Natural Language Processing (NLP) technology. Early detection and tracking of infectious disease outbreaks involves having access to information from a variety of sources. Increasingly this means monitoring many thousands of Internet news feeds simultaneously. However three difficulties exist in finding information using traditional search methods: firstly the massive volume of dynamically changing unstructured news data makes it extremely difficult for governments and public health workers to obtain a clear picture of the outbreak. Secondly, the initial reports of an outbreak are contained in only a few news articles which will usually be overlooked using simple keyword indexing methods. Thirdly, the initial reports of an infectious disease will usually be reported in local none-English news media. In order to capture outbreak information in the most timely manner it is therefore crucial for computer systems to have an understanding of several languages. As part of the EPI-AI project we have partnered with SDL (now part of RWS) to use their Machine Translation Edge technology to overcome the language barrier in 10 languages: Arabic, Chinese, French, Indonesian, Farsi, Korean, Portuguese, Spanish, Russian, and Swahili. The second generation of BioCaster has two major components: a web/database server (built on Elasticsearch and Kibana) and a backend cluster computer (Rocks) equipped with hybrid symbolic-neural NLP technology which continuously scans hundreds of RSS newsfeeds from local and national news providers. Since the NLP system has a detailed knowledge about the important concepts such as diseases, pathogens, phenotypes, people, places, drugs etc. this allows us to semantically index relevant parts of news articles, enabling users to have quicker and highly precise access to information.
Type Of Technology	Webtool/Application
Year Produced	2021
Impact	The biocaster portal of infectious disease news is publicly available for education and scientificresearch purposes.
URL	http://biocaster.org/


Description	Invited presentation at the WHO EIOS Global Technical Meeting
Form Of Engagement Activity	A talk or presentation
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	Dr David Buckeridge went to Egypt to attend the WHO EIOS Global Technical Meeting where he gave a presentation on our work entitled "Connecting Information to Improve Detection in Event-Based Surveillance". The talk sparked questions and a discussion afterwards about automated event-based epidemic surveillance.
Year(s) Of Engagement Activity	2022


Description	Visit to WHO Epidemic Intelligence Hub in Berlin
Form Of Engagement Activity	A formal working group, expert panel or dialogue
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Third sector organisations
Results and Impact	The EPI-AI team from Cambridge and McGill Universities met with the WHO's Epidemic Intelligence from Open Sources team, and a team from the Robert Koch Institute, at the WHO Hub in Berlin from October 19th to 20th 2023. The meeting agenda including presentations on EPI-AI activities from all members, including text analytics, aberration detection and bias. This was followed by presentations from WHO and RKI, and on the second day we had an open discussion about directions for future collaborations.
Year(s) Of Engagement Activity	2023

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications