Panda Alert Research Proposal

Lead Research Organisation: University of Cambridge
Department Name: Linguistics

Abstract

The Panda Alert System aims to improve the state of the art in early disease outbreak
detection by incorporating linguistic features into the language models which were
previously not known/seen to indicate any risk to human health. The project will
develop a fully functioning, real-time alerting and mapping system.

This kind of research project assumes significant knowledge/use of software
engineering methodologies combined with novel research into unsupervised or
semi-supervised NLP models. We propose a layered, modular architecture
detachable from the core NLP engine, which should demonstrate a universal
detection capability easily transferable to new domains.

The most likely approach would be a semi-supervised, bootstrapping model, which
learns from a small amount of training data to generalise over the unknown domain.
Harnessing the latest NLP machine learning methods such as neural networks (deep
learning), the model analyses news reports to flag any risks to public health. All tools
and data sets from this research will be open sourced and available for download
from public repositories such as GitHub or Google Code.

It is vital for the future reusability of the core NLP engine to be domain-agnostic so
that it can be extended and adapted via a clear programming interface to new event
detection tasks. The real-time analysis system will have to handle hundreds of
thousands of news articles and social media items per day in multiple languages.
The research part of the project will take the state of the art in NLP and AI and
extend it to establish novel methods of NLP general discourse pattern detection in
order to analyse statements for known and unknown features potentially indicative
of the fact-bearing target knowledge. The new technique should be capable of
handling the identification and reporting of other target knowledge/facts from casual
online user/news activity with minimum or no training.

In year one, I shall be mainly focusing on toponym resolution and geoparsing. These
techniques comprise NLP techniques for identification of place names in text and
their subsequent resolution to geographical coordinates. This task is easy for humans
to perform, however for a machine to tell which London (UK, Canada, etc.) was
mentioned in a text is a non-trivial job. I aim to improve on the existing baselines by
researching a novel method of geoparsing. We aim to publish the findings next year
at a respected conference.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
NE/M009009/1 05/10/2015 31/12/2022
1649558 Studentship NE/M009009/1 01/10/2015 30/09/2018 Milan Gritta
 
Description We discovered novel ways to increase the effectiveness of geographic text analysis. This is particularly relevant for event monitoring such as disease outbreaks. In order to accurately monitor breaking news events, we need to be able to deploy the latest techniques in Natural Language Processing and Artificial Intelligence. This was the aim of the thesis, to show how these novel methods can significantly improve existing approaches to disease monitoring for the benefit of public health.
Exploitation Route The audience will come from a mixture of technical and policy research background. The findings show a path to more accurate information extraction for disease monitoring or any other event monitoring for that matter. The thesis can be consulted for ways to bring existing public health monitoring systems up to date with the latest techniques in artificial intelligence and computational linguistics.
Sectors Agriculture, Food and Drink,Communities and Social Services/Policy,Education,Environment,Healthcare,Government, Democracy and Justice

 
Description The findings of this thesis are expected to be used by international Public Health agencies (JRC Europe, PHAC Canada) maintaining an automatic disease monitoring system using Natural Language Processing technology. This involves mostly technical advice and material for the development and maintenance of such technology. The aim is to increase the capability and effectiveness of NLP monitoring systems for the benefit of public health.
First Year Of Impact 2019
Sector Environment,Healthcare,Government, Democracy and Justice
Impact Types Societal,Economic,Policy & public services

 
Description Technical Research for Disease Monitoring
Geographic Reach Multiple continents/international 
Policy Influence Type Influenced training of practitioners or researchers
 
Title GitHub Resources 
Description NLP/AI Resources for Geoparsing and beyond. Descriptions in the repository. 
Type Of Material Improvements to research infrastructure 
Year Produced 2016 
Provided To Others? Yes  
Impact Availing oneself of the SOTA tools and resources for geographic text analysis. 
URL https://github.com/milangritta
 
Title My GitHub Page 
Description This is where I store most of the code and resources generate by my research including links to further resources. 
Type Of Material Data handling & control 
Year Produced 2015 
Provided To Others? Yes  
Impact Allows anyone to see what I'm researching, try the code, download the data and replicate experiments. 
URL https://github.com/milangritta
 
Title Research data supporting "Vancouver Welcomes You! Minimalist Location Metonymy Resolution" 
Description Complete supporting/replication data and code for the ACL Publication. The paper was published in August 2017 at www.acl2017.org 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
 
Title Research data supporting "What's missing in geographical parsing?" 
Description Full code and data required for replication and experimentation. 
Type Of Material Database/Collection of data 
Year Produced 2017 
Provided To Others? Yes  
 
Title Research data supporting "Which Melbourne? Augmenting Geocoding with Maps" 
Description Please unzip the files and read the README file for more instructions. Also visit my GitHub account for more information (milangritta) 
Type Of Material Database/Collection of data 
Year Produced 2018 
Provided To Others? Yes  
 
Title Software supporting 'A Pragmatic Guide to Geoparsing Evaluation' 
Description Code and data for the NCRF++ model described in the paper. For more information, download the file to view the README files within. 
Type Of Technology Software 
Year Produced 2019 
 
Description Joint Research Centre Visit 
Form Of Engagement Activity Participation in an open day or visit at my research institution
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Visited the Joint Research Centre at the European Commission's Science Hub in Ispra, Italy. The purpose was to share research with the European Media Monitor research group and to gather experience and observe "Science in Action". I learnt how to create a case study for my PhD thesis.
Year(s) Of Engagement Activity 2018