Unparameterised multi-modal data, high order signatures, and the mathematics of data science

Lead Research Organisation: University of Oxford
Department Name: Mathematical Institute

Abstract

Our ancestors communicated by scratching on the walls of caves, took navigational decisions by looking at the stars and made medical diagnoses simply by listening to patients. A great deal of information is captured in these simple data streams; our ability to capture, process, and decide actions based on information pervades all aspects of human life.

Today, one has the same challenges but the information is much more voluminous and the expectations for the outcomes far higher. When we write using our finger on an iphone, as our voice is recorded for doctors to assess our mood, when video is analysed for abnormal actions, or as telescopes look deep into the galaxies for black holes, stars, planets,... technically sophisticated systems translate streams of sequential data into processed and recognised patterns that can be actioned.

Our relatively new ability to offload data analysis onto massive digital systems is transforming our world. However huge challenges remain. Groundbreaking mathematical innovation is rapidly expanding our depth of understanding in one area. This project aims to build on successful pilot collaborations to create tools that really merge this new maths with the existing data science, and then apply them to exemplar challenges to produce a more effective abstraction of the "capture, process, and decide" process. The evidence is now overwhelming that dimension reduction and high order methods can capture sequential data very effectively. The maths underpinning this provided the crucial step that resulted in the extension of Newton's calculus beyond Itô's theory to rough paths; its mathematical articulation, the signature of a stream, has significantly enhanced deep learning methods to develop online handwriting recognition with state-of-the-art accuracy.

This project has the goal of developing and embedding the abstract mathematics around rough paths and complex streamed data into a few of the richest challenges involved in the "capture, process, and decide" task. The investigators and the world-leading project partners are connected by the shared challenge of improving this task with complex datasets of importance in four contexts:
* Health
* Human interfaces
* Human Actions
* Observing the Universe
The specific base challenges we start from are:
1) Use face, speech data, with other self-reported mood data to better detect when an intervention to support someone with mental illness is or is not working.
2) When a person writes (in Chinese) with their finger on a sat-nav device or mobile phone, to better transcribe this signal into digital characters accurately and economically, and to recognise who wrote it.
3) By observing evolving images in video data, develop tools that can classify the human actions.
4) Develop measurement instruments, and nonlinear processing techniques for astronomical data that improve detection sensitivity for transients and make new observations, e.g. for planets orbiting stars.

The technical challenges are deeply interconnected. This project is a near unique opportunity to bring these together to produce a validated common methodology, and to create substantial cross-fertilization. One recent example of how this can happen is worth highlighting. In 2013, Ben Graham (then University of Warwick, now Facebook) used the signature to quantify strokes from Chinese hand-written characters parsimoniously and efficiently. The capture stage is subtle and has appreciably improved the accuracy of the recognition process; the China-based partners on this project subsequently created an app which has been downloaded millions of times.

While the handwriting context for rough paths is very well defined and successful, understanding motion of people in videos is at a successful but early stage! The contexts are clearly related, and link through faces with the mental health challenge, and through occlusion with transients in astronomy. It is all joined up!

Planned Impact

The approach to generating scientific impact in this project is, at its heart, based on creating an effective eco-structure for incredibly talented ECRs to innovate in an incredibly exciting context where progress translates into immediate outcomes - in academia and in industry. One technical outcome from this process should be new systematic approaches to efficiently learning to translate short term (and potentially transient) information into longer term interpretation.

It is immediately clear that such innovation would directly address the applied challenges. It could allow us to find better feedback on how people are coping with mental challenges, and perhaps even lead to new opportunities for diagnosis in primary care; it could allow tools for identifying danger for the public through collaborations with the health and safety community; it could allow us to identify a new range of astronomical features. At the same time, the project would consolidate mathematical research in rough path theory into a completely new more applied direction.

This project is an outstanding opportunity to build bridges between high quality, fundamental mathematics and data science. It can pull three top mathematics departments together in an exciting theme and through the momentum of cooperation. It will change the careers and culture of the (already excellent) younger researchers. The researchers will have to bridge between mathematical innovation and the exemplar contexts; they will be given support to this, but it will be non-negotiable, there will be no room for silos or comfort zones. The PDRAs will spend considerable time (at least 25%) with team leading the applied challenge, and will come out of this project exceptionally technically skilled in blending mathematical ideas with applied problems in data science.

The gains from the project will be sustained through creating this cohort of ECRs who are comfortable in two strategically important disciplines. It will leverage existing strength to help keep the UK mathematical world leading in a way that is deeply productive for the subject's links with applications.

The Alan Turing Institute will be used as a conduit to transfer ideas between the researchers, between the project and other data scientists, and between the project and mission-critical business activity and to provide an environment for the software engineer. The project will create broadly usable software. An ambition would be the development of a widely used package for scikit-learn that could be integrated into the broader machine learning environment built out from the current python extension esig.

Quantifying this impact is hard, but one can for example note that the SKA has been projected to cost 2 billion euros. The other areas have similar potential.

Publications

10 25 50
publication icon
Allan A (2020) Pathwise stochastic control with applications to robust filtering in Annals of Applied Probability

publication icon
Cohen S (2019) Switching cost models as hypothesis tests in Economics Letters

publication icon
Cohen S (2020) Uncertainty and filtering of hidden Markov models in discrete time in Probability, Uncertainty and Quantitative Risk

publication icon
Cohen S (2021) Detecting and Repairing Arbitrage in Traded Option Prices in Applied Mathematical Finance

publication icon
Foster J (2020) An Optimal Polynomial Approximation of Brownian Motion in SIAM Journal on Numerical Analysis

publication icon
Kalsi J (2020) Optimal Execution with Rough Path Signatures in SIAM Journal on Financial Mathematics

 
Description We have used the path signature to provide predictive features for identifying people whose are subsequently diagnosed with Alzheimer's disease. Features are derived from time-ordered measurements of the size of the whole brain, the ventricles and the hippocampus. We found two nonlinear interactions which are predictive in both cases. The first interaction is change of hippocampal volume with time, and the second is a change of hippocampal volume relative to the volume of the whole brain. We are not claiming to be the first to find these predictors - hippocampal and brain volume changes are well known in Alzheimer's disease - but we have demonstrated the power of the path signature in their identification and analysis without using manual feature selection.
Exploitation Route Sequential data is becoming increasingly available as monitoring technology is applied, and we have clearly shown how the path signature method is shown to be a useful tool in processing medical data. We hope to see more applications of the path signature to sequential data from patient monitoring.

The successful take-up of these methods continues steadily, and for example they formed part of the winning team effort in the nternational PhysioNet 2019 competition to develop machine learning tools to identify sepsis early using the streams of data intensive care units have about their patients.
Sectors Healthcare

 
Description Thy have been used in financial industry and also iin the development of commercial Chinese handwriting recognition apps for the mobile phone
First Year Of Impact 2018
Sector Construction,Digital/Communication/Information Technologies (including Software),Financial Services, and Management Consultancy,Security and Diplomacy
Impact Types Societal,Economic

 
Description Chinese handwring and Action Recognition 
Organisation South China University of Technology
Department School of Electronic and Information Engineering
Country China 
Sector Academic/University 
PI Contribution We actively collaborate on research in Action interpretation, gesture recognition, online Chinese handwriting recognition
Collaborator Contribution They bring profound domain knowledge and are able to execute research projects on the engineering side. Without it our mathematical knowledge would have little worth.
Impact this collaboration is multi-disciplinary Mathematics, computer vision, engineering There are several research papers. Some at high level (e.g. TPAMI). Much of the work came out of a Turing sponsored week of intensive startup effort pulling a small group of turing and International scientists.
Start Year 2014
 
Description Clinical psychiatry and mood processing 
Organisation Alan Turing Institute
Country United Kingdom 
Sector Academic/University 
PI Contribution This project consolidates our research theme that mood can be quantified and provides an indicator of wider diagnosis, and allows us to explore alternative media - speech and transcribed speech. The primary contribution so far is to establish mood as a stream that can be used for medical inference.
Collaborator Contribution Access to data, a good computing environment, Opportunity to interact with appropriate domain experts.
Impact This collaboration is multi-disciplinary, involving researchers from the areas of Mathematics, Psychiatry, Natural Language Processing and Speech Processing.
Start Year 2018
 
Description Clinical psychiatry and mood processing 
Organisation University of Oxford
Department Department of Psychiatry
Country United Kingdom 
Sector Academic/University 
PI Contribution This project consolidates our research theme that mood can be quantified and provides an indicator of wider diagnosis, and allows us to explore alternative media - speech and transcribed speech. The primary contribution so far is to establish mood as a stream that can be used for medical inference.
Collaborator Contribution Access to data, a good computing environment, Opportunity to interact with appropriate domain experts.
Impact This collaboration is multi-disciplinary, involving researchers from the areas of Mathematics, Psychiatry, Natural Language Processing and Speech Processing.
Start Year 2018
 
Description DatSig and the Alan Turing Institute 
Organisation Alan Turing Institute
Country United Kingdom 
Sector Academic/University 
PI Contribution The Alan Turing Institute is a project Partner to the EPSRC programme grant and DataSig progrmme. It has allowed and nurtured several subpartnerships - partnerships that have allowed nterchange between mathematical insight and very practical problems stimulating both sides of the equation. They are in early days and will undoubtedly generate several full-blown reports as they mature. The external funding stimulated by this relationship currently supports two postdocs and two software engineers part time and involves Datacentric Engineering, Defense and Security, HSBC, and Health and Safety. Other funding supports the DataSig partnership with mental health challenges. Our contribution is to bring and further to develop novel expertise on the understanding of complex multimodal streamed data to novel and externally interesting challenges.
Collaborator Contribution These projects are currently in early stages. They bring challenges, resource for postdocs and computing for our related science etc. as well as placing their own staff to work on their specific challenges so that they get a strong return. The expectation is that their results will form published examples demonstrating and benchmarking our research while also forming a pathway for knowledge exchange.
Impact These are still in development The research is multidisiplinary Mathematics, Data Science, Medicine, Civil Engineering, ...
Start Year 2019