Emerging Geometries for Statistical Science: Articulating the Vision

Lead Research Organisation: The Open University
Department Name: Faculty of Sci, Tech, Eng & Maths (STEM)

Abstract

The long term vision of this proposed research is of statistical science enhanced by emerging geometries, driven by the needs of science, industry and government. Examples of ultimate impact include unique conspicuous benefits for experimental scientists, product development teams and policy-makers. The fundamental driver for this vision is that, given a statistical problem, an appropriate geometry can inform a novel, enhanced methodology for it. Colloquially: 'use the right tool for the job'.

Statistics, with its procedures for reasoning under uncertainty, is deeply embedded across science, industry and government. A picture being worth a thousand words, while requiring invariance to irrelevant choices, many of its methods are based on geometry.

The resulting invariant insights come at a price - that of finding a match between, on the one hand, underlying geometric axioms and, on the other, statistical conditions appropriate to a given applied context. Whereas global Euclidean geometry matches many contexts very well, increasingly, advances and challenges in science and elsewhere are throwing up important problems which demand that alternatives be used. A variety of geometries - affine, convex, differential, algebraic - have been emerging to meet these challenges.

To ensure maximal impact and provide the appropriate context in which to focus the advances to be made in theoretical and methodological development, this project targets 3 generic statistical problems where such alternative geometries are required. These problems present some of the most exacting challenges to statistical methodology while offering vast potential in application:
(1) dealing with model uncertainty,
(2) estimating mixtures and
(3) analysing high dimensional low sample size data.
Each was central to a recent cutting-edge event hosted, respectively, by the Royal Society, the International Centre for Mathematical Sciences and the Isaac Newton Institute, their identified fields of application including: theoretical physics, cosmology, biology, economics, health, image analysis, microarray analysis, finance, document classification, astronomy and atmospheric science, as well as the media, government and business.

Rooted in two new research areas - invariant coordinate selection and computational information geometry - this ambitious programme will bring together and extend emerging geometries for these important generic statistical problems. Developing the necessary underlying theory, it will provide novel, geometrically-enhanced, methodologies as tools for practical application. Pursuing potentially transformative blue sky lines of enquiry, it will enlarge both research areas leading to further new methodologies. In concert with cognate research communities, it will widely articulate the overall vision announced above.

Ultimately, this work will have a very broad impact. The following specific pathways to this end have been identified, embedded statisticians facilitating pathways 2 to 4:
1. Cognate research communities will be stimulated by advances in mathematical and computational statistics, fundamental theory underpinning new methodologies.
2. Science can ultimately benefit from more efficient theory-practice iteration.
3. The economy can ultimately benefit from faster, better product development.
4. Society can ultimately benefit from more robust policy-making.
5. With their project-enhanced transferable skills, the 2 PDRAs will be ideal recruits to many areas of science, industry or government, as well as to higher posts in academia.

Planned Impact

As new theory and methods are developed for application in generic problems of statistical science, there will be diverse benefits and beneficiaries, cognate communities being built up and statisticians-at-large informed to maximise impact. Impact on a range of Academic Beneficiaries (AB) being described in the eponymous summary, we focus on others here.

NB: ICS abbreviates 'Invariant Coordinate Selection', and CIG 'Computational Information Geometry'.

IS.1 Industry
The same theory-practice iterative procedure of science outlined by Box (see AB.3) describes many instances of product development, a working model linking features of product design to product performance. Product development requiring quality control, the multivariate outlier detection capabilities of ICS offer further benefit. Industrial applications within the Thales group will be a particular focus, in collaboration with Ampère medallist Frédéric Barbaresco.

IS.2 Government
As in any time-limited decision context, policy-makers have to come to conclusions based on the data available, good practice dictating that due allowance be made for its sampling variability. Despite recent progress, what is not yet the norm is to fully allow for the effects of model uncertainty, which can be appreciable. Box again writes: 'while [diagnostic] checks are always necessary, they may not be sufficient, because some discrepancies may on the one hand be potentially disastrous and on the other be not easily detectable'. Again, even if detectable, important discrepancies may not be foreseen. The interpretability of the space of all empirically-supported important model perturbations provided by CIG gives it the potential to become a powerful tool for the policy-maker. In particular, it can help inform which contingencies policies most need to be robust against. This is especially important in safety-critical contexts. Policy issues in public health will be a particular focus, in collaboration with Bradford Hill medallist Paddy Farrington.

IS.3 Articulating the Vision
The main modes of engagement ensuring that the beneficiaries can access the potential of this research are by publication in leading journals, continued dialogue with cognate communities, and presentation at major conferences in statistics and related areas. Each mode has a successful track record, gathering momentum and well-targeted future plans, as detailed in the Case for Support. In addition to developing the WOGAS series of workshops, and initiating a complementary ICS series, events for cognate specialists will be offered at major international events, such as GSI'13.
A further way it can be ensured that beneficiaries can access the outcomes of this research will be through availability of relevant software. The development of commercial quality software is premature at present, but distribution of research quality codes will certainly form part of the standard publication process, enabling and encouraging early adoption of the project's new methodologies.
Communication of research results at international meetings will be undertaken by the PDRAs, supported by the PI. This is seen as an outstanding way for them to be trained and to establish themselves in the Statistics Community, building their careers in Statistical Methodology.

IS.4 Summary
1. Cognate research communities will be informed and stimulated by advances in mathematical and computational statistics, fundamental theory underpinning new methodologies.
2. Science can ultimately benefit from more efficient theory-practice iteration.
3. The economy can ultimately benefit from faster, better product development.
4. Society can ultimately benefit from more robust policy-making.
5. With their project-enhanced transferable skills, the two PDRAs will be ideal recruits to corresponding areas of science, industry or government, as well as to higher posts in academia.
 
Description [1] One part of the project introduced a new method (FFOBI: functional fourth-order blind identification) able to recover hidden signals in data that takes the form of functions. The method was previously limited to the classical multidimensional case only. This is particularly useful in modern research where, with the data deluge we face, data increasingly comes in functional form.

[2] Another side of the project deals with discriminant analysis -- the task of allocating individuals to one of several known groups. Classically, this is done in terms of a discriminant rule, comparing measurements made on these individuals with those of other, previously obtained, individuals whose group membership is known. Knowing group membership for such 'training data' is crucial to the classical approach, optimal forms of which are available. Recently, new techniques (called Invariant Coordinate Selection, ICS) have been developed that, remarkably, do NOT have this requirement. They are based on the joint analysis of two dispersion measures, called scatter matrices, whose choice is left to the analyst in general. Our recent project provides a thorough analysis on the importance of that choice, by providing, in a parametrised set of such scatter matrices, the pair that will 'classify best' in the sense of most closely reproducing optimal classical discriminant analysis results. These new techniques will allow us to perform classification in a wider array of contexts, guiding the practitioner's choice while providing a deeper understanding of the performance of ICS.

[3] A third side of the project concerns Computational Information Geometry (CIG) -- a recent area in which differential and other types of geometry are applied to provide important new contributions to pressing practical problems in statistics. A substantial focus here has been the development and exploitation of novel goodness-of-fit testing procedures in the large, sparse, discrete data context which typifies so much of current quantitative enquiry.
Exploitation Route With reference to the three key findings highlighted above:

[1] The FFOBI method and its algorithms (which are available freely upon simple request) can be implemented by any researcher or analyst seeking to recover information from functional data. This is particularly useful in, for example, signal processing, analysis of weather patterns, health patterns, etc.

[2] Software for estimation of the discrimination subspace based on ICS is also available upon request.

[3] An edited volume of papers on CIG for image and signal processing was published Springer.
Sectors Aerospace, Defence and Marine,Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Environment,Healthcare,Transport,Other

URL https://sites.google.com/view/egssatv
 
Description [1] The Computational Information geometry (CIG) workshop, hosted by the International Centre for Mathematical Sciences (Edinburgh), focussed on the application and development of information geometric methods in the analysis, classification and retrieval of images and signals, particularly for medical applications. This area of work has developed rapidly over recent years, propelled by the major theoretical developments in information geometry, efficient data and image acquisition (particularly in biological/medical contexts and image recognition) and the desire to process and interpret large databases of digital information. Further details of its many applications can be found, for example, at the website: http://www.icms.org.uk/workshops/infogeom and in the resulting Springer volume edited by Nielsen, Critchley and Dodson. [2] Additionally, our very latest findings in CIG -- and their applications across statistics and machine learning -- have been strategically communicated via a, wholly novel, graduate course at an internationally-leading research university (Waterloo, Canada ).
First Year Of Impact 2016
Sector Digital/Communication/Information Technologies (including Software),Education,Healthcare
Impact Types Societal,Economic,Policy & public services

 
Description Paul Marriott's graduate course STAT 946 Topics in Probability and Statistics (Information Geometry) at University of Waterloo
Geographic Reach North America 
Policy Influence Type Influenced training of practitioners or researchers
Impact Graduate course...
 
Description International Centre for Mathematical Sciences (ICMS) workshop proposal
Amount £20,500 (GBP)
Organisation International Centre for Mathematical Sciences (ICMS) 
Sector Academic/University
Country United Kingdom
Start 09/2015 
End 09/2015
 
Title Computational information geometry for image and signal processing 
Description This area has developed rapidly over recent years, propelled by the major theoretical developments in information geometry, efficient data and image acquisition and the desire to process and interpret large databases of digital information. 
Type Of Material Data analysis technique 
Provided To Others? No  
Impact Application and development of information geometric methods in the analysis, classification and retrieval of images and signals. 
 
Title Functional independent component analysis: an extension of fourth-order blind identification 
Description We have extended Independent Component Analysis, and in particular Fourth-Order Blind Identification, to functional data. 
Type Of Material Data analysis technique 
Provided To Others? No  
Impact Our new methodology is shown to uncover particular structures that are missed by classical PCA. 
 
Title Recovery of Fisher's linear discriminant subspace by invariant coordinate selection methods 
Description Using any pair of scatter matrices, invariant coordinate selection can recover the Fisher linear discriminant subspace without knowing group membership. 
Type Of Material Data analysis technique 
Provided To Others? No  
Impact We discuss the impact of the choice of such a pair in terms of asymptotic accuracy of recovery, quantifying the asymptotic loss of information due to not knowing group membership 
 
Title CRAN package - FFOBI - coming 
Description Coming: CRAN R-package for functional FOBI 
Type Of Technology Software 
Year Produced 2017 
Open Source License? Yes  
Impact Functional FOBI software made available to all R users 
 
Title R code - On the geometric interplay between goodness-of-fit and estimation: illustrative examples. 
Description R code to produce figures from paper Anaya et al (2016): On the geometric interplay between goodness-of-fit and estimation: illustrative examples. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact possibility to reproduce figures, rerun simulations 
URL https://drive.google.com/open?id=0B7QXhwyY8-8tVFNPRFEyVmZYWkk