Distance measures and whitening procedures for high dimensional data

Lead Research Organisation: Cardiff University
Department Name: Sch of Mathematics

Abstract

Large, complex, multi-variable and multiple data type data sources present a new challenge for anomaly detection as part of the statistical production process. Simple parametric models used for outlier detection in survey data are no longer suitable. They require model assumptions that would become prohibitively complex, are not efficient in processing large data sets, and do not allow for mixed variable types.

Anomaly detection in statistical production is key to ensuring the quality of statistics, and the challenge has not yet been fully addressed in official statistics. Working with ONS, the UK's national statistics institute, would offer the student access to sensitive, record-level data which is not usually easily available to researchers. Although some record-level survey data are available to academic researchers, non-survey data not collected by ONS is not generally accessible, and where it is, the environments are not usually suitable for big data processing. This project therefore offers the student the novel opportunity not only to work on datasets not usually available to academia, but also to do so in a state-of-the art distributed processing environment.

Datasets that the student would work on may include HMRC's turnover and expenditure data from value added tax returns and HMRC payroll data. ONS is exploring the potential to use these in the compilation of headline economic statistics including gross domestic product (GDP). Robust understanding of these new datasets is crucial in ensuring the quality of market-moving statistics

This PhD is designed to develop novel mathematics which bridges linear algebra, statistics and optimization, and to introduce new modern techniques for anomaly detection. Linear algebra has seen applications in a wide variety of areas in multivariate statistics but the last decade has generated a number of new settings in which such techniques are being applied in statistics. Examples include the developments in compressed sensing, and matrix completion, work pioneered by prominent mathematicians such as Candès (Candès & Tao, 2010), Donoho (Donoho, 2006), Tao (Candès & Tao, 2007) and Tsybakov (Rohde & Tsybakov, 2011). The escalation of 'big data' has given rise to more considered thought on how optimization can inform statistical procedure as the dimensions of the problem grow. A modern trend has been to form statistical problems as (approximate) convex optimization problems, where the technology is such that existing routines can solve such problems in huge dimensions fairly quickly (Boyd & Vandenberghe, 2004). An interesting question is how close the solution to the approximate convex optimization problem is to the solution of the original statistical problem. This PhD is set in this context outlined, to tackle the problem of anomaly detection

Publications

10 25 50