Sparse, rank-reduced and general smooth modelling

Lead Research Organisation: University of Bristol

Department Name: Mathematics

Abstract

Smooth regression models are useful when some variable of interest is related to a number of predictor variable in a complex manner, and we want to understand that relationship. In many cases the complexity of the dependence between the variables means that it is impractical to follow the traditional statistical approach of writing down a simple statistical model describing the relationship, in which only a few unknown parameters are to be estimated. Instead the statistical model is specified in terms of unknown smooth functions of predictors, for example `log blood pressure is given by a smooth function of age plus a smooth function of weight and height plus a smooth function of hours of exercise per week'. The statistical challenge is then to estimate the smooth functions. Given decades of work on the theory and computation of these smooth models, their use is now widespread and almost as routine as that of traditional regression models. However there remain several practical obstacles to their use, in exactly the complex data situations in which they should be most appealing.

1. Current methods allow either the effective modelling of short range spatial, temporal or spatio-temporal correlation, via sparse computational methods, OR the modelling of complex relationships involving many variables, via reduced rank methods, but not both. However it is complex models with short range residual correlation are exactly where such smooth models are most practically appealing.

2. In the reduced rank setting, that allows feasible computation with highly complex models, the most reliable and efficient computational methods are so far restricted to situations where variable of interest comes from the exponential family of distributions (normal, Poisson, binomial etc). But given the proven wide utility of such methods, there would also be many applications for similarly reliable methods for models where the variable of interest follows a very different distribution to those in the exponential family (for example it might be the waiting time to an event, or the occurrence of an event at a spatial location).

3. Increasingly researchers and companies are seeking to analyze very large datasets, which are simply infeasible with current smooth modelling technology.

This project aims to address these challenges, thereby massively increasing the practical scope and utility of this class of models. In particular the project will seek to find novel ways to hybridize the sparse and reduced rank approaches to smooth modelling to resolve issue 1; to build on experience with the exponential family methods to develop reliable and efficient methods for variables from a much more general class of distributions, to resolve 2; and to develop novel and efficient algorithms for handling large and complex models that can be readily parallelized on cheap standard computer hardware, to address 3. The methods developed will be implemented in free open source software, building on the PIs successfully mgcv package for generalized additive modelling, in the R statistical computing environment. The methods will also be disseminated via a textbook, short courses and the provision of web resources.

Planned Impact

High quality statistical methods and software provide part of the essential infrastructure for large parts of science and business. This proposal is designed to strengthen this infrastructure in a very direct way, by developing methods aimed at the problems that most users of current smooth regression methods would most like to be solved: that is dealing with high frequency autocorrelation in complex smooth regression models, dealing reliably with models well beyond the exponential family, and dealing computationally with complex high rank models of very large data sets. By providing free open-source software implementing the methods the project will deliver the methods directly to where they will achieve maximum impact.

The ultimate societal and economic impact of the work will be achieved via the use of the developed methods in science, industry, business, public health and environmental management, in particular. For example generalized additive models are currently used quite widely in fisheries management as part of the information generating process that leads to quota setting and policy, but here problems of un-modelled residual spatial autocorrelation and or very large datasets (e.g. from commercial catch data) are endemic.

In addition to the provision and maintenance of high quality software, the project will foster this impact via short courses, a text book targeted at statistical practitioners outside academic statistics, the production of web resources, and continued electronic interaction with software users.

In addition direct impact on Electricite De France's business is expected via collaboration on short term electricity demand modelling as part of this project. Of our collaboration to date, Yannig Goude of EDF writes that it "has clearly a concrete impact on our work at EDF." and goes on to list 3 specific areas.
1. The methods are used to discover and investigate new effects and properties of the electrical load on the French national electricity grid. A number of such effects have subsequently been incorporated in the parametric models currently used for operational forecasting.
2. The methods have been successfully employed in pilot studies on EDF subsidiary companies, and are currently being further developed for operational forecasting purposes for these companies.
3. The methods have been used operationally on the French national grid as a tool to help operators when special meteorological events happen (extreme temperatures or temperature variations, for example). In these cases the mgcv GAM based models capture the electricity grid load dynamics better than the current operational models, and are used to correct the operational models.

Funded Value:

£248,630

Funded Period:

Dec 15 - Jan 18

Funder:

EPSRC

Project Status:

Closed

Project Category:

Fellowship

Project Reference:

EP/K005251/2

Principal Investigator:

Simon Wood

Research Subject:

Mathematical sciences (100%)

Research Topic:

Statistics & Appl. Probability (100%)

Organisations

People	ORCID iD
Simon Wood (Principal Investigator / Fellow)

Publications

Author Name

Title Publication Date Published

|< < 1 2 > >|

10 25 50

Wood, S N (2017) Generalized Additive Models: An Introduction with R (2nd Edition)

Wood SN (2017) A generalized Fellner-Schall method for smoothing parameter optimization with application to Tweedie location, scale and shape models. in Biometrics

Wood S (2017) Smoothing Parameter and Model Selection for General Smooth Models in Journal of the American Statistical Association

Wood S (2017) Generalized Additive Models for Gigadata: Modeling the U.K. Black Smoke Network Daily Data in Journal of the American Statistical Association

Wood S (2016) P-splines with derivative based penalties and tensor product smoothing of unevenly distributed data in Statistics and Computing

Wood S (2015) Generalized Additive Models for Large Data Sets in Journal of the Royal Statistical Society Series C: Applied Statistics

Wood S (2016) Just Another Gibbs Additive Modeler: Interfacing JAGS and mgcv in Journal of Statistical Software

Wood (2015) Core Statistics

Wieling M (2016) Investigating dialectal differences using articulography in Journal of Phonetics

Pya Natalya (2016) A note on basis dimension selection in generalized additive modelling in arXiv e-prints

Key Findings
Impact Summary
Further Funding
Collaboration
Software and Technical Products
Engagement Activities


Description	Methods have been developed for reliable estimation of smooth regression models when the responses are of a much richer range of types than was previously possible. For example multivariate, categorical, survival time, heavy tailed and other data types. Theses approaches have also allowed the development of much improved quantile regression methods. Methods have also been developed for working with much larger data sets and models than was previously possible, by developing methods that can exploit modern computer hardware. These methods have been released in generally usable form as part of the widely used R statistical software. A new class of methods for smoothing parameter estimation suitable for use with highly sparse model representations has been developed.
Exploitation Route	All the developed methods are published as journal articles and have been released as high quality software. The methods are already in use by researchers around the world, working in energy forecasting, forestry and fisheries management, finance, environmental science, quantitative psychology, air pollution economics and more.
Sectors	Agriculture, Food and Drink,Education,Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Pharmaceuticals and Medical Biotechnology


Description	Big data methods are being used by air pollution epidemiologists and EDF for prediction. The new methods in software are being used by a range of data analysts outside academia. I know this from email contact, but don't have follow through details on the eventual impact for these. Another example is that the Farmers Business Network in the USA uses the new big-model big-data methods to help improve farm profitability in 7000 farms.
Sector	Agriculture, Food and Drink,Education,Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Pharmaceuticals and Medical Biotechnology
Impact Types	Economic


Description	Industrial collaboration with EDF research
Amount	€ 150,000 (EUR)
Organisation	EDF Energy
Sector	Private
Country	United Kingdom
Start	07/2015
End	07/2018


Description	Methods underpinning electricity demand modelling
Organisation	EDF Energy
Department	EDF Innovation and Research
Country	France
Sector	Private
PI Contribution	Developing large data GAM methods
Collaborator Contribution	Problem setting, data provision, discussion and ideas.
Impact	Wood, Goude and Shaw (2014) Generalized additive models for large datasets. Journal of the Royal Statistical Society (C) online early.
Start Year	2009


Title	mgcv 1.8-16
Description	Recommended R package for additive smooth models and extensions. This version adds further big data methods and additional model classes plus a new smoothing parameter estimation method that offers one route to infill free sparse computation.
Type Of Technology	Software
Year Produced	2016
Open Source License?	Yes
Impact	The software has quite wide uptake amongst academic and non-academic statisticians. From email and short course contact it is clear that the new methods are being used, which is unsurprising as they were driven by users scientific needs.
URL	https://cran.r-project.org/web/packages/mgcv/index.html


Description	Short course on smooth regression modelling in Sydney, Australia
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	A short course (2 days) on smooth regression modelling, consisting of a mixture of theory presentations and computer based practicals.
Year(s) Of Engagement Activity	2016


Description	Short course on smooth regression modelling, Groningen
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	2 Day course on theory and practice of smooth regression modelling, including new methods developed in fellowship. Mixture of theory and hands on computer labs.
Year(s) Of Engagement Activity	2016


Description	Tutorial on quantile additive models at university of Tuebingen
Form Of Engagement Activity	Participation in an activity, workshop or similar
Part Of Official Scheme?	No
Geographic Reach	International
Primary Audience	Professional Practitioners
Results and Impact	A tutorial (3 hours) on quantile additive model methods developed as part of the project (presented by Matteo Fasiolo, PDRA on project)
Year(s) Of Engagement Activity	2017

Abstract

Planned Impact

Organisations

People

ORCID iD

Publications