Sparse, rank-reduced and general smooth modelling

Lead Research Organisation: University of Bristol
Department Name: Mathematics

Abstract

Smooth regression models are useful when some variable of interest is related to a number of predictor variable in a complex manner, and we want to understand that relationship. In many cases the complexity of the dependence between the variables means that it is impractical to follow the traditional statistical approach of writing down a simple statistical model describing the relationship, in which only a few unknown parameters are to be estimated. Instead the statistical model is specified in terms of unknown smooth functions of predictors, for example `log blood pressure is given by a smooth function of age plus a smooth function of weight and height plus a smooth function of hours of exercise per week'. The statistical challenge is then to estimate the smooth functions. Given decades of work on the theory and computation of these smooth models, their use is now widespread and almost as routine as that of traditional regression models. However there remain several practical obstacles to their use, in exactly the complex data situations in which they should be most appealing.

1. Current methods allow either the effective modelling of short range spatial, temporal or spatio-temporal correlation, via sparse computational methods, OR the modelling of complex relationships involving many variables, via reduced rank methods, but not both. However it is complex models with short range residual correlation are exactly where such smooth models are most practically appealing.

2. In the reduced rank setting, that allows feasible computation with highly complex models, the most reliable and efficient computational methods are so far restricted to situations where variable of interest comes from the exponential family of distributions (normal, Poisson, binomial etc). But given the proven wide utility of such methods, there would also be many applications for similarly reliable methods for models where the variable of interest follows a very different distribution to those in the exponential family (for example it might be the waiting time to an event, or the occurrence of an event at a spatial location).

3. Increasingly researchers and companies are seeking to analyze very large datasets, which are simply infeasible with current smooth modelling technology.

This project aims to address these challenges, thereby massively increasing the practical scope and utility of this class of models. In particular the project will seek to find novel ways to hybridize the sparse and reduced rank approaches to smooth modelling to resolve issue 1; to build on experience with the exponential family methods to develop reliable and efficient methods for variables from a much more general class of distributions, to resolve 2; and to develop novel and efficient algorithms for handling large and complex models that can be readily parallelized on cheap standard computer hardware, to address 3. The methods developed will be implemented in free open source software, building on the PIs successfully mgcv package for generalized additive modelling, in the R statistical computing environment. The methods will also be disseminated via a textbook, short courses and the provision of web resources.

Planned Impact

High quality statistical methods and software provide part of the essential infrastructure for large parts of science and business. This proposal is designed to strengthen this infrastructure in a very direct way, by developing methods aimed at the problems that most users of current smooth regression methods would most like to be solved: that is dealing with high frequency autocorrelation in complex smooth regression models, dealing reliably with models well beyond the exponential family, and dealing computationally with complex high rank models of very large data sets. By providing free open-source software implementing the methods the project will deliver the methods directly to where they will achieve maximum impact.

The ultimate societal and economic impact of the work will be achieved via the use of the developed methods in science, industry, business, public health and environmental management, in particular. For example generalized additive models are currently used quite widely in fisheries management as part of the information generating process that leads to quota setting and policy, but here problems of un-modelled residual spatial autocorrelation and or very large datasets (e.g. from commercial catch data) are endemic.

In addition to the provision and maintenance of high quality software, the project will foster this impact via short courses, a text book targeted at statistical practitioners outside academic statistics, the production of web resources, and continued electronic interaction with software users.

In addition direct impact on Electricite De France's business is expected via collaboration on short term electricity demand modelling as part of this project. Of our collaboration to date, Yannig Goude of EDF writes that it "has clearly a concrete impact on our work at EDF." and goes on to list 3 specific areas.
1. The methods are used to discover and investigate new effects and properties of the electrical load on the French national electricity grid. A number of such effects have subsequently been incorporated in the parametric models currently used for operational forecasting.
2. The methods have been successfully employed in pilot studies on EDF subsidiary companies, and are currently being further developed for operational forecasting purposes for these companies.
3. The methods have been used operationally on the French national grid as a tool to help operators when special meteorological events happen (extreme temperatures or temperature variations, for example). In these cases the mgcv GAM based models capture the electricity grid load dynamics better than the current operational models, and are used to correct the operational models.

Publications

10 25 50
 
Description Methods have been developed for reliable estimation of smooth regression models when the responses are of a much richer range of types than was previously possible. For example multivariate, categorical, survival time, heavy tailed and other data types. Theses approaches have also allowed the development of much improved quantile regression methods. Methods have also been developed for working with much larger data sets and models than was previously possible, by developing methods that can exploit modern computer hardware. These methods have been released in generally usable form as part of the widely used R statistical software. A new class of methods for smoothing parameter estimation suitable for use with highly sparse model representations has been developed.
Exploitation Route All the developed methods are published as journal articles and have been released as high quality software. The methods are already in use by researchers around the world, working in energy forecasting, forestry and fisheries management, finance, environmental science, quantitative psychology, air pollution economics and more.
Sectors Agriculture, Food and Drink,Education,Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Pharmaceuticals and Medical Biotechnology

 
Description Big data methods are being used by air pollution epidemiologists and EDF for prediction. The new methods in software are being used by a range of data analysts outside academia. I know this from email contact, but don't have follow through details on the eventual impact for these. Another example is that the Farmers Business Network in the USA uses the new big-model big-data methods to help improve farm profitability in 7000 farms.
Sector Agriculture, Food and Drink,Education,Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Leisure Activities, including Sports, Recreation and Tourism,Government, Democracy and Justice,Pharmaceuticals and Medical Biotechnology
Impact Types Economic

 
Description Industrial collaboration with EDF research
Amount € 150,000 (EUR)
Organisation EDF Energy 
Sector Private
Country United Kingdom
Start 07/2015 
End 07/2018
 
Description Methods underpinning electricity demand modelling 
Organisation EDF Energy
Department EDF Innovation and Research
Country France 
Sector Private 
PI Contribution Developing large data GAM methods
Collaborator Contribution Problem setting, data provision, discussion and ideas.
Impact Wood, Goude and Shaw (2014) Generalized additive models for large datasets. Journal of the Royal Statistical Society (C) online early.
Start Year 2009
 
Title mgcv 1.8-16 
Description Recommended R package for additive smooth models and extensions. This version adds further big data methods and additional model classes plus a new smoothing parameter estimation method that offers one route to infill free sparse computation. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact The software has quite wide uptake amongst academic and non-academic statisticians. From email and short course contact it is clear that the new methods are being used, which is unsurprising as they were driven by users scientific needs. 
URL https://cran.r-project.org/web/packages/mgcv/index.html
 
Description Short course on smooth regression modelling in Sydney, Australia 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A short course (2 days) on smooth regression modelling, consisting of a mixture of theory presentations and computer based practicals.
Year(s) Of Engagement Activity 2016
 
Description Short course on smooth regression modelling, Groningen 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 2 Day course on theory and practice of smooth regression modelling, including new methods developed in fellowship. Mixture of theory and hands on computer labs.
Year(s) Of Engagement Activity 2016
 
Description Tutorial on quantile additive models at university of Tuebingen 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact A tutorial (3 hours) on quantile additive model methods developed as part of the project (presented by Matteo Fasiolo, PDRA on project)
Year(s) Of Engagement Activity 2017