Sparse, rank-reduced and general smooth modelling

Lead Research Organisation: University of Bath
Department Name: Mathematical Sciences

Abstract

Smooth regression models are useful when some variable of interest is related to a number of predictor variable in a complex manner, and we want to understand that relationship. In many cases the complexity of the dependence between the variables means that it is impractical to follow the traditional statistical approach of writing down a simple statistical model describing the relationship, in which only a few unknown parameters are to be estimated. Instead the statistical model is specified in terms of unknown smooth functions of predictors, for example `log blood pressure is given by a smooth function of age plus a smooth function of weight and height plus a smooth function of hours of exercise per week'. The statistical challenge is then to estimate the smooth functions. Given decades of work on the theory and computation of these smooth models, their use is now widespread and almost as routine as that of traditional regression models. However there remain several practical obstacles to their use, in exactly the complex data situations in which they should be most appealing.

1. Current methods allow either the effective modelling of short range spatial, temporal or spatio-temporal correlation, via sparse computational methods, OR the modelling of complex relationships involving many variables, via reduced rank methods, but not both. However it is complex models with short range residual correlation are exactly where such smooth models are most practically appealing.

2. In the reduced rank setting, that allows feasible computation with highly complex models, the most reliable and efficient computational methods are so far restricted to situations where variable of interest comes from the exponential family of distributions (normal, Poisson, binomial etc). But given the proven wide utility of such methods, there would also be many applications for similarly reliable methods for models where the variable of interest follows a very different distribution to those in the exponential family (for example it might be the waiting time to an event, or the occurrence of an event at a spatial location).

3. Increasingly researchers and companies are seeking to analyze very large datasets, which are simply infeasible with current smooth modelling technology.

This project aims to address these challenges, thereby massively increasing the practical scope and utility of this class of models. In particular the project will seek to find novel ways to hybridize the sparse and reduced rank approaches to smooth modelling to resolve issue 1; to build on experience with the exponential family methods to develop reliable and efficient methods for variables from a much more general class of distributions, to resolve 2; and to develop novel and efficient algorithms for handling large and complex models that can be readily parallelized on cheap standard computer hardware, to address 3. The methods developed will be implemented in free open source software, building on the PIs successfully mgcv package for generalized additive modelling, in the R statistical computing environment. The methods will also be disseminated via a textbook, short courses and the provision of web resources.

Planned Impact

High quality statistical methods and software provide part of the essential infrastructure for large parts of science and business. This proposal is designed to strengthen this infrastructure in a very direct way, by developing methods aimed at the problems that most users of current smooth regression methods would most like to be solved: that is dealing with high frequency autocorrelation in complex smooth regression models, dealing reliably with models well beyond the exponential family, and dealing computationally with complex high rank models of very large data sets. By providing free open-source software implementing the methods the project will deliver the methods directly to where they will achieve maximum impact.

The ultimate societal and economic impact of the work will be achieved via the use of the developed methods in science, industry, business, public health and environmental management, in particular. For example generalized additive models are currently used quite widely in fisheries management as part of the information generating process that leads to quota setting and policy, but here problems of un-modelled residual spatial autocorrelation and or very large datasets (e.g. from commercial catch data) are endemic.

In addition to the provision and maintenance of high quality software, the project will foster this impact via short courses, a text book targeted at statistical practitioners outside academic statistics, the production of web resources, and continued electronic interaction with software users.

In addition direct impact on Electricite De France's business is expected via collaboration on short term electricity demand modelling as part of this project. Of our collaboration to date, Yannig Goude of EDF writes that it "has clearly a concrete impact on our work at EDF." and goes on to list 3 specific areas.
1. The methods are used to discover and investigate new effects and properties of the electrical load on the French national electricity grid. A number of such effects have subsequently been incorporated in the parametric models currently used for operational forecasting.
2. The methods have been successfully employed in pilot studies on EDF subsidiary companies, and are currently being further developed for operational forecasting purposes for these companies.
3. The methods have been used operationally on the French national grid as a tool to help operators when special meteorological events happen (extreme temperatures or temperature variations, for example). In these cases the mgcv GAM based models capture the electricity grid load dynamics better than the current operational models, and are used to correct the operational models.

Publications

10 25 50
publication icon
Fasiolo M (2019) Scalable Visualization Methods for Modern Generalized Additive Models in Journal of Computational and Graphical Statistics

publication icon
Marra G (2017) A Simultaneous Equation Approach to Estimating HIV Prevalence With Nonignorable Missing Responses in Journal of the American Statistical Association

publication icon
Pya N (2014) Shape constrained additive models in Statistics and Computing

publication icon
Wieling M (2016) Investigating dialectal differences using articulography in Journal of Phonetics

publication icon
Wood (2015) Core Statistics

publication icon
Wood S (2015) Generalized additive models for large data sets in Journal of the Royal Statistical Society: Series C (Applied Statistics)

Related Projects

Project Reference Relationship Related To Start End Award Value
EP/K005251/1 01/02/2013 30/11/2015 £599,282
EP/K005251/2 Transfer EP/K005251/1 01/12/2015 31/01/2018 £248,630
 
Description New methods for flexible regression modelling, particularly for very large models of very large datasets and for complicated data structures. Example applications have been in the space-time modelling of daily air pollution data in the UK, of farm productivity in the USA and of energy forecasting across France.
Exploitation Route The findings are implemented in open source software (see above) and therefore directly used by others for the modelling and analysis of data in agriculture, fisheries, energy forecasting, economics, medicine, epidemiology, environmental science, ecology and other areas. The methods also underpin other researchers applied and computational work on functional data analysis and smooth modelling and the big model/data methods are likely to be taken up by researchers and used in other modelling frameworks.
Sectors Aerospace, Defence and Marine,Agriculture, Food and Drink,Digital/Communication/Information Technologies (including Software),Education,Energy,Environment,Financial Services, and Management Consultancy,Healthcare,Government, Democracy and Justice,Pharmaceuticals and Medical Biotechnology,Other

URL https://cran.r-project.org/web/packages/mgcv/index.html
 
Description The developed methods are released as a recommended package distributed with the R statistical software. The methods are quite widely used in environmental science, resource management, epidemiology, medical research, finance and economics for example. One particular example is the energy company EDF's use of the methods for energy use prediction. Another is the Farmers Business Network in the USA which uses the methods to help optimize farm production on 7000 farms.
Sector Agriculture, Food and Drink,Energy,Environment,Other
Impact Types Societal,Economic

 
Description EPSRC CDT
Amount £3,771,473 (GBP)
Funding ID EP/L015684/1 
Organisation Engineering and Physical Sciences Research Council (EPSRC) 
Sector Public
Country United Kingdom
Start 05/2014 
End 11/2022
 
Description Industrial collaboration with EDF research
Amount € 150,000 (EUR)
Organisation EDF Energy 
Sector Private
Country United Kingdom
Start 07/2015 
End 07/2018
 
Description Methods underpinning electricity demand modelling 
Organisation EDF Energy
Department EDF Innovation and Research
Country France 
Sector Private 
PI Contribution Developing large data GAM methods
Collaborator Contribution Problem setting, data provision, discussion and ideas.
Impact Wood, Goude and Shaw (2014) Generalized additive models for large datasets. Journal of the Royal Statistical Society (C) online early.
Start Year 2009
 
Title mgcv 1.8-0 
Description Major upgrade of R (see cran.r-project.org) software package for generalized additive modelling, providing new methods for Generalized Additive modelling beyond simple exponential family distributions. Based on new statistical computing methods the software now provides ordered categorical, beta, negative binomial, Tweedie, zero inflated Poisson and scaled t distributions, as well as Cox proportional hazards additive models, multivariate normal additive models and scale location additive models. Parallel computation of the leading order computations via openMP has also been implemented. 
Type Of Technology Software 
Year Produced 2014 
Open Source License? Yes  
Impact This software package is one of a handful of 'recommended' packages supplied with the base distribution of R. Previous versions have been widely used in a wide range of applications, particularly in ecology and natural resource management, as well as medicine, epidemiology and economics. For example the energy company EDF use the methods for electricity load forecasting. In addition the package is currently used by 102 other software packages for R. For example the underlying fitting methods are sufficiently general that they can be efficiently leveraged for functional data analysis, as in the 'refund' package. 
URL http://cran.r-project.org/web/packages/mgcv/
 
Title mgcv 1.8-12 
Description Recommended R package for generalized additive modelling. Major upgrades are: 1. Addition of scalable parallel fitting methods, allowing models with 10^8 observations or more and 10^4 coefficients or more to be estimated on relatively modest workstations or servers in minutes-hours, rather than days to weeks. 2. Interface with JAGS for Bayesian stochastic simulation with GAMs. 3. Addition of Gaussian Process smoothers and b-spline smoothers with derivative penalties of arbitrary order. 
Type Of Technology Software 
Year Produced 2015 
Open Source License? Yes  
Impact Used for energy load prediction problems by EDF (France). Collaboration with EDF to improve methods for this purpose. 
URL https://cran.r-project.org/web/packages/mgcv/index.html
 
Title mgcv 1.8-16 
Description Recommended R package for additive smooth models and extensions. This version adds further big data methods and additional model classes plus a new smoothing parameter estimation method that offers one route to infill free sparse computation. 
Type Of Technology Software 
Year Produced 2016 
Open Source License? Yes  
Impact The software has quite wide uptake amongst academic and non-academic statisticians. From email and short course contact it is clear that the new methods are being used, which is unsurprising as they were driven by users scientific needs. 
URL https://cran.r-project.org/web/packages/mgcv/index.html
 
Description Broadcast radio interview and round table discussion on big data for Radio Silencio (Spain) 
Form Of Engagement Activity A broadcast e.g. TV/radio/film/podcast (other than news/press)
Part Of Official Scheme? No
Geographic Reach Regional
Primary Audience Media (as a channel to the public)
Results and Impact As part of a big data workshop in Barcelona I was interviewed for a Catalan radio science program...
https://drive.google.com/open?id=0B4Kqe49544LbYVFnMXFMVmJfRUE
... and took part in a round table discussion on big data issues...
https://www.dropbox.com/s/py54qtaohljv2ll/ROUND_TABLE_large.m4v?dl=0
- these have been broadcast, but I have no idea what the audience reached is.
Year(s) Of Engagement Activity 2015
 
Description GAM 3 day course, ASA Alaska 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Short course on smooth modelling for Alaska chapter of American Statistical Association. Participants work primarily in natural resource management, and will apply the models in this context.
Year(s) Of Engagement Activity 2015
URL http://community.amstat.org/alaskachapter/meetings/2015/annualmeeting2015
 
Description GAM course in Zurich R course series 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact 2 Day course as part of Zurich professional development course programme, on smoothing, additive models etc. 10 participants from, Switzerland, Italy and UK, from universities and industry.
Year(s) Of Engagement Activity 2016
URL http://www.zhrcourses.uzh.ch/programm/gen-additive-models_en.html
 
Description Smoothing short course University of Graz 
Form Of Engagement Activity Participation in an activity, workshop or similar
Part Of Official Scheme? No
Geographic Reach International
Primary Audience Professional Practitioners
Results and Impact Two day course on smoothing and additive models at the University of Graz, Austria. A mixture of academic, consulting and industry statisticians and postgraduate students.
Year(s) Of Engagement Activity 2015