Sparse, rank-reduced and general smooth modelling
Lead Research Organisation:
University of Bath
Department Name: Mathematical Sciences
Abstract
Smooth regression models are useful when some variable of interest is related to a number of predictor variable in a complex manner, and we want to understand that relationship. In many cases the complexity of the dependence between the variables means that it is impractical to follow the traditional statistical approach of writing down a simple statistical model describing the relationship, in which only a few unknown parameters are to be estimated. Instead the statistical model is specified in terms of unknown smooth functions of predictors, for example `log blood pressure is given by a smooth function of age plus a smooth function of weight and height plus a smooth function of hours of exercise per week'. The statistical challenge is then to estimate the smooth functions. Given decades of work on the theory and computation of these smooth models, their use is now widespread and almost as routine as that of traditional regression models. However there remain several practical obstacles to their use, in exactly the complex data situations in which they should be most appealing.
1. Current methods allow either the effective modelling of short range spatial, temporal or spatio-temporal correlation, via sparse computational methods, OR the modelling of complex relationships involving many variables, via reduced rank methods, but not both. However it is complex models with short range residual correlation are exactly where such smooth models are most practically appealing.
2. In the reduced rank setting, that allows feasible computation with highly complex models, the most reliable and efficient computational methods are so far restricted to situations where variable of interest comes from the exponential family of distributions (normal, Poisson, binomial etc). But given the proven wide utility of such methods, there would also be many applications for similarly reliable methods for models where the variable of interest follows a very different distribution to those in the exponential family (for example it might be the waiting time to an event, or the occurrence of an event at a spatial location).
3. Increasingly researchers and companies are seeking to analyze very large datasets, which are simply infeasible with current smooth modelling technology.
This project aims to address these challenges, thereby massively increasing the practical scope and utility of this class of models. In particular the project will seek to find novel ways to hybridize the sparse and reduced rank approaches to smooth modelling to resolve issue 1; to build on experience with the exponential family methods to develop reliable and efficient methods for variables from a much more general class of distributions, to resolve 2; and to develop novel and efficient algorithms for handling large and complex models that can be readily parallelized on cheap standard computer hardware, to address 3. The methods developed will be implemented in free open source software, building on the PIs successfully mgcv package for generalized additive modelling, in the R statistical computing environment. The methods will also be disseminated via a textbook, short courses and the provision of web resources.
1. Current methods allow either the effective modelling of short range spatial, temporal or spatio-temporal correlation, via sparse computational methods, OR the modelling of complex relationships involving many variables, via reduced rank methods, but not both. However it is complex models with short range residual correlation are exactly where such smooth models are most practically appealing.
2. In the reduced rank setting, that allows feasible computation with highly complex models, the most reliable and efficient computational methods are so far restricted to situations where variable of interest comes from the exponential family of distributions (normal, Poisson, binomial etc). But given the proven wide utility of such methods, there would also be many applications for similarly reliable methods for models where the variable of interest follows a very different distribution to those in the exponential family (for example it might be the waiting time to an event, or the occurrence of an event at a spatial location).
3. Increasingly researchers and companies are seeking to analyze very large datasets, which are simply infeasible with current smooth modelling technology.
This project aims to address these challenges, thereby massively increasing the practical scope and utility of this class of models. In particular the project will seek to find novel ways to hybridize the sparse and reduced rank approaches to smooth modelling to resolve issue 1; to build on experience with the exponential family methods to develop reliable and efficient methods for variables from a much more general class of distributions, to resolve 2; and to develop novel and efficient algorithms for handling large and complex models that can be readily parallelized on cheap standard computer hardware, to address 3. The methods developed will be implemented in free open source software, building on the PIs successfully mgcv package for generalized additive modelling, in the R statistical computing environment. The methods will also be disseminated via a textbook, short courses and the provision of web resources.
Planned Impact
High quality statistical methods and software provide part of the essential infrastructure for large parts of science and business. This proposal is designed to strengthen this infrastructure in a very direct way, by developing methods aimed at the problems that most users of current smooth regression methods would most like to be solved: that is dealing with high frequency autocorrelation in complex smooth regression models, dealing reliably with models well beyond the exponential family, and dealing computationally with complex high rank models of very large data sets. By providing free open-source software implementing the methods the project will deliver the methods directly to where they will achieve maximum impact.
The ultimate societal and economic impact of the work will be achieved via the use of the developed methods in science, industry, business, public health and environmental management, in particular. For example generalized additive models are currently used quite widely in fisheries management as part of the information generating process that leads to quota setting and policy, but here problems of un-modelled residual spatial autocorrelation and or very large datasets (e.g. from commercial catch data) are endemic.
In addition to the provision and maintenance of high quality software, the project will foster this impact via short courses, a text book targeted at statistical practitioners outside academic statistics, the production of web resources, and continued electronic interaction with software users.
In addition direct impact on Electricite De France's business is expected via collaboration on short term electricity demand modelling as part of this project. Of our collaboration to date, Yannig Goude of EDF writes that it "has clearly a concrete impact on our work at EDF." and goes on to list 3 specific areas.
1. The methods are used to discover and investigate new effects and properties of the electrical load on the French national electricity grid. A number of such effects have subsequently been incorporated in the parametric models currently used for operational forecasting.
2. The methods have been successfully employed in pilot studies on EDF subsidiary companies, and are currently being further developed for operational forecasting purposes for these companies.
3. The methods have been used operationally on the French national grid as a tool to help operators when special meteorological events happen (extreme temperatures or temperature variations, for example). In these cases the mgcv GAM based models capture the electricity grid load dynamics better than the current operational models, and are used to correct the operational models.
The ultimate societal and economic impact of the work will be achieved via the use of the developed methods in science, industry, business, public health and environmental management, in particular. For example generalized additive models are currently used quite widely in fisheries management as part of the information generating process that leads to quota setting and policy, but here problems of un-modelled residual spatial autocorrelation and or very large datasets (e.g. from commercial catch data) are endemic.
In addition to the provision and maintenance of high quality software, the project will foster this impact via short courses, a text book targeted at statistical practitioners outside academic statistics, the production of web resources, and continued electronic interaction with software users.
In addition direct impact on Electricite De France's business is expected via collaboration on short term electricity demand modelling as part of this project. Of our collaboration to date, Yannig Goude of EDF writes that it "has clearly a concrete impact on our work at EDF." and goes on to list 3 specific areas.
1. The methods are used to discover and investigate new effects and properties of the electrical load on the French national electricity grid. A number of such effects have subsequently been incorporated in the parametric models currently used for operational forecasting.
2. The methods have been successfully employed in pilot studies on EDF subsidiary companies, and are currently being further developed for operational forecasting purposes for these companies.
3. The methods have been used operationally on the French national grid as a tool to help operators when special meteorological events happen (extreme temperatures or temperature variations, for example). In these cases the mgcv GAM based models capture the electricity grid load dynamics better than the current operational models, and are used to correct the operational models.
People |
ORCID iD |
Simon Wood (Principal Investigator / Fellow) |
Publications
Wood S
(2013)
A simple test for random effects in regression models
in Biometrika
Pya N
(2014)
Shape constrained additive models
in Statistics and Computing
Wood S
(2015)
Generalized Additive Models for Large Data Sets
in Journal of the Royal Statistical Society Series C: Applied Statistics
Wood
(2015)
Core Statistics
Wood S
(2015)
Core Statistics
Pya Natalya
(2016)
A note on basis dimension selection in generalized additive modelling
in arXiv e-prints
Pya N
(2016)
Incorporating shape constraints in generalized additive modelling of the height-diameter relationship for Norway spruce
in Forest Ecosystems
Wood S
(2016)
P-splines with derivative based penalties and tensor product smoothing of unevenly distributed data
in Statistics and Computing
Fasiolo M
(2016)
A Comparison of Inferential Methods for Highly Nonlinear State Space Models in Ecology and Epidemiology
in Statistical Science
Description | New methods for flexible regression modelling, particularly for very large models of very large datasets and for complicated data structures. Example applications have been in the space-time modelling of daily air pollution data in the UK, of farm productivity in the USA and of energy forecasting across France. |
Exploitation Route | The findings are implemented in open source software (see above) and therefore directly used by others for the modelling and analysis of data in agriculture, fisheries, energy forecasting, economics, medicine, epidemiology, environmental science, ecology and other areas. The methods also underpin other researchers applied and computational work on functional data analysis and smooth modelling and the big model/data methods are likely to be taken up by researchers and used in other modelling frameworks. |
Sectors | Aerospace Defence and Marine Agriculture Food and Drink Digital/Communication/Information Technologies (including Software) Education Energy Environment Financial Services and Management Consultancy Healthcare Government Democracy and Justice Pharmaceuticals and Medical Biotechnology Other |
URL | https://cran.r-project.org/web/packages/mgcv/index.html |
Description | The developed methods are released as a recommended package distributed with the R statistical software. The methods are quite widely used in environmental science, resource management, epidemiology, medical research, finance and economics for example. One particular example is the energy company EDF's use of the methods for energy use prediction. Another is the Farmers Business Network in the USA which uses the methods to help optimize farm production on 7000 farms. |
Sector | Agriculture, Food and Drink,Energy,Environment,Other |
Impact Types | Societal Economic |
Description | EPSRC CDT |
Amount | £3,771,473 (GBP) |
Funding ID | EP/L015684/1 |
Organisation | Engineering and Physical Sciences Research Council (EPSRC) |
Sector | Public |
Country | United Kingdom |
Start | 04/2014 |
End | 11/2022 |
Description | Industrial collaboration with EDF research |
Amount | € 150,000 (EUR) |
Organisation | EDF Energy |
Sector | Private |
Country | United Kingdom |
Start | 06/2015 |
End | 07/2018 |
Description | Methods underpinning electricity demand modelling |
Organisation | EDF Energy |
Department | EDF Innovation and Research |
Country | France |
Sector | Private |
PI Contribution | Developing large data GAM methods |
Collaborator Contribution | Problem setting, data provision, discussion and ideas. |
Impact | Wood, Goude and Shaw (2014) Generalized additive models for large datasets. Journal of the Royal Statistical Society (C) online early. |
Start Year | 2009 |
Title | mgcv 1.8-0 |
Description | Major upgrade of R (see cran.r-project.org) software package for generalized additive modelling, providing new methods for Generalized Additive modelling beyond simple exponential family distributions. Based on new statistical computing methods the software now provides ordered categorical, beta, negative binomial, Tweedie, zero inflated Poisson and scaled t distributions, as well as Cox proportional hazards additive models, multivariate normal additive models and scale location additive models. Parallel computation of the leading order computations via openMP has also been implemented. |
Type Of Technology | Software |
Year Produced | 2014 |
Open Source License? | Yes |
Impact | This software package is one of a handful of 'recommended' packages supplied with the base distribution of R. Previous versions have been widely used in a wide range of applications, particularly in ecology and natural resource management, as well as medicine, epidemiology and economics. For example the energy company EDF use the methods for electricity load forecasting. In addition the package is currently used by 102 other software packages for R. For example the underlying fitting methods are sufficiently general that they can be efficiently leveraged for functional data analysis, as in the 'refund' package. |
URL | http://cran.r-project.org/web/packages/mgcv/ |
Title | mgcv 1.8-12 |
Description | Recommended R package for generalized additive modelling. Major upgrades are: 1. Addition of scalable parallel fitting methods, allowing models with 10^8 observations or more and 10^4 coefficients or more to be estimated on relatively modest workstations or servers in minutes-hours, rather than days to weeks. 2. Interface with JAGS for Bayesian stochastic simulation with GAMs. 3. Addition of Gaussian Process smoothers and b-spline smoothers with derivative penalties of arbitrary order. |
Type Of Technology | Software |
Year Produced | 2015 |
Open Source License? | Yes |
Impact | Used for energy load prediction problems by EDF (France). Collaboration with EDF to improve methods for this purpose. |
URL | https://cran.r-project.org/web/packages/mgcv/index.html |
Title | mgcv 1.8-16 |
Description | Recommended R package for additive smooth models and extensions. This version adds further big data methods and additional model classes plus a new smoothing parameter estimation method that offers one route to infill free sparse computation. |
Type Of Technology | Software |
Year Produced | 2016 |
Open Source License? | Yes |
Impact | The software has quite wide uptake amongst academic and non-academic statisticians. From email and short course contact it is clear that the new methods are being used, which is unsurprising as they were driven by users scientific needs. |
URL | https://cran.r-project.org/web/packages/mgcv/index.html |
Description | Broadcast radio interview and round table discussion on big data for Radio Silencio (Spain) |
Form Of Engagement Activity | A broadcast e.g. TV/radio/film/podcast (other than news/press) |
Part Of Official Scheme? | No |
Geographic Reach | Regional |
Primary Audience | Media (as a channel to the public) |
Results and Impact | As part of a big data workshop in Barcelona I was interviewed for a Catalan radio science program... https://drive.google.com/open?id=0B4Kqe49544LbYVFnMXFMVmJfRUE ... and took part in a round table discussion on big data issues... https://www.dropbox.com/s/py54qtaohljv2ll/ROUND_TABLE_large.m4v?dl=0 - these have been broadcast, but I have no idea what the audience reached is. |
Year(s) Of Engagement Activity | 2015 |
Description | GAM 3 day course, ASA Alaska |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Short course on smooth modelling for Alaska chapter of American Statistical Association. Participants work primarily in natural resource management, and will apply the models in this context. |
Year(s) Of Engagement Activity | 2015 |
URL | http://community.amstat.org/alaskachapter/meetings/2015/annualmeeting2015 |
Description | GAM course in Zurich R course series |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | 2 Day course as part of Zurich professional development course programme, on smoothing, additive models etc. 10 participants from, Switzerland, Italy and UK, from universities and industry. |
Year(s) Of Engagement Activity | 2016 |
URL | http://www.zhrcourses.uzh.ch/programm/gen-additive-models_en.html |
Description | Smoothing short course University of Graz |
Form Of Engagement Activity | Participation in an activity, workshop or similar |
Part Of Official Scheme? | No |
Geographic Reach | International |
Primary Audience | Professional Practitioners |
Results and Impact | Two day course on smoothing and additive models at the University of Graz, Austria. A mixture of academic, consulting and industry statisticians and postgraduate students. |
Year(s) Of Engagement Activity | 2015 |