Missing data in Generalized Additive Models

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Mathematics


Generalized additive models (GAM) are generalized linear models in which the linear predictor is constructed from unknown smooth functions of predictor variables. As with all regression models there are many practical applications of GAMs where many predictor variable measurements may be missing, so that the number of response observations with a full set of predictor variable measurements is only a small proportion of the number of less complete measurements. In linear model settings a number of approaches have been proposed to deal with this missing predictor problem [2], but in the less linear GAM setting it is unclear what approach is best. This project will investigate the existing alternatives in the GAM context, with the aim of also developing new methods better suited to the GAM structure. There is the potential to incorporate the developed methods into the R default GAM modelling package `mgcv' [3]. Missing predictor variable problems are ubiquitous in applied statistics. For example medical or epidemiological data rarely have every interesting covariate recorded for every subject, while environmental and pollution data are usually missing some covariates for some sample locations. When large numbers of covariates are of interest the number of `complete data' can become quite small, so that analysis based only on complete cases becomes unreliable, even if the missingness is random enough to permit meaningful inference in principle. This is a substantial problem in applied statistics. A common approach to the issue is to attempt to `fill in' the missing data with some sort of imputation procedure [3], while allowing for the associated uncertainty in subsequent analysis. But, implicitly or explicitly, the approaches for doing this often target rather simple regression model structures, rather than the flexible function based regression offered by GAMs. The aim of this project is to investigate the relative merits of existing missing covariate methods such as those covered in [3], in the context of GAMs and to try to develop new methods targetting the GAM structure. In principle Bayesian stochastic simulation methods are particularly well suited to this missing data problem, but with the catch that the feasible direct simulation methods can be prohibitively expensive in many large data settings where the methods might be most useful. The project will therefor also examine Bayesian approaches, with the particular aim of looking for efficient computational methods for this setting.


10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/V520251/1 30/09/2020 31/10/2025
2588137 Studentship EP/V520251/1 31/08/2020 30/07/2022 Zhendong Lin