Learning Highly Structured Sparse Latent Variable Models

Lead Research Organisation: University College London
Department Name: Statistical Science

Abstract

Technological advances have brought the ability of collecting and analysing patterns in high-dimensional databases. One particular type of analysis concerns problems where the recorded variables indirectly measure hidden factors that explain away observed associations. For instance, the recent National NHS Staff Survey of 2009, taken by over one hundred thousand staff members, contained several questions on job satisfaction. It is only natural that the patterns of observed answers are the result of some common hidden factors that remain unrecorded. In particular, such answers could arguably be grouped by factors such as perceptions of the quality of work practice, support of colleagues and so on, that are only indirectly measured.

In practice, when making sense out of a high-dimensional data source, it is useful to reduce the observations to a small number of common factors. Since records are affected by sources of variability that are unrelated to the actual factors (think of someone having a bad day, or even typing wrong information by mistake), removing such artifacts is also part of the statistical problem. A model that estimates such transformations is said to perform "dimensionality reduction" and "smoothing".

There are a variety of methods to accomplish such tasks. At one end of the spectrum, there are models that assume the data match some very simple patterns such as bell curves and pre-determined factors. Others are very powerful, allowing for flexible patterns and even an infinite number of factors that are inferred from data under some very mild assumptions. The proposed work tries to bridge these extremes: the shortcomings of the very flexible models are subtle but important. In particular, they can be very sensitive to changes in the data - meaning some very different conclusions about the hidden factors might be achieved if a slightly different set of observations is provided. Moreover there are computational concerns: calculating the desired estimates usually requires an iterative process, a process that needs some initial guess about these estimates. So, even for a fixed dataset, results can vary considerably if such an initial guess is not carefully chosen. Our motivation is that if one does have these concerns, one might as well take the trouble of incorporating domain knowledge about the domain. The upshot: we do not aim to be general, and instead target applications where some reasonable domain knowledge exists. In particular, we focus on problems where the hidden targets of interest are pre-specified, but infinitely many others might exist. While we map our data to a fixed space of hidden variables, we provide an approach that is robust to the presence of an unbounded number of other, implicit, common factors. The proposed models are adaptive: they account for possible extra variability between the given hidden factors that would be missed by the simpler models. At the same time, they are designed to be less sensitive to initial conditions while being less sensitive to small changes in the datasets.

Planned Impact

Institutions and research groups with high-dimensional data often found themselves with records that are either noisy measurements of some unobserved factors of interest, or heavily confounded by hidden common causes. In this case, estimating a small-dimensional representation of the data is useful for data summarization and visualization; as a smoothing device to estimate relationships between unobservable factors; and as an alternative representation for imputing missing values and providing features that can be used for prediction tasks and clustering.

In particular, applications that fit well the assumptions of our proposal are processes where large surveys are collected - be it in the government (NHS, Home Office and other departments), industry (marketing and employee surveys) and health services (questionnaires applied to patients and staff). Another source of applications come from natural scientists that have theories on how the data were generated (say, postulating hidden functional modules of the cell that cause gene expression levels) and want to understand the consequences of their assumptions. Moreover, in financial domains, a substantial number of variables correspond to entities such as assets (relatively easy to categorize as belonging to particular sectors of the economy) and market indicators (designed to measure theoretical market factors).

Essentially, we plan to change how data analysis practice is done in domains where variables follow a natural and relatively simple partition, but one that is arguably not perfect and can be improved by allowing for residual associations due to infinitely many other factors. In many cases, this natural partition is often implied by the way data collection was designed. For instance, a company interested in market segmentation may find more useful (say, in a predictive sense or by providing insights that translate more directly to policy making) to generate latent embeddings of its customers according to a pre-determined set of factors used in the very design of the questionnaire - as opposed to (say) having an potentially unbounded number of factors being generated by a non-parametric model. However, since the analyst cannot predict the infinitely other implicit factors that explain the associations in the data, the model has to be robust to these other possible factors. The resulting pre-determined latent variables retain their interpretability, but now can fit the data better.

It is certainly the case that several applications do not fit this format, and there is definitely no shortage of very important domains where non-parametric models for latent structure should be the approach of the choice. In any case, it is only sensible that we provide a choice: often, sophisticated statistical methods for dimensionality reduction are ignored by the practitioner because they promise a level of automation and flexibility that simply is not there. The burden of providing information to the model is simply postponed to the stage of trying to make sense of the resulting factors. Where applicable, the philosophy of this proposal is to fit the data well while making the statistical model respect the goals of the analysis, instead of the opposite.

Quantifying the hidden factors that explain observed associations is a hard task, one of the ultimate goals of science and data analysis in general. The proposal tackles the problem using advanced computational tools while listening to practical needs that are not explicitly exploited in the state of the art. The resulting work aims at showing that a new level of results can be achieved if the science behind such applications is exploited in their full potential.

Publications

10 25 50
 
Description The act of data collection usually implies the simultaneous measurement of many aspects of a social or natural phenomenon. For instance, surveys are usually implemented through questionnaires that probe respondents (such as clients of a company, patients in a medical study or staff members in a organisation) from a variety of perspectives (such as questions on job satisfaction, relationship to co-workers, welfare and so on). It is not uncommon that, due to the very structure in which such data collection is organised, one should expect that a few latent traits will explain the association among the recorded variables. However, there is a degree by which a model for such data should include other latent traits that were not foreseen by the data scientist modelling the phenomenon of interest. Misspecification of latent traits can result in a bad fit of a postulated model to the data, while a statistical approach that does not exploit the background knowledge of the problem can result in uninterpretable outcomes and be exceedingly computationally demanding. Our first key finding is the some background knowledge about the main expected latent traits can go a long way. We developed an approach that starts with a given partition of the measurements according to what should be the main latent traits that explain the overall data. For instance, the data scientist can group the questions in a questionnaire according to the main aspect they are intended to measure (so that questions about satisfaction with job duties can be grouped in a different set from those about satisfaction with financial compensation). Some models can infer a single number summarising the relationship of each group with respect to others, but this can be far too restrictive and waste some of the information in the data. Our method searches for potential new traits by identifying which associations implied by a partially build model do not match the ones supported by the empirical observations. It then iteratively "patches" a candidate structure with residual associations that correspond to new latent traits of very limited scope - hence,retaining most of the interpretability of the original main traits. We show that this simple "partition-and-patch" recipe for breaking apart a large system of measurements can provide better fitting models, while at the same time respecting the background knowledge of the data scientist and deviating from it using very local modifications to the original specification. Our second finding is that we can avoid making assumptions about the nature of the individual measurements and instead model their relationships directly. This means that we can avoid assigning numerical meanings to graded answers in a questionnaire (for instance, which numbers one would assign to answers that vary from "Strongly agree" to "Strongly disagree"?). We developed advanced algorithms that work by directly comparing the ranks of the answers instead of the actual values, and as such it reduces the number of assumptions required to draw conclusions from such cases. Our third finding concerns the nature of some probabilistic models that are composed by modelling a large system of variables by pieces. Each piece consists of a distribution over the values of a small number of variables, not unlike the idea of starting from a partition of the variables. The whole system has been recently defined in the literature by the product of such pieces instead of a seemingly complex system of latent traits. However, this poses complicated computational statistics problems, since although the system as a whole is always well-defined, this product of such pieces needs to be translated into a function that measures the probability of a data point. Although it is possible to do some sort of piecewise fitting, as we had originally planned, we realised we could do much more. We found that this translation process can also be cast as operations in a highly structured latent variable system, but with a process of a very different computational structure. This unifies two schools of model construction in statistics, and allow them to exchange ideas for algorithms for data fitting and languages for expressing more flexible probability distributions.
Exploitation Route A main application, as mentioned above, is on modelling social data and using this as a tool to understand the validity of the measurements (as identified by which residual structure was identified by the approach, and which new latent traits can cast light on how appropriate the measurements were to begin with). This is of potential use not only for social scientists, but also in business contexts, where private surveys and marketing research can benefit from validating the measurement of latent traits that summarize the result of a survey. Moreover, in industry at large, there is the potential on exploiting the product of distributions framework in predictive modelling by developing some further research on how to adapt such models to streaming data setups and possible network data. From the methodological perspective, the tools here developed can provide a framework for other statistical approaches for model building that combines main effects and residual ones. Although other methods exist that combine low-rank structures with sparse matrices, it is not clear how these methods generalize to non-linear cases and non-Gaussian distributions. To tackle other non-Gaussian models besides the discrete models we have experimented with so far, it might necessary to think carefully about which candidate structures to look into in order to simplify the problem, and this "partition-and-patch" setup might be an appropriate framework. Moreover, the findings on product of distributions have many potential extensions, when for instance covering problems where data points are associated due to some time, spatial or network structure. From the perspective of applications to other scientific domains, there are on-going efforts on applying the technology here developed to molecular biology domains where the variables correspond to measurement of some metabolites (including gene expression measurements) that come in modular groups - but where the group structure cannot perfectly provide a good fit to the data without taking into account residual associations. Similarly, in survey analysis, if a latent variable model is needed then one might exploit the possibility of fixing imperfections in the model by the adaptive model build procedure developed here without sacrificing much of the interpretability. These steps also help on understanding and assessing the validity of the measurements collected with respect to the target latent traits.
Sectors Communities and Social Services/Policy,Digital/Communication/Information Technologies (including Software),Education

 
Description NIPS 2013 Travel Award
Amount £480 (GBP)
Organisation Neural Information Processing Systems Foundation 
Sector Charity/Non Profit
Country United States
Start 12/2013 
End 01/2014
 
Description NIPS 2013 Travel Award
Amount £480 (GBP)
Organisation Neural Information Processing Systems Foundation 
Sector Charity/Non Profit
Country United States
Start 12/2013 
End 12/2013
 
Description Travel support to attend the 12th Brazilian Meeting on Bayesian Statistics
Amount £800 (GBP)
Organisation International Society for Bayesian Analysis (ISBA) 
Department Brazilian Chapter
Sector Charity/Non Profit
Country Brazil
Start 04/2014 
End 05/2014
 
Description Travel support to attend the 12th Brazilian Meeting on Bayesian Statistics
Amount £800 (GBP)
Organisation International Society for Bayesian Analysis (ISBA) 
Department Brazilian Chapter
Sector Charity/Non Profit
Country Brazil
Start 03/2014 
End 03/2014