Sublinear Algorithms for Approximating Probability Distributions

Lead Research Organisation: University of Edinburgh
Department Name: Sch of Informatics

Abstract

The goal of this proposal is to advance a research program of developing
sublinear-time algorithms for estimating a wide range of natural and
important classes of probability distributions.

We live in an era of "big data," where the amount of data that can be brought to bear
on questions of biology, climate, economics, etc, is vast and expanding rapidly.
Much of this raw data frequently consists of example points without corresponding labels.
The challenge of how to make sense of this unlabeled data has immediate relevance
and has rapidly become a bottleneck in scientific understanding across many disciplines.

An important class of big data is most naturally modeled as samples from a probability
distribution over a very large domain. The challenge of big data is that the sizes
of the domains of the distributions are immense, typically resulting in unacceptably
slow algorithms. Scaling up a computational framework to comfortably deal with
ever-larger data presents a series of challenges in algorithms.

This prompts the basic question: Given samples from some unknown distribution, what can we infer?
While this question has been studied for several decades by various different communities of researchers,
both the number of samples and running time required for such estimation tasks
are not yet well understood, even for some surprisingly simple types of discrete distributions.

The proposed research focuses on sublinear-time algorithms, that is,
algorithms that run in time that is significantly less than the domain of the underlying distributions.
In this project we will develop sublinear-time algorithms for estimating various classes
of discrete distributions over very large domains.
Specific problems we will address include:
(1) Developing sublinear algorithms to estimate probability distributions that satisfy various
natural types of "shape restrictions" on the underlying probability density function.

(2) Developing sublinear algorithms for estimating complex distributions that result
from the aggregation of many independent simple sources of randomness.

We believe that highly efficient algorithms for these estimation tasks
may play an important role for the next generation of large-scale machine learning applications.

Planned Impact

The proposed research is focusing on a very active and practically relevant
research area of designing provably efficient algorithms
for a paradigmatic class of unsupervised learning problems.
The high-level intellectual contribution of the proposed research is to develop
new inference algorithms for natural classes of distributions,
and to study new learning models and problems.
We propose to settle some fundamental questions in the area and
through these achievements to ultimately construct a theory unifying different aspects of the field.
Since the proposed research concerns fundamental algorithmic questions in unsupervised learning, the results are likely
to be of broad interest to researchers in algorithms and complexity, as well as related fields, such as probability, statistics
and machine learning. In particular, we believe that insights that will emerge from the proposed research would benefit
researchers in those fields as well.
Hence, research in this area has the potential to strengthen existing connections between computational
learning theory, property testing and related fields such as statistics and applied probability.

In addition, this project will strengthen U.K. collaborations and will make a strong contribution to the broad aims of improving quality, promoting mobility and increasing the attractiveness of U.K.'s universities to overseas students and to international employers and graduates. Success in publishing the results in high-quality scientific conferences and journals, as well as oral and poster communications at international conferences will attract increased interest in the U.K. and will be an instrument to increase its competitiveness. A direct consequence will be enhanced interest in the U.K. and the creation of new investment in research and innovation.

From a practical perspective, a focus on unsupervised estimation problems is currently both timely and well-motivated.
As is well known, many fields are now experiencing a "data deluge;" from biology to sociology, incredible amounts
of empirical data are becoming available at a scale and pace that would have been unimaginable not long ago.
Since this raw data is frequently unlabeled, consisting simply of example points (DNA sequences, smartphone user locations, sensor readings, etc.) without corresponding labels, it is increasingly important to have effective learning algorithms from unsupervised data. The PI believes that provably correct, highly efficient algorithms for distribution estimation may play an important role as a
"computational substrate" for the next generation of large-scale machine learning applications based on unsupervised data.

The mid-term non-academic beneficiaries of this research will be high-tech companies that need to process
massive amounts of data (e.g., IBM, google, yahoo). Randomness and probability distributions are
an inherent feature of many of the mathematical models used in business optimization solutions (with our project
partner, IBM, being a notable example). Efficient algorithms for inference tasks as the ones considered
in the current proposal can potentially result in resource-efficient algorithms for real world problems.
In the long run we believe that the benefits will shift to the customers of such companies
(the end users), who will make more efficient, effective use of their computing resources.
We plan to interact with industry from the beginning of the project, in particular
through our project partner, the IBM T.J. Watson Research Center.
 
Description The proposed research focused on a very active research area of designing provably efficient algorithms for a paradigmatic class of unsupervised learning problems.
The key contribution of the proposed research was to develop
new inference algorithms for natural structured probabilistic models. We settled some fundamental questions in the area and ultimately constructed a theory unifying different aspects of the field.
Exploitation Route Since the proposed research concerns fundamental algorithmic questions in unsupervised learning, the results are likely
to be of broad interest to researchers in machine learning and algorithmic statistics.

Efficient algorithms for inference tasks as the ones considered in our proposal can potentially result
in resource-efficient algorithms for real world problems. Our collaboration with IBM (project partner) has acted as the foundation for potentially engaging with the system building industry of the company.
Sectors Digital/Communication/Information Technologies (including Software)