Sublinear Algorithms for Approximating Probability Distributions
Lead Research Organisation:
University of Edinburgh
Department Name: Sch of Informatics
Abstract
The goal of this proposal is to advance a research program of developing
sublinear-time algorithms for estimating a wide range of natural and
important classes of probability distributions.
We live in an era of "big data," where the amount of data that can be brought to bear
on questions of biology, climate, economics, etc, is vast and expanding rapidly.
Much of this raw data frequently consists of example points without corresponding labels.
The challenge of how to make sense of this unlabeled data has immediate relevance
and has rapidly become a bottleneck in scientific understanding across many disciplines.
An important class of big data is most naturally modeled as samples from a probability
distribution over a very large domain. The challenge of big data is that the sizes
of the domains of the distributions are immense, typically resulting in unacceptably
slow algorithms. Scaling up a computational framework to comfortably deal with
ever-larger data presents a series of challenges in algorithms.
This prompts the basic question: Given samples from some unknown distribution, what can we infer?
While this question has been studied for several decades by various different communities of researchers,
both the number of samples and running time required for such estimation tasks
are not yet well understood, even for some surprisingly simple types of discrete distributions.
The proposed research focuses on sublinear-time algorithms, that is,
algorithms that run in time that is significantly less than the domain of the underlying distributions.
In this project we will develop sublinear-time algorithms for estimating various classes
of discrete distributions over very large domains.
Specific problems we will address include:
(1) Developing sublinear algorithms to estimate probability distributions that satisfy various
natural types of "shape restrictions" on the underlying probability density function.
(2) Developing sublinear algorithms for estimating complex distributions that result
from the aggregation of many independent simple sources of randomness.
We believe that highly efficient algorithms for these estimation tasks
may play an important role for the next generation of large-scale machine learning applications.
sublinear-time algorithms for estimating a wide range of natural and
important classes of probability distributions.
We live in an era of "big data," where the amount of data that can be brought to bear
on questions of biology, climate, economics, etc, is vast and expanding rapidly.
Much of this raw data frequently consists of example points without corresponding labels.
The challenge of how to make sense of this unlabeled data has immediate relevance
and has rapidly become a bottleneck in scientific understanding across many disciplines.
An important class of big data is most naturally modeled as samples from a probability
distribution over a very large domain. The challenge of big data is that the sizes
of the domains of the distributions are immense, typically resulting in unacceptably
slow algorithms. Scaling up a computational framework to comfortably deal with
ever-larger data presents a series of challenges in algorithms.
This prompts the basic question: Given samples from some unknown distribution, what can we infer?
While this question has been studied for several decades by various different communities of researchers,
both the number of samples and running time required for such estimation tasks
are not yet well understood, even for some surprisingly simple types of discrete distributions.
The proposed research focuses on sublinear-time algorithms, that is,
algorithms that run in time that is significantly less than the domain of the underlying distributions.
In this project we will develop sublinear-time algorithms for estimating various classes
of discrete distributions over very large domains.
Specific problems we will address include:
(1) Developing sublinear algorithms to estimate probability distributions that satisfy various
natural types of "shape restrictions" on the underlying probability density function.
(2) Developing sublinear algorithms for estimating complex distributions that result
from the aggregation of many independent simple sources of randomness.
We believe that highly efficient algorithms for these estimation tasks
may play an important role for the next generation of large-scale machine learning applications.
Planned Impact
The proposed research is focusing on a very active and practically relevant
research area of designing provably efficient algorithms
for a paradigmatic class of unsupervised learning problems.
The high-level intellectual contribution of the proposed research is to develop
new inference algorithms for natural classes of distributions,
and to study new learning models and problems.
We propose to settle some fundamental questions in the area and
through these achievements to ultimately construct a theory unifying different aspects of the field.
Since the proposed research concerns fundamental algorithmic questions in unsupervised learning, the results are likely
to be of broad interest to researchers in algorithms and complexity, as well as related fields, such as probability, statistics
and machine learning. In particular, we believe that insights that will emerge from the proposed research would benefit
researchers in those fields as well.
Hence, research in this area has the potential to strengthen existing connections between computational
learning theory, property testing and related fields such as statistics and applied probability.
In addition, this project will strengthen U.K. collaborations and will make a strong contribution to the broad aims of improving quality, promoting mobility and increasing the attractiveness of U.K.'s universities to overseas students and to international employers and graduates. Success in publishing the results in high-quality scientific conferences and journals, as well as oral and poster communications at international conferences will attract increased interest in the U.K. and will be an instrument to increase its competitiveness. A direct consequence will be enhanced interest in the U.K. and the creation of new investment in research and innovation.
From a practical perspective, a focus on unsupervised estimation problems is currently both timely and well-motivated.
As is well known, many fields are now experiencing a "data deluge;" from biology to sociology, incredible amounts
of empirical data are becoming available at a scale and pace that would have been unimaginable not long ago.
Since this raw data is frequently unlabeled, consisting simply of example points (DNA sequences, smartphone user locations, sensor readings, etc.) without corresponding labels, it is increasingly important to have effective learning algorithms from unsupervised data. The PI believes that provably correct, highly efficient algorithms for distribution estimation may play an important role as a
"computational substrate" for the next generation of large-scale machine learning applications based on unsupervised data.
The mid-term non-academic beneficiaries of this research will be high-tech companies that need to process
massive amounts of data (e.g., IBM, google, yahoo). Randomness and probability distributions are
an inherent feature of many of the mathematical models used in business optimization solutions (with our project
partner, IBM, being a notable example). Efficient algorithms for inference tasks as the ones considered
in the current proposal can potentially result in resource-efficient algorithms for real world problems.
In the long run we believe that the benefits will shift to the customers of such companies
(the end users), who will make more efficient, effective use of their computing resources.
We plan to interact with industry from the beginning of the project, in particular
through our project partner, the IBM T.J. Watson Research Center.
research area of designing provably efficient algorithms
for a paradigmatic class of unsupervised learning problems.
The high-level intellectual contribution of the proposed research is to develop
new inference algorithms for natural classes of distributions,
and to study new learning models and problems.
We propose to settle some fundamental questions in the area and
through these achievements to ultimately construct a theory unifying different aspects of the field.
Since the proposed research concerns fundamental algorithmic questions in unsupervised learning, the results are likely
to be of broad interest to researchers in algorithms and complexity, as well as related fields, such as probability, statistics
and machine learning. In particular, we believe that insights that will emerge from the proposed research would benefit
researchers in those fields as well.
Hence, research in this area has the potential to strengthen existing connections between computational
learning theory, property testing and related fields such as statistics and applied probability.
In addition, this project will strengthen U.K. collaborations and will make a strong contribution to the broad aims of improving quality, promoting mobility and increasing the attractiveness of U.K.'s universities to overseas students and to international employers and graduates. Success in publishing the results in high-quality scientific conferences and journals, as well as oral and poster communications at international conferences will attract increased interest in the U.K. and will be an instrument to increase its competitiveness. A direct consequence will be enhanced interest in the U.K. and the creation of new investment in research and innovation.
From a practical perspective, a focus on unsupervised estimation problems is currently both timely and well-motivated.
As is well known, many fields are now experiencing a "data deluge;" from biology to sociology, incredible amounts
of empirical data are becoming available at a scale and pace that would have been unimaginable not long ago.
Since this raw data is frequently unlabeled, consisting simply of example points (DNA sequences, smartphone user locations, sensor readings, etc.) without corresponding labels, it is increasingly important to have effective learning algorithms from unsupervised data. The PI believes that provably correct, highly efficient algorithms for distribution estimation may play an important role as a
"computational substrate" for the next generation of large-scale machine learning applications based on unsupervised data.
The mid-term non-academic beneficiaries of this research will be high-tech companies that need to process
massive amounts of data (e.g., IBM, google, yahoo). Randomness and probability distributions are
an inherent feature of many of the mathematical models used in business optimization solutions (with our project
partner, IBM, being a notable example). Efficient algorithms for inference tasks as the ones considered
in the current proposal can potentially result in resource-efficient algorithms for real world problems.
In the long run we believe that the benefits will shift to the customers of such companies
(the end users), who will make more efficient, effective use of their computing resources.
We plan to interact with industry from the beginning of the project, in particular
through our project partner, the IBM T.J. Watson Research Center.
People |
ORCID iD |
Ilias Diakonikolas (Principal Investigator) |
Publications
Acharya J
(2016)
Fast Algorithms for Segmented Regression
Acharya J
(2016)
Fast Algorithms for Segmented Regression
Acharya J
(2015)
Sample-Optimal Density Estimation in Nearly-Linear Time
Canonne C
(2015)
Testing Shape Restrictions of Discrete Distributions
Canonne C
(2016)
Testing Shape Restrictions of Discrete Distributions
Canonne C
(2017)
Fourier-Based Testing for Families of Distributions
Canonne, C.L.
(2016)
Testing Shape Restrictions of Discrete Distributions
Chen X
(2022)
On the Complexity of Optimal Lottery Pricing and Randomized Mechanisms for a Unit-Demand Buyer
in SIAM Journal on Computing
De A
(2015)
Learning from satisfying assignments
Diakonikolas I
(2019)
Robust Estimators in High-Dimensions Without the Computational Intractability
in SIAM Journal on Computing
Diakonikolas I
(2015)
Optimal Algorithms and Lower Bounds for Testing Closeness of Structured Distributions
Diakonikolas I
(2014)
Testing Identity of Structured Distributions
Diakonikolas I
(2016)
A New Approach for Testing Properties of Discrete Distributions
Diakonikolas I
(2015)
Optimal Algorithms and Lower Bounds for Testing Closeness of Structured Distributions
Diakonikolas I
(2015)
Differentially Private Learning of Structured Discrete Distributions
Diakonikolas I
(2016)
Efficient Robust Proper Learning of Log-concave Distributions
Diakonikolas I
(2016)
Robust Estimators in High Dimensions without the Computational Intractability
Diakonikolas I
(2016)
Robust Estimators in High Dimensions without the Computational Intractability
Diakonikolas I
(2016)
A New Approach for Testing Properties of Discrete Distributions
Diakonikolas I
(2015)
Properly Learning Poisson Binomial Distributions in Almost Polynomial Time
Diakonikolas I
(2015)
Testing Identity of Structured Distributions
Diakonikolas I.
(2016)
Properly Learning Poisson Binomial Distributions in Almost Polynomial Time
Diakonikolas I.
(2016)
Optimal learning via the Fourier transform for sums of independent integer random variables
in Journal of Machine Learning Research
Diakonikolas I.
(2015)
Differentially private learning of structured discrete distributions
in Advances in Neural Information Processing Systems
Description | The proposed research focused on a very active research area of designing provably efficient algorithms for a paradigmatic class of unsupervised learning problems. The key contribution of the proposed research was to develop new inference algorithms for natural structured probabilistic models. We settled some fundamental questions in the area and ultimately constructed a theory unifying different aspects of the field. |
Exploitation Route | Since the proposed research concerns fundamental algorithmic questions in unsupervised learning, the results are likely to be of broad interest to researchers in machine learning and algorithmic statistics. Efficient algorithms for inference tasks as the ones considered in our proposal can potentially result in resource-efficient algorithms for real world problems. Our collaboration with IBM (project partner) has acted as the foundation for potentially engaging with the system building industry of the company. |
Sectors | Digital/Communication/Information Technologies (including Software) |