PrivInfer - Programming Languages for Differential Privacy: Conditioning and Inference

Lead Research Organisation: University of Dundee
Department Name: Computing

Abstract

An enormous amount of individuals' data is collected every day. These
data could potentially be very valuable for scientific and medical
research or for targeting business. Unfortunately, privacy concerns
restrict the way this huge amount of information can be used and
released. Several techniques have been proposed with the aim of
making the data anonymous. These techniques however lose their
effectiveness when attackers can exploit additional knowledge.

Differential privacy is a promising approach to the privacy-preserving
release of data: it offers a strong guaranteed bound on the increase
in harm that a user I incurs as a result of participating in a
differentially private data analysis, even under worst-case
assumptions.

A standard way to ensure differential privacy is by adding some
statistical noise to the result of a data analysis. Differentially
private mechanisms have been proposed for a wide range of interesting
problems like statistical analysis, combinatorial optimization,
machine learning, distributed computations, etc. Moreover, several
programming language verification tools have been proposed with the
goal of assisting a programmer in checking whether a given program is
differentially private or not.

These tools have been proved successful in checking differentially
private programs that uses standard mechanisms. They offer however only a
limited support for reasoning about differential privacy when this is
obtained using non-standard mechanisms. One limitation comes from the
simplified probabilistic models that are built-in to those tools. In
particular, these simplified models provide no support (or only very
limited support) for reasoning about explicit conditional
distributions and probabilistic inference. From the verification
point of view, dealing with explicit conditional distributions is
difficult because it requires finding a manageable representation, in
the internal logic of the verification tool, of events and probability
measures. Moreover, it requires a set of primitives to handle them
efficiently.


In this project we aim at overcoming these limitations by extending
the scope of verification tools for differential privacy to support
explicit reasoning about conditional distributions and probabilistic
inference. Support for conditional distributions and probabilistic
inference is crucial for reasoning about machine learning
algorithms. Those are essential tools for achieving efficient and
accurate data analysis for massive collection of data. So, the goal of
the project is to provide a novel programming language technology
useful for enhancing privacy-preserving data analysis based on machine learning.

Planned Impact

KNOWLEDGE
This proposal aims at designing innovative programming language techniques useful for differentially private data analysis. The direct impact of the project on knowledge will be on the theory and practice of differential privacy and programming languages. Due to its natural multidisciplinary nature the project will also impact the work of several academic communities like the ones working on differential privacy, probabilistic programming languages, verification and data analysis. More in general, the project will also help in improving the understanding of the viability of technological tools for private-data analysis as a solution to the societal privacy problems.

PEOPLE
This project will be an important occasion for the PI to develop new skills and further consolidate his scientific lead and his international position in differential privacy and programming languages. Moreover, the project will benefit the exchange of knowledge between the PI and internationally recognized researcher in the UK and overseas. This will also place the basis for further cooperations between the partners, with the potential of building an EU network of researchers around the theme of differential privacy.
The project will also contribute to the development of new skills for the hired RA. Moreover, the focused multidisciplinary setting of this project will help the RA to significantly enhance his or her career.
Finally the project will boost the research environment at the School of Computing in Dundee - in particular by increasing the collaborations between the Computer Vision and Machine Learning group, and the Theory of Computing group. Direct recipient of this boost will be the PI externally funded PhD, MSc, and Honours students.

ECONOMY
Large part of the interest in big data analysis comes from the important economic opportunities that these huge collections of data offer. Our project does not aim directly at producing techniques for improving the understanding of the data but instead it aims at providing tools that can permit to release and share information about the data without risks for individuals' privacy. As such, the research agenda set by this project can benefit several aspects of UK economy and industry. More specifically, the technology developed in this project will have direct industrial applications in the context of the Tabular project developed at Microsoft Research in Cambridge. This project aims at bringing probabilistic machine learning techniques to Microsoft Excel's final users, and has an enormous potential for information extraction. We don't expect the results of our work to be integrate at a short term in Microsoft Excel, we nevertheless expect the interaction with the Tabular team to be fruitful for identifying the industrial needs that our research agenda can fulfill.

SOCIETY
Developing tools for understanding the information contained in the data that are collected every day is of the greatest importance for the society. The main focus of this project is on the design and development of privacy preserving data analysis useful for this goal. Having data analysis that do not damage the privacy of the individuals would help in removing one of the biggest barrier to the sharing and releasing of the information contained in the data. Several national and international recent projects are aiming at making important collection of data available for research purpose with the goal of improving the quality of life. Two examples are the ''National database of pupils'', and the ''NHS safe havens'' project. The access to these collections is limited to trusted researchers, but the discussion on their public access is open. In the long term, we expect that the foundational contributions that will result as an outcome of the project will provide the technology useful for removing the barriers for the public release of useful data.

Publications

10 25 50