Nuit Blanche: Generalization in Adaptive Data Analysis and Holdout Reuse

Thursday, October 15, 2015

Generalization in Adaptive Data Analysis and Holdout Reuse - part 2 -

I have updated the previous entry with the following text but it might be new to some hence the reposting. Here is the additional information that is relevant;

Andy has a blog post on the use of this holdout reuse technique in Holdout reuse and attendant Reddit comments.
John Mount wrote on the Win-vector blog about Using differential privacy to reuse training data
Nina Zumel, A Simpler Explanation of Differential Privacy
Moritz's slides
Reddit comments on the Google Blog entry.

also from the comment section, what looks one of the author mentioned the following:

Note that the NIPS 2015 paper relies on results from two papers by other authors that came after the STOC 2015 paper:
Bassily et al: http://arxiv.org/abs/1503.04843
Nissim and Stemmer: http://arxiv.org/abs/1504.05800

Those two papers provide tighter generalization guarantees than the STOC 2015 paper. The NIPS 2015 and Science papers rely on those intermediate results.

h/t Charles and Giuseppe and WinVector on Twitter

The two papers are:

More General Queries and Less Generalization Error in Adaptive Data Analysis
Raef Bassily, Adam Smith, Thomas Steinke, Jonathan Ullman

Adaptivity is an important feature of data analysis---typically the choice of questions asked about a dataset depends on previous interactions with the same dataset. However, generalization error is typically bounded in a non-adaptive model, where all questions are specified before the dataset is drawn. Recent work by Dwork et al. (STOC '15) and Hardt and Ullman (FOCS '14) initiated the formal study of this problem, and gave the first upper and lower bounds on the achievable generalization error for adaptive data analysis.
Specifically, suppose there is an unknown distribution P and a set of n independent samples x is drawn from P. We seek an algorithm that, given x as input, "accurately" answers a sequence of adaptively chosen "queries" about the unknown distribution P. How many samples n must we draw from the distribution, as a function of the type of queries, the number of queries, and the desired level of accuracy?
In this work we make two new contributions towards resolving this question:
*We give upper bounds on the number of samples n that are needed to answer statistical queries that improve over the bounds of Dwork et al.
*We prove the first upper bounds on the number of samples required to answer more general families of queries. These include arbitrary low-sensitivity queries and the important class of convex risk minimization queries.
As in Dwork et al., our algorithms are based on a connection between differential privacy and generalization error, but we feel that our analysis is simpler and more modular, which may be useful for studying these questions in the future.

On the Generalization Properties of Differential Privacy
Kobbi Nissim, Uri Stemmer

A new line of work, started with Dwork et al., studies the task of answering statistical queries using a sample and relates the problem to the concept of differential privacy. By the Hoeffding bound, a sample of size O(logk/α2) suffices to answer k non-adaptive queries within error α, where the answers are computed by evaluating the statistical queries on the sample. This argument fails when the queries are chosen adaptively (and can hence depend on the sample). Dwork et al. showed that if the answers are computed with (ϵ,δ)-differential privacy then O(ϵ) accuracy is guaranteed with probability 1−O(δϵ). Using the Private Multiplicative Weights mechanism, they concluded that the sample size can still grow polylogarithmically with the k.
Very recently, Bassily et al. presented an improved bound and showed that (a variant of) the private multiplicative weights algorithm can answer k adaptively chosen statistical queries using sample complexity that grows logarithmically in k. However, their results no longer hold for every differentially private algorithm, and require modifying the private multiplicative weights algorithm in order to obtain their high probability bounds.
We greatly simplify the results of Dwork et al. and improve on the bound by showing that differential privacy guarantees O(ϵ) accuracy with probability 1−O(δlog(1/ϵ)/ϵ). It would be tempting to guess that an (ϵ,δ)-differentially private computation should guarantee O(ϵ) accuracy with probability 1−O(δ). However, we show that this is not the case, and that our bound is tight (up to logarithmic factors).