Thursday, October 15, 2015

Generalization in Adaptive Data Analysis and Holdout Reuse

[Updated with Andy's Holdout reuse and John Mount's blog entries at the very end of this post, h/t Charles  and Giuseppe and WinVector on Twitter]

Over the years, people like Andrew Gelman have been leading the good fight of pointing how some studies, particularly in social sciences and other areas, are marred with issues stemming in part from  "p-value hacking" i.e; the ability for poorly designed experiments and inference methods to pick a non relevant set of explanatory variables (see also the Freedman's paradox)

Within the generic context of compressive sensing, this is related, in part, with the issue of underdeterminedness of systems of equations where an infinite number of solutions exist. Compressive Sensing generally focuses on the ability of finding a solution that is simple enough (i.e. sparse) out of this infinite set of possibilities; Noiseless phase transitions are a way to guide use through the ability to identify those sparse solutions given a certain amount of sampling. In noisy situations, those phase transitions change and sometimes, the solution recoverability is not garanteed.  The ability to not find a sparse solution could also rear its ugly head in an adversarial setting and/or  some adaptive compressive sensing situations.

In Machine Learning, which really extends traditional statistical analysis, the issue is becoming central when evaluating models through the use of the test set. In effect, once you have used the test set, your next iteration is embedding that information to improve its next iteration. Kaggle and Kaggle-like competitions have been set up to deal with this issue. A public leaderboard tells you how your model fits the data with an unknown but reused datasets. A second set of data is then used only once (or secretly) to assess the real position of the algorithms with respect to others. The recent  "scandal" in Machine Learning is linked to this ability to reuse the test set more often than the rest of the community. But really deep down, one wonders how often is often. This is why any clever way to reuse the test set is becoming a very interesting proposition. 



To get more insight on this issue and how it may be solved, you want to read both of these blog entries and their attendant comments:


and more importantly the attendant NIPS submission related to the first blog entry:

Generalization in Adaptive Data Analysis and Holdout Reuse by Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, Aaron Roth

Overfitting is the bane of data analysts, even when data are plentiful. Formal approaches to understanding this problem focus on statistical inference and generalization of individual analysis procedures. Yet the practice of data analysis is an inherently interactive and adaptive process: new analyses and hypotheses are proposed after seeing the results of previous ones, parameters are tuned on the basis of obtained results, and datasets are shared and reused. An investigation of this gap has recently been initiated by the authors in (Dwork et al., 2014), where we focused on the problem of estimating expectations of adaptively chosen functions.
In this paper, we give a simple and practical method for reusing a holdout (or testing) set to validate the accuracy of hypotheses produced by a learning algorithm operating on a training set. Reusing a holdout set adaptively multiple times can easily lead to overfitting to the holdout set itself. We give an algorithm that enables the validation of a large number of adaptively chosen hypotheses, while provably avoiding overfitting. We illustrate the advantages of our algorithm over the standard use of the holdout set via a simple synthetic experiment.
We also formalize and address the general problem of data reuse in adaptive data analysis. We show how the differential-privacy based approach given in (Dwork et al., 2014) is applicable much more broadly to adaptive data analysis. We then show that a simple approach based on description length can also be used to give guarantees of statistical validity in adaptive settings. Finally, we demonstrate that these incomparable approaches can be unified via the notion of approximate max-information that we introduce.
also from the comment section:
 Note that the NIPS 2015 paper relies on results from two papers by other authors that came after the STOC 2015 paper:
Bassily et al: http://arxiv.org/abs/1503.04843
Nissim and Stemmer: http://arxiv.org/abs/1504.05800

Those two papers provide tighter generalization guarantees than the STOC 2015 paper. The NIPS 2015 and Science papers rely on those intermediate results.

 
 
h/t Charles on Twitter:


h/t Giuseppe on Twitter



 
Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.

1 comment:

--- said...

Note that the NIPS 2015 paper relies on results from two papers by other authors that came after the STOC 2015 paper:
Bassily et al: http://arxiv.org/abs/1503.04843
Nissim and Stemmer: http://arxiv.org/abs/1504.05800

Those two papers provide tighter generalization guarantees than the STOC 2015 paper. The NIPS 2015 and Science papers rely on those intermediate results.

Printfriendly