In a previous entry, I mentioned the connection between how L1 and maximum entropy considerations. As we now know, in Compressed Sensing, the reconstruction of the signal from incoherent measurements involves a Linear Programming (L1) step featured in the Basis Pursuit approach. We also know that there is a bayesian approach to the reconstruction which obviously involves Laplace priors (because a maxent approach to the problem involving the L1 norm point to Laplace distribution as ideal priors).
In The Price of Privacy and the Limits of LP Decoding by Cynthia Dwork, Frank McSherry and Kunal Talwar show the following:
Our principal result is the discovery of a sharp threshhold ρ∗ ≈ 0.239, so that if ρ < ρ∗ and A is a random m × n encoding matrix of independently chosen standard Gaussians, where m = O(n), then with overwhelming probability over choice of A, for all x ∈ Rn, LP decoding corrects ρm arbitrary errors in the encoding Ax, while decoding can be made to fail if the error rate exceeds ρ∗.
and then in the conclusion
Comparing the results of section 7 with those of section 4, we note that while near-perfect decoding is information theoretically possible for error rates up to a half, LP decoding fails at much lower error rates. It is thus natural to look for other efficient decoding procedures for the case of adversarial error.
We already somehow knew this from previous investigations (David Donoho and Victoria Stodden in Breakdown Point of Model When the Number of Variables Exceeds the Observations ) but we now have a number: 24 percent. Interestingly, Rick Chartrand in Exact reconstruction of sparse signals via nonconvex minimization makes the case that since L1 might not be optimal and that looking at Lp with p less than 1 might lead to better results
. The idea being that asymptotically, we want p to go to zero in order to fulfill the real L0 requirement. From the abstract:
We show that by replacing the L1 norm with the Lp norm with p < 1, exact reconstruction is possible with substantially fewer measurements. We give a theorem in this direction, and many numerical examples, both in one complex dimension, and larger scale examples in two real dimensions.
Another paper by Chartrand,  Nonconvex compressed sensing and error correction shows similar results. His computations seem to go very fast compared to BP/LP so this is noteworthy (we could always go for L0 directly but since it is combinatorial, the time spent on the computation is just too enormous). In light of the reconstruction error shown in that article, one cannot but recall the statement made by Aleks Jakulin and Andrew Gelman in this comments section on using log(1+d^2) norm.
As an log(1+d^2) ball does not have (thank you Yaroslav) the same shape as the Lp norm ball for p less than 1, how can we reconcile the findings that Aleks seems to find Cauchy/t-distributions are doing a good job in sparse decomposition (better than L-1)?
For other reasons, I'd like to think that Cauchy distributions are indeed grounded in serious theoretical work (besides the observation that they are not sensitive to outliers). We already know that random projections exist when modeling the primary visual cortex. We may eventually figure that some type of Autism is related to an equivalent phase change between L0 and L1 or L_log(1+d^2). There is a nice parallel between the metabolic constraints on the cortex when doing sparse coding and CPU requirements to do L0 or L1 or Lp. 
[1] Loss function semantics.
[2] Rice Compressed sensing page.
[3] Object recognition in the cortex
I don't see how t-distribution can favor sparse representations. log(1+d^2) has the same contours as d^2, so MAP estimation with T-distribution or Gaussian should follow the same regularization path.
ReplyDeleteOne thing that intrigues me is how important is the original encoding of features for L1 to work. Suppose you take your data X to AX where A is some dense matrix. If a good solution was sparse in the original representation, it's no longer sparse in the new representation. Will L1 regularization still work better than L2?
Yaroslav,
ReplyDeleteyou are absolutely right. There is a good reason I needed a break. Let me make a huge correction in the post. It stills remains that empirically, the statistics folks seem to find that T-distribution yield sparser representation, and as you point out it does not make sense in light of what we currently know about L1 regularization doing better than L2.
Thanks for pointing out this gross error.
I am not sure I fully understand the second question .
Igor.
The second question is how L1 compares to L2 when the signal is not sparse in the basis we are considering. Will L1 work better than no regularization at all? Will L2 work better than L1?
ReplyDeleteI don't know. L1 is enjoying a nice ride because of the sparse property. If the decomposition is not sparse (exactly or approximately), then solving an underdetermined system will produce infinitely many solutions using either L2 or L1.
ReplyDeleteSome people could say that if your decomposition is not sparse, then maybe you are solving the wrong problem.
Does that answer your question ?
Igor.
Well, I was thinking more from the classification point of view - people say that L1 is good when only a few features are relevant, so I was wondering how it compares to L2 when all the features are relevant. Or in other words, whether there are practical situations when L2 regularization is better in the sense of producing lower prediction error.
ReplyDeleteI suspect that Alex/Gelman are using t-distribution for Bayesian Model Averaging, and not sparse decomposition. Since t-distribution is flatter than a Gaussian, doing BMA with it would result in smaller variance.
I'd be interested to see if anyone else got good results for doing Lp regularization for p<1 with gradient descent. The problem I see is that the norm of the gradient is infinite whenever one of the coefficients is 0. If you initialize your parameters at 0, the gradient descent will be stuck at 0. If you initialize one of the parameters close to 0, then that parameter should go to 0 fast. In other words it seems like gradient descent with such regularization would be overly sensitive to initial conditions
Yaroslav,
ReplyDeleteThis L1/L0 discovery is really about introducing the concept of prior knowledge on the solution (something that was just hoped for L2). When all the features are important, L2 might as well be used, I would venture.
Now for Lp with p less than 1, maybe you should ask Rick Chartrand directly ( http://math.lanl.gov/~rick/ ) on the difficulties he may have had ?
Igor.