Nuit Blanche: CS: RIP vs NSP, Clarifications by Jared Tanner and Remi Gribonval

Thursday, October 15, 2009

CS: RIP vs NSP, Clarifications by Jared Tanner and Remi Gribonval

Yesterday's post got some nice feedback. Two things:

Thank you to Jared Tanner and Remi Gribonval for being kind enough to take some time off to answer yesterday's question.
I suck as I have covered some of these issues/papers before but cannot seem to have a good grasp on this issue.

Here are these answers:

Jared Tanner was the first one to respond to yesterday's post:

Dear Igor,

I just read your Oct. 14th posting on Nuit Blanche where the following question was raised by Giuseppe Paleologo:

"do you know how much weaker is the nullspace property vs. restricted isometry? Are there studies on this?"

and your comment that "I am sure a specialist could clear that up for all of us by sending me a short E-mail." Here are a few take away "bullets", followed by a longer discussion.

a) The RIP is a generic tool that can be used for many algorithms, whenever sparsity is the simplicity being exploited.

b) The NSP is only applicable to sparse approximation questions where there is an "objective", such as l1 minimization.

c) For l1 minimization, the NSP is a weaker assumption than the RIP, see "Compressed Sensing: How sharp is the RIP?" by Blanchard, Cartis, and Tanner.

d) l1 minimization is the only setting where NSP and the RIP have been used to try to answer the same question, and is the only setting where it is fair to compare them. NSP is equivalent to l1 minimization recovery, and for this reason RIP implies the NSP, but not the other way around.

e) Both methods can be used to imply stability.

f) Many matrix ensembles are proven to have bounded RIP (Gaussian, Fourier, etc...). Many matrix ensembles are proven to have the NSP, see " Counting the faces of randomly-projected hypercubes and orthants, with applications" by Donoho and Tanner, to appear in Discrete and Computational Geometry.

Now for the longer discussion:

First of all, what is the NSP? Lets focus on the case of \min \|x\|_1 subject to Ax=b where there is an x_0 that is k-sparse with Ax_0=b and A is of size n by N. (Note the ordering k\lt n \lt N.) min l1 recovers x_0 if and only if thereis no vector \nu in the null space of A such that x_0+\nu is interior to (or intersects the boundary of) the l1 ball defined by \|x_0\|_1. (See image above depicting this, with the blue object being the l1 ball, the yellow circle being x_0 and the red line being the null space of A.) The NSP asks if this occurs for any x_0 that is k sparse, effectively moving x_0 to each of the 2^N (N \choose k) k-faces of the l1 ball. That the null space of A shifted to k-faces of the l1 ball never goes interior probably seems a very strict requirement, but it often holds. In fact, this notion isn't new, it is referred to as "neighborliness" in the convex polytope community. David L. Donoho analyzed this question in detail in 2005, precisely characterizing the values of (k,n,N) when this does and does not hold, see "Neighborly Polytopes and Sparse Solutions of Underdetermined Linear Equations."

How does this compare with RIP? First of all, RIP has nothing to do with l1 minimization, but is clearly a natural property to desire in sparse approximation. This is both the strength and weakness of the RIP. It has been used to prove convergence results for many algorithms, including many for which the null space property would give no information. However, because it is a general tool, the RIP based conditions are generally very restrictive. There is only one venue where it is appropriate to directly compare the null space property and the RIP, this is the recovery results for l1 minimization where both can be used. Not surprisingly, the null space results derived by Donoho are much weaker than associated RIP based results.

This raises a very related point, how does one read RIP based results? When is it true that RIP constants of order 2k are less then 0.45? (See Foucart and Lai, ACHA 2009, pages 395-407 for this l1 recovery condition.) How much more restrictive is it to require RIP constants of order 3k to be less than say 0.2? (See Subspace Pursuit by Dai and Milenkovic.) Some answer this question by saying that matrix class zzz has bounded RIP constants so this is satisfied when n \gt O(k\log(N/k)). Unfortunately, this type of answer does not allow one to distinguish between the above two conditions, and essentially says that all state of the art CS algorithms behave at the optimal order. Although true, this might not be helpful to a practitioner who wants to know what is the best algorithm for them, and how large n must be. To shed some light on this, Jeffrey B. Blanchard, Coralia Cartis, and myself derived improved bounds on the RIP constants for the Gaussian ensemble, and showed how these bounds can be used to make more transparent when the above mentioned conditions are satisfied; allowing one to take the values k,n,N and easily verify if the bound would be satisfied. (Note these bounds require large problem sizes to be effective.) Andrew Thompson joined us to repeat this process for many of the greedy algorithms, see "Phase transitions for greedy sparse approximation algorithms" at: http://www.maths.ed.ac.uk/~tanner/publist.html

Lastly, the NSP and RIP prove a "strong equivalence", that the matrix can be used for compressed sensing of any k sparse vector. There is a less restrictive version of the NSP, which proves what Donoho termed "weak equivalence". Weak equivalence ensures that the matrix can be used for compressed sensing for all but a vanishingly small fraction of k sparse vectors. Note, this is what one observes in practice when
selecting a k sparse vector at random and tests the an algorithms performance. (For a discussion of weak equivalence see "Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing" by Donoho and Tanner.) Efforts are underway to derive weak versions of the RIP, see "Construction of a Large Class of Deterministic Sensing Matrices".

Hope that answers a few questions....

Then, Remi Gribonval sent the following:

Dear Igor,

Regarding the RIP vs NSP discussion, I think our recent IEEE Trans. Inf. Theory paper with Mike Davies pretty much shows how sharper the NSP is wrt the RIP. The paper can be found here and our preprint was discussed on a previous post on Nuit Blanche. We also have a conference paper at SAMPTA09 which considers stable versions of the NSP, seemingly identical to the "truncated NSP" of Wotao Yin and Yilun Wang mentioned in a recent post. This paper is available here.

One way to summarize the results is the following
Any matrix satisfying the RIP with delta_2m \lt 0.414 (slightly better constants are known) must satisfy the NSP of order m, and L1 minimization therefore recovers all m-sparse vectors.
There exists matrices with RIP delta_2m arbitrarily close to 1/sqrt(2) which fail to satisfy the NSP of order m, and for which L1 will therefore fail to recover some m-sparse vectors
Yet, there are matrices with RIP arbitrarily close to 1 which satisfy the NSP of order m, and where L1 is therefore successful.
With this respect,
The RIP is a sufficient condition to guarantee that the NSP (or, even better, its stable version) is satisfied. The RIP (implicitly, with a small RIP constant) is not necessary to guarantee the recovery of sparse vectors (and the stable approximate recovery of compressible vectors).
however the RIP seems necessary to guarantee robustness to noise
I guess that similar results would hold for the RIP-1 property and other sufficient conditions which differ from the NSP which is a necessary and sufficient condition. Another line of thought is that of Jared Tanner and David Donoho, where they consider "weak recovery thresholds" for specific families of random matrices, that is to say they allow an exponentially small fraction of m-sparse vectors that may not be recovered. Then, the NSP no longer characterizes the threshold, but the RIP is even less accurate.

I hope that this contributes to clarifying these questions....

@article{Davies:2008ab,
Author = {Davies, M. E. and Gribonval, R{\'e}mi},
Journal = {IEEE Trans. Inform. Theory},
Month = may,
Number = {5},
Pages = {2203--2214},
Title = {Restricted Isometry Constants where $\ell^p$ sparse recovery can fail for $0 <>
Volume = {55},
Year = {2009}}

@inproceedings{Davies:2009aa,
Address = {Marseille, France},
Author = {Davies, M. E. and Gribonval, R{\'e}mi},
Booktitle = {Proc. SAMPTA'09 (Sampling Theory and Applications)},
Month = {may},
Title = {On Lp minimisation, instance optimality, and restricted isometry constants for sparse approximation},
Year = {2009}}

Thank you Jared and Remi !

1 comment:

Anonymous said...: I have to say that it's quite amazing that a tweet I shot to Igor cascaded in two detailed replies from none other than Jared Tanner and Remi Gribonval. Thanks to Jared for the nice geometric intuition provided for the NSP. I still have to digest his comment, and his recent paper with Blanchard and Cartis.

Briefly, I should mention that the question of RIP vs NSP is interesting to me because in statistical applications where A is observational, RIP is not obviously satisfied. However, even when it is not satisfied, l1 minimization or lasso perform well, so possibly there is a different explanation for their success. In this respect, I am especially interested in stable recovery under the two assumptions.

I have been reading some papers and presentations by Remi, and have been thinking about his statement: "however the RIP seems necessary to guarantee robustness to noise" (which appears in the recent IEEE paper). In most papers this seems to be the case, because stability depends obviously on the metric properties of the encoder, which RIP captures. The (lone?) exception is a recent paper of Yin Zhang (http://www.caam.rice.edu/~zhang/reports/tr0811_revised.pdf), which characterizes stability not using RIP or the canonical NSP, but still in terms of projections on the null space. The quantity of interest in that analysis is the value of (||x||_1/||x||_2) for x \in N(A), which is different from NSP.

Thanks a lot to Igor, Jared and Remi for their posts!

-Giuseppe Paleologo; Thursday, October 15, 2009 at 1:43:00 AM CDT