Wednesday, August 08, 2012

We live in exciting times !

You probably recall last week's entries on
Go read them I'll wait. Zhilin sent me the following:

Dear Igor, 
I read your post "  More intra-block correlation and sparse sensing matrices " which presents Phil's email. After he introduced his several nice papers, in the end of the email Phil said "we get performance that meets or exceeds that of Bayesian or variational techniques, but with orders-of-magnitude reduction in complexity". I believe Phil was saying that AMP algorithms meet or exceed the performance of Bayesian or variational techniques in the scenarios/applications given in their papers.
There are several scenarios where SBL has obvious advantages over other algorithms. One situation is when the sensing matrix is coherent (i.e., columns are largely correlated). This situation is often encountered in source localization and tracking (e.g. direction-of-arrival estimation, brain source localization, radar detection), biomedical data analysis (e.g. neuroimaging data pattern recognition), and computer vision (e.g. face/expression recognition). SBL algorithms generally have much better performance than other algorithms in these applications. David Wipf also has a NIPS 2011 paper, titled : "Sparse Estimation with Structured Dictionaries", which mathematically proved why SBL is better than Lasso in this situation (Note that the so-called dictionary matrix in this paper is indeed the sensing matrix in a standard CS experiment. I hope readers won't be confused by these names).
There is a source localization demo using a sensing matrix from real-world data, which can be downloaded at: In this demo I compared three MMV algorithms developed in our lab. Anyone can use his/her favorite algorithm in this demo to see the performance.
When signals are structured but non-sparse or less-sparse, SBL algorithms also have good performance, as shown in my wireless telemonitoring paper, " Low Energy Wireless Body-Area Networks for Fetal ECG Telemonitoring via the Framework of Block Sparse Bayesian Learning". Also, my BSBL paper, titled " Extension of SBL Algorithms for the Recovery of Block Sparse Signals with Intra-Block Correlation", shows that in a noise-free situation, after exploiting block structure and intra-block correlation, BSBL algorithms can exactly recover sparse signals with K non-zero elements from only K measurements with high success rates (>0.99). Thus, SBL is helpful in the applications when signals are not very sparse.
SBL algorithms are often criticized for slow speed. But I think the situation will change. One reason is that SBL algorithms can be transformed into iterative reweighted algorithms. For example, in the above BSBL paper, BSBL-L1 can be implemented by iteratively running group Lasso 3-4 times or less. Since every year there are a number of more efficient Lasso-type algorithms proposed, it is expected that BSBL-L1 can be faster and faster. Another reason is there are several methods to speed up SBL algorithms, such as bounded-optimization technique used in my BSBL-BO algorithm, and the fixed-point method, which was used in my CVPR 2012 paper, titled "Sparse Bayesian Multi-Task Learning for Predicting Cognitive Outcomes from Neuroimaging Measures in Alzheimer's Disease". And we also have several much faster algorithms and will be submitted soon.
In the end, I want to say it is hard to say AMP outperforms SBL, or SBL outperforms AMP. AMP and SBL, as two benchmarks, have advantages in different applications/scenarios. We all enjoy seeing both AMP and SBL are evolving for [the] better.


[ for reference: Zhilin's SBL solver is here. Phil Schinter's DCS-AMP solver is here. ]

Thanks  Zhilin. I am pretty sure the attentive reader drew similar conclusions. But I think eventually this type of exchange shows that there is something at play here. Whether or not we have the choice for the measurement matrix is indeed an issue. To me, the other issue is really trying to figure out why some of these results stand: Is this due to the choice of the measurement matrix ? or a constraint on the unknown vector such as sparsity? or because of a specific distribution for that vector ? or even some deeper knowledge of the structure of that vector ? I can safely say that this is an exciting time as I don't think we have a clear view of these subjects and boundaries. 

My other take away from this debate is that it is all the more important for the theoretically minded to jump in. Indeed, it would be nice for the hardware makers to get a feedback from the theoretical folks on how much they should care on the structure of the measurement matrix and derive some new hardware or if some sorts of insightful knowledge will make the detail of the hardware somewhat irrelevant.

As it turns out Larry Wasserman just wrote the following entry on his blog:  RIP RIP (Restricted Isometry Property, Rest In Peace), and I think there is a parallel. Go read it I'll wait.....In his indictment of RIP for regression purposes, he also writes:

Compressed sensing is different. In these applications, the regression {Y_i = \beta^T X_i + \epsilon} refers to data coming from some device. We actually design and build the device in such a way that the linear assumption is true. Moreover, part if the device design is that we get to make up the {X_i}‘s (not nature). In particular, we can choose the design matrix. Thus, we can force the RIP to hold. In this engineering applications, linearity and RIP are not just convenient mathematical assumptions. They are design features. They are real.

I then commented on his blog:
Within Compressed Sensing, RIP is not a condition for sparse recovery that is in fashion anymore. In particular since about 2008 and the work of Donoho and Tanner [1], we have empirical proofs that in general this condition is too stringent. The generic Donoho-Tanner phase transition generally shows that l_1 recovery of sparse vectors admit less sparse vectors than what RIP would require.
In a certain sense, the “coup de grace” was given to that argument recently with a phase transition even beyond the traditional Donoho-Tanner PT (see [2]) and initially featured in the work of Krzakala et al [3]. All this can be easily pictured in this beautiful hand crafted drawing I made for the occasion: :)
The work of Krzakala et al [3] show that by choosing the right design matrix, one can attain very interesting limits. Right, within the community, there is an on-going discussion as to whether we ought to put some constraint on the design matrix or push further the constraint on the vector to be recovered. On the one hand, it means, we need to design different hardware, on the other, it means we need to dig further in the structured sparsity world that the ML folks are investigating. An example of that discussion can be found in [4] and [5].
But let me play the devil’s advocate for RIP, at least within compressive sensing. RIP was for a long while the only mathematically rigorous reasoning at hand for people to show that their empirical finding held even if in most of these papers, the authors acknowledged that their design matrices did not fit the RIP condition. To a certain extent, I can bet you could not even get a paper out unless the first few paragraphs talked about the solid reason as to why sparse recovery was possible (thanks to RIP). We have also seen many authors coming up with different conditions paralleling RIP, so it was a useful tool for a while.
There is also another aspect as to why RIP is interesting: It is a similar tool that is currently used to investigate low rank type of algorithms. Here again, I am sure that most phase transition discovered there will show that RIP is too stringent, but again it may be the only theoretical tool around grounded in some theory.
[1] Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing by Jared Tanner and David L. Donoho 
[2] Compressive Sensing: The Big Picture, 
[3] Statistical physics-based reconstruction in compressed sensing by Florent Krzakala, Marc Mézard, François Sausset, Yifan Sun, Lenka Zdeborová, 

Because here again, this new phase transition can be attained through the construction of dedicated matrices ( Krzakala et al ) or by choosing a specific type of distributions as shown by Phil Schniter. Here again we are torn between the simple hardware question: should I design a hardware / sensor around a special set of  "magic matrices" or should I strive for a representation of the signal so that it has a certain distribution (like Bernouilli)

To which Larry kindly replied with:

Thanks for the comment Igor.
Perhaps I focused too much on RIP. My real point was that ANY conditions needed for recovery are unrealistic for vanilla regression problems. To put it another way, recovery is (usually) not an interesting problem in ordinary regression. (As a pedantic example, take two covariates to be perfectly correlated. You can’t separate the two but you don’t need to if you focus on predictive accuracy.) In other words, regression and compressed sensing are very different problems (despite the obvious overlap). But the difference often gets ignored in some statistics papers.
Thanks for the discussion and references
I could see some problem in blind deconvolution where the problematic might overlap in a more meaningful way but this is a good point from the regression standpoint. 

To come back to compressive sensing and the current not-so-well understood relationship between measurement matrices, structured sparsity and solvers, much like Curiosity on the surface of Mars, we live  in exciting times.

Photo: NASA/JPL/Caltech

No comments: