Thursday, October 03, 2013

The Tale of Three Breakthroughs: Matrix Sketches, Face Recognition and an AMP based LASSO

Your world won't be the same after reading today's entry.

I recently read three papers that are bringing home the idea that sometimes, we are overoptimizing elements of an idea when in fact, the end process ought to be simpler than it really is made out to be, While at the same time bringing home the fact that some breakthroughs are happening right in front of your eye. Let me explain what I mean by that:

Edo liberty uses streaming algorithms to perform matrix sketching [1] (we mentioned it here). This is wild ! If you recall streaming algorithms are the sort of algorithms you use when you just don't have the time, space and energy to store data, one pass through, that's it, you can't do more, because there is still more data in the pipeline. It reminds me of the optimizing the optimizer of Pablo Sprechmann, Alex Bronstein, Guillermo Sapiro featured recently in [2,3] (and mentioned in [4]) . Yes, we have been comfortable with slow algorithms and maybe over designing them, but the applied maths people need to provide bounds for the "simpler" algorithms because we are getting to the point where we just don't have the time ("We generated more data than there are stars in the Universe"). And this is what Edo does in [1].

The second paper is about face recognition [5] (Generic Image Classification Approaches Excel on Face Recognition), there  Fumin Shen and Chunhua Shen study different database of faces and find that sparse coding with no dictionary learning is actually (much) better than more elaborate, should I , say over thought, algorithms and that the path to research ought to be on how to make face in the right format before feeding them onto recognition algorithms. And this is just what Yi Ma and others (see It's stunning and quite amazingly rich and no ... it's not your father's signal processing)
 and others do. Matrix factorization to the people I say. And then there is the peculiar but very noteworthy finding: faces are not exceptional, algorithms using patches from non face related images work as well as patches from face databases. Wow, just wow.

The third paper is by Ali MousaviArian MalekiRichard Baraniuk [6] and about how to connect AMP and LASSO. Here again, this is in my view major. If you recall we are coming out of a period where L1 reconstruction algorithms [4] were slow yet becoming faster. In fact for a period they were 10 times faster every year. Then out of the blue physicists [7,9] and statisticians and applied mathematicians [8] showed us that an algorithm allowing the computation of the states of spin in condensed matter physics could solve our problems much faster with even stricter bounds. An eye opening of sorts that, in my mind, paved the way to make the structured sparsity approach on a firmer ground theoretically speaking. We are not talking about k = O(klog(k/N)) any longer but m = k+1. Not only that, but we are also talking about a few matrix-vector multiply, i.e. a very speedy solver. So all this thanks to either pertinently well designed measurement matrices, or through an assumption of structured sparsity. To put it simply and pragmatically AMP simply blew our world in terms of both speed and sparsity bounds in compressive sensing and eventually hardware design.. We've been made aware recently of some connection between bounds on LASSO and AMP [8] but this time Ali MousaviArian MalekiRichard Baraniuk show us the connection between the regularization term in the LASSO that is a mainstay of statistics and sparsity and how these results enable the design of an AMP based LASSO algorithm. In short, this is opening the door to very large scale statistical work. Wow just wow. Go read it at [6].


A sketch of a matrix A is another matrix B which is significantly smaller than A, but still approximates it well. Finding such sketches efficiently is an important building block in modern algorithms for approximating, for example, the PCA of massive matrices. This task is made more challenging in the streaming model, where each row of the input matrix can be processed only once and storage is severely limited.
In this paper, we adapt a well known streaming algorithm for approximating item frequencies to the matrix sketching setting. The algorithm receives n rows of a large matrix
A ∈ R n×m one after the other, in a streaming fashion. It maintains a sketch B ∈ R
ℓ×m containing only ℓ ≪ n rows but still guarantees that A
T A ≈ BTB. More accurately,
∀x,kxk = 1 0 ≤ kAxk2 − kBxk2 ≤ 2kAk 2 f/ℓ
BT B ≺ AT A and kAT A − BTBk ≤ 2kAk2f/ℓ .
This algorithm’s error decays proportionally to 1/ℓ using O(mℓ) space. In comparison, random-projection, hashing or sampling based algorithms produce convergence bounds proportional to 1/ √ℓ. Sketch updates per row in A require amortized O(mℓ) operations and the algorithm is perfectly parallelizable. Our experiments corroborate the algorithm’s scalability and improved convergence rate. The presented algorithm also stands out in that it is deterministic, simple to implement, and elementary to prove.

[3] Optimizing to Optimize by Guillermo Sapiro
[4] Sunday Morning Insight: Faster Than a Blink of an Eye
The main finding of this work is that the standard image classification pipeline, which consists of dictionary learning, feature encoding, spatial pyramid pooling and linear classification, outperforms all state-of-the-art face recognition methods on the tested benchmark datasets (we have tested on AR, Extended Yale B, the challenging FERET, and LFW-a datasets). This surprising and prominent result suggests that those advances in generic image classification can be directly applied to improve face recognition systems. In other words, face recognition may not need to be viewed as a separate object classification problem. 
While recently a large body of residual based face recognition methods focus on developing complex dictionary learning algorithms, in this work we show that a dictionary of randomly extracted patches (even from non-face images) can achieve very promising results using the image classification pipeline. That means, the choice of dictionary learning methods may not be important. Instead, we find that learning multiple dictionaries using different low-level image features often improve the final classification accuracy. Our proposed face recognition approach offers the best reported results on the widely-used face recognition benchmark datasets. In particular, on the challenging FERET and LFW-a datasets, we improve the best reported accuracies in the literature by about 20% and 30% respectively.
This paper concerns the performance of the LASSO (also knows as basis pursuit denoising) for recovering sparse signals from undersampled, randomized, noisy measurements. We consider the recovery of the signal $x_o \in \mathbb{R}^N$ from $n$ random and noisy linear observations $y= Ax_o + w$, where $A$ is the measurement matrix and $w$ is the noise. The LASSO estimate is given by the solution to the optimization problem $x_o$ with $\hat{x}_{\lambda} = \arg \min_x \frac{1}{2} \|y-Ax\|_2^2 + \lambda \|x\|_1$. Despite major progress in the theoretical analysis of the LASSO solution, little is known about its behavior as a function of the regularization parameter $\lambda$. In this paper we study two questions in the asymptotic setting (i.e., where $N \rightarrow \infty$, $n \rightarrow \infty$ while the ratio $n/N$ converges to a fixed number in $(0,1)$): (i) How does the size of the active set $\|\hat{x}_\lambda\|_0/N$ behave as a function of $\lambda$, and (ii) How does the mean square error $\|\hat{x}_{\lambda} - x_o\|_2^2/N$ behave as a function of $\lambda$? We then employ these results in a new, reliable algorithm for solving LASSO based on approximate message passing (AMP).
[7] Maximin Analysis of Message Passing Algorithms for Recovering Block Sparse Signals by Armeen Taeb, Arian Maleki, Christoph Studer, Richard Baraniuk.
[9] Probabilistic Reconstruction in Compressed Sensing: Algorithms, Phase Diagrams, and Threshold Achieving Matrices by Florent Krzakala, Marc Mézard,François Sausset, Yifan Sun, Lenka Zdeborová


Anonymous said...

Wasn't the equivalence of AMP and LASSO proved in this 2010 paper:

Bayati and Montanari, "LASSO Risk of Gaussian Matrices," IEEE Transactions on Information Theory

Anonymous said...

I agree with the previous anonymous comment. In fact, I believe that any careful reader of the 2009 PNAS paper by Donoho et al. expected this sort of result.

Igor said...

Yes to the both of you. Those important papers are previous breakthroughs but do we agree that determining an actual parameter thresholding policy is important because it provides a very large potential user base with a clear implementation detail ? (it's an open question).

PS: I will write an entry and mention this in this entry so that there is no confusion. Feel to answer my question above and I will include it in the next entry as well.