Nuit Blanche: Frugal Streaming and Sparse Recovery with Very Sparse Compressed Counting

Monday, January 06, 2014

Frugal Streaming and Sparse Recovery with Very Sparse Compressed Counting

For some processes, the foremost issue is really that you can only see the data once. The data stream is so large that nothing can be stored, yet, we want to know something about it. It's a little bit like the brain, memory is probably a very small footprint and slow processes are at play when it is involved, yet it is able to perform some fantastically large number of tasks. One generally wonder if there are some algorithmic equivalent to these "simple" natural models. In that category, I am biased toward Count-Min sketches, compressed counting, streaming,..... Heavy hitters for instance has a direct connection to compressive sensing. Frugal streaming [1] is one of new instances. It is featured in a recent blog entry Sketch of the Day: Frugal Streaming by Matt Curcio where he talks about:

One “bit” – Frugal-1U
Finding Quantiles With Two “bits”- Frugal-2U

and he also has a script you can try in the browser. Check also the comments.

To get a bird's eye view on some of the techniques, you might want to read this excellent entry on Probabilistic Data Structures for Web Analytics and Data Mining featuring several aspects of counting:

Cardinality Estimation: Linear Counting
Cardinality Estimation: Loglog Counting
Frequency Estimation: Count-Min Sketch
Frequency Estimation: Count-Mean-Min Sketch
Heavy Hitters: Count-Min Sketch
Heavy Hitters: Stream-Summary
Range Query: Array of Count-Min Sketches
Membership Query: Bloom Filter

There is also what called compressed counting,

Sparse Recovery with Very Sparse Compressed Counting by Ping Li, Cun-Hui Zhang, Tong Zhang

Compressed sensing (sparse signal recovery) often encounters nonnegative data (e.g., images). Recently we developed the methodology of using (dense) Compressed Counting for recovering nonnegative K-sparse signals. In this paper, we adopt very sparse Compressed Counting for nonnegative signal recovery. Our design matrix is sampled from a maximally-skewed p-stable distribution (0 \gt p \lt1), and we sparsify the design matrix so that on average (1-g)-fraction of the entries become zero. The idea is related to very sparse stable random projections (Li et al 2006 and Li 2007), the prior work for estimating summary statistics of the data.

In our theoretical analysis, we show that, when p goes to 0, it suffices to use M= K/(1-exp(-gK) log N measurements, so that all coordinates can be recovered in one scan of the coordinates. If g = 1 (i.e., dense design), then M = K log N. If g= 1/K or 2/K (i.e., very sparse design), then M = 1.58K log N or M = 1.16K log N. This means the design matrix can be indeed very sparse at only a minor inflation of the sample complexity.

Interestingly, as p goes to 1, the required number of measurements is essentially M = 2.7K log N, provided g= 1/K. It turns out that this result is a general worst-case bound.

I note this footnote "2 This report does not include comparisons with the SMP algorithm [1, 5], as we can not run the code from

http://groups.csail.mit.edu/toc/sparse/wiki/index.php?title=Sparse_Recovery_Experiments,

at the moment. We will provide the comparisons after we are able to execute the code. We thank the communications with the author of [1, 5]."

which is, to my knowledge, the first reference using sparse measurement matrices for images (positive unknowns). I sure look forward to see how SMP compares to this new approach. The SMP wiki was featured in 2008 here on Nuit Blanche. Sparse measurement matrices are stunning now in 2014, they were already stunning in 2008.

[1] Frugal Streaming for Estimating Quantiles, Qiang Ma, S. Muthukrishnan, Mark Sandler