Tuesday, August 21, 2012

Streaming Data Mining Tutorial slides (and more)

Jelani Nelson.and Edo Liberty just released an important tutorial they gave at KDD 12 on the state of the art and practical algorithms used in mining streaming data,  it is entitled: Streaming Data Mining  I personally marvel at the development of these deep algorithms which, because of the large data streams constraints, get to redefine what it means to do seemingly simple functions such as counting in the Big Data world. Here are some slides that got my interest, but the 111 pages are worth the read:  

of related interest we have the recent Improved Concentration Bounds for Count-Sketch by Gregory T. Minton, Eric Price. The abstract reads: 

We present a refined analysis of the classic Count-Sketch streaming heavy hitters algorithm [CCF02]. Count-Sketch uses O(k log n) linear measurements of a vector x in R^n to give an estimate x' of x. The standard analysis shows that this estimate x' satisfies ||x'-x||_infty^2 < ||x_tail||_2^2 / k, where x_tail is the vector containing all but the largest k coordinates of x. Our main result is that most of the coordinates of x' have substantially less error than this upper bound; namely, for any c < O(log n), we show that each coordinate i satisfies
(x'_i - x_i)^2 < (c/log n) ||x_tail||_2^2/k with probability 1-2^{-c}, as long as the hash functions are fully independent. This subsumes the previous bound. Our improved point estimates also give better results for l_2 recovery of exactly k-sparse estimates x^* when x is drawn from a distribution with suitable decay, such as a power law.
Our proof shows that any random variable with positive real Fourier transform and finite variance concentrates around 0 at least as well as a Gaussian. This result, which may be of independent interest, gives good concentration even when the noise does not converge to a Gaussian.

For additional reference:

Already earlier this month, we saw this subject area in different meetings and blog entries:

Let us first recall that in Coding, Complexity, and Sparsity Workshop SlidesPiotr Indyk presented a Tutorial on compressive sensing/streaming algorithms

The count-min sketch and related algorithms have a dedicated page. Of related interest is the excellent Probabilistic Data Structures for Web Analytics and Data Mining blog entry by Ilya Katsov

More recent blog entries by Andrew McGregor, Suresh Venkatasubramanian and Anna Gilbert summarized  the recent Dortmund meeting:
Of interest are the slides in the following meetings:

Petros Drineas talked about Randomized Algorithms in Data Mining: a Linear Algebraic Approach

We recently also covered similar work on Nuit Blanche on the subject of Randomization and Linear Algebra:

No comments: