Jelani Nelson.and Edo Liberty just released an important tutorial they gave at KDD 12 on the state of the art and practical algorithms used in mining streaming data, it is entitled: Streaming Data Mining I personally marvel at the development of these deep algorithms which, because of the large data streams constraints, get to redefine what it means to do seemingly simple functions such as counting in the Big Data world. Here are some slides that got my interest, but the 111 pages are worth the read:
of related interest we have the recent Improved Concentration Bounds for Count-Sketch by Gregory T. Minton, Eric Price. The abstract reads:
We present a refined analysis of the classic Count-Sketch streaming heavy hitters algorithm [CCF02]. Count-Sketch uses O(k log n) linear measurements of a vector x in R^n to give an estimate x' of x. The standard analysis shows that this estimate x' satisfies ||x'-x||_infty^2 < ||x_tail||_2^2 / k, where x_tail is the vector containing all but the largest k coordinates of x. Our main result is that most of the coordinates of x' have substantially less error than this upper bound; namely, for any c < O(log n), we show that each coordinate i satisfies(x'_i - x_i)^2 < (c/log n) ||x_tail||_2^2/k with probability 1-2^{-c}, as long as the hash functions are fully independent. This subsumes the previous bound. Our improved point estimates also give better results for l_2 recovery of exactly k-sparse estimates x^* when x is drawn from a distribution with suitable decay, such as a power law.Our proof shows that any random variable with positive real Fourier transform and finite variance concentrates around 0 at least as well as a Gaussian. This result, which may be of independent interest, gives good concentration even when the noise does not converge to a Gaussian.
For additional reference:
Already earlier this month, we saw this subject area in different meetings and blog entries:
Let us first recall that in Coding, Complexity, and Sparsity Workshop Slides, Piotr Indyk presented a Tutorial on compressive sensing/streaming algorithms.
The count-min sketch and related algorithms have a dedicated page. Of related interest is the excellent Probabilistic Data Structures for Web Analytics and Data Mining blog entry by Ilya Katsov.
More recent blog entries by Andrew McGregor, Suresh Venkatasubramanian and Anna Gilbert summarized the recent Dortmund meeting:
Of interest are the slides in the following meetings:
In particular in Part 3: Videos of the UC Berkeley Conference: "From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications"
Petros Drineas talked about Randomized Algorithms in Data Mining: a Linear Algebraic Approach
We recently also covered similar work on Nuit Blanche on the subject of Randomization and Linear Algebra:
- The MMDS 2012 Slides are out! Workshop on Algorithms for Modern Massive Data Sets
- Sublinear Algorithms 2011 Presentation Slides.
and videos and slides from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications with all the videos here: Part 1, Part 2, Part 3, Part 4, Part 5.
In particular in Part 3: Videos of the UC Berkeley Conference: "From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications"
We recently also covered similar work on Nuit Blanche on the subject of Randomization and Linear Algebra:
- Fast approximation of matrix coherence and statistical leverage, Michael Mahoney, Petros Drineas, Malik Magdon-Ismail, David Woodruff
- Parallel implementation of a fast randomized algorithm for the decomposition of low rank matrices by Andrew Lucas, Mark Stalzer, John Feo .
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.
No comments:
Post a Comment