Nuit Blanche: Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

Thursday, July 30, 2015

Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

Random Projections as rank-k projection-cost preserving sketches (Th12) ! From the paper:

We start by noting that both problems are special cases of a general constrained k-rank approximation problem [DFK+04], which also includes problems related to sparse and nonnegative PCA[ PDK13, YZ13, APD14]. Then, following the coreset definitions of [FSS13], we introduce the concept of a projection-cost preserving sketch, an approximation where the sum of squared distances of A’s columns from any k-dimensional subspace (plus a fixed constant independent of the subspace) is multiplicatively close to that of A. This ensures that the cost of any k-rank projection of A is well approximated by A and thus, we can solve the general constrained k-rank approximation problem approximately for A using A.Next, we give several simple and efficient approaches for obtaining projection-cost preserving sketches with (1 + ǫ) relative error. All of these techniques simply require computing an SVD, multiplying by a random projection, random sampling, or some combination of the three.

Without further ado:

Dimensionality Reduction for k-Means Clustering and Low Rank Approximation by Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, Madalina Persu

We show how to approximate a data matrix A with a much smaller sketch A~ that can be used to solve a general class of constrained k-rank approximation problems to within (1+ϵ) error. Importantly, this class of problems includes k-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just O(k) dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems.
For k-means dimensionality reduction, we provide (1+ϵ) relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover' a good subspace for $\bv{A}$, but can be used directly to compute this subspace.
Finally, for k-means clustering, we show how to achieve a (9+ϵ) approximation by Johnson-Lindenstrauss projecting data points to just O(logk/ϵ2) dimensions. This gives the first result that leverages the specific structure of k-means to achieve dimension independent of input size and sublinear in k.