Tuesday, March 28, 2017

Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means - implementation -

Stephen just sent me the following:

Hi Igor,
Hope all is going well and that springtime is coming to Paris.
My student Farhad Pourkamali-Anaraki and I have a recent paper that may be interesting for the blog. It was just published in the IEEE Transactions Information Theory journal (https://doi.org/10.1109/TIT.2017.2672725) but we also have a free version at https://arxiv.org/abs/1511.00152, as well as open source code at https://github.com/stephenbeckr/SparsifiedKMeans. We discuss several applications in the paper, and we think that applying our method to K-means leads to one of the fastest big-data K-means algorithms.

Yes, Stephen, springtime is coming to Paris ! Thanks for the heads-up.

We analyze a compression scheme for large data sets that randomly keeps a small percentage of the components of each data sample. The benefit is that the output is a sparse matrix and therefore subsequent processing, such as PCA or K-means, is significantly faster, especially in a distributed-data setting. Furthermore, the sampling is single-pass and applicable to streaming data. The sampling mechanism is a variant of previous methods proposed in the literature combined with a randomized preconditioning to smooth the data. We provide guarantees for PCA in terms of the covariance matrix, and guarantees for K-means in terms of the error in the center estimators at a given step. We present numerical evidence to show both that our bounds are nearly tight and that our algorithms provide a real benefit when applied to standard test data sets, as well as providing certain benefits over related sampling approaches.

The implementation is here: https://github.com/stephenbeckr/SparsifiedKMeans

No comments: