Stephen just sent me the following:
Hope all is going well and that springtime is coming to Paris.
My student Farhad Pourkamali-Anaraki and I have a recent paper that may be interesting for the blog. It was just published in the IEEE Transactions Information Theory journal (https://doi.org/10.1109/TIT.2017.2672725) but we also have a free version at https://arxiv.org/abs/1511.00152, as well as open source code at https://github.com/stephenbeckr/SparsifiedKMeans. We discuss several applications in the paper, and we think that applying our method to K-means leads to one of the fastest big-data K-means algorithms.
Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means by Farhad Pourkamali-Anaraki, Stephen Becker
We analyze a compression scheme for large data sets that randomly keeps a small percentage of the components of each data sample. The benefit is that the output is a sparse matrix and therefore subsequent processing, such as PCA or K-means, is significantly faster, especially in a distributed-data setting. Furthermore, the sampling is single-pass and applicable to streaming data. The sampling mechanism is a variant of previous methods proposed in the literature combined with a randomized preconditioning to smooth the data. We provide guarantees for PCA in terms of the covariance matrix, and guarantees for K-means in terms of the error in the center estimators at a given step. We present numerical evidence to show both that our bounds are nearly tight and that our algorithms provide a real benefit when applied to standard test data sets, as well as providing certain benefits over related sampling approaches.
The implementation is here: https://github.com/stephenbeckr/SparsifiedKMeans
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.