Nuit Blanche: SketchMLbox: A MATLAB toolbox for large-scale mixture learning - implementation - ( Compressive K-means , Sketching for Large-Scale Learning of Mixture Models )

Wednesday, December 28, 2016

SketchMLbox: A MATLAB toolbox for large-scale mixture learning - implementation - ( Compressive K-means , Sketching for Large-Scale Learning of Mixture Models )

Here is a new sketching toolbox: SketchMLbox:

SketchMLbox

A MATLAB toolbox for large-scale mixture learning

Purpose :
The SketchMLbox is a Matlab toolbox for fitting mixture models to large databases using sketching techniques.
The database is first compressed into a vector called sketch, then a mixture model (e.g. a Gaussian Mixture Model) is estimated from this sketch using greedy algorithms typical of sparse recovery.
The size of the sketch does not depend on the number of elements in the database, but rather on the complexity of the problem at hand [2,3]. Its computation can be massively parallelized and distributed over several units. It can also be maintained in an online setting at low cost.

Mixtures of Diracs ("K-means") and Gaussian Mixture Models with diagonal covariance are currently available, the toolbox is structured so that new mixture models can be easily implemented.

Details can be found in the following papers:
[1] Keriven N., Bourrier A., Gribonval R., Pérèz P., "Sketching for Large-Scale Learning of Mixture Models", ICASSP 2016.
[2] Keriven N., Bourrier A., Gribonval R., Pérèz P., "Sketching for Large-Scale Learning of Mixture Models", arXiv:1606.02838 (extended version)
[3] Keriven N., Tremblay N., Traonmilin Y., Gribonval R., "Compressive K-means", arXiv:1610.08738

The attendant papers are:

Compressive K-means by Nicolas Keriven, Nicolas Tremblay, Yann Traonmilin, Rémi Gribonval

The Lloyd-Max algorithm is a classical approach to perform K-means clustering. Unfortunately, its cost becomes prohibitive as the training dataset grows large. We propose a compressive version of K-means (CKM), that estimates cluster centers from a sketch, i.e. from a drastically compressed representation of the training dataset. We demonstrate empirically that CKM performs similarly to Lloyd-Max, for a sketch size proportional to the number of centroids times the ambient dimension, and independent of the size of the original dataset. Given the sketch, the computational complexity of CKM is also independent of the size of the dataset. Unlike Lloyd-Max which requires several replicates, we further demonstrate that CKM is almost insensitive to initialization. For a large dataset of 10^7 data points, we show that CKM can run two orders of magnitude faster than five replicates of Lloyd-Max, with similar clustering performance on artificial data. Finally, CKM achieves lower classification errors on handwritten digits classification.

Sketching for Large-Scale Learning of Mixture Models by Nicolas Keriven, Anthony Bourrier, Rémi Gribonval, Patrick Pérez

Learning parameters from voluminous data can be prohibitive in terms of memory and computational requirements. We propose a "compressive learning" framework where we estimate model parameters from a sketch of the training data. This sketch is a collection of generalized moments of the underlying probability distribution of the data. It can be computed in a single pass on the training set, and is easily computable on streams or distributed datasets. The proposed framework shares similarities with compressive sensing, which aims at drastically reducing the dimension of high-dimensional signals while preserving the ability to reconstruct them. To perform the estimation task, we derive an iterative algorithm analogous to sparse reconstruction algorithms in the context of linear inverse problems. We exemplify our framework with the compressive estimation of a Gaussian Mixture Model (GMM), providing heuristics on the choice of the sketching procedure and theoretical guarantees of reconstruction. We experimentally show on synthetic data that the proposed algorithm yields results comparable to the classical Expectation-Maximization (EM) technique while requiring significantly less memory and fewer computations when the number of database elements is large. We further demonstrate the potential of the approach on real large-scale data (over 10 8 training samples) for the task of model-based speaker verification. Finally, we draw some connections between the proposed framework and approximate Hilbert space embedding of probability distributions using random features. We show that the proposed sketching operator can be seen as an innovative method to design translation-invariant kernels adapted to the analysis of GMMs. We also use this theoretical framework to derive information preservation guarantees, in the spirit of infinite-dimensional compressive sensing.

h/t Ravi

Join the CompressiveSensing subreddit or the Google+ Community or the Facebook page and post there !