To achieve our goal, we propose to use an objective akin to (11), where the masks are now random variables independant from the samples. In other words, we want to combine ideas of online dictionary learning with random subsampling, in a principled manner. This leads us to consider an infinite stream of samples (Mtxt)t0, where the signals xt are i.i.d. samples from the data distribution – that is, a column of X selected at random – and Mt “selects” a random subset of observed entries in X. This setting can accommodate missing entries, never selected by the mask, and only requires loading a subset of xt at each iteration
Dictionary Learning for Massive Matrix Factorization by Arthur Mensch, Julien Mairal, Bertrand Thirion, Gaël Varoquaux
Sparse matrix factorization is a popular tool to obtain interpretable data decompositions, which are also effective to perform data completion or denoising. Its applicability to large datasets has been addressed with online and randomized methods, that reduce the complexity in one of the matrix dimension, but not in both of them. In this paper, we tackle very large matrices in both dimensions. We propose a new factoriza-tion method that scales gracefully to terabyte-scale datasets, that could not be processed by previous algorithms in a reasonable amount of time. We demonstrate the efficiency of our approach on massive functional Magnetic Resonance Imaging (fMRI) data, and on matrix completion problems for recommender systems, where we obtain significant speed-ups compared to state-of-the art coordinate descent methods.
MODL or Masked Online Dictionary Learning is available on Github.
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.