Thursday, September 22, 2011

Matrix Factorization This Week

What are the news in Matrix Factorization this week ? well,

Here is some of the disconnect I am beginning to see. On the one hand you have people pushing for a new variety of matrix factorizations based on sparsity, rank, structured sparsity and so on. On the other, many implementations in python, R or other languages are lagging and many implementers (I did not say all) seem to think that the term "matrix factorization" is equivalent to NMF. Maybe the next GSOC projects, in whatever language, should pay more attention to this shift and to the need to make a traceable connection between implementation and actual algorithms. In the case of sparse PCA, some of the PCA algos don't even seem to talk about the same beast.

Anyway, here are the different papers that showed up on my radar screen this week: Let us first start with a poster on an algorithm that aims at speeding up the .... NMF :-)

Statistical Optimization of Non-Negative Matrix Factorization by Anoop Korattikara, Levi Boyles, Max Welling, Jingu Kim, Haesun Park.

We also have:

Low-rank matrix recovery via iteratively reweighted least squares minimization with Massimo Fornasier and Holger Rauhut, Rachel Ward. The abstract reads:
We present and analyze an e fficient implementation of an iteratively reweighted least squares algorithm for recovering a matrix from a small number of linear measurements. The algorithm is designed for the simultaneous promotion of both a minimal nuclear norm and an approximatively low-rank solution. Under the assumption that the linear measurements ful fill a suitable generalization of the Null Space Property known in the context of compressed sensing, the algorithm is guaranteed to recover iteratively any matrix with an error of the order of the best k-rank approximation. In certain relevant cases, for instance for the matrix completion problem, our version of this algorithm can take advantage of the Woodbury matrix identity, which allows to expedite the solution of the least squares problems required at each iteration. We present numerical experiments which con rm the robustness of the algorithm for the solution of matrix completion problems, and demonstrate its competitiveness with respect to other techniques proposed recently in the literature.

The  implementation is here. It will be listed on the Matrix Factorization Jungle page.
Multi-label Learning via Structured Decomposition and Group Sparsity by Tianyi Zhou and Dacheng Tao. The abstract reads:
In multi-label learning, each sample is associated with several labels. Existing works indicate that exploring correlations between labels improve the prediction performance. However, embedding the label correlations into the training process significantly increases the problem size. Moreover, the mapping of the label structure in the feature space is not clear. In this paper, we propose a novel multi-label learning method Structured Decomposition + Group Sparsity (SDGS)''. In SDGS, we learn a feature subspace for each label from the structured decomposition of the training data, and predict the labels of a new sample from its group sparse representation on the multi-subspace obtained from the structured decomposition. In particular, in the training stage, we decompose the data matrix $X\in R^{n\times p}$ as $X=\sum_{i=1}^kL^i+S$, wherein the rows of $L^i$ associated with samples that belong to label $i$ are nonzero and consist a low-rank matrix, while the other rows are all-zeros, the residual $S$ is a sparse matrix. The row space of $L_i$ is the feature subspace corresponding to label $i$. This decomposition can be efficiently obtained via randomized optimization. In the prediction stage, we estimate the group sparse representation of a new sample on the multi-subspace via group \emph{lasso}. The nonzero representation coefficients tend to concentrate on the subspaces of labels that the sample belongs to, and thus an effective prediction can be obtained. We evaluate SDGS on several real datasets and compare it with popular methods. Results verify the effectiveness and efficiency of SDGS.
I am sure the code will be available soon.

Divide-and-Conquer Matrix Factorization by Lester MackeyAmeet TalwalkarMichael I. Jordan. The abstract reads:
This work introduces SUBMF, a parallel divide-and-conquer framework for noisy matrix factorization. SUBMF divides a large-scale matrix factorization task into smaller subproblems, solves each subproblem in parallel using an arbitrary base matrix factorization algorithm, and combines the subproblem solutions using techniques from randomized matrix approximation. Our experiments with collaborative ﬁltering, video background modeling, and simulated data demonstrate the near-linear to super-linear speed-ups attainable with this approach. Moreover, our analysis shows that SUBMF enjoys high probability recovery guarantees comparable to those of its base algorithm.

The SubMF site is here with the attendant code. It will be listed on the Matrix Factorization Jungle page.

Model Order Selection for Boolean Matrix Factorization by Pauli MiettinenJilles Vreeken. The abstract reads:
Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection problem’ of determining where ﬁne-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)—where data, factors, and matrix product are Boolean—has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. But so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous beneﬁts, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general— making it applicable for any BMF algorithm. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior

The BMF code is here. It will also be listed on the Matrix Factorization Jungle page.

Nonnegative matrix factorization (NMF) is now a common tool for audio source separation. When learning NMF on large audio databases, one major drawback is that the complexity in time is O(F KN) when updating the dictionary (where (F, N) is the dimension of the input power
spectrograms, and K the number of basis spectra), thus forbidding its application on signals longer than an hour. We provide an online algorithm with a complexity of O(F K) in time and memory for updates in the dictionary. We show on audio simulations that the online approach is faster for short audio signals and allows to analyze audio signals of several hours.

We propose an unsupervised inference procedure for audio source separation. Components in nonnegative matrix factorization (NMF) are grouped automatically in audio sources via a penalized maximum likelihood approach. The penalty term we introduce favors sparsity at the group level, and is motivated by the assumption that the local amplitude of the sources are independent. Our algorithm extends multiplicative updates for NMF ; moreover we propose a test statistic to tune hyperparameters in our model, and illustrate its adequacy on synthetic data. Results on real audio tracks show that our sparsity prior allows to identify audio sources without knowledge on their spectral properties.

Task-Driven Dictionary Learning by Julien MairalFrancis Bach , Jean Ponce. The abstract reads:
Modeling data with linear combinations of a few elements from a learned dictionary has been the focus of much recent research in machine learning, neuroscience and signal processing. For signals such as natural images that admit such sparse representations, it is now well established that these models are well suited to restoration tasks. In this context, learning the dictionary amounts to solving a large-scale matrix factorization problem, which can be done efficiently with classical optimization tools. The same approach has also been used for learning features from data for other purposes, e.g., image classification, but tuning the dictionary in a supervised way for these tasks has proven to be more difficult. In this paper, we present a general formulation for supervised dictionary learning adapted to a wide variety of tasks, and present an efficient algorithm for solving the corresponding optimization problem. Experiments on handwritten digit classification, digital art identification, nonlinear inverse image problems, and compressed sensing demonstrate that our approach is effective in large-scale settings, and is well suited to supervised and semi-supervised classification, as well as regression tasks for data that admit sparse representations.

Distributed User Profiling via Spectral Methods by Dan-Cristian Tomozei, Laurent Massoulié. The abstract reads:
User profiling is a useful primitive for constructing personalised services, such as content recommendation. In the present paper we investigate the feasibility of user profiling in a distributed setting, with no central authority and only local information exchanges between users. We compute a profile vector for each user (i.e., a low-dimensional vector that characterises her taste) via spectral transformation of observed user-produced ratings for items. Our two main contributions follow: i) We consider a low-rank probabilistic model of user taste. More specifically, we consider that users and items are partitioned in a constant number of classes, such that users and items within the same class are statistically identical. We prove that without prior knowledge of the compositions of the classes, based solely on few random observed ratings (namely $O(N\log N)$ such ratings for $N$ users), we can predict user preference with high probability for unrated items by running a local vote among users with similar profile vectors. In addition, we provide empirical evaluations characterising the way in which spectral profiling performance depends on the dimension of the profile space. Such evaluations are performed on a data set of real user ratings provided by Netflix. ii) We develop distributed algorithms which provably achieve an embedding of users into a low-dimensional space, based on spectral transformation. These involve simple message passing among users, and provably converge to the desired embedding. Our method essentially relies on a novel combination of gossiping and the algorithm proposed by Oja and Karhunen.

We consider a class of learning problems regularized by a structured sparsity-inducing norm de-
ﬁned as the sum of ℓ2- or ℓ∞-norms over groups of variables. Whereas much effort has been put in developing fast optimization techniques when the groups are disjoint or embedded in a hierarchy, we address here the case of general overlapping groups. To this end, we present two different strategies: On the one hand, we show that the proximal operator associated with a sum of ℓ∞-norms can be computed exactly in polynomial time by solving a quadratic min-cost ﬂow problem, allowing the use of accelerated proximal gradient methods. On the other hand, we use proximal splitting techniques, and address an equivalent formulation with non-overlapping groups, but in higher dimension and with additional constraints. We propose efﬁcient and scalable algorithms exploiting these two strategies, which are signiﬁcantly faster than alternative approaches. We illustrate these methods with several problems such as CUR matrix factorization, multi-task learning of tree-structured dictionaries, background subtraction in video sequences, image denoising with wavelets, and topographic dictionary learning of natural image patches

Abstract: Joint analysis of data from multiple sources has the potential to improve our understanding of the underlying structures in complex data sets. For instance, in restaurant recommendation systems, recommendations can be based on rating histories of customers. In addition to rating histories, customers' social networks (e.g., Facebook friendships) and restaurant categories information (e.g., Thai or Italian) can also be used to make better recommendations. The task of fusing data, however, is challenging since data sets can be incomplete and heterogeneous, i.e., data consist of both matrices, e.g., the person by person social network matrix or the restaurant by category matrix, and higher-order tensors, e.g., the "ratings" tensor of the form restaurant by meal by person.
In this paper, we are particularly interested in fusing data sets with the goal of capturing their underlying latent structures. We formulate this problem as a coupled matrix and tensor factorization (CMTF) problem where heterogeneous data sets are modeled by fitting outer-product models to higher-order tensors and matrices in a coupled manner. Unlike traditional approaches solving this problem using alternating algorithms, we propose an all-at-once optimization approach called CMTF-OPT (CMTF-OPTimization), which is a gradient-based optimization approach for joint analysis of matrices and higher-order tensors. We also extend the algorithm to handle coupled incomplete data sets. Using numerical experiments, we demonstrate that the proposed all-at-once approach is more accurate than the alternating least squares approach.

Making Tensor Factorizations Robust to Non-Gaussian Noise. by E. C. Chi and Tammy Kolda. The abstract reads:

Abstract: Tensors are multi-way arrays, and the CANDECOMP/PARAFAC (CP) tensor factorization has found application in many different domains. The CP model is typically fit using a least squares objective function, which is a maximum likelihood estimate under the assumption of independent and identically distributed (i.i.d.) Gaussian noise. We demonstrate that this loss function can be highly sensitive to non-Gaussian noise. Therefore, we propose a loss function based on the 1-norm because it can accommodate both Gaussian and grossly non-Gaussian perturbations. We also present an alternating majorization-minimization (MM) algorithm for fitting a CP model using our proposed loss function (CPAL1) and compare its performance to the workhorse algorithm for fitting CP models, CP alternating least squares (CPALS).

Credit Thierry Legault, Video of a pass of UARS satellite 8/9 days before atmospheric reentry, at an altitude of only 250km, taken from the ground with a 14" telescope. More info on http://legault.perso.sfr.fr/uars_110915.html Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from