## Thursday, September 15, 2011

### Matrix Factorization This Week

As I was reading John Langford's blog, I came across a reference to an implementation of a matrix factorization module within Vowpal Wabbit by Jake Hofman. I asked Jake about the specifics of the matrix factorization since as we all know, with the advent of new regularizers and the jungle that comes with it, we may need to be a little bit more accurate in stating what algorithm and the type of regularization is implemented. Here is what Jake sent me.

hi igor
...the matrix factorization that's currently in VW is quite simple:stochastic gradient descent with L2 regularization. the main departure from "vanilla" approaches is that features are hashed into the weight vector, as is done in all of VW, to keep a fixed memory footprint.i don't have anything formal written at the moment, but the specifics that you're probably interested in are in the mf_inline_train function(lines 120 to 163 here): https://github.com/JohnLangford/vowpal_wabbit/blob/master/gd_mf.cc  we'll likely try to improve performance in the future, but as of right now it's relatively straightforward and performing reasonably well.the approach is similar to that in "collaborative filtering on abudget", which might be a useful reference: http://jmlr.csail.mit.edu/proceedings/papers/v9/karatzoglou10a/karatzoglou10a.pdf you can find a wiki page on example usage here: https://github.com/JohnLangford/vowpal_wabbit/wiki/Matrix-factorization-example let me know if i can answer any other questions.
regards,
jake
Thanks Jake.

Here is a list of papers and code I have noticed over the past week. It ranges from application to theoretical investigation of a replacement of the trace norm regularizer, a subject I'll cover soon since we have seen more than one replacement since that entry.

When data is sampled from an unknown subspace, principal component analysis (PCA) provides an e ective way to estimate the subspace and hence reduce the dimension of the data. At the heart of PCA is the Eckart-Young-Mirsky theorem, which characterizes the best rank k approximation of a matrix.In this paper, we prove a generalization of the Eckart-Young-Mirsky theorem under all unitarily invariant norms. Using this result, we obtain closed-form solutions for a set of rank/norm regularized problems, and derive closed-form solutions for a general class of subspace clustering problems (where data is modelled by unions of unknown subspaces). From these results we obtain new theoretical insights and promising experimental results.

Collaborative Filtering with the Trace Norm: Learning, Bounding, and Transducing by Ohad Shamir and Shai Shalev-Shwartz. The abstract reads:
Trace-norm regularization is a widely-used and successful approach for collaborative fi ltering and matrix completion. However, its theoretical understanding is surprisingly weak, and despite previous attempts, there are no distribution-free, non-trivial learning guarantees currently known. In this paper, we bridge this gap by providing such guarantees, under mild assumptions which correspond to collaborative filtering as performed in practice. In fact, we claim that previous di fficulties partially stemmed from a mismatch between the standard learning-theoretic modeling of collaborative fi ltering, and its practical application. Our results also shed some light on the issue of collaborative fi ltering with bounded models, which enforce predictions to lie within a certain range. In particular, we provide experimental and theoretical evidence that such models lead to a modest yet signi ficant improvement.

We consider the problem of approximately reconstructing a partially-observed, approximately low-rank matrix. This problem has received much attention lately, mostly using the trace-norm as a surrogate to the rank. Here we study low-rank matrix reconstruction using both the trace-norm, as well as the less-studied max-norm, and present reconstruction guarantees based on existing analysis on the Rademacher complexity of the unit balls of these norms. We show how these are superior in several ways to recently published guarantees based on specialized analysis.

Feature-Based Matrix Factorization by Tianqi Chen, Zhao Zheng, Qiuxia Lu, Weinan Zhang, Yong Yu. The abstract reads:
Recommendation system has been used more and more frequently in many applications recent years. With the increasing information available, not only in quantities but also in types, how to leverage these rich information to build a better recommendation system becomes a natural problem. Most traditional approaches try to design a specific model for each scenario, which demands great efforts in developing and modifying models. In this technical report, we describe our implementation of feature-based matrix factorization. This model is an abstract of many variants of matrix factorization models, and new types of information can be utilized by simply defining new features, without modifying any line of code.

ECG Compression via Matrix Completion by Luisa F. Polania, Rafael E. Carrillo, Manuel Blanco-Velasco and Kenneth E. Barner. The abstract reads:
Matrix completion is a new paradigm in signal processing that seeks to recover a low-rank matrix based on small number of observations. We propose an ECG compression scheme that organizes the ECG data in a low-rank matrix via a period normalization stage. By using matrix completion, we can recover the ECG data matrix from a few number of entries, and therefore achieve high compression ratios comparable to the ones obtained by existing compression techniques. The proposed scheme offers a low-complexity encoder and good tolerance to quantization noise, which allows a signiﬁcant reduction in the number of bits per sample and, yet achieve good quality in the reconstruction.

Nonconvex proximal splitting: batch and incremental algorithms by Suvrit Sra. The abstract reads:
Within the unmanageably large class of nonconvex optimization, we consider the rich subclass of nonsmooth problems that have \emph{composite} objectives---this already includes the extensively studied convex, composite objective problems as a special case. For this subclass, we introduce a powerful, new framework that permits asymptotically \emph{non-vanishing} perturbations. In particular, we develop perturbation-based batch and incremental (online like) nonconvex proximal splitting algorithms. To our knowledge, this is the first time that such perturbation-based nonconvex splitting algorithms are being proposed and analyzed. While the main contribution of the paper is the theoretical framework, we complement our results by presenting some empirical results on matrix factorization.
Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation by Zhouchen Lin, Risheng Liu, Zhixun Su. The abstract reads:
Low-rank representation (LRR) is an effective method for subspace clustering and has found wide applications in computer vision and machine learning. The existing LRR solver is based on the alternating direction method (ADM). It suffers from $O(n^3)$ computation complexity due to the matrix-matrix multiplications and matrix inversions, even if partial SVD is used. Moreover, introducing auxiliary variables also slows down the convergence. Such a heavy computation load prevents LRR from large scale applications. In this paper, we generalize ADM by linearizing the quadratic penalty term and allowing the penalty to change adaptively. We also propose a novel rule to update the penalty such that the convergence is fast. With our linearized ADM with adaptive penalty (LADMAP) method, it is unnecessary to introduce auxiliary variables and invert matrices. The matrix-matrix multiplications are further alleviated by using the skinny SVD representation technique. As a result, we arrive at an algorithm for LRR with complexity $O(rn^2)$, where $r$ is the rank of the representation matrix. Numerical experiments verify that for LRR our LADMAP method is much faster than state-of-the-art algorithms. Although we only present the results on LRR, LADMAP actually can be applied to solving more general convex programs.

The rank function rank( ) is neither continuous nor convex which brings much diﬃculty to the solution of rank minimization problems. In this paper, we provide a uniﬁed framework to construct the approximation functions of rank( ), and study their favorable properties. Particularly, with two families of approximation functions, we propose a convex relaxation method for the rank minimization problems with positive semideﬁnite cone constraints, and illustrate its application by computing the nearest lowrank correlation matrix. Numerical comparisons with the convex relaxation method in [17] indicate that our method tends to yield a better local optimal solution.

From that paper I note:

In [10, 26], Fazel et al. showed that the convex envelope of rank( ) on the unit ball in the operator norm is the nuclear norm. Motivated by this result, the convex relaxation methods based on the nuclear norm minimization are extensively studied, and many eﬀective algorithms have been proposed for them (see, e.g., [18, 34, 19, 7, 22]). However, the nuclear norm has a drastic derivation from the rank function since the former is convex whereas the latter is concave. This means that the nuclear norm is not a high quality approximation to rank( ). In fact, as pointed out in [26, 27], the nuclear norm relaxation exhibits a phase transition where for suﬃciently small rank the method always succeeds, but in the complement of the region it may fail even never succeed.

In this paper, noting that rank( ) is a composition of the one-side sign function and the singular value function, we give a uniﬁed framework to construct its approximation functions, and prove that rank( ) can be approximated to any prescribed accuracy when the approximation parameter is smaller than a given threshold. Although these approximation functions are not diﬀerentiable, they are semismooth and directional diﬀerentiable everywhere in R^m×n

It looks to me a little bit like the equivalent of the SL0 code that does so well in many compressive sensing reconstruction cases.

Exact recovery from contaminated visual data plays an important role in various tasks. By assuming the observed data matrix as the addition of a low-rank matrix and a sparse matrix, theoretic guarantee exists under mild conditions for exact data recovery. Practically matrix nuclear norm is adopted as a convex surrogate of the non-convex matrix rank function to encourage low-rank property and serves as the major component of recently-proposed Robust Principal Component Analysis (R-PCA). Recent endeavors have focused on enhancing the scalability of RPCA to large-scale datasets, especially mitigating the computational burden of frequent large-scale Singular Value Decomposition (SVD) inherent with the nuclear norm optimization. In our proposed scheme, the nuclear norm of an auxiliary matrix is minimized instead, which is related to the original low-rank matrix by random projection. By design, the modiﬁed optimization entails SVD on matrices of much smaller scale, as compared to the original optimization problem. Theoretic analysis well justiﬁes the proposed scheme, along with greatly reduced optimization complexity. Both qualitative and quantitative studies are provided on various computer vision benchmarks to validate its effectiveness, including facial shadow removal, surveillance background modeling and large-scale image tag transduction. It is also highlighted that the proposed solution can serve as a general principal to accelerate many other nuclear norm oriented problems in numerous tasks

In this work, we address the following matrix recovery problem: suppose we are given a set of data points containing two parts, one part consists of samples drawn from a union of multiple subspaces and the other part consists of outliers. We do not know which data points are outliers, or how many outliers there are. The rank and number of the subspaces are unknown either. Can we detect the outliers and segment the samples into their right subspaces, efficiently and exactly?
We utilize a so-called Low-Rank Representation (LRR) method to solve the above problem, and prove that under mild technical conditions, any solution to LRR exactly recover the row space of the samples and detect the outliers as well. Since the subspace membership is provably determined by the row space, this further implies that LRR can perform exact subspace segmentation and outlier detection, in an exact and efficient way.
The code for the LRR example can be found here.

Dictionary Learning of Convolved Signals by Daniele Barchiesi and Mark D. Plumbley. The abstract reads:
Assuming that a set of source signals is sparsely representable in a given dictionary, we show how their sparse recovery fails whenever we can only measure a convolved observation of them. Starting from this motivation, we develop a block coordinate descent method which aims to learn a convolved dictionary and provide a sparse representation of the observed signals with small residual norm. We compare the proposed approach to the K-S V D dictionary learning
algorithm and show through numerical experiment on synthetic signals that, provided some conditions on the problem data, our technique converges in a ﬁxed number of iterations to a sparse representation with smaller residual norm

Using the `1-norm to regularize the estimation of the parameter vector of a linear model leads to an unstable estimator when covariates are highly correlated. In this paper, we introduce a new penalty function which takes into account the correlation of the design matrix to stabilize the estimation. This norm, called the trace Lasso, uses the trace norm, which is a convex surrogate of the rank, of the selected covariates as the criterion of model complexity. We analyze the properties of our norm, describe an optimization algorithm based on reweighted least-squares, and illustrate the behavior of this norm on synthetic data, showing that it is more adapted to strong correlations than competing methods such as the elastic net.

Large-Scale Sparse Principal Component Analysis with Application to Text Data by Youwei Zhang and Laurent El Ghaoui. The abstract reads:
Sparse PCA provides a linear combination of small number of features that maxi- mizes variance across data. Although Sparse PCA has apparent advantages com- pared to PCA, such as better interpretability, it is generally thought to be computationally much more expensive. In this paper, we demonstrate the surprising fact that sparse PCA can be easier than PCA in practice, and that it can be reliably applied to very large data sets. This comes from a rigorous feature elimination pre-processing result, coupled with the favorable fact that features in real-life data typically have exponentially decreasing variances, which allows for many features to be eliminated. We introduce a fast block coordinate ascent algorithm with much better computational complexity than the existing first-order ones. We provide ex- perimental results obtained on text corpora involving millions of documents and hundreds of thousands of features. These results illustrate how Sparse PCA can help organize a large corpus of text data in a user-interpretable way, providing an attractive alternative approach to topic models.

HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent by Feng NiuBenjamin Recht, Christopher ReStephen Wright. The abstract reads:
Abstract: Stochastic Gradient Descent (SGD) is a popular algorithm that can achieve state-of-the-art performance on a variety of machine learning tasks. Several researchers have recently proposed schemes to parallelize SGD, but all require performance-destroying memory locking and synchronization. This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented *without any locking*. We present an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work. We show that when the associated optimization problem is sparse, meaning most gradient updates only modify small parts of the decision variable, then HOGWILD! achieves a nearly optimal rate of convergence. We demonstrate experimentally that HOGWILD! outperforms alternative schemes that use locking by an order of magnitude.
The HOGWILD code is here.

Liked this entry ? subscribe to the Nuit Blanche feed, there's more where that came from
All entries on Compressive Sensing are here. All entries on Matrix Factorization are here.

Image Credit: NASA/JPL/Space Science Institute, N00175604.jpg was taken on September 13, 2011 and received on Earth September 14, 2011. The camera was pointing toward ENCELADUS at approximately 335,723 kilometers away, and the image was taken using the RED and CL2 filters.