## Thursday, May 10, 2012

### Matrix Factorization This Week

We have a pretty large number of papers this week, mostly because I have not kept up in the past month on the subject. We should also also some other Cable and Igor's adventure in Matrix Factorization tomorrow. In the meantime, enjoy!

Robust PCA as implemented using GoDec, you can find more on Tianyi's Youtube Channel.

The following presentation Finding Structure in Big Data by Ankur Moitra advertizes the results in the following paper (see below). It is noteworthy that it took about 10 years to get to an empirical reason as to why and how NMF works yet the techniques is very much implemented by everybody and their cousins and is itself synonymous with matrix factorization in the literature. Here is the paper: Computing a Nonnegative Matrix Factorization -- Provably  by Sanjeev AroraRong Ge, Ravi Kannan,  Ankur Moitra. The abstract reads:
In the Nonnegative Matrix Factorization (NMF) problem we are given an $n \times m$ nonnegative matrix $M$ and an integer $r > 0$. Our goal is to express $M$ as $A W$ where $A$ and $W$ are nonnegative matrices of size $n \times r$ and $r \times m$ respectively. In some applications, it makes sense to ask instead for the product $AW$ to approximate $M$ -- i.e. (approximately) minimize $\norm{M - AW}_F$ where $\norm{}_F$ denotes the Frobenius norm; we refer to this as Approximate NMF. This problem has a rich history spanning quantum mechanics, probability theory, data analysis, polyhedral combinatorics, communication complexity, demography, chemometrics, etc. In the past decade NMF has become enormously popular in machine learning, where $A$ and $W$ are computed using a variety of local search heuristics. Vavasis proved that this problem is NP-complete. We initiate a study of when this problem is solvable in polynomial time:
1. We give a polynomial-time algorithm for exact and approximate NMF for every constant $r$. Indeed NMF is most interesting in applications precisely when $r$ is small.
2. We complement this with a hardness result, that if exact NMF can be solved in time $(nm)^{o(r)}$, 3-SAT has a sub-exponential time algorithm. This rules out substantial improvements to the above algorithm.
3. We give an algorithm that runs in time polynomial in $n$, $m$ and $r$ under the separablity condition identified by Donoho and Stodden in 2003. The algorithm may be practical since it is simple and noise tolerant (under benign assumptions). Separability is believed to hold in many practical settings.
To the best of our knowledge, this last result is the first example of a polynomial-time algorithm that provably works under a non-trivial condition on the input and we believe that this will be an interesting and important direction for future work.

Recovering a low-rank matrix from some of its linear measurements is a popular problem in many areas of science and engineering. One special case of it is the matrix completion problem, where we need to reconstruct a low-rank matrix from incomplete samples of its entries. A lot of eﬃcient algorithms have been proposed to solve this problem and they perform well when Gaussian noise with a small variance is added to the given data. But they can not deal with the sparse random-valued noise in the measurements. In this paper, we propose a robust method for recovering the low-rank matrix with adaptive outlier pursuit when part of the measurements are damaged by outliers. This method will detect the positions where the data is completely ruined and recover the matrix using correct measurements. Numerical experiments show the accuracy of noise detection and high performance of matrix completion for our algorithms compared with other algorithms.

Normalized Iterative Hard Thresholding for Matrix Completion by Jared Tanner and Ke Wei. The abstarct reads:
Matrices of low rank can be uniquely determined from fewer linear measurements, or entries, than the total number of entries in the matrix. Moreover, there is a growing literature of computationally eﬃcient algorithms which can recover a low rank matrix from such limited information, typically referred to as matrix completion. We introduce a particularly simple yet highly eﬃcient alternating projection algorithm which uses an adaptive stepsize calculated to be exact for a restricted subspace. This method is proven to have near optimal order recovery guarantees, and is observed to have average case performance superior in some respects to other matrix completion algorithms. In particular, this proposed algorithm is able to recover matrices from extremely close to the minimum number of measurements necessary

We consider the problem of recovering a target matrix that is a superposition of low-rank and sparse components, from a small set of linear measurements. This problem arises in compressed sensing of structured high-dimensional signals such as videos and hyperspectral images, as well as in the analysis of transformation invariant low-rank recovery. We analyze the performance of the natural convex heuristic for solving this problem, under the assumption that measurements are chosen uniformly at random. We prove that this heuristic exactly recovers low-rank and sparse terms, provided the number of observations exceeds the number of intrinsic degrees of freedom of the component signals by a polylogarithmic factor. Our analysis introduces several ideas that may be of independent interest for the more general problem of compressed sensing and decomposing superpositions of multiple structured signals.

Respiration-correlated CBCT, commonly called 4DCBCT, provide respiratory phase-resolved CBCT images. In many clinical applications, it is more preferable to reconstruct true 4DCBCT with the 4th dimension being time, i.e., each CBCT image is reconstructed based on the corresponding instantaneous projection. We propose in this work a novel algorithm for the reconstruction of this truly time-resolved CBCT, called cine-CBCT, by effectively utilizing the underlying temporal coherence, such as periodicity or repetition, in those cine-CBCT images. Assuming each column of the matrix $\bm{U}$ represents a CBCT image to be reconstructed and the total number of columns is the same as the number of projections, the central idea of our algorithm is that the rank of $\bm{U}$ is much smaller than the number of projections and we can use a matrix factorization form $\bm{U}=\bm{L}\bm{R}$ for $\bm{U}$. The number of columns for the matrix $\bm{L}$ constraints the rank of $\bm{U}$ and hence implicitly imposing a temporal coherence condition among all the images in cine-CBCT. The desired image properties in $\bm{L}$ and the periodicity of the breathing pattern are achieved by penalizing the sparsity of the tight wavelet frame transform of $\bm{L}$ and that of the Fourier transform of $\bm{R}$, respectively. A split Bregman method is used to solve the problem. In this paper we focus on presenting this new algorithm and showing the proof of principle using simulation studies on an NCAT phantom.

Robust Nonnegative Matrix Factorization via $L_1$ Norm Regularization by Bin Shen, Luo Si, Rongrong Ji, Baodi Liu. The abstract reads:
Nonnegative Matrix Factorization (NMF) is a widely used technique in many applications such as face recognition, motion segmentation, etc. It approximates the nonnegative data in an original high dimensional space with a linear representation in a low dimensional space by using the product of two nonnegative matrices. In many applications data are often partially corrupted with large additive noise. When the positions of noise are known, some existing variants of NMF can be applied by treating these corrupted entries as missing values. However, the positions are often unknown in many real world applications, which prevents the usage of traditional NMF or other existing variants of NMF. This paper proposes a Robust Nonnegative Matrix Factorization (RobustNMF) algorithm that explicitly models the partial corruption as large additive noise without requiring the information of positions of noise. In practice, large additive noise can be used to model outliers. In particular, the proposed method jointly approximates the clean data matrix with the product of two nonnegative matrices and estimates the positions and values of outliers/noise. An efficient iterative optimization algorithm with a solid theoretical justification has been proposed to learn the desired matrix factorization. Experimental results demonstrate the advantages of the proposed algorithm.

Learning Topic Models - Going beyond SVD by Sanjeev Arora, Rong Ge, Ankur Moitra. The abstract reads:
Topic Modeling is an approach used for automatic comprehension and classification of data in a variety of settings, and perhaps the canonical application is in uncovering thematic structure in a corpus of documents. A number of foundational works both in machine learning and in theory have suggested a probabilistic model for documents, whereby documents arise as a convex combination of (i.e. distribution on) a small number of topic vectors, each topic vector being a distribution on words (i.e. a vector of word-frequencies). Similar models have since been used in a variety of application areas; the Latent Dirichlet Allocation or LDA model of Blei et al. is especially popular.
Theoretical studies of topic modeling focus on learning the model's parameters assuming the data is actually generated from it. Existing approaches for the most part rely on Singular Value Decomposition(SVD), and consequently have one of two limitations: these works need to either assume that each document contains only one topic, or else can only recover the span of the topic vectors instead of the topic vectors themselves.
This paper formally justifies Nonnegative Matrix Factorization(NMF) as a main tool in this context, which is an analog of SVD where all vectors are nonnegative. Using this tool we give the first polynomial-time algorithm for learning topic models without the above two limitations. The algorithm uses a fairly mild assumption about the underlying topic matrix called separability, which is usually found to hold in real-life data. A compelling feature of our algorithm is that it generalizes to models that incorporate topic-topic correlations, such as the Correlated Topic Model and the Pachinko Allocation Model.
We hope that this paper will motivate further theoretical results that use NMF as a replacement for SVD - just as NMF has come to replace SVD in many applications.

Recovery of Low-Rank Plus Compressed Sparse Matrices with Application to Unveiling Traffic Anomalies by Morteza Mardani, Gonzalo Mateos, Georgios B. Giannakis. The abstract reads:
Given the superposition of a low-rank matrix plus the product of a known fat compression matrix times a sparse matrix, the goal of this paper is to establish deterministic conditions under which exact recovery of the low-rank and sparse components becomes possible. This fundamental identifiability issue arises with traffic anomaly detection in backbone networks, and subsumes compressed sensing as well as the timely low-rank plus sparse matrix recovery tasks encountered in matrix decomposition problems. Leveraging the ability of $\ell_1$- and nuclear norms to recover sparse and low-rank matrices, a convex program is formulated to estimate the unknowns. Analysis and simulations confirm that the said convex program can recover the unknowns for sufficiently low-rank and sparse enough components, along with a compression matrix possessing an isometry property when restricted to operate on sparse vectors. When the low-rank, sparse, and compression matrices are drawn from certain random ensembles, it is established that exact recovery is possible with high probability. First-order algorithms are developed to solve the nonsmooth convex optimization problem with provable iteration complexity guarantees. Insightful tests with synthetic and real network data corroborate the effectiveness of the novel approach in unveiling traffic anomalies across flows and time, and its ability to outperform existing alternatives.

Statistical Mechanics of Dictionary Learning  by Ayaka Sakata, Yoshiyuki Kabashima . The abstract reads:
Finding a basis matrix (dictionary) by which objective signals are represented sparsely is of major relevance in various scientific and technological fields. We consider a problem to learn a dictionary from a set of training signals. We employ techniques of statistical mechanics of disordered systems to evaluate the size of the training set necessary to typically succeed in the dictionary learning. The results indicate that the necessary size is much smaller than previously estimated, which theoretically supports and/or encourages the use of dictionary learning in practical situations.

We consider the decision problem asking whether a partial rational symmetric matrix with an all-ones diagonal can be completed to a full positive semidefinite matrix of rank at most $k$. We show that this problem is $\NP$-hard for any fixed integer $k\ge 2$. Equivalently, for $k\ge 2$, it is $\NP$-hard to test membership in the rank constrained elliptope $\EE_k(G)$, i.e., the set of all partial matrices with off-diagonal entries specified at the edges of $G$, that can be completed to a positive semidefinite matrix of rank at most $k$. Additionally, we show that deciding membership in the convex hull of $\EE_k(G)$ is also $\NP$-hard for any fixed integer $k\ge 2$.

Risk estimation for matrix recovery with spectral regularization by Charles-Alban Deledalle, Samuel Vaiter, Gabriel Peyré, ,Jalal Fadili, Charles Dossal. The abstract reads:
In this paper, we develop an approach to recursively estimate the quadratic risk for matrix recovery problems regularized with spectral functions. Toward this end, in the spirit of the SURE theory, a key step is to compute the (weak) derivative and divergence of a solution with respect to the observations. As such a solution is not available in closed form, but rather through a proximal splitting algorithm, we propose to recursively compute the divergence from the sequence of iterates. A second challenge that we unlocked is the computation of the (weak) derivative of the proximity operator of a spectral function. To show the potential applicability of our approach, we exemplify it on a matrix completion problem to objectively and automatically select the regularization parameter.

This paper proposes a new algorithm for linear system identification from noisy measurements. The proposed algorithm balances a data fidelity term with a norm induced by the set of single pole filters. We pose a convex optimization problem that approximately solves the atomic norm minimization problem and identifies the unknown system from noisy linear measurements. This problem can be solved efficiently with standard, freely available software. We provide rigorous statistical guarantees that explicitly bound the estimation error (in the H_2-norm) in terms of the stability radius, the Hankel singular values of the true system and the number of measurements. These results in turn yield complexity bounds and asymptotic consistency. We provide numerical experiments demonstrating the efficacy of our method for estimating linear systems from a variety of linear measurements.

On Trafﬁc Matrix Completion in the Internet by Gonca Gursun, Mark Crovella. The abstract reads:

The ability of an ISP to infer trafﬁc volumes that are not directly measurable can be useful for research, engineering, and business intelligence. Previous work has shown that trafﬁc matrix completion is possible, but there is as yet no clear understanding of which ASes are likely to be able to perform TM completion, and which trafﬁc ﬂows can be inferred. In this paper we investigate the relationship between the AS-level topology of the Internet and the ability of an individual AS to perform trafﬁc matrix completion. We ﬁrst frame the questions through abstract analysis of idealized topologies, and then use actual routing measurements and topologies to study the ability of real ASes to infer trafﬁc ﬂows. Our ﬁrst set of results identiﬁes which ASes are bestpositioned to perform TM completion. We show, surprisingly, that TM completion ability is not particularly characteristic of ASes in the ‘core,’ nor does it help for an AS to have many peering links. Rather, the most important factor enabling an AS to perform TM completion is the number of direct customers it has. Our second set of results focuses on which ﬂows can be inferred. We show that topologically close ﬂows are easier to infer, and that ﬂows passing through customers are particularly well suited for inference.

Improved matrix algorithms via the Subsampled Randomized Hadamard Transform  by Christos Boutsidis, Alex Gittens . The abstract reads:
The SRHT low-rank matrix approximation algorithm, which is based upon randomized dimension reduction via the Subsampled Randomized Hadamard Transform, is the fastest known low-rank matrix approximation technique. Novel Frobenius and spectral norm error bounds are provided which improve upon previous efforts to provide quality-of-approximation guarantees for this method. In particular, a much sharpened spectral norm error bound is obtained. Similarly, the SRHT least-squares algorithm solves regressions problems quickly via dimension reduction and the Subsampled Randomized Hadamard Transform. We also provide a novel analysis of this approximation algorithm and show improved quality-of-approximation guarantees. Our main theorems are a consequence of results on approximate matrix computations involving SRHT matrices that may themselves be of independent interest.

A new graph parameter related to bounded rank positive semidefinite matrix completions  by Monique Laurent, Antonios Varvitsiotis . The abstract reads:
The Gram dimension $\gd(G)$ of a graph $G$ is the smallest integer $k\ge 1$ such that any partial real symmetric matrix, whose entries are specified on the diagonal and at the off-diagonal positions corresponding to edges of $G$, can be completed to a positive semidefinite matrix of rank at most $k$ (assuming a positive semidefinite completion exists). For any fixed $k$ the class of graphs satisfying $\gd(G) \le k$ is minor closed, hence it can characterized by a finite list of forbidden minors. We show that the only minimal forbidden minor is $K_{k+1}$ for $k\le 3$ and that there are two minimal forbidden minors: $K_5$ and $K_{2,2,2}$ for $k=4$. We also show some close connections to Euclidean realizations of graphs and to the graph parameter $\nu^=(G)$ of \cite{H03}. In particular, our characterization of the graphs with $\gd(G)\le 4$ implies the forbidden minor characterization of the 3-realizable graphs of Belk and Connelly \cite{Belk,BC} and of the graphs with $\nu^=(G) \le 4$ of van der Holst \cite{H03}.

Predicting user affinity to items is an important problem in applications like content optimization, computational advertising, and many more. While bilinear random effect models (matrix factorization) provide state-of-the-art performance when minimizing RMSE through a Gaussian response model on explicit ratings data, applying it to imbalanced binary response data presents additional challenges that we carefully study in this paper. Data in many applications usually consist of users' implicit response that are often binary -- clicking an item or not; the goal is to predict click rates, which is often combined with other measures to calculate utilities to rank items at runtime of the recommender systems. Because of the implicit nature, such data are usually much larger than explicit rating data and often have an imbalanced distribution with a small fraction of click events, making accurate click rate prediction difficult. In this paper, we address two problems. First, we show previous techniques to estimate bilinear random effect models with binary data are less accurate compared to our new approach based on adaptive rejection sampling, especially for imbalanced response. Second, we develop a parallel bilinear random effect model fitting framework using Map-Reduce paradigm that scales to massive datasets. Our parallel algorithm is based on a "divide and conquer" strategy coupled with an ensemble approach. Through experiments on the benchmark MovieLens data, a small Yahoo! Front Page data set, and a large Yahoo! Front Page data set that contains 8M users and 1B binary observations, we show that careful handling of binary response as well as identifiability issues are needed to achieve good performance for click rate prediction, and that the proposed adaptive rejection sampler and the partitioning as well as ensemble techniques significantly improve model performance.

Given a set of possibly corrupted and incomplete linear measurements, we leverage low-dimensional models to best explain the data for provable solution quality in inversion. A non-exhaustive list of examples includes sparse vector and low-rank matrix approximation. Most of the well-known low dimensional models are inherently non-convex. However, recent approaches prefer convex surrogates that "relax" the problem in order to establish solution uniqueness and stability. In this paper, we tackle the linear inverse problems revolving around low-rank matrices by preserving their non-convex structure. To this end, we present and analyze a new set of sparse and low-rank recovery algorithms within the class of hard thresholding methods. We provide strategies on how to set up these algorithms via basic "ingredients" for different configurations to achieve complexity vs. accuracy tradeoffs. Moreover, we propose acceleration schemes by utilizing memory-based techniques and randomized, $\epsilon$-approximate, low-rank projections to speed-up the convergence as well as decrease the computational costs in the recovery process. For all these cases, we present theoretical analysis that guarantees convergence under mild problem conditions. Simulation results demonstrate notable performance improvements compared to state-of-the-art algorithms both in terms of data reconstruction and computational complexity.
An important property of the Kalman filter is that the underlying Riccati flow is a contraction for the natural metric of the cone of symmetric positive definite matrices. The present paper studies the geometry of a low-rank version of the Kalman filter. The underlying Riccati flow evolves on the manifold of fixed rank symmetric positive semidefinite matrices. Contraction properties of the low-rank flow are studied by means of a suitable metric recently introduced by the authors.
Given a limited number of entries from the superposition of a low-rank matrix plus the product of a known fat compression matrix times a sparse matrix, recovery of the low-rank and sparse components is a fundamental task subsuming compressed sensing, matrix completion, and principal components pursuit. This paper develops algorithms for distributed sparsity-regularized rank minimization over networks, when the nuclear- and $\ell_1$-norm are used as surrogates to the rank and nonzero entry counts of the sought matrices, respectively. While nuclear-norm minimization has well-documented merits when centralized processing is viable, non-separability of the singular-value sum challenges its distributed minimization. To overcome this limitation, an alternative characterization of the nuclear norm is adopted which leads to a separable, yet non-convex cost minimized via the alternating-direction method of multipliers. The novel distributed iterations entail reduced-complexity per-node tasks, and affordable message passing among single-hop neighbors. Interestingly, upon convergence the distributed (non-convex) estimator provably attains the global optimum of its centralized counterpart, regardless of initialization. Several application domains are outlined to highlight the generality and impact of the proposed framework. These include unveiling traffic anomalies in backbone networks, predicting networkwide path latencies, and mapping the RF ambiance using wireless cognitive radios. Simulations with synthetic and real network data corroborate the convergence of the novel distributed algorithm, and its centralized performance guarantees.

Frequency domain subspace identification is an effective means of obtaining a low order model from frequency domain data. In the noisy data case using a singular value decomposition to determine the observable subspace has several problems: an incorrect weighting of the data in the singular values; difficulties in determining the appropriate rank; and a loss of the Hankel structure in the low order approximation. A nuclear norm (sum of the singular values) minimization based method, using spectral constraints, is presented here and shown to be an effective technique for overcoming these problems.

Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.