On b-bit min-wise hashing for large-scale regression and classification with sparse data by Rajen D. Shah, Nicolai Meinshausen
Large-scale regression problems where both the number of variables,
p, and the number of observations, n, may be large and in the order of millions or more, are becoming increasingly more common. Typically the data are sparse: only a fraction of a percent of the entries in the design matrix are non-zero. Nevertheless, often the only computationally feasible approach is to perform dimension reduction to obtain a new design matrix with far fewer columns, and then work with this compressed data. b-bit min-wise hashing (Li and Konig, 2011) is a promising dimension reduction scheme for sparse matrices. In this work we study the prediction error of procedures which perform regression in the new lower-dimensional space after applying the method. For both linear and logistic models we show that the average prediction error vanishes asymptotically as long as q∥β∗∥22/n→0, where qis the average number of non-zero entries in each row of the design matrix and β∗is the coefficient of the linear predictor.
We also show that ordinary least squares or ridge regression applied to the reduced data in a sense amounts to a non-parametric regression and can in fact allow us fit more flexible models. We obtain non-asymptotic prediction error bounds for interaction models and for models where an unknown row normalisation must be applied before the signal is linear in the predictors.
Detecting Relative Anomaly by Richard Neuberg, Yixin Shi
System states that are anomalous from the perspective of a domain expert occur frequently in some anomaly detection problems. The performance of commonly used unsupervised anomaly detection methods may suffer in that setting, because they use frequency as a proxy for anomaly. We propose a novel concept for anomaly detection, called relative anomaly detection. It is tailored to be robust towards anomalies that occur frequently, by taking into account their location relative to the most typical observations. The approaches we develop are computationally feasible even for large data sets, and they allow real-time detection. We illustrate using data sets of potential scraping attempts and Wi-Fi channel utilization, both from Google, Inc.
Scalable Exemplar Clustering and Facility Location via Augmented Block Coordinate Descent with Column Generation by Ian En-Hsu Yen, Dmitry Malioutov, Abhishek Kumar
Interpolative Butterfly Factorization by Yingzhou Li, Haizhao YangIn recent years exemplar clustering has become a popular tool for applications in document and video summarization, active learning, and clustering with general similarity, where cluster centroids are required to be a subset of the data samples rather than their linear combinations. The problem is also well-known as facility location in the operations research literature. While the problem has well-developed convex relaxation with approximation and recovery guarantees, it has number of variables grows quadratically with the number of samples. Therefore, state-of-the-art methods can hardly handle more thansamples (i.e.
104 108variables). In this work, we propose an Augmented-Lagrangian with Block Coordinate Descent (AL-BCD) algorithm that utilizes problem structure to obtain closed-form solution for each block sub-problem, and exploits low-rank representation of the dissimilarity matrix to search active columns without computing the entire matrix. Experiments show our approach to be orders of magnitude faster than existing approaches and can handle problems of up to 106samples. We also demonstrate successful applications of the algorithm on world-scale facility location, document summarization and active learning.
This paper introduces the interpolative butterfly factorization for nearly optimal implementation of several transforms in harmonic analysis, when their explicit formulas satisfy certain analytic properties and the matrix representations of these transforms satisfy a complementary low-rank property. A preliminary interpolative butterfly factorization is constructed based on interpolative low-rank approximations of the complementary low-rank matrix. A novel sweeping matrix compression technique further compresses the preliminary interpolative butterfly factorization via a sequence of structure-preserving low-rank approximations. The sweeping procedure propagates the low-rank property among neighboring matrix factors to compress dense submatrices in the preliminary butterfly factorization to obtain an optimal one in the butterfly scheme. For an
N×Nmatrix, it takes O(NlogN)operations and complexity to construct the factorization as a product of O(logN)sparse matrices, each with O(N)nonzero entries. Hence, it can be applied rapidly in O(NlogN)operations. Numerical results are provided to demonstrate the effectiveness of this algorithm.
Approximating matrices and convex bodies through Kadison-Singer by Omer Friedland, Pierre Youssef
We exploit the recent solution of the Kadison-Singer problem to show that any
n×mmatrix Acan be approximated in operator norm by a submatrix with a number of columns of order the stable rank of A. This improves on existing results by removing an extra logarithmic factor in the size of the extracted matrix. We develop a sort of tensorization technique which allows to deal with a special kind of constraint approximation motivated by problems in convex geometry. As an application, we show that any convex body in Rnis arbitrary close to another one having O(n)contact points. This fills the gap left in the literature after the results of Rudelson and Srivastava and completely answers the problem. As a consequence of this, we show that the method developed by Gu\'edon, Gordon and Meyer to establish the isomorphic Dvoretzky theorem yields to the best known result once we inject our improvement of Rudelson's theorem.
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.