From Data to Knowledge - 301 - Petros DrineasPetros Drineas: "Randomized Algorithms in Data Mining: a Linear Algebraic Approach". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).
Abstract Petros Drineas (Computer Science Dept., Rensselaer Polytechnic Institute) The introduction of randomization in the design and analysis of algorithms for matrix computations (such as matrix multiplication, least-squares regression, the Singular Value Decomposition (SVD), etc.) over the last decade provided a new paradigm and a complementary perspective to traditional numerical linear algebra approaches. These novel approaches were motivated by technological developments in many areas of scientific research that permit the automatic generation of large data sets, which are often modeled as matrices. In this talk we will outline how such approaches can be used to approximate problems ranging from matrix multiplication and the Singular Value Decomposition (SVD) of matrices to approximately solving least-squares problems and systems of linear equations. Applications of the proposed algorithms to data analysis will also be discussed.
From Data to Knowledge - 302 - Alexandr AndoniAlexandr Andoni: "Sublinear algorithms via precision sampling". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).
Abstract Alexandr Andoni (Microsoft Research) Suppose we want to estimate a sum of bounded reals a_1+a_2+...+a_n with access to only some "limited information" about a_i's. A classical setting is where we estimate the entire sum by knowing only a random subset of a_i's. Naturally, there is a trade-off between the size of the subset and the resulting approximation. Motivated by applications where this trade-off is not good enough, we introduce Precision Sampling, which is an estimation technique that uses a more general kind of "limited information" about a_i's. Instead of obtaining a subset of a_i's as above, here we obtain a rough estimate for each of the n quantities, up to various prescribed degrees of "precision" (approximation). The trade-off then becomes between the precision of the estimates of a_i's and the resulting approximation to the total sum. We show one can obtain a trade-off that is qualitatively better in the precision sampling setting than in the aforementioned (vanilla sampling) setting. Our resulting tool leads to new sublinear algorithms. In one instance, it gives a simplified algorithm for a class of streaming problems. In another instance, the tool plays an instrumental role in a query-efficient algorithm for estimating edit distance. Joint work with Robi Krauthgamer (Weizmann Inst.) and Krzysztof Onak (CMU).
From Data to Knowledge - 303 - Simon Wilson
Simon Wilson: "Sequential Bayesian parameter estimation in state space models". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).
Abstract Simon Wilson (Trinity College Dublin, CS & Stats) The goal of this work is to propose and evaluate a set of real-time methods for Bayesian estimation of the parameters of a state space model. While most state space inference procedures concentrate on the prediction and smoothing tasks, model parameter estimation can also play an important role in some applications. This work starts with the observation that any solution that provides prediction and smoothing densities can be used to derive the posterior distribution of the parameters of a state space model. We evaluate approaches based on Laplace approximations, various generalisations of the Kalman filter, along with corrections that improve these approximations. They are evaluated with non-linear and non-Gaussian state space models. It is shown that, when the parameter space is of modest dimension, a real-time Bayesian parameter estimation method can be successfully implemented.
From Data to Knowledge - 304 - Alekh Agarwal
Alekh Agarwal: "Oracle inequalities for large scale model selection". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).
Abstract Alekh Agarwal (UC Berkeley, EECS) In many large-scale, high-dimensional prediction problems, performance is limited by computational resources rather than sample size. In this setting, we consider the problem of model selection under computational constraints: given a particular computational budget, is it better to gather more data and estimate a simpler model, or gather less data and estimate a more complex model? We introduce general procedures for model selection under computational constraints. In contrast to classical oracle inequalities, which show that a model selection scheme gives a near-optimal trade-off between approximation error and estimation error for a given sample size, we give computational oracle inequalities, which show that our methods give a near-optimal trade-off for a given amount of computation, that is, devoting all of our computational budget to the best model would not have led to a significant performance improvement. Joint work with Peter Bartlett, John Duchi and Clement Levrard.
From Data to Knowledge - 305 - Alexis Bondu
Alexis Bondu: "A Supervised Approach for Change Detection in Data Streams". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).
Abstract Alexis Bondu (Electricite de France R&D) In recent years, the amount of data to process has increased in many application areas such as network monitoring, web click and sensor data analysis. Data stream mining answers to the challenge of massive data processing, this paradigm allows for treating pieces of data on the fly and overcomes exhaustive data storage. The detection of changes in a data stream distribution is an important issue which application area is wide. In this article, change detection problem is turned into a supervised learning task. We chose to exploit the supervised discretization method "MODL" given its interesting properties. Our approach is favorably compared with an alternative method on artificial data streams, and is applied on real data