Nuit Blanche: Part 1: Videos of the UC Berkeley Conference: "From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications"

Thursday, May 10, 2012

Part 1: Videos of the UC Berkeley Conference: "From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications"

Here is the first series of videos from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012). Congratulations to the local Organizing committee: Joshua Bloom, Damian Eads, Berian James, Peter Nugent, John Rice, Joseph Richards and Dan Starr for making the meeting happen and putting it all on videos for others to learn from. (in near real time!).The titles of the talks are linked to the presentation slides. The full program which ends tomorrow is here.. All the videos are here: Part 1, Part 2, Part 3, Part 4, Part 5.

From Data to Knowledge - 101 - Welcome and Introduction

From Data to Knowledge - 103 - Damian Eads

Damian Eads: "BigRF: New Software for Learning Random Forests". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

From Data to Knowledge - 104 - Yan Liu

Yan Liu: "Sparse-GEV: Sparse Latent Space Model for Multivariate Extreme Value Time Series Modeling". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Yan Liu (University of Southern California, CS) In many applications of time series models, such as climate analysis or social media analysis, we are often interested in extreme events, such as heatwave, wind gust, and burst of topics. These time series data usually exhibit a heavy-tailed distribution rather than a normal distribution. This poses great challenges to existing approaches due to the signicantly different assumptions on the data distributions and the lack of sufficient past data on extreme events. In this talk, we propose the Sparse-GEV model, a latent state model based on the theory of extreme value modeling to automatically learn sparse temporal dependence and make predictions. Our model is theoretically significant because it is among the first models to learn sparse temporal dependencies among multivariate extreme value time series. We demonstrate the superior performance of our algorithm compared with state-of-art methods, including Granger causality, copula approach, and transfer entropy, on one synthetic dataset, one climate dataset and two Twitter datasets.

From Data to Knowledge - 105 - Maya Gupta

Maya Gupta: "Making The Call: Robust Classification of Streaming Signals". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Maya Gupta (University of Washington, EE) We consider the problem of robustly classifying a streaming signal as soon as possible. We give optimal and sub-optimal practical rules for linear and quadratic discriminants, and show practical probabilistic guarantees that the classification decision will be as good as if one waited, and can be implemented with quadratic programming. We demonstrate the guarantees with experiments for SVM's and QDA.

From Data to Knowledge - 106 - Sotiria Lampoudi

Sotiria Lampoudi: "Bounds estimation from timeseries". A video from the
UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Sotiria Lampoudi (UC Santa Barbara, CS) QBETS (http://spinner.cs.ucsb.edu/batchq/) is a service for estimating bounds on the wait times of jobs submitted to batch queued supercomputers. QBETS is formed by a binomial non-parametric statistical method at the core, wrapped in a time series clustering method with change-point detection. We are in the process of re-engineering QBETS to work on generic, as opposed to domain-specific, time series, and are continually improving the underlying methodology.

From Data to Knowledge - 107 - Alex Szalay

Alex Szalay: "Scalable Data-Intensive Statistical Computations in Astrophysics". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Alex Szalay (Dept. of Physics and Astronomy, John Hopkins University) Scientific computing is increasingly revolving around massive amounts of data. This new, data-centric computing requires a new look at computing architectures and strategies. We will look at how various randomized and streaming algorithms exhibit much better scaling behavior than the traditional "optimal" algorithms. We will also discuss how existing hardware can be used to build systems that better suited to high-performance streaming.

From Data to Knowledge - 108 - Jeff Scargle

Jeff Scargle: "Bayesian Blocks: Segmented Models for Detecting and Characterizing Transients in Streaming Time Series". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Jeff Scargle (NASA Ames, Planetary Systems, Astrobiology and Space Science) Bayesian Blocks finds the best fitting piece-wise constant model of time series data. This algorithm uses dynamic programming to find the exact global optimum over all possible partitions of N data points (an exponentially large search space!) in time of order N**2. Recent elaborations of the basic method include: treatment of gaps and exposure variations in general, calibration of the parameter in the prior on the number of blocks, multivariate analysis including mixed data modes, solution of the empty block problem, piece-wise linear and piecewise exponential models, and partitioning data on the circle. All of these features can be implemented in a real-time mode in which the algorithm triggers on the first change-point found in the data stream, as well as in the standard retrospective mode.

From Data to Knowledge - 109 - Tamas Budavari

Tamas Budavari: "Multiple Object Detections in Time-Domain Surveys". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Tamas Budavari (Johns Hopkins University, Physics & Astronomy) Observational astronomy in the time-domain era faces several new challenges. One of them is the optimal use of multiple detections in a sequence of repeated observations. The work presented here is focusing on faint sources at the detection threshold, and seeks to find an incremental strategy for separating real objects from random artifacts in ongoing surveys, where one does not have all the observations readily available.

From Data to Knowledge - 110 - Ashish Mahabal

Ashish Mahabal: "Knowledge from extremely sparse matrices". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Ashish Mahabal (Caltech, Astronomy) For rare and time-constrained astronomical phenomena like transients, decisions have to be made quickly before the object fades and while more observations are still possible. The available data on which to base this is almost always fragmentary: (1) those coming from a small number of flux and positional observations during discovery, and (2) some from archival observations at different wavelengths. The archival data are especially sparse because of (a) pointed observations and hence incomplete coverage of the sky, and (b) different flux levels reached by different sets of observations. The resulting datasets for transients have a large number of columns that are mostly empty and the priors that can be constructed from these are far from ideal. Using examples from the Catalina Realtime Transient Survey (CRTS) we explore Bayesian methodology to boost differences between the classes for learning to separate them. This technique will get used a lot in Time Domain Astronomy, but it should also find applicability in situations with sparse matrices elsewhere.

From Data to Knowledge - 111 - Josh Bloom

Joshua Bloom: "Classification of Astronomical Time Series in the Synoptic Survey Era". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract: Joshua Bloom (UC Berkeley, Astronomy) We have entered the Synoptic Survey Era of observational astronomy, where data rates are quickly reaching several terabytes per night. A growing army of telescopes monitor, nightly, the brightnesses of millions, and soon to be upwards of a billion objects. Real-time analysis of these data is critical to determine which objects and events require timely observations with expensive follow-up resources. To maximize the scientific returns from these massive projects, sophisticated machine-learning tools must be used. Our group has been on the cutting edge of the methodological and algorithmic development for time-domain astronomical data analysis. I will describe several problems in which we have made great strides, including real-time ML classification of transient events, photometric supernova typing, and probabilistic classification of variable stars from long-baseline time series. We address a multitude of statistical issues, and in this talk I will describe our use of manifold learning for feature extraction in time series, active learning to overcome sample-selection biases, and semi-supervised learning to detect anomalies in data streams. I will describe the newly released Machine-learned ASAS Classification Catalog (MACC, www.bigmacc.info) and discuss the future of astronomical source catalogs.