Nuit Blanche: Part 4: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

Friday, May 11, 2012

Part 4: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

Here is the fourth series of videos from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications(May 7-11, 2012). Congratulations to the local Organizing committee: Joshua Bloom, Damian Eads, Berian James, Peter Nugent, John Rice, Joseph Richards and Dan Starr for making the meeting happen and putting it all on videos for others to learn from. (in near real time!).. All the videos are here: Part 1, Part 2, Part 3, Part 4, Part 5.

From Data to Knowledge - 401 - S. Muthukrishnan

S. Muthukrishnan: "Modern Algorithmic Tools for Analyzing Data Streams". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract S. Muthukrishnan (Dept. of Computer Science, Rutgers University) We now have a second generation of algorithmic tools for analyzing massive online data streams, that go beyond the initial tools for summarizing a single stream in small space. The new tools deal with distributed online data, stochastic models, dynamic graph and matrix objects and others; they optimize communication, number of parallel rounds and privacy among other things. I will provide an overview of these tools.

From Data to Knowledge - 402 - Hua Ouyang

Hua Ouyang: "Stochastic Smoothing for Nonsmooth Minimizations: Accelerating SGD by Exploiting Structure". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Hua Ouyang (Georgia Institute of Technology) In this work we consider the stochastic minimization of nonsmooth convex loss functions, a central problem in machine learning. We propose a novel algorithm called Accelerated Nonsmooth Stochastic Gradient Descent (ANSGD), which exploits the structure of common nonsmooth loss functions to achieve optimal convergence rates for a class of problems including SVMs. It is the first stochastic algorithm that achieves the optimal O(1/t) rate for minimizing nonsmooth loss functions. The fast rates are confirmed by empirical comparisons, in which ANSGD significantly outperforms previous subgradient descent algorithms including SGD.

From Data to Knowledge - 403 - S.V.N. Vishwanathan

S.V.N. Vishwanathan: "Training Linear Classifiers via Dual Cached Loops". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract S.V.N. Vishwanathan (Purdue University, Stats & CS) StreamSVM is a solver for Linear SVMs that exploits the different speeds of computing on the CPU and accessing data from disk. StreamSVM works by performing coordinate updates on the dual, thus avoiding the need to rebalance frequently visited examples. Further, we trade-off file I/O with data expansion on the fly by generating features on demand, thereby significantly increasing throughput. Experiments show that StreamSVM outperforms other linear SVM solvers by orders of magnitude while also producing more accurate solutions.

From Data to Knowledge - 404 - John Duchi

John Duchi: "Randomized Smoothing for (Parallel) Stochastic Optimization". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract John Duchi (UC Berkeley, CS) By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates for stochastic optimization procedures, both in expectation and with high probability, that have optimal dependence on the variance of the gradient estimates. To the best of our knowledge, these are the first variance-based convergence guarantees for non-smooth optimization. A combination of our techniques with recent work on decentralized optimization yields order-optimal parallel stochastic optimization algorithms. We give applications of our results to several statistical machine learning problems, providing experimental results demonstrating the effectiveness of our algorithms.

From Data to Knowledge - 405 - Borja Balle

Borja Balle: "Learning Markovian Models from Time-Evolving Data Streams". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Borja Balle (Universitat Politecnica de Catalunya, CS) Markovian models with hidden state are widely-used formalisms for modeling sequential phenomena. Learnability of these models has been well studied when the sample is given in batch mode, and algorithms with PAC-like learning guarantees exist for specific classes of models such as Probabilistic Deterministic Finite Automata (PDFA). Our work focuses on PDFA and gives, to the best our knowledge, the first algorithm for infering models in this class under the stringent data stream scenario: unlike existing methods, our algorithm works incrementally and in one pass, uses memory sublinear in the stream length, processes input items in amortized constant time, and can react to drift revising the model and forgetting past irrelevant information. We provide rigorous guarantees for all of the above, as well as an evaluation on realistic synthetic data. Our algorithm makes a key usage of several old and new sketching techniques. In particular, we develop a new sketch for implementing bootstrapped statistical tests in a streaming setting which may be of independent interest. Experimentaly we observe that this sketch yields order-of-magnitude reductions in the examples required for performing some crucial statistical tests in our algorithm.

From Data to Knowledge - 406 - Ziv Bar-Joseph

Ziv Bar-Joseph: "Data integration for modeling dynamic biological systems". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Ziv Bar-Joseph (Computer Science Dept., Carnegie Mellon University) Several recent technological advances are transforming molecular biology to a data intensive field. Several different types of data, each providing a different point of view of cellular activity, can now be measured on a large scale. These include sequencing data, mRNA and microRNA expression data and various types of interaction datasets. While valuable, these datasets also raise several computational challenges and machine learning methods have been playing an ever increasing role in addressing them. In this talk I will discuss some challenges associated with the analysis of sequencing data from next generation machines that often generate tens of millions of (often noisy) short reads. I will then discuss how to integrate time series data from these next generation RNA-Sequencing studies with (mostly static) protein-DNA interaction data for modeling dynamic regulatory networks using an Input-Output Hidden Markov model (IOHMM). These network models lead to testable temporal hypotheses identifying both new regulators and their time of activation. Our models can be extended to integrate other types of biological interactions including protein interactions (for modeling signaling networks) and microRNA data. I will discuss the application and experimental validation of predictions made by our methods and mplications for predicting interventions in various diseases.

From Data to Knowledge - 407 - Jason Pell

Jason Pell: "Digital Normalization of Metagenomic DNA Sequence Data". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Jason Pell (Michigan State University, CSE) The cost of DNA sequencing is falling rapidly, which allows data to be generated more quickly than it can be processed using current algorithms and techniques. Digital Normalization (diginorm) is a streaming data reduction algorithm that can significantly reduce the total amount of sequence data without reducing the amount of information in the data set, and also reduces much of the data with sequencing errors. Diginorm is especially well suited for metagenomic data where the relative abundance of different organisms can differ by several orders of magnitude, and sampling deeply to cover the low-abundance organisms will result in extremely high coverage of high-abundance organisms. Diginorm requires a fixed amount of memory and can be parallelized for faster processing.

From Data to Knowledge - 408 - David Kale

David Kale: "Unsupervised Pattern Discovery in Sparsely Sampled Clinical Time Series". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract David Kale (Children's Hospital LA, Virtual Pediatric Intensive Care Unit) Time series data recorded in real electronic health care record (EHR) systems, such as from pediatric intensive care units (PICUs), presents a challenging set of problems that appear to differ from those present in other streaming data domains. In addition to the complexity of the underlying dynamic system (the human body, affected by various disease processes), there is a requirement that observations be manually verified by clinical staff, resulting in data that is sparse, irregularly sampled, incomplete, and potentially biased in subtle ways. Through a particular research question (e.g., searching for similar episodes within a historical database), we will explore the opportunities and challenges presented by such data, increasingly prevalent in commercial EHRs. We will present some results obtained using a simple model and discuss active research directions we are pursuing with collaborators such as Benjamin Marlin from University of Massachusetts Amherst, Christian Shelton at UC Riverside, and Yan Liu at USC. On a more practical note, we will mention how researchers can get started in this exciting new area, including how they can obtain data.

From Data to Knowledge - 409 - Fabien Scalzo

Fabien Scalzo: "Pattern Recognition Methods For Efficient Alarm-based Patient Monitoring". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Fabien Scalzo (UCLA, Neurosurgery) Bedside monitors are omnipresent in intensive care units (ICU) of modern hospitals. They are designed to continuously analyze clinically relevant signals and attract the attention of the critical care nurse by emitting an alarm sound. Although alarm-based patient monitoring systems are life-saving tools, it has also been recognized that most of the alarms produced are false (up to 85%). False alarms are a major issue that causes alarm fatigue, waste of human resources, and increased patient risks. As currently implemented in bedside monitors, alarms are triggered by manually adjusted thresholds. This talk describes our latest works about intelligent alarm detection systems based on the dynamic of the signal and waveform patterns observed prior to threshold crossing. Experimental evaluation is performed on a comprehensive dataset of 4791 manually labelled intracranial pressure alarm episodes extracted from 154 neurosurgical patients. In addition to this pilot study, we introduce the concept of SuperAlarms that abstracts specific temporal co-occurrences of alarms. SuperAlarms are higher-level indicators of ongoing patient deteriorations and hence predictive of in-hospital emergency codes.