Nuit Blanche: Part 2: Videos of the UC Berkeley Conference: "From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications"

Thursday, May 10, 2012

Part 2: Videos of the UC Berkeley Conference: "From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications"

Here is the second series of videos from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications(May 7-11, 2012). Congratulations to the local Organizing committee: Joshua Bloom, Damian Eads, Berian James, Peter Nugent, John Rice, Joseph Richards and Dan Starr for making the meeting happen and putting it all on videos for others to learn from. (in near real time!). All the videos are here: Part 1, Part 2, Part 3, Part 4, Part 5.

From Data to Knowledge - 201 - David Bader

David Bader: "Opportunities and Challenges in Massive Data-Intensive Computing". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract David Bader (College of Computing, Georgia Tech) Emerging real-world graph problems include detecting community structure in large social networks, improving the resilience of the electric power grid, and detecting and preventing disease in human populations. Unlike traditional applications in computational science and engineering, solving these problems at scale often raises new challenges because of sparsity and the lack of locality in the data, the need for additional research on scalable algorithms and development of frameworks for solving these problems on high performance computers, and the need for improved models that also capture the noise and bias inherent in the torrential data streams. In this talk, the speaker will discuss the opportunities and challenges in massive data-intensive computing for applications in computational biology, genomics, and security. The explosion of real-world graph data poses a substantial challenge: How can we analyze constantly changing streaming graphs with billions of vertices? Our approach leverages fine-grained parallelism, lightweight synchronization, and shared memory, to scale to massive graphs.

From Data to Knowledge - 202 - Jeff Hawkins

Jeff Hawkins: "Modeling data streams using sparse distributed representations". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Jeff Hawkins (Numenta Inc) Sparse distributed representations appear to be the means by which brains encode information. They have several advantageous properties including the ability to encode semantic meaning. We have created a distributed memory system for learning sequences of sparse distribute representations. In addition we have created a means of encoding structured and unstructured data into sparse distributed representations. The resulting memory system learns in an on-line fashion making it suitable for high velocity data streams. We are currently applying it to commercially valuable data streams for prediction, classification, and anomaly detection In this talk I will describe this distributed memory system and illustrate how it can be used to build models and make predictions from data streams.

From Data to Knowledge - 203 - Steve Plimpton

Steve Plimpton: "PHISH framework for streaming graph algorithms". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Steve Plimpton (Sandia National Labs, Computational Sciences) We've developed a small, portable C++ library called PHISH (Parallel Harness for Informatic Stream Hashing) which orchestrates the passing of datums between independent processes (minnows) as they compute on a stream of data in parallel. The library has a C API and a Python wrapper, so minnows can be written in any language (C, C++, Fortran, Python). PHISH can be run on top of message-passing (MPI) or sockets (zeroMQ), which means a streaming computation can be launched on a multi-core desktop, a traditional supercomputer, or a geographically disperse network. We've been using PHISH to experiement with streaming MapReduce operations and graph algorithms, such as connected component finding and sub-graph isomorphism matching. I'll give a few PHISHy details, describe the streaming graph algorithms at a high level, and highlight MPI vs socket performance.

From Data to Knowledge - 204 - Suresh Venkatasubramanian

Suresh Venkatasubramanian: "Protocols for Distributed Learning". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Suresh Venkatasubramanian (University of Utah, CS) We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models real-world communication bottlenecks in the processing of massive distributed datasets. We present several very general sampling-based solutions as well as some two-way protocols which have a provable exponential speed-up over any one-way protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node.

From Data to Knowledge - 205 - Xavier Amatriain

Xavier Amatriain: "Netflix Recommendations: Beyond the 5 Stars". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Xavier Amatriain (Netflix) Netflix is known for pushing the envelope of recommendation technologies. In particular, the Netflix Prize put a focus on using explicit user feedback to predict ratings. Nowadays Netflix has moved into the streaming world and this has spurred numerous changes in the way people use the service. Instead of spending time deciding what to add to a DVD queue to watch later, people now access the service and watch whatever appeals to them at that moment. In this talk I will give a detailed overview of the different techniques we used to personalize Netflix. I will explain why we now consider that "everything is a recommendation", to the point that more than 75% of the things users select come from some sort of recommendation. I will describe how we deal with the different data sources and models and how we innovate by using an offline-online cycle that connects our machine learning experiments with the results of our AB tests.

From Data to Knowledge - 206 - Joao Gama

Joao Gama: "Challenges on Mining Evolving Data Streams". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Joao Gama (Lab. of A.I. and Decision Support, Economics at Univ. of Porto, Portugal) The computational model of data streams imposes new challenges and open new research opportunities on the design of data mining algorithms. Data is abundant, being continuously generated from time-changing processes with unknown dynamics. Evolving time-changing data requires that learning algorithms must be able to monitor the evolution of the learning process. Monitoring the learning process opens the ability of predictive self-diagnosis; not only after a failure has occurred, but also predictive, before the failure. These aspects require monitoring the evolution of the learning process, taking into account the available resources. Diagnosis is a significant and useful characteristic, and requires the ability of reasoning and learning about the learning process itself. In this talk we present a one-pass classification algorithm able for self-diagnosis. It is able to detect and react to changes in the process generating data, identifies contexts using drift detection, characterize contexts using meta-learning, and select the most appropriate base model for the incoming data using unlabeled examples.

From Data to Knowledge - 207 - Georges Hebrail

Georges Hebrail: "Analog Method for Collaborative Very-Short-Term Forecasting of Power Generation from Photovoltaic Systems". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Georges Hebrail (Electricite de France R&D) There is an increasing interest in exploiting renewable energy resources, such as solar power harnessed by photovoltaic (PV) systems. Forecasts of production of photovoltaic (PV) systems are important tools for the management and adoption of this energy. We propose a method to generate very short term (up to a few hours) forecasts of power generation from individual photovoltaic (PV) systems. The method is based on the search of analogs with an additional filter that uses information of nearby PV systems. While exchanging a very small amount of information between sites and keeping local measurements private, collaboration between sites manages to enrich local predictions. The method is tested with data from 11 PV systems and accuracy is compared to two reference models: a linear regression model and a corrected persistence model. Results show that the method is adequate for predictions at very short term, with some improvement in results when collaboration among sites is employed.

From Data to Knowledge - 208 - Marc Berkowitz

Marc Berkowitz: "A Dynamically Scalable Cloud Platform for the Real-Time Detection of Seismic Events". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Marc Berkowitz (SQLstream Inc.) UCSD seismologists have deployed an application based on SQLstream that detects significant events in data collected from a large grid of seismic sensors. A large-scale data infrastructure (the OOI/CI) provides raw signal data over an AMQP message bus. SQLstream monitors this data in real-time, applying heuristic algorithms that look for patterns indicating earthquakes. The detected events are streamed in real-time back onto the AMQP message bus for visualization and further processing. The detection algorithms are coded in streaming SQL, with Java plugins for added time-series operations. The live system executes across multiple servers in a cloud environment. The number of servers is adjusted automatically, based on the current demand, by a control program in Python. SQLstream is a scalable, distributed platform for real-time monitoring and analysis of event streams. Applications are built using continuous SQL queries that process high speed, high volume data streams over moving time windows. Unlike traditional data management platforms that must store data before processing, SQLstream processes streams of arriving event data on the fly. Thus alerts and other analytics are generated continuously, without having to store the data.

From Data to Knowledge - 209 - Frederic Py

Frederic Py: "In-situ robotic classification of Coastal Ocean features". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

From Data to Knowledge - 210 - Venu Govindaraju

Venu Govindaraju: "Making Sense of all Things Handwritten from Postal Addresses to Tablet Notes". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Venu Govindaraju (SUNY Buffalo, CS) The handwritten address interpretation system pioneered in our lab at UB is widely regarded as one of the key success stories in AI. It integrated the document processing steps of binarization, segmentation, recognition, and combination of classifiers with carefully handcrafted rules. Advances in machine learning (ML) in the past decade, made possible by the abundance of training data, storage, and processing power, have facilitated the development of principled approaches for many of the same modules. In this talk, we will describe the ML adaptation of some of the modules originally deployed by the handwritten address interpretation system. We will present an MRF based method for discriminating handwritten and machine printed matter. The early success of document recognition systems, in the handwritten domain, was pivoted on constraining the size of the lexicons. Therefore, we will also detail an interactive model that made 'dynamic' use of the lexicon in building adaptive classifiers. Fusion of recognizers will then be investigated by statistical modeling of the dependencies in score vectors. We will conclude by presenting our recent foray into search applications, in addition to demonstrating the scalability of our methods in making sense of handwritten notes on tablets.