Nuit Blanche: Part 5: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

Saturday, May 12, 2012

Part 5: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications

Here is the fifth series of videos from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications(May 7-11, 2012). Congratulations to the local Organizing committee: Joshua Bloom, Damian Eads, Berian James, Peter Nugent, John Rice, Joseph Richards and Dan Starr for making the meeting happen and putting it all on videos for others to learn from. (in near real time!).. All the videos are here: Part 1, Part 2, Part 3, Part 4, Part 5.

From Data to Knowledge - 501 - Michael Franklin

Michael Franklin: "Continuous Analytics: Data Stream Query Processing in Practice". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Michael Franklin (Computer Science Dept., UC Berkeley) Data stream query processing is a key technique for providing low latency answers to Big Data questions. The basic idea is to provide database-style query processing over data on-the-fly as they arrive at the system, in contrast to the store-first, query-later approach followed by traditional database systems. Work in this area was originally motivated by "real-time" data-intensive scenarios such as sensor networks, financial trading applications, and network security. Stream query processing caught the imagination of the research community due to the new applications it could enable as well as the large number of traditional database assumptions that needed to be rethought and the new opportunities for optimization this mode of execution provided. Lately, stream processing has been moving from the research lab into the real world through efforts at start-up companies, traditional database vendors, and open source projects. Not surprisingly, the practical uses and advantages of the technology are turning out to be different than many had originally expected. In this talk, I'll survey the state of the art in stream query processing and related technologies such as event processing, discuss some of the implications for data-intensive system architectures, and provide my views on the future role of this technology from both a research and a commercial perspective. In particular, I'll describe the notion of Continuous Analytics, which leverages Stream Query Processing techniques to solve some of the inherent bottlenecks that have existed in database systems since their inception.

From Data to Knowledge - 502 - Indrajit Roy

Indrajit Roy: "Using R for Large Scale Incremental Processing". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Indrajit Roy (HP Labs) It is cumbersome to write complex machine learning and graph algorithms in existing data-parallel models like MapReduce. Many of these algorithms are based on linear algebra and, by nature, iterative and incremental computations, neither of which are efficiently supported by current frameworks. We argue that array-based languages, like R, are ideal to express these algorithms, and we should extend these languages for processing in a cloud. In this talk we present the challenges and abstractions to extend R and run it on 100s of machines. Early results show that many computations can be expressed in a few lines of code and are an order of magnitude faster than processing in Hadoop.

From Data to Knowledge - 504 - Pedro Bizarro

Pedro Bizarro: "Real-time fraud detection and business activity monitoring". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Pedro Bizarro (FeedZai Inc) In this industry talk we will present real use cases of fraud detection and business activity monitoring using real-time engines and so called "big data" solutions that use Cassandra and/or Hadoop.

From Data to Knowledge - 505 - Dovi Poznanski

Dovi Poznanski: "Real or Bogus (2.0): Finding needles in an astronomical haystack". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Dovi Poznanski (Tel Aviv University, Astrophysics) In time-domain astronomy one surveys the sky for transient and variable sources. New images are compared via subtraction to previous images of the same part of the sky to detect change or new sources, a noisy process that yields many artifacts - "bogus" detections. The Palomar Transient Factory,produces about 1 million candidates every night with a real to bogus ratio around 1 in 1000, creating a needle in a haystack problem. A first generation classifier is performing well, however, after a few years of operation we have the ability to rethink the process and optimize it. In this talk I will present this effort to build a probabilistic classification framework that can identify the interesting transient candidates from the myriads of bogus subtractions. This supervised classifier is carefully trained and tested on a large training set. The lessons learned and algorithms developed will be of outmost importance for upcoming surveys such as LSST.

From Data to Knowledge - 506 - Olfa Nasraoui

Olfa Nasraoui: "Addressing Two BigData Challenges: Tracking and Validating Evolving Clusters in Data Streams and Mining Multi-source Heterogeneous Data". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Olfa Nasraoui (Knowledge Discovery & Web Mining Lab, Computer Engineering & Computer Science Dept., University of Louisville) Two of the most difficult problems that arise under the Big Data challenge are (1) the problem of discovering, tracking and validating clusters in an evolving data stream, and (2) knowledge discovery in multimodal or heterogeneous data sets. In the first problem, I will present a novel framework, called "Stream Dashboard", that wraps a tracking and validation layer around a cluster discovery core. While the core component discovers clusters in a single pass over streaming noisy data, the tracking and validation layer detects and characterizes the evolution pattern of the clusters. In addition to evolution tracking, this layer also computes a concise and adaptive summary of the discovered knowledge. In the second problem, an overview of some leading approaches will be presented. Then novel Matrix Factorization and Semi-Supervised learning-based approaches will be presented for unsupervised learning in multimodal or heterogeneous data sets that arise in many real life situations where data is collected from multiple repositories, instruments or sensors. In particular, we target data of multiple modalities or mixed data types including numerical, categorical, transactional, text, image, and social attributes.

From Data to Knowledge - 507 - Philipp Kranen

Philipp Kranen: "The ClusTree: Indexing Micro-Clusters for Anytime Stream Mining". A video from the UC Berkeley Conference: From Data to Knowledge: Machine-Learning with Real-time and Streaming Applications (May 7-11, 2012).

Abstract Philipp Kranen (RWTH Aachen University, Germany, CS) Clustering streaming data requires algorithms which are capable of updating clustering results for the incoming data. As data is constantly arriving, time for processing is limited. Clustering has to be performed in a single pass over the incoming data and within the possibly varying inter-arrival times of the stream. Likewise, memory is limited, making it impossible to store all data. For clustering, we are faced with the challenge of maintaining a current result that can be presented to the user at any given time. In this talk the ClusTree algorithm will be presented that automatically adapts to the speed of the data stream. It makes best use of the time available under the current constraints to provide a clustering of the objects seen up to that point. The approach incorporates the age of the objects to reflect the greater importance of more recent data. For efficient and effective handling, a compact and self-adaptive index structure is employed to maintain stream summaries. Additionally solutions are discussed for handling very fast streams through aggregation mechanisms, for improving the clustering result on slower streams as long as time permits, and for explicit noise handling in anytime stream clustering. The talk concludes with a view on the evaluation of stream clustering algorithms using the MOA framework and the CMM measure. The talk is based on a 2011 KAIS Journal article and contains material from further research by the author including papers from ICDM 2010 and KDD 2011. This work has been supported by the UMIC Research Centre, RWTH Aachen University, Germany.