Nuit Blanche: Unsupervised Feature Selection on Data Streams / Streaming Anomaly Detection Using Randomized Matrix Sketching

Monday, August 31, 2015

Unsupervised Feature Selection on Data Streams / Streaming Anomaly Detection Using Randomized Matrix Sketching

Today, we see the use of streaming algorithms to figure out anomaly detection or unsupervised feature selection.

Streaming Anomaly Detection Using Randomized Matrix Sketching by Hao Huang, Shiva Kasiviswanathan.

Data is continuously being generated from sources such as machines, network traffic, application logs, etc. Timely and accurate detection of anomalies in massive data streams have important applications such as in preventing machine failures, intrusion detection, and dynamic load balancing. In this paper, we introduce a novel anomaly detection algorithm, which can detect anomalies in a streaming fashion by making only one pass over the data while utilizing limited storage. The algorithm uses ideas from matrix sketching and randomized low-rank matrix approximations to maintain, in a streaming model, a set of few orthogonal vectors that form a good approximate basis for the data. Using this constructed orthogonal basis, anomalies in new incoming data are detected based on a simple reconstruction error test. We theoretically prove that our algorithm compares favorably with an offline approach based on global singular value decomposition updates. The experimental results show the effectiveness and efficiency of our approach over other popular fast anomaly detection methods.

Unsupervised Feature Selection on Data Streams by Hao Huang, Shinjae Yoo, Shiva Kasiviswanathan

Massive data streams are continuously being generated from sources such as social media, broadcast news, etc., and typically these datapoints lie in high-dimensional spaces (such as the vocabulary space of a language). Timely and accurate feature subset selection in these massive data streams has important applications in model interpretation, computational/storage cost reduction, and generalization enhancement. In this paper, we introduce a novel unsupervised feature selection approach on data streams that selects important features by making only one pass over the data while utilizing limited storage. The proposed algorithm uses ideas from matrix sketching to e ciently maintain a low-rank approximation of the observed data and applies regularized regression on this approximation to identify the important features. We theoretically prove that our algorithm is close to an expensive offline approach based on global singular value decompositions. The experimental results on a variety of text and image datasets demonstrate the excellent ability of our approach to identify important features even in presence of concept drifts and also its efficiency over other popular scalable feature selection algorithms.

Previously: