Nuit Blanche: Distributed Outlier Detection using Compressive Sensing

Friday, June 27, 2014

Distributed Outlier Detection using Compressive Sensing

[1] Data Stream Algorithms Notes from a series of lectures by S. Muthu Muthukrishnan

[2] Some Algorithmic Problems and Results in Compressed Sensing by S. Muthukrishnan and attendant papers that cited this paper.

[3] S. Muthu MuthuKrishnan, Data Stream Algorithms: Developments and Implications for ML and Sam Bessalah, Stream Mining via Abstract Algebra (ppt version).

I would probably add the magic measurement ensembles of Krzakala et al since they could be used as streaming measurement matrices and would remove the O notation from the number of measurements needed to find outliers.

Distributed Outlier Detection using Compressive Sensing by Ying Yan, Jiaxing Zhang, Bojun Huang, Jiaqi Mu, Zheng Zhang, and Thomas Moscibroda

In this paper, we study the computation of statistics over data that is stored across a distributed set of nodes. In this setting, a data vector of size N is distributed into L nodes such that each node holds a partial data in such a way that the global data vector is the sum of the locally stored data vectors. The goal is that each node transmits information to some global aggregator node, which can then compute the desired statistical function. For this setting, there are well-known lower bounds that show that even for simple aggregation functions such as the max-value, the total amount of communication required is at least linear in the size of the data N in the worst-case.

In this paper, we show that these lower bounds can be beaten if we assume that the underlying data has sparsity properties. Specifically, we devise a new algorithm for the distributed outlier detection problem that is based on compressive-sensing and exploits the sparse structure of the underlying data. We show both empirically on real web-scale production data as well as theoretically that such use of compressive sensing-based techniques can result in substantial improvements for distributed computing problems. Specifically, we prove under some conjecture that the algorithm can succeed in recovering outliers using only O(sc · logN) constant size messages, where c is a small constant and s is the sparsity of the underlying data.

Join the CompressiveSensing subreddit or the Google+ Community and post there !