Thursday, July 11, 2013

Homomorphic Sketches: Shrinking Big Data without Sacrificing Structure

As mentioned earlier in A Quick Panorama of Sensing from Direct Imaging to Machine Learning,  homomorphic encryption looks a lot like manifold signal processing or functional compressed sensing. In other words, take an object, compress it, then perform operations on the compressed version of the data in a manner similar as if you were dealing with the uncompressed data. Andrew McGregor and colleagues explore this in the context of texts. 

Homomorphic Fingerprints under Misalignments: Sketching Edit and Shift Distances by Alexandr AndoniAssaf GoldbergerAndrew McGregor, Ely Porat
Fingerprinting is a widely-used technique for efficiently verifying that two files are identical. More generally, linear sketching is a form of lossy compression (based on random projections) that also enables the “dissimilarity” of nonidentical files to be estimated. Many sketches have been proposed for dissimilarity measures that decompose coordinatewise such as the Hamming distance between alphanumeric strings, or the Euclidean distance between vectors. However, virtually nothing is known on sketches that would accommodate alignment errors. With such errors, Hamming or Euclidean distances are rendered useless: a small misalignment may result in a file that looks very dissimilar to the original file according such measures. In this paper, we present the first linear sketch that is robust to a small number of alignment errors. Specifically, the sketch can be used to determine whether two files are within a small Hamming distance of being a cyclic shift of each other. Furthermore, the sketch is homomorphic with respect to rotations: it is possible to construct the sketch of a cyclic shift of a file given only the sketch of the original file. The relevant dissimilarity measure, known as the shift distance, arises in the context of embedding edit distance and our result addressed an open problem [26, Question 13] with a rather surprising outcome. Our sketch projects a length n file into D(n)·polylogn dimensions where D(n) n is the number of divisors of n. The striking fact is that this is near-optimal, i.e., the D(n) dependence is inherent to a problem that is ostensibly about lossy compression. In contrast, we then show that any sketch for estimating the edit distance between two files, even when small, requires sketches whose size is nearly linear in n. This lower bound addresses a long-standing open problem on the low distortion embeddings of edit distance [36, Question 2.15], [24], for the case of linear embeddings.
That work seems to parallel the work of Mallat et al Scattering Transform work for images where a nonlinear transform is used to find out if different images are translate of each other.

Towards a Theory of Homomorphic Compression by Andrew McGregor
Abstract. In this talk we survey recent progress on designing homomorphic fingerprints. Fingerprinting is a classic randomized technique for efficiently verifying whether two files are equal. We will discuss two extensions of this basic functionality: a) verifying whether two text files are cyclic shifts of one another and b) when the files correspond to “address books”, verifying whether the resulting social network is connected. Underlying both results is the idea of homomorphic lossy compression, i.e., lossy data compression that supports a range of operations on the compressed data that correspond directly to operations on the original data.
Andrew will give a talk on this subject at SPARC 2013. From the meeting's page:

Andrew McGregor 
Homomorphic Sketches: Shrinking Big Data without Sacrificing Structure

Abstract Fingerprinting is a widely-used technique for efficiently verifying that two large files are identical. More generally, sketching is a form of lossy compression based on random linear projections that, for example, could be used to estimate the Hamming distance between the files. Both techniques are particularly useful for monitoring data streams or minimizing communication in distributed and parallel computing. Many applications have been studied extensively including sparse signal recovery, locality sensitive hashing, and linear regression. Our goal is to extend these popular techniques and design random projections that preserve combinatorial and group-theoretic structure in large data sets. In this talk, we present recent results on analyzing spectral graph properties and fingerprinting misaligned text. Both results rely on developing homomorphic sketches that support operations on the compressed data that correspond to operations on the uncompressed data. Applications include the first sublinear space algorithms for processing dynamic graph streams. Includes joint work with Kook Jin Ahn, Alexandr Andoni, Assaf Goldberger, Sudipto Guha, and Ely Porat.


No comments: