Changing various non-text documents into vectors that have the characteristics of vector texts using thresholded random projections is the goal of today's paper. From the paper:
Although proving that our random mapping scheme works is involved, the scheme is remarkably simple. Our corpus X is a finite collection of vectors in R^d, normalized to have unit l_2 norm. To transform each vector in X, multiply each vector by a random matrix, then threshold each element.
Random mappings designed for commercial search engines by Roger Donaldson, Arijit Gupta, Yaniv Plan, Thomas Reimer
We give a practical random mapping that takes any set of documents represented as vectors in Euclidean space and then maps them to a sparse subset of the Hamming cube while retaining ordering of inter-vector inner products. Once represented in the sparse space, it is natural to index documents using commercial text-based search engines which are specialized to take advantage of this sparse and discrete structure for large-scale document retrieval. We give a theoretical analysis of the mapping scheme, characterizing exact asymptotic behavior and also giving non-asymptotic bounds which we verify through numerical simulations. We balance the theoretical treatment with several practical considerations; these allow substantial speed up of the method. We further illustrate the use of this method on search over two real data sets: a corpus of images represented by their color histograms, and a corpus of daily stock market index values.
Python codes used to generate results of that paper, including running example searches using the Whoosh search engine for Python, is here at: https://gitlab.com/dgpr-sparse-search/code
Liked this entry ? subscribe to Nuit Blanche's feed, there's more where that came from. You can also subscribe to Nuit Blanche by Email, explore the Big Picture in Compressive Sensing or the Matrix Factorization Jungle and join the conversations on compressive sensing, advanced matrix factorization and calibration issues on Linkedin.