Changing various non-text documents into vectors that have the characteristics of vector texts using thresholded random projections is the goal of today's paper. From the paper:
Although proving that our random mapping scheme works is involved, the scheme is remarkably simple. Our corpus X is a finite collection of vectors in R^d, normalized to have unit l_2 norm. To transform each vector in X, multiply each vector by a random matrix, then threshold each element.
Random mappings designed for commercial search engines by Roger Donaldson, Arijit Gupta, Yaniv Plan, Thomas Reimer
We give a practical random mapping that takes any set of documents represented as vectors in Euclidean space and then maps them to a sparse subset of the Hamming cube while retaining ordering of inter-vector inner products. Once represented in the sparse space, it is natural to index documents using commercial text-based search engines which are specialized to take advantage of this sparse and discrete structure for large-scale document retrieval. We give a theoretical analysis of the mapping scheme, characterizing exact asymptotic behavior and also giving non-asymptotic bounds which we verify through numerical simulations. We balance the theoretical treatment with several practical considerations; these allow substantial speed up of the method. We further illustrate the use of this method on search over two real data sets: a corpus of images represented by their color histograms, and a corpus of daily stock market index values.
Python codes used to generate results of that paper, including running example searches using the Whoosh search engine for Python, is here at: https://gitlab.com/dgpr-sparse-search/code
No comments:
Post a Comment