Monday, November 17, 2014

Neural Word Embeddings as Implicit Matrix Factorization

Recently at the Paris Machine Learning meetup there was a brief presentation on Word2Vec by Charles Ollion. Well, I was wondering about the connection between Word2Vec and other known approaches. The upcoming NIPS2014 paper provides some light on the subject. Excerpted from the paper:

...The training method (as implemented in the word2vec software package) is highly popular, but not well understood. While it is clear that the training objective follows the distributional hypothesis – by trying to maximize the dot-product between the vectors of frequently occurring word-context pairs, and minimize it for random word-context pairs – very little is known about the quantity being optimized by the algorithm, or the reason it is expected to produce good word representations...
We analyze skip-gram with negative-sampling (SGNS), a word embedding method introduced by Mikolov et al., and show that it is implicitly factorizing a word-context matrix, whose cells are the pointwise mutual information (PMI) of the respective word and context pairs, shifted by a global constant. We find that another embedding method, NCE, is implicitly factorizing a similar matrix, where each cell is the (shifted) log conditional probability of a word given its context. We show that using a sparse Shifted Positive PMI word-context matrix to represent words improves results on two word similarity tasks and one of two analogy tasks. When dense low-dimensional vectors are preferred, exact factorization with SVD can achieve solutions that are at least as good as SGNS’s solutions for word similarity tasks. On analogy questions SGNS remains superior to SVD. We conjecture that this stems from the weighted nature of SGNS’s factorization.

Also relevant: word2vec Explained: Deriving Mikolov et al.’s Negative-Sampling Word-Embedding Method. Yoav Goldberg and Omer Levy. arXiv 2014.

No comments: