Turn words into vectors

Computers are better at comparing numbers instead of texts. A good numerical representation for words thus greatly facilitates linguistic analysis. In this post I will summarize the different approaches to obtaining numerical representations of words.

These numerical representations are commonly known as vector space models, or in recent years as word embeddings. Some of the following information may be captured by a word embedding:

semantic similarity between words
syntactic closeness between words
analogy between pairs of words

To make the discussion concrete, let us assume that our text \(\mathbf w\) contains \(N_T\) words with vocabulary \(V\), i.e., there are \(N_V\) unique words in this text. The text \(\mathbf w\) is also known as corpus. The goal is to find a vector \(u_w\in\mathbb R^d\) for each word \(w\in V\). Note \(N_T>N_V\) due to repeated words and we would like to have \(d\le N_V\) or even \(d\ll N_V\). For practical applications, both \(N_T\) and \(N_V\) can be large (think of \(N_V\) as thousands and \(N_T\) as millions). According to Wikipedia, a native English speaker knows about 10,000 words and mastering 3000 words is necessary for a non-native speaker. According to Steven Pinker, a typical high school graduate has a vocabulary of 60,000 words.

One-hot vectors

The most straightforward word representation is the so-called one-hot representation. Here all words live in the \(N_V\) dimensional space and every one of them is a unit vector along the axes. These one-hot vectors are not particularly useful by themselves since all semantic information is eliminated. The only information left is that the vocabularies are different from each other. In terms of cosine similarity, they are not similar to each other at all.

Distributional vectors

A better representation built on the one-hot vectors is the distributional vector representation. Here the idea is that a word is characterized by the words near it, just like a person is somewhat characterized by his or her friends. It is known as the distributional hypothsis. The neighborhood of a word is called its context and is defined as a window near the word. Research shows that smaller context window (4-word window instead of 100-word window) improves word similarity performance.

To analyze the context, one can either do statistics of the word co-occurrence in the context (latent semantic analysis) or use neural network to train the context (neural probabilistic language model). The former is known as count-based (or frequency-based) model and the latter predictive model.

Count-based model

A simple count-based model can be derived from analyzing the so-called word-context matrix, also known as co-occurrence matrix. The rows of this matrix are words to be represented by vectors. The columns of this matrix are the words that serve as features. For example, one can use the top 10% of the words that occur most frequently as column words (context words). For simplicity let’s assume both rows and columns correspond to the vocabulary \(V\), i.e., the co-occurrence matrix is \(N_V\times N_V\). The matrix entries correspond to the counts of word co-occurrence, i.e., the number of times the column word appears in the context of the row word. Although this model completely disregards word ordering, the row vectors still encode some semantic similarities between words into their spatial locations. T.K. Landauer estimated in 2002 that 80% of the meaning of English text comes from word choice and the remaining 20% comes from word ordering.

To reduce the dimensionality of this representation, one can then apply methods such as principal component analysis (PCA) or non-negative matrix factorization (NMF). It has been shown that dimension reduction actually improves the linguistic task performance.

Usually, some weighting needs to be performed on the co-occurrence matrix elements before the dimension reduction to account for the word statistics. For example, one can normalize the row vectors such that the entries denote conditional probability of the column word appearance given the row word. Furthermore, if certain words occur very frequently in the text, the row vector would then be dominated by the corresponding entries. In most cases, these high-frequency words do not convey much information (recall in information theory surprising event has high information content). To weight down these high-frequency words, one can use term frequency-inverse document frequency method (tf-idf) or pointwise mutual information method (PMI). Both of them are empirical and have many variations. Here I only describe one called positive PMI (PPMI).

Let’s assume the co-occurrence matrix is \(F\) and the counts are \(f_{ij}\). Then the PMI matrix can be defined from the three probability distributions

\[\begin{align} p_{ij} \simeq& \frac{f_{ij}}{\sum_{m,n=1}^{N_V} f_{mn}} \\ p_{i*} \simeq& \frac{\sum_{j=1}^{N_V} f_{ij}}{\sum_{m,n=1}^{N_V} f_{mn}} \\ p_{*j} \simeq& \frac{\sum_{i=1}^{N_V} f_{ij}}{\sum_{m,n=1}^{N_V} f_{mn}} \\ \text{PMI}_{ij} = & \log\left(\frac{p_{ij}}{p_{i*}p_{*j}}\right) \end{align}\]

If the row word and column word are statistically independent of each other, then PMI is 0. If PMI is positive, it suggests that the co-occurrence is more than just chance. If PMI is negative, it suggests that the two words are unrelated.

Given these observations, the PPMI is defined as

\[\text{PPMI}_{ij} = \begin{cases} \text{PMI}_{ij}, \quad \text{if PMI}_{ij} > 0 \\ 0, \qquad {\text otherwise} \end{cases}\]

Predictive model

To understand the predictive models, we need to know a little bit about probabilistic language model. Indeed, even before neural network based methods were invented for word embedding, probabilistic language model had been used as alternative of vector space model to quantify semantic similarity.

The probabilistic language model

The probabilistic language model aims to assign probabilities to sentences (sequences of words), i.e., \(P(\mathbf w)\equiv P(w_1, w_2,\ldots, w_n)\). According to the chain rule of probability

\[P(w_1, w_2, \ldots, w_t) = \Pi_{i=1}^t P(w_i|w_1, w_2, \ldots, w_{i-1})\]

Thus the purpose is equally served if we can assign all the conditional probabilities. Here \(t\) is the number of words in a sentence. Note in this setup, it is very natual to take word ordering into account.

Given a text with many sentences, one can calculate all the conditional probabilities. However, as the sentence gets longer, there are too many combinations of the words (\(N_V^t\) in theory) thus the computational burden becomes huge. Furthermore the text might not be long enough to provide good statistics of these conditional probabilities with long sentences. As a result, such brutal force approach is almost never used.

One workaround is the N-gram model. It was proposed in the 80’s and it assumes that words in the far past do not affect the probability of current word. If only the last \(N-1\) words need to be considered, the corresponding model is called N-gram model. For example, in the bigram model, it is assumed that

\[P(w_i|w_1, w_2, \ldots, w_{i-1}) \simeq P(w_i|w_{i-1}).\]

Thus in the N-gram model, one gains some knowledge of what word is more likely to come next. Also, these conditional probabilities can be considered as features of a word.

The bigram model is closely related to the co-occurrence matrix. It can be considered as the special case where the context window is one word to the right.

Neural probabilistic language model

The more recent trend is to use neural network to model \(P(\mathbf w)\). As a byproduct, word embedding can be generated. Here the transformation from one-hot vectors to word embedding vectors can be explicitly parametrized as unkowns. The word embedding vectors are then linked to the conditional probabilities \(P(w_i|w_1, w_2, \ldots)\) using a neural network layer, as seen in Fig. 1. As one scans through the text, the transformation from one-hot vectors to word embedding vectors can thus be fitted out.

Figure 1. Diagram of the neuroal probabilistic language model.

Such a procedure is computationally very expensive given a large text body since it is expensive to compute \(P(w_i|w_1, w_2, \ldots)\) unless the context is made small (small N for N-gram).

Mikolov’s word2vec algorithm avoids the calculation of these conditional probabilities. Instead, the output of the neural network is a binary classification, i.e., the probability of the real target word showing up should beat the probability of some imaginary target words showing up. This trick immediately allows efficient training of text bodies with hundreds of billions words and with close to million vocabulary size.

It turns out that the word2vec embedding not only performs well on semantic similarity, it also encodes word analogy information. The famous example is

\[u_{\text king} -u_{\text man} +u_{\text woman}\simeq u_{\text queen}.\]

Summary

So far there are not many papers directly compare word embeddings from count-based models or predictive models. In Baroni et.al.’s 2014 paper, it is found that the predictive models provide better embeddings for many linguistic tasks, but the count based models also have good performance overall.

There are also discussions that there are connections between the count-based model and the neural probabilistic language model. The interested reader can look into a nicely-written paper GloVe by Pennington et.al..

Resources:

Neural net language models by Yoshua Bengio
RADIM ŘEHŮŘEK’s word2vec tutorial
gensim word2vec page
TensorFlow’s word2vec tutorial
paper by Peter Turney and Patrick Pantel
python nltk book
Sebastian Ruder’s post
Christopher Olah’s post
offconvex post

Notes:

To use word2vec in gensim, make sure Cython is installed. It helps with speed.