Vector Representation of Words Tensorflow Web Tutorial

MM 5/24/2018

The word2vec model by Kikolov et al. [1] is present.

This model is used for learning vector representation of words, i.e., word embedding.

Abstract

The motivation of word vector is described. The intuition behind the model and how this model is trained is examined. The implementation of the model in Tensorflow is shown.

The minimalistic implementation is written in :

tensorflow/examples/tutorials/word2vec/wordvec_basic.py

This basic example contain the code needed to download some data, train on it a bit and visualize the result.

The advanced implementation is written in

models/tutorial/embedding/word2vec.py

which is a more serious implementation that showcases some more advanced Tensorflow principles about how to efficiently use threads to move data into a text model, how to checkpoint during training, etc.

Motivation

Image and audio processing systems work with rich, high-dimensional dataset encoded as vectors of the individual raw pixel-intensities for image data, or e.g. power spectral density coefficient for audio data. For tasks like object or speech recognition, all the information required to successfully perform the task is encoded in the raw data. However, natural language processing systems traditionally treat words as discrete atomic symbols, and therefore 'cat' may be represented as Id537 and 'dog' as Id143. These encoding are arbitrary, and provide no useful information to the system regarding the relationships that may exists between the individual symbols. This means that the model can leverage little of what it has learned about 'cats' when it is processing data about 'dogs'. Such that they are both animals, four legged, pets, etc.

Representing words as unique, discrete ids furthermore leads to data sparsity, and usually means that more data is needed in order to successfully train statistical models.

Using vector representations can overcome some of these obstacles.

Fig.1 Data in audio, images and text.

Vector space models (VSMs) represent (embed) words in a continuous vector space where semantically similar words are mapped to nearby points ('are embedded nearby each other').

VSM have a long, rich history in NLP, but all methods depend in some way or another on the distributional hypothesis [1], which states that words that appear in the same context share semantic meaning. The different approaches that leverage this principle can be divided into two categories: count-based methods (Laten Semantic Analysis), and predictive methods (neural probablistic lanugage models).

The distinction is elaborated in detail in [2] . In nutshell: Count-based methods compute the statistics of how often some word co-occurs with its neighbor words in a large text corpus, and then map these count-statistcs down to a small, dense vector for each word. Predictive models directly try to predict a word from its neighbors in terms of learned small, dense embedding vectors (considered parameters of the model).

Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text. It comes in two flavors, the continuous bag-of-words model (CBOW) and the Skip-Gram model (section 3.1 and 3.2 in [3]). Algorithmically, these models are similar, except that CBOW predicts target words (e.g. 'mat') from source context words ('the cat sits on the'), while the skip-gram does the inverse and predicts source context-words from the target words. This inversion might seem like an arbitrary choice, but statistically it has the effect that CBOW smoothes over a lot of the distributional information by treating an entire context as one observation. For the most part, this turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets. The skip-gram model is focused in this tutorial.

Scaling up with Noise-Contrastive Training

Neural probablistic language models are traditionally trained using the maximum likelihood (ML) principle to maximize the probability of the next word $w_t$ (for 'target') given the previous words $h$ (for 'history') in terms of a softmax function,

$P(w_t|h)=\textrm{softmax}(\textrm{score}(w_t,h))=\displaystyle \frac{\exp \{ \textrm{score} (w_t,h)\}}{\sum_{\textrm{word in volcab}} \exp\{ \textrm{score}(w',h) \} }$

where score $(w_t|h)$ computes the compatibility of words $w_t$ with the context $h$ ( a dot product is commonly used). This model is trained by maximizing its log-likelihood on the training set, i.e., by maximizing

$L_{\textrm{ML}} = \log P(w_t|h) =\textrm{score}(w_t,h) - \log \left( \sum_{\textrm{word in vocab}} \exp\{ \textrm{score}(w',h) \} \right)$

This yields a normalized probablistic model for language model.

However, this is expensive, because the score for all other $V$ words $w'$ in the current context $h$ are used to compute each probability at every training step.

Fig.2 W2V CBOW model using softmax classifier and noise classifier.

Fig.2 shows the CBOW model using softmax classifier and noise classifier.

Mathematically, the objective (for each example) is to maximize

$L_{\textrm{NEG}}=\log Q_\theta (D=1|w_t,h) + k \textrm{E}_{\tilde{w} P_{\textrm{noise}}} [ \log Q_\theta ( D=0| \tilde{w},h)]$

where $Q_\theta(D=1|w,h)$ is the binary logistic regression probability under the model of seeing the word in the context $h$ in the dataset $D$ , calculated in terms of the learned embedding vectors $\theta$ . In practice, the expectation is approximated by drawing $k$ contrastive words from the noise distribution (i.e., a Monte Carlo average is computed).

The objective is maximized when the model assigns high probabilities to the real words, and low probabilities to noise words. Technically, this is called Negative Sampling [3], and there is good mathematical motivation for using this loss function: The updates approximate the updates of the softmax function in the limit. But computationally is appealing because computing hte loss function scales only with the number of selected noise words ( $k$ ), and not all words in the vocabulary ( $V$ ). This makes it much faster to train. The noise-contrastive estimation (NCE) loss is adopted, and the TensorFlow provide the helper function tf.nn.nce.nce_loss().

The Skip-gram Model

Considering . the dataset

the quick brown fox jumped over the lazy dog

The vanilla definition of 'context' is the window of words to the left and to other right of a target word. Using a window size of 1, the dataset can be expressed as

([the,brown],quick), ([quick,fox],brown), ([brown,jumped],fox), ...

of ([context],target) pairs.

The skip-gram inverts contexts and targets, and tries to predict each context word from its target word, and the task becomes to predict 'the' and 'brown' from 'quick.

Therefore the data set becomes

(quick,the), (quick,brown), (brown,quick), (brown,fox), ...

of (input, output) pairs.

The objective function is defined over the entire dataset, and optimized with stochastic gradient descent (SGD) using one example at a time (or a minibatch of batch_size example, where typically 16<=batch_size<=512.

Considering the first training case, where the goal is to predict 'the' from 'quick', the num_noise number of noisy (contrastive) examples are selected by drawing from some noise distribution, typically the unigram distribution, $P(w)$ .

For simplicity, let's say num_noise=1 and sheep is selected as a noisy example.

The loss for this pair of observed and noisy examples are computed, i.e., the objective at time step $t$ becomes

$L_{\textrm{NEG}}^{(t)}=\log Q_\theta ( D=1|\textrm{the,quick}) + \log ( Q_\theta(D=0|\textrm{sheep,quick}))$

The goal is to make an update to the embedding parameters $\theta$ to improve (maximize) this objective function.

The gradient of the loss with respect to the embedding parameters $\theta$ , i.e. $\frac{\partial L_{\textrm{NEG}}}{\partial \theta}$ can be implemented with TensorFlow.

When the learning process is repeated over the entire training set, the model is successful at discriminating real words from noise words.

Fig.3 illustrative learned word vector.

The learned vectors can be visualized by projecting the down to 2 dimensions using for instance something like the t-SNE dimensionality reduction technique [5]. It is apparent that the vectors capture some general, useful, semantic information about words and their relationships to one another. It is discovered that certain directions in the induced vector space specialize towards certain semantic relationships, e.g., male-female, verb tense and even country-capital relationship between words, as illustrated in the figure.3.

This explains why these vectors are useful as features for many canoical NLP prediction tasks, such as part-of-speech tagging or named entity recognition.

Building the Graph

The embedding matrix is initialized with a random matrix .

embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))

The noise-contrative estimation loss is defined in terms of a logistic regression model. The weights and biases for each word in vocabulary should be defined (also called the output weights as opposed to the input embeddings).

nce_weights = tf.Variable(
  tf.truncated_normal([vocabulary_size, embedding_size],
                      stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

Let us suppose the text corpus have been integerized with a vocabulary so that each word is represented as an integer. The file `tensorflow/examples/tutorials/word2vec/word2vec_basic.py' apply this action for the detail.

The skip-gram model takes two inputs. One is a batch full of integers representing the source context words, the other is for the target words. The placeholder nodes are created for these inputs, and the data is fed later.

# Placeholders for inputs
train_inputs = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])

The vector for each of the source words in the batch should be checked. TensorFlow has handy helpers.

embed = tf.nn.embedding_lookup(embeddings, train_inputs)

The embedding for each word is available, the target word can be predicted using the noise-contrastive training objective.

# Compute the NCE loss, using a sample of the negative labels each time.
loss = tf.reduce_mean(
  tf.nn.nce_loss(weights=nce_weights,
                 biases=nce_biases,
                 labels=train_labels,
                 inputs=embed,
                 num_sampled=num_sampled,
                 num_classes=vocabulary_size))

The loss node is available, and adding nodes is required to compute gradients and update the parameters, etc. Stochastic gradient descent is applied, and TensorFlow has handy helpers

# We use the SGD optimizer.
optimizer = tf.train.GradientDescentOptimizer(learning_rate=1.0).minimize(loss)

Training The Model

feeddict can be used to push data into the placeholders, and tf.Session.run is called with new data in a loop to training the model .

for inputs, labels in generate_batch(...):
  feed_dict = {train_inputs: inputs, train_labels: labels}
  _, cur_loss = session.run([optimizer, loss], feed_dict=feed_dict)

The full example code can be examined in

tensorflow/examples/tutorials/word2vec/word2vec_basic.py.

Visualizing the Learned Embeddings

After training has finished, the learned embedding can be visualized using t-SNE.

Fig.4 learned embeddings using t-SNE.

As expected, words that are similar end up clustering nearby each other.

'' models/tutorial/embedding/word2vec.py'' can be examined for a more heavy weight implementation of word2vec that showcases more of the advanced features of TensorFlow.

Evaluating Embeddings: Analogical Reasoning

Embedding are useful for a wide variety of prediction tasks in NLP.

Predicting syntactic and semantic relationship like king is to queen as father is to ?.

This is called analogical reasoning, and the dataset can be downloaded from [6].

build_eval_graph() and eval() are used to implement the evaluation.

The choice of hyperparameters can strongly influence the accuracy on the task. To achieve state-of-the-art performance on this task requires training over a large dataset, carefully tuning the hyperparameters and making use of tricks like subsampling the data.

Optimizing the implementation

The vanilla implementation showcases the flexibility of Tensorflow.

For example, changing the training objective can be realized by swapping out the call to tf.nn.nce_loss() for an off-the-self alternative such astf.nn.sampled_softmax_loss().

New loss function can also be manually writen in Tensorflow.

If the model is seriously bottlenecked on input data, a custom data reader is expected as described in [7]. For the case of Skip-gram modeling, file ''models/tutorials/embedding/word2vec.py'' serve as an example.

TensorFlow Ops is described in [8]. ''models/tutorials/embedding/word2vec_optimized.py'' serves as an example.