Working with CBOW Embeddings
Jeff 0415/2018
The CBOW method of word2vec is implemented. CBOW is similar to the skip-gram method.
In CBOW, a single target word is predicted from a surrounding window of context words.
In the prior example of skip-gram, each combination of window and target are treated as a group of paired inputs and outputs. In the CBOW, the surrounding window embeddings together are added to get one embedding to predict the target word embedding.
(Adding figure)
A depiction of how the CBOW embedding data is created out of a window on an example sentence (window size=1 on each side).
The codes are similar to that of skip-gram.
We need to change how we create the embedding and how we generate the data from the sentences.
All the major functions are moved to a separate file, called "text_helpers.py", in the same directory.
This function holds the data loading text normalization, dictionary creation, and batch generation function.
1.Libraries Loading
The libraries should be loaded, including the "text_helpers.py". A graph session is activated.
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import random
import os
import pickle
import string
The "tensorflow" library is imported as "tf".
other libraries " ......" are imported in the same way.
We then start a graph session:
//********************************************
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import random
import os
import pickle
import string
import requests
import collections
import io
import tarfile
import urllib.request
import text_helpers
from nltk.corpus import stopwords
sess = tf.Session()
//*************************************************************
2.The existance of parameter saving folder
We want to make sure that
1.our temporary data and parameter saving folder exists
before we start saving to it.
Use the following code:
//*************************************************
# Make a saving directory if it doesn't exist
data_folder_name = 'temp'
if not os.path.exists(data_folder_name):
os.makedirs(data_folder_name)
//**********************************************************************
3.Model Parameter Delaration
We declare
1.the parameters of our model,
which will be similar to the skip-gram method in the prior recipe:
//**************************************************************
# Declare model parameters
batch_size = 500
embedding_size = 200
vocabulary_size = 2000
generations = 50000
model_learning_rate = 0.001
num_sampled = int(batch_size/2
window_size = 3
# Add checkpoints to training
save_embeddings_every = 5000
print_valid_every = 5000
print_loss_every = 100
# Declare stop words
stops = stopwords.words('english')
# We pick some test words. We are expecting synonyms to appear
valid_words = ['love', 'hate', 'happy', 'sad', 'man', 'woman']
//**************************************************************
4.There or more words
We have moved over
1.the data loading
and
2.text normalization functions
to a separate file that we imported at the start,
so we can call them now.
We also want only reviews
1.that have three or more words in them.
Use the following code:
//**************************************************************
texts, target = text_helpers.load_movie_data(data_folder_name)
texts = text_helpers.normalize_text(texts, stops)
# Texts must contain at least 3 words
target = [target[ix] for ix, x in enumerate(texts) if len(x.
split()) > 2]
texts = [x for x in texts if len(x.split()) > 2]
//***************************************************************************************
- A Reverse dictionary
Now we create
1.our vocabulary dictionary
that will help us to look up words.
We also need
1.a reverse dictionary
that looks up words from indices
when we want to print out the nearest words to our validation set:
word_dictionary = text_helpers.build_dictionary(texts,
vocabulary_size)
word_dictionary_rev = dict(zip(word_dictionary.values(), word_
dictionary.keys()))
text_data = text_helpers.text_to_numbers(texts, word_dictionary)
# Get validation word keys
valid_examples = [word_dictionary[x] for x in valid_words]
6.the initailization of the word embeddings
Next we initialize
1.the word embeddings
that we want to
1.fit
and
2.declare
the model data placeholders.
Use the following code:
//*********************************************************
embeddings = tf.Variable(tf.random_uniform([vocabulary_size,
embedding_size], -1.0, 1.0))
# Create data/target placeholders
x_inputs = tf.placeholder(tf.int32, shape=[batch_size,
2*window_size])
y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
//*******************************************************************************
7.A Loop and addition of the embeddings
We can now create
1.how we want to deal with the word embeddings.
Since the CBOW model
1.adds up the embeddings of the context window,
we create
1.a loop
and
2.add up all of the embeddings in the window:
//*************************************************
# Lookup the word embeddings and
# Add together window embeddings:
embed = tf.zeros([batch_size, embedding_size])
for element in range(2*window_size):
embed += tf.nn.embedding_lookup(embeddings, x_inputs[:,
element])
//********************************************************************
8.The Usage of NCE loss function
We use
1.the NCE loss function
that TensorFlow has built in
because our categorical output is
1.too sparse for the softmax to converge,
as follows:
//******************************************************
# NCE loss parameters
nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size,
embedding_size], stddev=1.0 / np.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Declare loss function (NCE)
loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights, nce_biases,
embed,
y_target, num_sampled, vocabulary_size))
//***************************************************************************
9.Print off the nearest words
Just like in the skip-gram recipe, we will use
1.cosine similarity
to print off
1.the nearest words
to our validation word data set
to get an idea of
1.how our embeddings are working.
Use the following code:
//***********************************************************
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_
dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings,
valid_dataset)
similarity = tf.matmul(valid_embeddings, normalized_embeddings,
transpose_b=True)
//***********************************************************
10.Embedding Saving
To save our embeddings, we must
1.load the TensorFlow train.Saver method.
This method defaults to
1.saving the whole graph,
but we can
1.give it an argument
just to save the embedding variable,
and we can also
2.give it a specific name.
Here we give it
1.the same name
as the variable name in our graph:
//**********************************************************
saver = tf.train.Saver({"embeddings": embeddings})
//*********************************************************************************
11.Optimizer and model variable
We now
1.declare an optimizer function
and
2.initialize our model variables.
Use the following code:
//*****************************************************************
optimizer = tf.train.GradientDescentOptimizer(learning_rate=model_
learning_rate).minimize(loss)
init = tf.initialize_all_variables()
sess.run(init)
//*******************************************************************************************
12.Trainng step loop and embeddings/dictionary saving
Finally, we can
1.loop across our training step
and
2.print out the loss,
and
3.save
1.the embeddings
and
2.dictionary
when we specify:
//**************************************************************
loss_vec = []
loss_x_vec = []
for i in range(generations):
batch_inputs, batch_labels = text_helpers.generate_batch_
data(text_data, batch_size, window_size, method='cbow')
feed_dict = {x_inputs : batch_inputs, y_target : batch_labels}
# Run the train step
sess.run(optimizer, feed_dict=feed_dict)
# Return the loss
if (i+1) % print_loss_every == 0:
loss_val = sess.run(loss, feed_dict=feed_dict)
loss_vec.append(loss_val)
loss_x_vec.append(i+1)
print('Loss at step {} : {}'.format(i+1, loss_val))
# Validation: Print some random words and top 5 related words
if (i+1) % print_valid_every == 0:
sim = sess.run(similarity, feed_dict=feed_dict)
for j in range(len(valid_words)):
valid_word = word_dictionary_rev[valid_examples[j]]
top_k = 5 # number of nearest neighbors
nearest = (-sim[j, :]).argsort()[1:top_k+1]
log_str = "Nearest to {}:".format(valid_word)
for k in range(top_k):
close_word = word_dictionary_rev[nearest[k]]
print_str = '{} {},'.format(log_str, close_word)
print(print_str)
# Save dictionary + embeddings
if (i+1) % save_embeddings_every == 0:
# Save vocabulary dictionary
with open(os.path.join(data_folder_name,'movie_vocab.
pkl'), 'wb') as f:
pickle.dump(word_dictionary, f)
# Save embeddings
model_checkpoint_path = os.path.join(os.getcwd(),data_
folder_name,'cbow_movie_embeddings.ckpt')
save_path = saver.save(sess, model_checkpoint_path)
print('Model saved in file: {}'.format(save_path))
////************************************************************
13.Output
This results in the following output:
//************************************************
Loss at step 100 : 62.04829025268555
Loss at step 200 : 33.182334899902344
Loss at step 49900 : 1.6794960498809814
Loss at step 50000 : 1.5071022510528564
Nearest to love: clarity, cult, cliched, literary, memory,
Nearest to hate: bringing, gifted, almost, next, wish,
Nearest to happy: ensemble, fall, courage, uneven, girls,
Nearest to sad: santa, devoid, biopic, genuinely, becomes,
Nearest to man: project, stands, none, soul, away,
Nearest to woman: crush, even, x, team, ensemble,
Model saved in file: .../temp/cbow_movie_embeddings.ckpt
//************************************************
14.CBOW method added
All but
1.one of the functions
in the text\_helpers.py file
have functions that come directly
from the prior recipe.
We make
1.a slight addition
to the generate\_batch\_data\(\) function
by adding a 'cbow' method as follows:
//*******************************************************
elif method=='cbow':
batch_and_labels = [(x[:y] + x[(y+1):], x[y]) for x,y in
zip(window_sequences, label_indices)]
# Only keep windows with consistent 2*window_size
batch_and_labels = [(x,y) for x,y in batch_and_labels if
len(x)==2*window_size]
batch, labels = [list(x) for x in zip(*batch_and_labels)]
//*******************************************************