Working with CBOW Embeddings

Jeff 0415/2018


The CBOW method of word2vec is implemented. CBOW is similar to the skip-gram method.

In CBOW, a single target word is predicted from a surrounding window of context words.

In the prior example of skip-gram, each combination of window and target are treated as a group of paired inputs and outputs. In the CBOW, the surrounding window embeddings together are added to get one embedding to predict the target word embedding.

(Adding figure)

A depiction of how the CBOW embedding data is created out of a window on an example sentence (window size=1 on each side).

The codes are similar to that of skip-gram.

We need to change how we create the embedding and how we generate the data from the sentences.

All the major functions are moved to a separate file, called "text_helpers.py", in the same directory.

This function holds the data loading text normalization, dictionary creation, and batch generation function.



1.Libraries Loading

The libraries should be loaded, including the "text_helpers.py". A graph session is activated.

import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import random
import os
import pickle
import string

The "tensorflow" library is imported as "tf".

other libraries " ......" are imported in the same way.

We then start a graph session:

//********************************************

import tensorflow as tf

import matplotlib.pyplot as plt

import numpy as np

import random

import os

import pickle

import string

import requests

import collections

import io

import tarfile

import urllib.request

import text_helpers

from nltk.corpus import stopwords

sess = tf.Session()

//*************************************************************

2.The existance of parameter saving folder

We want to make sure that

1.our temporary data and parameter saving folder exists

    before we start saving to it.

Use the following code:

//*************************************************

# Make a saving directory if it doesn't exist

data_folder_name = 'temp'

if not os.path.exists(data_folder_name):

os.makedirs(data_folder_name)

//**********************************************************************

3.Model Parameter Delaration

We declare

1.the parameters of our model,

which will be similar to the skip-gram method in the prior recipe:

//**************************************************************

# Declare model parameters

batch_size = 500

embedding_size = 200

vocabulary_size = 2000

generations = 50000

model_learning_rate = 0.001

num_sampled = int(batch_size/2

window_size = 3

# Add checkpoints to training

save_embeddings_every = 5000

print_valid_every = 5000

print_loss_every = 100

# Declare stop words

stops = stopwords.words('english')

# We pick some test words. We are expecting synonyms to appear

valid_words = ['love', 'hate', 'happy', 'sad', 'man', 'woman']

//**************************************************************

4.There or more words

We have moved over

1.the data loading

and

2.text normalization functions

to a separate file that we imported at the start,

so we can call them now.

We also want only reviews

1.that have three or more words in them.

Use the following code:

//**************************************************************

texts, target = text_helpers.load_movie_data(data_folder_name)

texts = text_helpers.normalize_text(texts, stops)

# Texts must contain at least 3 words

target = [target[ix] for ix, x in enumerate(texts) if len(x.

split()) > 2]

texts = [x for x in texts if len(x.split()) > 2]

//***************************************************************************************

  1. A Reverse dictionary

Now we create

1.our vocabulary dictionary 

    that will help us to look up words.

We also need

1.a reverse dictionary 

    that looks up words from indices

         when we want to print out the nearest words to our validation set:

word_dictionary = text_helpers.build_dictionary(texts,

vocabulary_size)

word_dictionary_rev = dict(zip(word_dictionary.values(), word_

dictionary.keys()))

text_data = text_helpers.text_to_numbers(texts, word_dictionary)

# Get validation word keys

valid_examples = [word_dictionary[x] for x in valid_words]

6.the initailization of the word embeddings

Next we initialize

1.the word embeddings 

    that we want to 

        1.fit 

    and 

        2.declare 

    the model data placeholders.

Use the following code:

//*********************************************************

embeddings = tf.Variable(tf.random_uniform([vocabulary_size,

embedding_size], -1.0, 1.0))

# Create data/target placeholders

x_inputs = tf.placeholder(tf.int32, shape=[batch_size,

2*window_size])

y_target = tf.placeholder(tf.int32, shape=[batch_size, 1])

valid_dataset = tf.constant(valid_examples, dtype=tf.int32)

//*******************************************************************************

7.A Loop and addition of the embeddings

We can now create

1.how we want to deal with the word embeddings.

Since the CBOW model

1.adds up the embeddings of the context window,

we create

1.a loop

and

2.add up all of the embeddings in the window:

//*************************************************

# Lookup the word embeddings and

# Add together window embeddings:

embed = tf.zeros([batch_size, embedding_size])

for element in range(2*window_size):

embed += tf.nn.embedding_lookup(embeddings, x_inputs[:,

element])

//********************************************************************

8.The Usage of NCE loss function

We use

1.the NCE loss function 

    that TensorFlow has built in

because our categorical output is

1.too sparse for the softmax to converge,

as follows:

//******************************************************

# NCE loss parameters

nce_weights = tf.Variable(tf.truncated_normal([vocabulary_size,

embedding_size], stddev=1.0 / np.sqrt(embedding_size)))

nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

# Declare loss function (NCE)

loss = tf.reduce_mean(tf.nn.nce_loss(nce_weights, nce_biases,

embed,

y_target, num_sampled, vocabulary_size))

//***************************************************************************

9.Print off the nearest words

Just like in the skip-gram recipe, we will use

1.cosine similarity 

    to print off 

        1.the nearest words 

            to our validation word data set

to get an idea of

1.how our embeddings are working.

Use the following code:

//***********************************************************

norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_

dims=True))

normalized_embeddings = embeddings / norm

valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings,

valid_dataset)

similarity = tf.matmul(valid_embeddings, normalized_embeddings,

transpose_b=True)

//***********************************************************

10.Embedding Saving

To save our embeddings, we must

1.load the TensorFlow train.Saver method.

This method defaults to

1.saving the whole graph,

but we can

1.give it an argument 

    just to save the embedding variable,

and we can also

2.give it a specific name.

Here we give it

1.the same name 

    as the variable name in our graph:

//**********************************************************

saver = tf.train.Saver({"embeddings": embeddings})

//*********************************************************************************

11.Optimizer and model variable

We now

1.declare an optimizer function

and

2.initialize our model variables.

Use the following code:

//*****************************************************************

optimizer = tf.train.GradientDescentOptimizer(learning_rate=model_

learning_rate).minimize(loss)

init = tf.initialize_all_variables()

sess.run(init)

//*******************************************************************************************

12.Trainng step loop and embeddings/dictionary saving

Finally, we can

1.loop across our training step

and

2.print out the loss,

and

3.save 

    1.the embeddings 

and 

    2.dictionary

when we specify:

//**************************************************************

loss_vec = []

loss_x_vec = []

for i in range(generations):

batch_inputs, batch_labels = text_helpers.generate_batch_

data(text_data, batch_size, window_size, method='cbow')

feed_dict = {x_inputs : batch_inputs, y_target : batch_labels}

# Run the train step

sess.run(optimizer, feed_dict=feed_dict)

# Return the loss

if (i+1) % print_loss_every == 0:

loss_val = sess.run(loss, feed_dict=feed_dict)

loss_vec.append(loss_val)

loss_x_vec.append(i+1)

print('Loss at step {} : {}'.format(i+1, loss_val))

# Validation: Print some random words and top 5 related words

if (i+1) % print_valid_every == 0:

sim = sess.run(similarity, feed_dict=feed_dict)

for j in range(len(valid_words)):

valid_word = word_dictionary_rev[valid_examples[j]]

top_k = 5 # number of nearest neighbors

nearest = (-sim[j, :]).argsort()[1:top_k+1]

log_str = "Nearest to {}:".format(valid_word)

for k in range(top_k):

close_word = word_dictionary_rev[nearest[k]]

print_str = '{} {},'.format(log_str, close_word)

print(print_str)

# Save dictionary + embeddings

if (i+1) % save_embeddings_every == 0:

# Save vocabulary dictionary

with open(os.path.join(data_folder_name,'movie_vocab.

pkl'), 'wb') as f:

pickle.dump(word_dictionary, f)

# Save embeddings

model_checkpoint_path = os.path.join(os.getcwd(),data_

folder_name,'cbow_movie_embeddings.ckpt')

save_path = saver.save(sess, model_checkpoint_path)

print('Model saved in file: {}'.format(save_path))

////************************************************************

13.Output

This results in the following output:

//************************************************

Loss at step 100 : 62.04829025268555

Loss at step 200 : 33.182334899902344

Loss at step 49900 : 1.6794960498809814

Loss at step 50000 : 1.5071022510528564

Nearest to love: clarity, cult, cliched, literary, memory,

Nearest to hate: bringing, gifted, almost, next, wish,

Nearest to happy: ensemble, fall, courage, uneven, girls,

Nearest to sad: santa, devoid, biopic, genuinely, becomes,

Nearest to man: project, stands, none, soul, away,

Nearest to woman: crush, even, x, team, ensemble,

Model saved in file: .../temp/cbow_movie_embeddings.ckpt

//************************************************

14.CBOW method added

All but

1.one of the functions 

    in the text\_helpers.py file 

        have functions that come directly 

            from the prior recipe.

We make

1.a slight addition 

    to the generate\_batch\_data\(\) function

by adding a 'cbow' method as follows:

//*******************************************************

elif method=='cbow':

batch_and_labels = [(x[:y] + x[(y+1):], x[y]) for x,y in

zip(window_sequences, label_indices)]

# Only keep windows with consistent 2*window_size

batch_and_labels = [(x,y) for x,y in batch_and_labels if

len(x)==2*window_size]

batch, labels = [list(x) for x in zip(*batch_and_labels)]

//*******************************************************

results matching ""

    No results matching ""