Working with bag of words
We start by showing how to work with a bag of words embedding in TensorFlow. This mapping is what we introduced in the introduction.
Here we show how to use this type of embedding to do spam prediction.
To illustrate how to use bag of words with a text dataset,we will use a spam-ham phone text database from the UCI machine learning data repository
(https://archive.ics.uci. edu/ml/datasets/SMS+Spam+Collection)
This is a collection of phone text messages that are spam or not-spam (ham).
We will download this data, store it for future use, and then proceed with the bag of words method to predict whether a text is spam or not.
The model that will operate on the bag of words will be a logistic model with no hidden layers.
We will use stochastic training,
, with batch size of one
and compute the accuracy on a held-out test set at the end.
How to do it...
For this example
we will start by getting the data
normalizing and splitting the text
running it through an embedding function, and training the logistic function to predict spam
1.The first task will be to import the necessary libraries for this task. Among the usual
libraries, we will need a .zip file library to unzip the data from the UCI machine
learning website we retrieve it from:
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import numpy as np
import csv
import string
import requests
import io
from zipfile import ZipFile
from tensorflow.contrib import learn
sess = tf.Session()
2.Instead of downloading the text data every time the script is run, we will save it and
check whether the file has been saved before. This prevents us from repeatedly
downloading the data over and over if we want to change the script parameters. After
downloading, we will extract the input and target data and change the target to be 1
for spam and 0 for ham:
save_file_name = os.path.join('temp','temp_spam_data.csv')
if os.path.isfile(save_file_name):
text_data = []
with open(save_file_name, 'r') as temp_output_file:
reader = csv.reader(temp_output_file)
for row in reader:
text_data.append(row)
else:
zip_url = 'http://archive.ics.uci.edu/ml/machine-learningdatabases/
00228/smsspamcollection.zip'
r = requests.get(zip_url)
z = ZipFile(io.BytesIO(r.content))
file = z.read('SMSSpamCollection')
# Format Data
text_data = file.decode()
text_data = text_data.encode('ascii',errors='ignore')
text_data = text_data.decode().split('\n')
text_data = [x.split('\t') for x in text_data if len(x)>=1]
# And write to csv
with open(save_file_name, 'w') as temp_output_file:
writer = csv.writer(temp_output_file)
writer.writerows(text_data)
texts = [x[1] for x in text_data]
target = [x[0] for x in text_data]
# Relabel 'spam' as 1, 'ham' as 0
target = [1 if x=='spam' else 0 for x in target]
3.To reduce the potential vocabulary size, we normalize the text. To do this, we remove
the influence of capitalization and numbers in the text. Use the following code:
# Convert to lower case
texts = [x.lower() for x in texts]
# Remove punctuation
texts = [''.join(c for c in x if c not in string.punctuation) for
x in texts]
# Remove numbers
texts = [''.join(c for c in x if c not in '0123456789') for x in
texts]
# Trim extra whitespace
texts = [' '.join(x.split()) for x in texts]
4.We must also determine the maximum sentence size. To do this, we look at a
histogram of text lengths in the data set. We see that a good cut-off might be around
25 words. Use the following code:
# Plot histogram of text lengths
text_lengths = [len(x.split()) for x in texts]
text_lengths = [x for x in text_lengths if x < 50]
plt.hist(text_lengths, bins=25)
plt.title('Histogram of # of Words in Texts')
sentence_size = 25
min_word_freq = 3
Image
5.TensorFlow has a built-in processing tool for determining vocabulary embedding,
called VocabularyProcessor(), under the learn.preprocessing library:
vocab_processor = learn.preprocessing.
VocabularyProcessor(sentence_size, min_frequency=min_word_freq)
vocab_processor.fit_transform(texts)
embedding_size = len(vocab_processor.vocabulary_)
- Now we will split the data into a train and test set:
train_indices = np.random.choice(len(texts),
round(len(texts)*0.8), replace=False)
test_indices = np.array(list(set(range(len(texts))) - set(train_
indices)))
texts_train = [x for ix, x in enumerate(texts) if ix in train_
indices]
texts_test = [x for ix, x in enumerate(texts) if ix in test_
indices]
target_train = [x for ix, x in enumerate(target) if ix in train_
indices]
target_test = [x for ix, x in enumerate(target) if ix in test_
indices]
7.Next we declare the embedding matrix for the words. Sentence words will be
translated into indices. These indices will be translated into one-hot-encoded
vectors that we can create with an identity matrix, which will be the size of our word
embeddings. We will use this matrix to look up the sparse vector for each word and
add them together for the sparse sentence vector. Use the following code:
identity_mat = tf.diag(tf.ones(shape=[embedding_size]))
8.Since we will end up doing logistic regression to predict the probability of spam,
we need to declare our logistic regression variables. Then we declare our data
placeholders as well. It is important to note that the x_data input placeholder should
be of integer type because it will be used to look up the row index of our identity
matrix and TensorFlow requires that lookup to be an integer:
A = tf.Variable(tf.random_normal(shape=[embedding_size,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
# Initialize placeholders
x_data = tf.placeholder(shape=[sentence_size], dtype=tf.int32)
y_target = tf.placeholder(shape=[1, 1], dtype=tf.float32)
9.Now we use TensorFlow's embedding lookup function that will map the indices of the
words in the sentence to the one-hot-encoded vectors of our identity matrix. When we
have that matrix, we create the sentence vector by summing up the aforementioned
word vectors. Use the following code:
x_embed = tf.nn.embedding_lookup(identity_mat, x_data)
x_col_sums = tf.reduce_sum(x_embed, 0)
10.Now that we have our fixed-length sentence vectors for each sentence, we want
to perform logistic regression. To do this, we will need to declare the actual model
operations. Since we are doing this one data point at a time (stochastic training), we
will expand the dimensions of our input and perform linear regression operations on
it. Remember that TensorFlow has a loss function that includes the sigmoid function,
so we do not need to include it in our output here:
x_col_sums_2D = tf.expand_dims(x_col_sums, 0)
model_output = tf.add(tf.matmul(x_col_sums_2D, A), b)
11.We now declare the loss function, prediction operation, and optimization function
for training the model. Use the following code:
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_
logits(model_output, y_target))
# Prediction operation
prediction = tf.sigmoid(model_output)
# Declare optimizer
my_opt = tf.train.GradientDescentOptimizer(0.001)
train_step = my_opt.minimize(loss)
- Next we initialize our graph variables before we start the training generations:
init = tf.initialize_all_variables()
sess.run(init)
13.Now we start the iteration over the sentences. TensorFlow's vocab_processor.
fit() function is a generator that operates one sentence at a time. We will use this
to our advantage to do stochastic training on our logistic model. To get a better idea
of the accuracy trend, we keep a trailing average of the past 50 training steps. If we
just plotted the current one, we would either see 1 or 0 depending on whether we
predicted that training data point correctly or not. Use the following code:
loss_vec = []
train_acc_all = []
train_acc_avg = []
for ix, t in enumerate(vocab_processor.fit_transform(texts_
train)):
y_data = [[target_train[ix]]]
sess.run(train_step, feed_dict={x_data: t, y_target: y_data})
temp_loss = sess.run(loss, feed_dict={x_data: t, y_target: y_
data})
loss_vec.append(temp_loss)
if (ix+1)%10==0:
print('Training Observation #' + str(ix+1) + ': Loss = ' +
str(temp_loss))
# Keep trailing average of past 50 observations accuracy
# Get prediction of single observation
[[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_
target:y_data})
# Get True/False if prediction is accurate
train_acc_temp = target_train[ix]==np.round(temp_pred)
train_acc_all.append(train_acc_temp)
if len(train_acc_all) >= 50:
train_acc_avg.append(np.mean(train_acc_all[-50:]))
- This results in the following output:
Starting Training Over 4459 Sentences.
Training Observation #10: Loss = 5.45322
Training Observation #20: Loss = 3.58226
Training Observation #30: Loss = 0.0
Training Observation #4430: Loss = 1.84636
Training Observation #4440: Loss = 1.46626e-05
Training Observation #4450: Loss = 0.045941
15.To get the test set accuracy, we repeat the preceding process, but only on the
prediction operation, not the training operation with the test texts:
print('Getting Test Set Accuracy')
test_acc_all = []
for ix, t in enumerate(vocab_processor.fit_transform(texts_test)):
y_data = [[target_test[ix]]]
if (ix+1)%50==0:
print('Test Observation #' + str(ix+1))
# Keep trailing average of past 50 observations accuracy
# Get prediction of single observation
[[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_
target:y_data})
# Get True/False if prediction is accurate
test_acc_temp = target_test[ix]==np.round(temp_pred)
test_acc_all.append(test_acc_temp)
print('\nOverall Test Accuracy: {}'.format(np.mean(test_acc_all)))
Getting Test Set Accuracy For 1115 Sentences.
Test Observation #10
Test Observation #20
Test Observation #30
Test Observation #1000
Test Observation #1050
Test Observation #1100
Overall Test Accuracy: 0.8035874439461883
How it works...
For this example, we worked with the spam-ham text data from the UCI machine learning
repository. We used TensorFlow's vocabulary processing functions to create a standardized
vocabulary to work with and created sentence vectors which were the sum of each text's
word vectors. We used this sentence vector in logistic regression and obtained about an 80%
accuracy model to predict a text being spam.
There’s more
It is worthwhile to mention the motivation of limiting the sentence (or text) size. In this
example, we limited the text size to 25 words. This is a common practice with bag of words
because it limits the effect of text length on the prediction. You can imagine that if we find
a word, meeting for example, that is predictive of a text being ham (not spam), then a spam
message might get through by putting in many occurrences of that word at the end.
In fact, this is a common problem with imbalanced target data. Imbalanced data might occur
in this situation, since spam may be hard to find and ham may be easy to find. Because of this
fact, our vocabulary that we create might be heavily skewed toward words represented in the
ham part of our data (more ham means more words are represented in ham than spam). If we
allow unlimited lengths of texts, then spammers might take advantage of this and create very
long texts, which have a higher probability of triggering non-spam word factors in our logistic
model.
In the next section, we attempt to tackle this problem in a better way by using the frequency of
word occurrence to determine the values of the word embeddings.