Working with bag of words

We start by showing how to work with a bag of words embedding in TensorFlow. This mapping is what we introduced in the introduction.

Here we show how to use this type of embedding to do spam prediction.

To illustrate how to use bag of words with a text dataset,we will use a spam-ham phone text database from the UCI machine learning data repository

(https://archive.ics.uci. edu/ml/datasets/SMS+Spam+Collection)

This is a collection of phone text messages that are spam or not-spam (ham).

We will download this data, store it for future use, and then proceed with the bag of words method to predict whether a text is spam or not.

The model that will operate on the bag of words will be a logistic model with no hidden layers.

We will use stochastic training,

, with batch size of one

and compute the accuracy on a held-out test set at the end.

How to do it...

For this example

we will start by getting the data

normalizing and splitting the text

running it through an embedding function, and training the logistic function to predict spam

1.The first task will be to import the necessary libraries for this task. Among the usual

libraries, we will need a .zip file library to unzip the data from the UCI machine

learning website we retrieve it from:

import tensorflow as tf

import matplotlib.pyplot as plt

import os

import numpy as np

import csv

import string

import requests

import io

from zipfile import ZipFile

from tensorflow.contrib import learn

sess = tf.Session()

2.Instead of downloading the text data every time the script is run, we will save it and

check whether the file has been saved before. This prevents us from repeatedly

downloading the data over and over if we want to change the script parameters. After

downloading, we will extract the input and target data and change the target to be 1

for spam and 0 for ham:

save_file_name = os.path.join('temp','temp_spam_data.csv')

if os.path.isfile(save_file_name):

text_data = []

with open(save_file_name, 'r') as temp_output_file:

reader = csv.reader(temp_output_file)

for row in reader:

text_data.append(row)

else:

zip_url = 'http://archive.ics.uci.edu/ml/machine-learningdatabases/

00228/smsspamcollection.zip'

r = requests.get(zip_url)

z = ZipFile(io.BytesIO(r.content))

file = z.read('SMSSpamCollection')

# Format Data

text_data = file.decode()

text_data = text_data.encode('ascii',errors='ignore')

text_data = text_data.decode().split('\n')

text_data = [x.split('\t') for x in text_data if len(x)>=1]

# And write to csv

with open(save_file_name, 'w') as temp_output_file:

writer = csv.writer(temp_output_file)

writer.writerows(text_data)

texts = [x[1] for x in text_data]

target = [x[0] for x in text_data]

# Relabel 'spam' as 1, 'ham' as 0

target = [1 if x=='spam' else 0 for x in target]

3.To reduce the potential vocabulary size, we normalize the text. To do this, we remove

the influence of capitalization and numbers in the text. Use the following code:

# Convert to lower case

texts = [x.lower() for x in texts]

# Remove punctuation

texts = [''.join(c for c in x if c not in string.punctuation) for

x in texts]

# Remove numbers

texts = [''.join(c for c in x if c not in '0123456789') for x in

texts]

# Trim extra whitespace

texts = [' '.join(x.split()) for x in texts]

4.We must also determine the maximum sentence size. To do this, we look at a

histogram of text lengths in the data set. We see that a good cut-off might be around

25 words. Use the following code:

# Plot histogram of text lengths

text_lengths = [len(x.split()) for x in texts]

text_lengths = [x for x in text_lengths if x < 50]

plt.hist(text_lengths, bins=25)

plt.title('Histogram of # of Words in Texts')

sentence_size = 25

min_word_freq = 3

Image

5.TensorFlow has a built-in processing tool for determining vocabulary embedding,

called VocabularyProcessor(), under the learn.preprocessing library:

vocab_processor = learn.preprocessing.

VocabularyProcessor(sentence_size, min_frequency=min_word_freq)

vocab_processor.fit_transform(texts)

embedding_size = len(vocab_processor.vocabulary_)

Now we will split the data into a train and test set:

train_indices = np.random.choice(len(texts),

round(len(texts)*0.8), replace=False)

test_indices = np.array(list(set(range(len(texts))) - set(train_

indices)))

texts_train = [x for ix, x in enumerate(texts) if ix in train_

indices]

texts_test = [x for ix, x in enumerate(texts) if ix in test_

indices]

target_train = [x for ix, x in enumerate(target) if ix in train_

indices]

target_test = [x for ix, x in enumerate(target) if ix in test_

indices]

7.Next we declare the embedding matrix for the words. Sentence words will be

translated into indices. These indices will be translated into one-hot-encoded

vectors that we can create with an identity matrix, which will be the size of our word

embeddings. We will use this matrix to look up the sparse vector for each word and

add them together for the sparse sentence vector. Use the following code:

identity_mat = tf.diag(tf.ones(shape=[embedding_size]))

8.Since we will end up doing logistic regression to predict the probability of spam,

we need to declare our logistic regression variables. Then we declare our data

placeholders as well. It is important to note that the x_data input placeholder should

be of integer type because it will be used to look up the row index of our identity

matrix and TensorFlow requires that lookup to be an integer:

A = tf.Variable(tf.random_normal(shape=[embedding_size,1]))

b = tf.Variable(tf.random_normal(shape=[1,1]))

# Initialize placeholders

x_data = tf.placeholder(shape=[sentence_size], dtype=tf.int32)

y_target = tf.placeholder(shape=[1, 1], dtype=tf.float32)

9.Now we use TensorFlow's embedding lookup function that will map the indices of the

words in the sentence to the one-hot-encoded vectors of our identity matrix. When we

have that matrix, we create the sentence vector by summing up the aforementioned

word vectors. Use the following code:

x_embed = tf.nn.embedding_lookup(identity_mat, x_data)

x_col_sums = tf.reduce_sum(x_embed, 0)

10.Now that we have our fixed-length sentence vectors for each sentence, we want

to perform logistic regression. To do this, we will need to declare the actual model

operations. Since we are doing this one data point at a time (stochastic training), we

will expand the dimensions of our input and perform linear regression operations on

it. Remember that TensorFlow has a loss function that includes the sigmoid function,

so we do not need to include it in our output here:

x_col_sums_2D = tf.expand_dims(x_col_sums, 0)

model_output = tf.add(tf.matmul(x_col_sums_2D, A), b)

11.We now declare the loss function, prediction operation, and optimization function

for training the model. Use the following code:

loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_

logits(model_output, y_target))

# Prediction operation

prediction = tf.sigmoid(model_output)

# Declare optimizer

my_opt = tf.train.GradientDescentOptimizer(0.001)

train_step = my_opt.minimize(loss)

Next we initialize our graph variables before we start the training generations:

init = tf.initialize_all_variables()

sess.run(init)

13.Now we start the iteration over the sentences. TensorFlow's vocab_processor.

fit() function is a generator that operates one sentence at a time. We will use this

to our advantage to do stochastic training on our logistic model. To get a better idea

of the accuracy trend, we keep a trailing average of the past 50 training steps. If we

just plotted the current one, we would either see 1 or 0 depending on whether we

predicted that training data point correctly or not. Use the following code:

loss_vec = []

train_acc_all = []

train_acc_avg = []

for ix, t in enumerate(vocab_processor.fit_transform(texts_

train)):

y_data = [[target_train[ix]]]

sess.run(train_step, feed_dict={x_data: t, y_target: y_data})

temp_loss = sess.run(loss, feed_dict={x_data: t, y_target: y_

data})

loss_vec.append(temp_loss)

if (ix+1)%10==0:

print('Training Observation #' + str(ix+1) + ': Loss = ' +

str(temp_loss))

# Keep trailing average of past 50 observations accuracy

# Get prediction of single observation

[[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_

target:y_data})

# Get True/False if prediction is accurate

train_acc_temp = target_train[ix]==np.round(temp_pred)

train_acc_all.append(train_acc_temp)

if len(train_acc_all) >= 50:

train_acc_avg.append(np.mean(train_acc_all[-50:]))

This results in the following output:

Starting Training Over 4459 Sentences.

Training Observation #10: Loss = 5.45322

Training Observation #20: Loss = 3.58226

Training Observation #30: Loss = 0.0

Training Observation #4430: Loss = 1.84636

Training Observation #4440: Loss = 1.46626e-05

Training Observation #4450: Loss = 0.045941

15.To get the test set accuracy, we repeat the preceding process, but only on the

prediction operation, not the training operation with the test texts:

print('Getting Test Set Accuracy')

test_acc_all = []

for ix, t in enumerate(vocab_processor.fit_transform(texts_test)):

y_data = [[target_test[ix]]]

if (ix+1)%50==0:

print('Test Observation #' + str(ix+1))

# Keep trailing average of past 50 observations accuracy

# Get prediction of single observation

[[temp_pred]] = sess.run(prediction, feed_dict={x_data:t, y_

target:y_data})

# Get True/False if prediction is accurate

test_acc_temp = target_test[ix]==np.round(temp_pred)

test_acc_all.append(test_acc_temp)

print('\nOverall Test Accuracy: {}'.format(np.mean(test_acc_all)))

Getting Test Set Accuracy For 1115 Sentences.

Test Observation #10

Test Observation #20

Test Observation #30

Test Observation #1000

Test Observation #1050

Test Observation #1100

Overall Test Accuracy: 0.8035874439461883

How it works...

For this example, we worked with the spam-ham text data from the UCI machine learning

repository. We used TensorFlow's vocabulary processing functions to create a standardized

vocabulary to work with and created sentence vectors which were the sum of each text's

word vectors. We used this sentence vector in logistic regression and obtained about an 80%

accuracy model to predict a text being spam.

There’s more

It is worthwhile to mention the motivation of limiting the sentence (or text) size. In this

example, we limited the text size to 25 words. This is a common practice with bag of words

because it limits the effect of text length on the prediction. You can imagine that if we find

a word, meeting for example, that is predictive of a text being ham (not spam), then a spam

message might get through by putting in many occurrences of that word at the end.

In fact, this is a common problem with imbalanced target data. Imbalanced data might occur

in this situation, since spam may be hard to find and ham may be easy to find. Because of this

fact, our vocabulary that we create might be heavily skewed toward words represented in the

ham part of our data (more ham means more words are represented in ham than spam). If we

allow unlimited lengths of texts, then spammers might take advantage of this and create very

long texts, which have a higher probability of triggering non-spam word factors in our logistic

model.

In the next section, we attempt to tackle this problem in a better way by using the frequency of

word occurrence to determine the values of the word embeddings.

Working with bag of words

results matching ""

No results matching ""