Implementing TF-IDF
Since we can choose the embedding for each word
we might decide to change the weighting on certain words.
One such strategy is to upweight useful words and downweight overly common or too rare words.
The embedding we will explore in this recipe is an attempt to achieve this.
Getting ready
TF-IDF is an acronym
that stands for Text Frequency – Inverse Document Frequency.
This term is essentially
the product of text frequency and inverse document frequency
for each word.
In the prior recipe,
we introduced the bag of words methodology
which assigned a value of one for every occurrence of a word in a sentence
This is probably not ideal as each category of sentence
(spam and ham for the prior recipe example)
most likely has the same frequency of the, “and”, and other words, whereas words such as viagra and sale probably should have increased importance in figuring out whether or not the text is spam.
We first want to take into consideration the word frequency.
Here we consider the frequency with which a word occurs in an individual entry.
The purpose of this part (TF) is to fnd terms that appear to be important in each entry
But words such as the and and may appear very frequently in every entry.
We want to down weight the importance of these words
so we can imagine that multiplying the above text frequency (TF) by the inverse of the whole document frequency might help find important words.
But since a collection of texts (a corpus) may be quite large
it is common to take the logarithm of the inverse document frequency
This leaves us with the following formula for TF-IDF for each word in each document entry
Here wtf is the word frequency
By document
And wdf is the total frequency of such words across all documents.
We can imagine that high values of TF-IDF
Might indicate words that are very important to determine what a document is about.
Creating the TF-IDF vectors requires us to load all the text into memory and count the occurrence of each word before we can start training our model.
Because of this, it is not implemented fully in TensorFlow, so we will use scikit-learn for creating our TF-IDF embedding,
But use TensorFlow to fit the logistic model.
How to do it...
1.We start by loading the necessary libraries, and this time we are loading the Scikitlearn
TF-IDF preprocessing library for our texts. Use the following code:
import tensorflow as tf
import matplotlib.pyplot as plt
import csv
import numpy as np
import os
import string
import requests
import io
import nltk
from zipfile import ZipFile
from sklearn.feature_extraction.text import TfidfVectorizer
2.We start a graph session and declare our batch size and maximum feature size for
our vocabulary:
sess = tf.Session()
batch_size= 200
max_featurtes = 1000
3.Next we load the data, either from the Web or from our temp data folder if we have
saved it before. Use the following code:
save_file_name = os.path.join('temp','temp_spam_data.csv')
if os.path.isfile(save_file_name):
text_data = []
with open(save_file_name, 'r') as temp_output_file:
reader = csv.reader(temp_output_file)
for row in reader:
text_data.append(row)
else:
zip_url = 'http://archive.ics.uci.edu/ml/machine-learningdatabases/
00228/smsspamcollection.zip'
r = requests.get(zip_url)
z = ZipFile(io.BytesIO(r.content))
file = z.read('SMSSpamCollection')
# Format Data
text_data = file.decode()
text_data = text_data.encode('ascii',errors='ignore')
text_data = text_data.decode().split('\n')
text_data = [x.split('\t') for x in text_data if len(x)>=1]
# And write to csv
with open(save_file_name, 'w') as temp_output_file:
writer = csv.writer(temp_output_file)
writer.writerows(text_data)
texts = [x[1] for x in text_data]
target = [x[0] for x in text_data]
# Relabel 'spam' as 1, 'ham' as 0
target = [1. if x=='spam' else 0. for x in target]
4.Just like in the prior recipe, we will decrease our vocabulary size by converting
everything to lowercase, removing punctuation, and getting rid of numbers:
# Lower case
texts = [x.lower() for x in texts]
# Remove punctuation
texts = [''.join(c for c in x if c not in string.punctuation) for
x in texts]
# Remove numbers
texts = [''.join(c for c in x if c not in '0123456789') for x in
texts]
# Trim extra whitespace
texts = [' '.join(x.split()) for x in texts]
5.In order to use scikt-learn's TF-IDF processing functions, we have to tell it how to
tokenize our sentences. By this, we just mean how to break up a sentence into the
corresponding words. A great tokenizer is already built for us in the nltk package
that does a great job of breaking up sentences into the corresponding words:
def tokenizer(text):
words = nltk.word_tokenize(text)
return words
# Create TF-IDF of texts
tfidf = TfidfVectorizer(tokenizer=tokenizer, stop_words='english',
max_features=max_features)
sparse_tfidf_texts = tfidf.fit_transform(texts)
- Next we break up our data set into a train and test set. Use the following code:
train_indices = np.random.choice(sparse_tfidf_texts.shape[0],
round(0.8*sparse_tfidf_texts.shape[0]), replace=False)3test_
indices = np.array(list(set(range(sparse_tfidf_texts.shape[0])) -
set(train_indices)))
texts_train = sparse_tfidf_texts[train_indices]
texts_test = sparse_tfidf_texts[test_indices]
target_train = np.array([x for ix, x in enumerate(target) if ix in
train_indices])
target_test = np.array([x for ix, x in enumerate(target) if ix in
test_indices])
7.Now we can declare our model variables for logistic regression and our data
placeholders:
A = tf.Variable(tf.random_normal(shape=[max_features,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
# Initialize placeholders
x_data = tf.placeholder(shape=[None, max_features], dtype=tf.
float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
8.We can now declare the model operations and the loss function. Remember that
the sigmoid part of the logistic regression is in our loss function. Use the following
code:
model_output = tf.add(tf.matmul(x_data, A), b)
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_
logits(model_output, y_target))
9.We add a prediction and accuracy function to the graph so that we can see the
accuracy of the train and test set as our model is training:
prediction = tf.round(tf.sigmoid(model_output))
predictions_correct = tf.cast(tf.equal(prediction, y_target),
tf.float32)
accuracy = tf.reduce_mean(predictions_correct)
- We declare an optimizer and initialize our graph variables next:
my_opt = tf.train.GradientDescentOptimizer(0.0025)
train_step = my_opt.minimize(loss)
# Intitialize Variables
init = tf.initialize_all_variables()
sess.run(init)
11.We now train our model over 10,000 generations and record the test/train loss and
accuracy every 100 generations and print out the status every 500 generations. Use
the following code:
train_loss = []
test_loss = []
train_acc = []
test_acc = []
i_data = []
for i in range(10000):
rand_index = np.random.choice(texts_train.shape[0],
size=batch_size)
rand_x = texts_train[rand_index].todense()
rand_y = np.transpose([target_train[rand_index]])
sess.run(train_step, feed_dict={x_data: rand_x, y_target:
rand_y})
# Only record loss and accuracy every 100 generations
if (i+1)%100==0:
i_data.append(i+1)
train_loss_temp = sess.run(loss, feed_dict={x_data:
rand_x, y_target: rand_y})
train_loss.append(train_loss_temp)
test_loss_temp = sess.run(loss, feed_dict={x_data: texts_
test.todense(), y_target: np.transpose([target_test])})
test_loss.append(test_loss_temp)
train_acc_temp = sess.run(accuracy, feed_dict={x_data:
rand_x, y_target: rand_y})
train_acc.append(train_acc_temp)
test_acc_temp = sess.run(accuracy, feed_dict={x_data:
texts_test.todense(), y_target: np.transpose([target_test])})
test_acc.append(test_acc_temp)
if (i+1)%500==0:
acc_and_loss = [i+1, train_loss_temp, test_loss_temp,
train_acc_temp, test_acc_temp]
acc_and_loss = [np.round(x,2) for x in acc_and_loss]
print('Generation # {}. Train Loss (Test Loss): {:.2f}
({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_
loss))
- This results in the following output:
Generation # 500. Train Loss (Test Loss): 0.69 (0.73). Train Acc
(Test Acc): 0.62 (0.57)
Generation # 1000. Train Loss (Test Loss): 0.62 (0.63). Train Acc
(Test Acc): 0.68 (0.66)
...
Generation # 9500. Train Loss (Test Loss): 0.39 (0.45). Train Acc
(Test Acc): 0.89 (0.85)
Generation # 10000. Train Loss (Test Loss): 0.48 (0.45). Train Acc
(Test Acc): 0.84 (0.85)
- And here is the code to plot the accuracy and loss for both the train and test set:
Figure 2: Cross entropy loss for our logistic spam model built off of TF-IDF values.
Figure 3: Train and test set accuracy for the logistic spam model built off TF-IDF values.