Powered by GitBook

Implementing TF-IDF

Since we can choose the embedding for each word

we might decide to change the weighting on certain words.

One such strategy is to upweight useful words and downweight overly common or too rare words.

The embedding we will explore in this recipe is an attempt to achieve this.

Getting ready

TF-IDF is an acronym

that stands for Text Frequency – Inverse Document Frequency.

This term is essentially

the product of text frequency and inverse document frequency

for each word.

In the prior recipe,

we introduced the bag of words methodology

which assigned a value of one for every occurrence of a word in a sentence

This is probably not ideal as each category of sentence

(spam and ham for the prior recipe example)

most likely has the same frequency of the, “and”, and other words, whereas words such as viagra and sale probably should have increased importance in figuring out whether or not the text is spam.

We first want to take into consideration the word frequency.

Here we consider the frequency with which a word occurs in an individual entry.

The purpose of this part (TF) is to fnd terms that appear to be important in each entry

But words such as the and and may appear very frequently in every entry.

We want to down weight the importance of these words

so we can imagine that multiplying the above text frequency (TF) by the inverse of the whole document frequency might help find important words.

But since a collection of texts (a corpus) may be quite large

it is common to take the logarithm of the inverse document frequency

This leaves us with the following formula for TF-IDF for each word in each document entry

Here wtf is the word frequency

By document

And wdf is the total frequency of such words across all documents.

We can imagine that high values of TF-IDF

Might indicate words that are very important to determine what a document is about.

Creating the TF-IDF vectors requires us to load all the text into memory and count the occurrence of each word before we can start training our model.

Because of this, it is not implemented fully in TensorFlow, so we will use scikit-learn for creating our TF-IDF embedding,

But use TensorFlow to fit the logistic model.

How to do it...

1.We start by loading the necessary libraries, and this time we are loading the Scikitlearn

TF-IDF preprocessing library for our texts. Use the following code:

import tensorflow as tf

import matplotlib.pyplot as plt

import csv

import numpy as np

import os

import string

import requests

import io

import nltk

from zipfile import ZipFile

from sklearn.feature_extraction.text import TfidfVectorizer

2.We start a graph session and declare our batch size and maximum feature size for

our vocabulary:

sess = tf.Session()

batch_size= 200

max_featurtes = 1000

3.Next we load the data, either from the Web or from our temp data folder if we have

saved it before. Use the following code:

save_file_name = os.path.join('temp','temp_spam_data.csv')

if os.path.isfile(save_file_name):

text_data = []

with open(save_file_name, 'r') as temp_output_file:

reader = csv.reader(temp_output_file)

for row in reader:

text_data.append(row)

else:

zip_url = 'http://archive.ics.uci.edu/ml/machine-learningdatabases/

00228/smsspamcollection.zip'

r = requests.get(zip_url)

z = ZipFile(io.BytesIO(r.content))

file = z.read('SMSSpamCollection')

# Format Data

text_data = file.decode()

text_data = text_data.encode('ascii',errors='ignore')

text_data = text_data.decode().split('\n')

text_data = [x.split('\t') for x in text_data if len(x)>=1]

# And write to csv

with open(save_file_name, 'w') as temp_output_file:

writer = csv.writer(temp_output_file)

writer.writerows(text_data)

texts = [x[1] for x in text_data]

target = [x[0] for x in text_data]

# Relabel 'spam' as 1, 'ham' as 0

target = [1. if x=='spam' else 0. for x in target]

4.Just like in the prior recipe, we will decrease our vocabulary size by converting

everything to lowercase, removing punctuation, and getting rid of numbers:

# Lower case

texts = [x.lower() for x in texts]

# Remove punctuation

texts = [''.join(c for c in x if c not in string.punctuation) for

x in texts]

# Remove numbers

texts = [''.join(c for c in x if c not in '0123456789') for x in

texts]

# Trim extra whitespace

texts = [' '.join(x.split()) for x in texts]

5.In order to use scikt-learn's TF-IDF processing functions, we have to tell it how to

tokenize our sentences. By this, we just mean how to break up a sentence into the

corresponding words. A great tokenizer is already built for us in the nltk package

that does a great job of breaking up sentences into the corresponding words:

def tokenizer(text):

words = nltk.word_tokenize(text)

return words

# Create TF-IDF of texts

tfidf = TfidfVectorizer(tokenizer=tokenizer, stop_words='english',

max_features=max_features)

sparse_tfidf_texts = tfidf.fit_transform(texts)

Next we break up our data set into a train and test set. Use the following code:

train_indices = np.random.choice(sparse_tfidf_texts.shape[0],

round(0.8*sparse_tfidf_texts.shape[0]), replace=False)3test_

indices = np.array(list(set(range(sparse_tfidf_texts.shape[0])) -

set(train_indices)))

texts_train = sparse_tfidf_texts[train_indices]

texts_test = sparse_tfidf_texts[test_indices]

target_train = np.array([x for ix, x in enumerate(target) if ix in

train_indices])

target_test = np.array([x for ix, x in enumerate(target) if ix in

test_indices])

7.Now we can declare our model variables for logistic regression and our data

placeholders:

A = tf.Variable(tf.random_normal(shape=[max_features,1]))

b = tf.Variable(tf.random_normal(shape=[1,1]))

# Initialize placeholders

x_data = tf.placeholder(shape=[None, max_features], dtype=tf.

float32)

y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)

8.We can now declare the model operations and the loss function. Remember that

the sigmoid part of the logistic regression is in our loss function. Use the following

code:

model_output = tf.add(tf.matmul(x_data, A), b)

loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_

logits(model_output, y_target))

9.We add a prediction and accuracy function to the graph so that we can see the

accuracy of the train and test set as our model is training:

prediction = tf.round(tf.sigmoid(model_output))

predictions_correct = tf.cast(tf.equal(prediction, y_target),

tf.float32)

accuracy = tf.reduce_mean(predictions_correct)

We declare an optimizer and initialize our graph variables next:

my_opt = tf.train.GradientDescentOptimizer(0.0025)

train_step = my_opt.minimize(loss)

# Intitialize Variables

init = tf.initialize_all_variables()

sess.run(init)

11.We now train our model over 10,000 generations and record the test/train loss and

accuracy every 100 generations and print out the status every 500 generations. Use

the following code:

train_loss = []

test_loss = []

train_acc = []

test_acc = []

i_data = []

for i in range(10000):

rand_index = np.random.choice(texts_train.shape[0],

size=batch_size)

rand_x = texts_train[rand_index].todense()

rand_y = np.transpose([target_train[rand_index]])

sess.run(train_step, feed_dict={x_data: rand_x, y_target:

rand_y})

# Only record loss and accuracy every 100 generations

if (i+1)%100==0:

i_data.append(i+1)

train_loss_temp = sess.run(loss, feed_dict={x_data:

rand_x, y_target: rand_y})

train_loss.append(train_loss_temp)

test_loss_temp = sess.run(loss, feed_dict={x_data: texts_

test.todense(), y_target: np.transpose([target_test])})

test_loss.append(test_loss_temp)

train_acc_temp = sess.run(accuracy, feed_dict={x_data:

rand_x, y_target: rand_y})

train_acc.append(train_acc_temp)

test_acc_temp = sess.run(accuracy, feed_dict={x_data:

texts_test.todense(), y_target: np.transpose([target_test])})

test_acc.append(test_acc_temp)

if (i+1)%500==0:

acc_and_loss = [i+1, train_loss_temp, test_loss_temp,

train_acc_temp, test_acc_temp]

acc_and_loss = [np.round(x,2) for x in acc_and_loss]

print('Generation # {}. Train Loss (Test Loss): {:.2f}

({:.2f}). Train Acc (Test Acc): {:.2f} ({:.2f})'.format(*acc_and_

loss))

This results in the following output:

Generation # 500. Train Loss (Test Loss): 0.69 (0.73). Train Acc

(Test Acc): 0.62 (0.57)

Generation # 1000. Train Loss (Test Loss): 0.62 (0.63). Train Acc

(Test Acc): 0.68 (0.66)

...

Generation # 9500. Train Loss (Test Loss): 0.39 (0.45). Train Acc

(Test Acc): 0.89 (0.85)

Generation # 10000. Train Loss (Test Loss): 0.48 (0.45). Train Acc

(Test Acc): 0.84 (0.85)

And here is the code to plot the accuracy and loss for both the train and test set:

Figure 2: Cross entropy loss for our logistic spam model built off of TF-IDF values.

Figure 3: Train and test set accuracy for the logistic spam model built off TF-IDF values.

results matching ""

No results matching ""