Natural Language Processing
MM 0418/2018
word embeddings and bag of word methods are first introduced.
Advanced word embeddings such as word2vec and doc2vec are implemented.
The followings are the materials: bag of words, implementing TF-IDF, Skip-gram Embeddings, CBOW Embeddings, making predictions with Word2vec, Using Doc2vec for Sentiment Analysis.
TF-IDF is an acronym that stands for Text Frequency Inverse Document Frequency. This term is essentially the product of text frequency and inverse document frequency for each word.
All the code for this chapter can be found online athttps://github.com/nfmcclure/tensorflow_cookbook.
Introduction
There are many ways to convert the text into numbers.
The words are converted into numbers in the order of the word sequence.
Working with bag of words
Word embedding can be applied to spam prediction.
A spam-ham phone text database from the UCI machine learning data repository
(https://archive.ics.uci. edu/ml/datasets/SMS+Spam+Collection) is used in this example.
The spam-ham phone text database is a collection of phone text messages that are spam or not-spam (ham).
The spam-ham phone text database should be downloaded, and we will predict whether a text is spam or not with the bag of words method.
The bag of words model will will be a logistic model with no hidden layers.
The stochastic training is used with batch size of one, and the accuracy on a held-out test is evaluated at the end.
For this example
The overall flow is : getting the data -> normalizing and splitting the text -> running it through an embedding function ->
training the logistic function to predict spam -> Implementing TF-IDF.
Since we can choose the embedding for each word, we might decide to change the weighting on certain words.
One such strategy is to upweight useful words and downweight overly common or too rare words.
The bag of words methodology assigned a value of one for every occurrence of a word in a sentence.
This is probably not ideal as each category of sentence (spam and ham for the prior recipe example)
most likely has the same frequency of the, “and”, and other words, whereas words such as viagra and sale probably should have increased importance in figuring out whether or not the text is spam.
The word frequency is expected to be considered. The frequency with which a word occurs in an individual entry is considered.
The purpose of this part (TF) is to fnd terms that appear to be important in each entry
But words such as the and and may appear very frequently in every entry.
We want to down weight the importance of these words
so we can imagine that multiplying the above text frequency (TF) by the inverse of the whole document frequency might help find important words.
But since a collection of texts (a corpus) may be quite large
it is common to take the logarithm of the inverse document frequency
This leaves us with the following formula for TF-IDF for each word in each document entry
Here wtf is the word frequency
By document
And wdf is the total frequency of such words across all documents.
We can imagine that high values of TF-IDF
Might indicate words that are very important to determine what a document is about.
Creating the TF-IDF vectors requires us to load all the text into memory and count the occurrence of each word before we can start training our model.
Because of this, it is not implemented fully in TensorFlow, so we will use scikit-learn for creating our TF-IDF embedding,
But use TensorFlow to fit the logistic model.