RNN Tensorflow Tutorial

MM 0523/2018

Introduction

The RNN and LSTM are briefly described in [1].

Language Modeling

A RNN on a task of language modeling was present.

The goal of the problem is to fit a probablistic model which assigns probabilities to sentences.

It does so by predicting next words in a text given a history of previous words.

Penn Tree Bank (PTB) dataset

The Penn Tree Bank (PTB) dataset is a popular benchmart for measuring the model quality. PTB data set is small and relatively fast to train. PTB is is used in this example.

Language modeling is key to speech recognition, machine translation or image captioning. More information can be referred to [2].

The results from paper " RECURRENT NEURAL NETWORK REGULARIZATION" [3] was reproduced for this tutorial.

Tutorial Files

The tutorial refrences the following files from

models/tutorials/rnn/ptb in the Tensorflow models repo: https://github.com/tensorflow/models

File	purpose
ptb_word_lm.py	train a language model on teh PTB dataset
reader.py	read the dataset

Download and Prepare the Data

The data required for this tutorial is in the data/ directory of the PTB dataset from Tomas Mikolov's webpage.

The dataset is already pre-processed and contain overall 10,000 different words, including the end-of-sentence marker and a special symbol (\<unk>) for rare words.

In reader.py, each word is converted into a unique integer identifier for the neural network.

The model

LSTM

The core of the model consists of an LSTM cell that processes one word at a time and computes probabilities of the possible values for the next word in the sentence. The memory state of the network is initialized with a vector of zeros and gets updated after reading each word. For computational reasons, the data is processed in mini-batches of size batch_size.

Note that current_batch_of_words does not correspond to a "sentence" of words.

Every word in a batch should correspond to a time t. Tensorflow will automatically sum the gradients of each batch.

For example:

 t=0  t=1    t=2  t=3     t=4
[The, brown, fox, is,     quick]
[The, red,   fox, jumped, high]

words_in_dataset[0] = [The, The]
words_in_dataset[1] = [brown, red]
words_in_dataset[2] = [fox, fox]
words_in_dataset[3] = [is, jumped]
words_in_dataset[4] = [quick, high]
batch_size = 2, time_steps = 5

The basic pseudocode is as follows

words_in_dataset = tf.placeholder(tf.float32, [time_steps, batch_size, num_features])
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
hidden_state = tf.zeros([batch_size, lstm.state_size])
current_state = tf.zeros([batch_size, lstm.state_size])
state = hidden_state, current_state
probabilities = []
loss = 0.0
for current_batch_of_words in words_in_dataset:
    # The value of state is updated after processing each batch of words.
    output, state = lstm(current_batch_of_words, state)

    # The LSTM output can be used to make next word predictions
    logits = tf.matmul(output, softmax_w) + softmax_b
    probabilities.append(tf.nn.softmax(logits))
    loss += loss_function(probabilities, target_words)

Truncated Backpropagation

In order to make learning process tractable, it is common practice to create an "unrolled" version of hte network, which contains a fixed number (num_step) of LSTM inputs and outputs. The model is then trained on this finite approximation of the RNN. This can be implemented by feeding inputs of length num_step at a time and performing a backward pass after each such input block.

Following is a simplified block of code for creating a graph which performs truncated backpropagation:

# Placeholder for the inputs in a given iteration.
words = tf.placeholder(tf.int32, [batch_size, num_steps])

lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
initial_state = state = tf.zeros([batch_size, lstm.state_size])

for i in range(num_steps):
    # The value of state is updated after processing each batch of words.
    output, state = lstm(words[:, i], state)

    # The rest of the code.
    # ...

final_state = state

And this is how to implement an interation over the whole dataset:

# A numpy array holding the state of LSTM after each batch of words.
numpy_state = initial_state.eval()
total_loss = 0.0
for current_batch_of_words in words_in_dataset:
    numpy_state, current_loss = session.run([final_state, loss],
        # Initialize the LSTM state from the previous iteration.
        feed_dict={initial_state: numpy_state, words: current_batch_of_words})
    total_loss += current_loss

Inputs

The word IDs will be embedded into a dense representation before feeding to the LSTM [4].

This allows the model to efficiently represent the knowledge about particular words.

# embedding_matrix is a tensor of shape [vocabulary_size, embedding size]
word_embeddings = tf.nn.embedding_lookup(embedding_matrix, word_ids)

The embedding matrix will be initialized randomly and the model will learn to differentiate the meaning of words.

Loss Function

The average negative log probability (cross entropy) of the target words should be minimized

$L= - \frac{1}{N} \displaystyle \sum_{i=1}^N \ln p_{\textrm{target},i}$

The loss function can be used with code

sequence_loss_by_example.

The typical measure is average per-word perplexity as $e^{L}=e^{-\frac{1}{N} \sum_{i=1}^N \ln p_{\textrm{target},i}}$ .

The perplexities is monitored throughout the training process.

Stacking Multiple LSTM

Multiple layers of LSTM are added to process the data. The output of the first layer will become the input of the second and so on.

A class MultiRNNCell is used to make the implementation seamless

def lstm_cell():
  return tf.contrib.rnn.BasicLSTMCell(lstm_size)
stacked_lstm = tf.contrib.rnn.MultiRNNCell(
    [lstm_cell() for _ in range(number_of_layers)])

initial_state = state = stacked_lstm.zero_state(batch_size, tf.float32)
for i in range(num_steps):
    # The value of state is updated after processing each batch of words.
    output, state = stacked_lstm(words[:, i], state)

    # The rest of the code.
    # ...

final_state = state

Run the Code

The PTB dataset should be downloaded first. The PTB dataset will be underneath your home directory as follows:

tar xvfz simple-examples.tgz -C $HOME

(Note: you may need to use other tools to unpack a tar file on Windows OS [5].)

Clone the Tensorflow models repo [7] from GitHub. Run the following commands:

cd models/tutorials/rnn/ptb
python ptb_word_lm.py --data_path=$HOME/simple-examples/data/ --model=small

There are 3 supported model configurations in the tutorial code: "small", "medium" and "large". The difference between them is in size of the LSTMs and the set of hyperparameters used for training.

The larger the model, the better results it should get. The small model should be able to reach perplexity below 120 on the test set and the large one below 80, though it might take several ours to train.