Training-How to build NMT system
Let's first dive into the heart of building an NMT model with concrete code snippets through which we will explain Figure 2. in more detail. We defer data preparation and the full code to later. This part refers to the file model.py.
At the bottom layer, the encoder and decoder RNNs receive as input the following: first, the source sentence, then a boundary marker "\<s>" which indicates the transition from encoding to the decoding mode, and the target sentence. For _training, _we will feed the system with the following tensors, which are in time-major format and contain word indices:
- encoder inputs [max_encoder_time, batch_size]: source input words.
- decoder inputs [max_encoder_time, batch_size]: target input words.
- decoder outputs [max_decoder_time, batch_size]: target output words, these are decoder inputs shifted to the left by one time step with and end-of-sentence tag appended on the right.
Here foe efficiency, we train with multiple sentence (batch_size) at once. Testing is slightly different, so we will discuss it later.
Embedding
Given the categorical nature of words, the model must first look up the source and target embeddings to retrieve the corresponding word representations. For this _embedding _layer to work, a vocabulary is first chosen for each language. Usually, a vocabulary size V is selected, and only the most frequent words are treated as unique. All other words are converted to an "unknown" token and all get the same embedding The embedding weights, one set per language, are usually learner during training.
#Embedding
embedding_encoder = variable_scope.get_variable(
"embedding_encoder", [src_vocab_size, embedding_size], ...)
#Look up embedding:
# encoder_inputs: [max_time, batch_size]
# encoder_emb_inp: [max_time, batch_size, embedding_size]
encoder_emb_inp = embedding_ops.embedding_lookup(
embedding_encoder, encoder_inputs)
Similarly, we can build embedding_decoder _and _decoder_emb_inp. Note that one can choose to initialize embedding weights with pretrained word representations such as word2vec or Glove vectors. In general, given a large amount of training data we can learn these embeddings from scratch.
Encoder
Once retrieved, the word embeddings are then fed as input into the main network, which consists of two multi-layer RNNs - an encoder for the source language and a decoder for the target language. These two RNNs, in principle, can share the same weights; however, in practice, we often use two different RNN parameters (such models do a better job when fitting large training datasets). The encoder RNN uses zero vectors as its strating states and is built as follows:
#Building RNN cell
encoder_cell = tf.nn.rnn_cell.BasicSTMCell(num_units)
#Run Dynamic RNN
# encoder_outputs" [max_time, batch_size, num_units]
# encoder_state: [batch_size, num_units]
encoder_outputs, encoder_state = tf.nn.dynamic_rnn(
encoder_cell, encoder_emb_inp,
sequence_length = source_sequence_length, time_major= True)
Note that sentences have different lengths to avoid wasting computation, we tell dynamic_rnn t_he exact source sentence lengths through _source_sequence_length. Since our input is time major, we set time_major = True. Here, we build only a single layer LSTM, _encoder_cell. We'll describe how to build multi-layer LSTMs, add dropout, and use attention in a later section.
Decoder
The decoder _also needs to have access to the source information, and one simple way to achieve that is to initialize it with the last hidden state of the encoder, _encoder_state. In Figure. 2, we pass the hidden state at the source word "student" to the decoder side.
#Build RNN cell
decodre_cell = tf.nn.rnn_cell.BasicLSTMCell(num_units)
#Helper
helper = tf.contrib.seq2seq.TrainingHelper(
decoder_emb_inp, decoder_lengths, time_major=True)
#Decoder
decoder = tf.contrib.seq2seq.BasicDecoder(
decoder_cell, helper, encoder_state,
output_layer = projection_layer)
#Dynamic decoding
outputs, _ = tf.contrib.seq2seq.dynamic_decode(deocder, ...)
logits = outputs.rnn_output
Here, the core part of this code is the BasicDecoder _object, decoder, which receives _decoder_cell (similar to encoder_cell), a _helper, _and the previous _encoder_state _as inputs. By separating out decoders and helpers, we can reuse different databases, e.g., _TrianingHelper _can be substituted with _GreedyEmbeddingHelper _to do greedy decoding. See more in helper.py.
Lastly, we haven't mentioned _projection_layer _which is a dense matrix to turn the top hidden states to logit vectors of dimension V. We illustrate this process at the top of Figure 2.
projection_layer = layers+core.Dense(
tgt_vocab_size, use_bias = False)
Loss
Giver the logits above, we are now ready to compute our training loss:
crossent = tf.nn.sparse_softmax_cross_entropy_with(
labels = deocder_outputs, logits = logits)
train_loss = (tf.reduce_sum(crossent * target_weights) / batch_size)
Here, target_weights is a zero-one matrix of the same size as_ decoder_outputs. _It masks padding positions outside of the target sequence lengths with values 0.
Important note: It's worth pointing out that we divide the loss by batch_size, so our hyperparameters are "invariant" to batch_size. Some people divide the loss by (batch_size * num_time_steps), which plays down the errors made on short sentences. More subtly, our hyperparameters (applied to the former way) can't be used for the latter way. For example, if both approaches use SGD with a learning of 1.0, the latter approach effectively uses a much smaller learning rate of 1/num_time_steps.
Gradient computation & optimization
We have now defined the forward pass of our NMT model. Computing the backpropagation pass is just a matter of a few lines of code:
#Calculate and clip gradients
params = tf.trainable_variables()
gradients = tf.gradients(train_loss, params)
clipped_gradients, _ = tf.clip_by_global_norm(
gradients, max_gradient_norm)
One of the important steps in training RNNs is gradient clipping. Here, we clip by the global norm. The max value, max_grident_norm, is often set to a value like 5 or 1. The last step is selecting the optimizer. The Adam optimizer is a common choice. We also select a learning rate. The value of learning __rate can _is usually in the range 0.0001 to 0.001; and can be decrease as training progresses.
#Optimization
optimizer = tf.train.AdamOptimizer(learning_rate)
update_step = optimizer.apply_gradients(
zip(clipped_gradients, params))
In our own experiments, we use standard SGD (tf.train.GradientDescentOptimizer) with a decreasing learning rate schedule, which yields better performance. See the becnhmarks.