Training a RNN Language Model Using TensorFow
語言模型
Language Model,即語言模型,其主要主要功能之一是,在知道前一部分的詞的情況下,推斷出下一個最有可能出現的詞。例如,知道了"The fat cat sat on the",我們認為下一個詞為"mat"的可能性比"hat"要大,因為貓更有可能坐在毯子上,而不是帽子上。
在自然語言處理中,可以用機率統計模型來描述。就拿"The fat cat sat on the mat"來說。我們可能統計出第一個詞"The"出現的機率 P("The"),"The"後面是"fat"的條件機率為 P("fat"|"The") ,"The fat"同時出現的聯合機率:
這個聯合機率,就是The fat的合理性,即這句話的出現符不符合自然語言的評判標準,通俗點表述就是這是不是句人話。同理,根據鏈式規則,The fat cat的聯合機率可求:
因此,"The fat cat sat on the mat"整個句子的合理性同樣可以推導,這個句子的合理性即為它的機率。公式化的描述如下:
這個聯合機率,就是The fat的合理性,即這句話的出現符不符合自然語言的評判標準,通俗點表述就是這是不是句人話。同理,根據鏈式規則,The fat cat的聯合機率可求:
P("The fat cat")=P("The")P("fat|The")P("cat"|"The fat")
因此,"The fat cat sat on the mat"整個句子的合理性同樣可以推導,這個句子的合理性即為它的機率。公式化的描述如下:
P(S)=P(w1,w2,⋅⋅⋅,wn)=P(w1)⋅P(w2|w1)⋅P(w3|w1,w2)⋅⋅⋅P(wn|w1,w2,w3,⋅⋅⋅,wn−1)P(S)=P(w1,w2,•••,wn)=P(w1)•P(w2|w1)•P(w3|w1,w2)•••P(wn|w1,w2,w3,•••,wn−1)
數據準備
TensorFlow的官方文檔使用的是Mikolov準備好的PTB資料集。我們可以將其下載並解壓出來:
http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz
部分資料如下,不常用的詞轉換成了<unk>標記,數位轉換成了N:
we 're talking about years ago before anyone heard of asbestos having any questionable properties
there is no asbestos in our products now
neither <unk> nor the researchers who studied the workers were aware of any research on smokers of the kent cigarettes
we have no useful information on whether users are at risk said james a. <unk> of boston 's <unk> cancer institute
the total of N deaths from malignant <unk> lung cancer and <unk> was far higher than expected the researchers said
讀取檔中的資料,將分行符號轉換為<eos>,然後轉換為詞的list:
def _read_words(filename):
with open\(filename, 'r', encoding='utf-8'\) as f:
return f.read\(\).replace\('\n', '<eos>'\).split\(\)
f = _read_words('simple-examples/data/ptb.train.txt')
print(f[:20])
得到:
['aer', 'banknote', 'berlitz', 'calloway', 'centrust', 'cluett', 'fromstein', 'gitano', 'guterman', 'hydro-quebec', 'ipo', 'kia', 'memotec', 'mlx', 'nahb', 'punts', 'rake', 'regatta', 'rubens', 'sim']
構建詞彙表,詞與id互轉:
def _build_vocab(filename):
data = \_read\_words\(filename\)
counter = Counter\(data\)
count\_pairs = sorted\(counter.items\(\), key=lambda x: -x\[1\]\)
words, \_ = list\(ziP\(\*count\_pairs\)\)
word\_to\_id = dict\(ziP\(words, range\(len\(words\)\)\)\)
return words, word\_to\_id
words, words_to_id = _build_vocab('simple-examples/data/ptb.train.txt')
print(words[:10])
print(list(maP(lambda x: words_to_id[x], words[:10])))
輸出:
('the', '<unk>', '<eos>', 'N', 'of', 'to', 'a', 'in', 'and', "'s")
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
將一個檔轉換為id表示:
def _file_to_word_ids(filename, word_to_id):
data = \_read\_words\(filename\)
return \[word\_to\_id\[x\] for x in data if x in word\_to\_id\]
words_in_file = _file_to_word_ids('simple-examples/data/ptb.train.txt', words_to_id)
print(words_in_file[:20])
詞彙表已根據詞頻進行排序,由於第一句話非英文,所以id靠後。
[9980, 9988, 9981, 9989, 9970, 9998, 9971, 9979, 9992, 9997, 9982, 9972, 9993, 9991, 9978, 9983, 9974, 9986, 9999, 9990]
將一句話從id列表轉換回詞:
def to_words(sentence, words):
return list\(maP\(lambda x: words\[x\], sentence\)\)
將以上函數整合:
def ptb_raw_data(data_path=None):
train\_path = os.path.join\(data\_path, 'ptb.train.txt'\)
valid\_path = os.path.join\(data\_path, 'ptb.valid.txt'\)
test\_path = os.path.join\(data\_path, 'ptb.test.txt'\)
words, word\_to\_id = \_build\_vocab\(train\_path\)
train\_data = \_file\_to\_word\_ids\(train\_path, word\_to\_id\)
valid\_data = \_file\_to\_word\_ids\(valid\_path, word\_to\_id\)
test\_data = \_file\_to\_word\_ids\(test\_path, word\_to\_id\)
return train\_data, valid\_data, test\_data, words, word\_to\_id
以上部分和官方的例子有一定的相似之處。接下來的處理和官方存在很大的不同,主要參考了Keras常式處理文檔的操作:
def ptb_producer(raw_data, batch_size=64, num_steps=20, stride=1):
data\_len = len\(raw\_data\)
sentences = \[\]
next\_words = \[\]
for i in range\(0, data\_len - num\_steps, stride\):
sentences.append\(raw\_data\[i:\(i + num\_steps\)\]\)
next\_words.append\(raw\_data\[i + num\_steps\]\)
sentences = np.array\(sentences\)
next\_words = np.array\(next\_words\)
batch\_len = len\(sentences\) // batch\_size
x = np.reshape\(sentences\[:\(batch\_len \* batch\_size\)\], \
\[batch\_len, batch\_size, -1\]\)
y = np.reshape\(next\_words\[:\(batch\_len \* batch\_size\)\], \
\[batch\_len, batch\_size\]\)
return x, y
參數解析:
• raw_data: 即ptb_raw_data()函數產生的資料
• batch_size: 神經網路使用隨機梯度下降,資料按多個批次輸出,此為每個批次的資料量
• num_steps: 每個句子的長度,相當於之前描述的n的大小,這在迴圈神經網路中又稱為時序的長度。
• stride: 取資料的步長,決定資料量的大小。
代碼解析:
這個函數將一個原始資料list轉換為多個批次的資料,即[batch_len, batch_size, num_steps]。
首先,程式每一次取了num_steps個詞作為一個句子,即x,以這num_steps個詞後面的一個詞作為它的下一個預測,即為y。這樣,我們首先把原始資料整理成了batch_len * batch_size個x和y的表示,類似於已知x求y的分類問題。
為了滿足隨機梯度下降的需要,我們還需要把資料整理成一個個小的批次,每次喂一個批次的資料給TensorFlow來更新權重,這樣,資料就整理為[batch_len, batch_size, num_steps]的格式。
列印部分資料:
train_data, valid_data, test_data, words, word_to_id = ptb_raw_data('simple-examples/data')
x_train, y_train = ptb_producer(train_data)
print(x_train.shape)
print(y_train.shape)
輸出:
(14524, 64, 20)
(14524, 64)
可見我們得到了14524個批次的資料,每個批次的訓練集維度為[64, 20]。
print(' '.join(to_words(x_train[100, 3], words)))
第100個批次的第3句話為:
despite steady sales growth <eos> magna recently cut its quarterly dividend in half and the company 's class a shares
print(words[np.argmax(y_train[100, 3])])
它的下一個詞為:
the
構建模型
配置項
class LMConfig(object):
"""language model 配置項"""
batch\_size = 64 \# 每一批數據的大小
num\_steps = 20 \# 每一個句子的長度
stride = 3 \# 取數據時的步長
embedding\_dim = 64 \# 詞向量維度
hidden\_dim = 128 \# RNN隱藏層維度
num\_layers = 2 \# RNN層數
learning\_rate = 0.05 \# 學習率
dropout = 0.2 \# 每一層後的丟棄機率
讀取輸入
讓模型可以按批次的讀取資料。
class PTBInput(object):
"""按批次讀取數據"""
def \_\_init\_\_\(self, config, data\):
self.batch\_size = config.batch\_size
self.num\_steps = config.num\_steps
self.vocab\_size = config.vocab\_size \# 詞彙表大小
self.input\_data, self.targets = ptb\_producer\(data,
self.batch\_size, self.num\_steps\)
self.batch\_len = self.input\_data.shape\[0\] \# 總批次
self.cur\_batch = 0 \# 當前批次
def next\_batch\(self\):
"""讀取下一批次"""
x = self.input\_data\[self.cur\_batch\]
y = self.targets\[self.cur\_batch\]
\# 轉換為one-hot編碼
y\_ = np.zeros\(\(y.shape\[0\], self.vocab\_size\), dtype=np.bool\)
for i in range\(y.shape\[0\]\):
y\_\[i\]\[y\[i\]\] = 1
\# 如果到最後一個批次,則回到最開頭
self.cur\_batch = \(self.cur\_batch +1\) % self.batch\_len
return x, y\_
模型
class PTBModel(object):
def \_\_init\_\_\(self, config, is\_training=True\):
self.num\_steps = config.num\_steps
self.vocab\_size = config.vocab\_size
self.embedding\_dim = config.embedding\_dim
self.hidden\_dim = config.hidden\_dim
self.num\_layers = config.num\_layers
self.rnn\_model = config.rnn\_model
self.learning\_rate = config.learning\_rate
self.dropout = config.dropout
self.placeholders\(\) \# 輸入預留位置
self.rnn\(\) \# rnn 模型構建
self.cost\(\) \# 代價函數
self.optimize\(\) \# 優化器
self.error\(\) \# 錯誤率
def placeholders\(self\):
"""輸入資料的預留位置"""
self.\_inputs = tf.placeholder\(tf.int32, \[None, self.num\_steps\]\)
self.\_targets = tf.placeholder\(tf.int32, \[None, self.vocab\_size\]\)
def input\_embedding\(self\):
"""將輸入轉換為詞向量表示"""
with tf.device\("/cpu:0"\):
embedding = tf.get\_variable\(
"embedding", \[self.vocab\_size,
self.embedding\_dim\], dtype=tf.float32\)
\_inputs = tf.nn.embedding\_lookuP\(embedding, self.\_inputs\)
return \_inputs
def rnn\(self\):
"""rnn模型構建"""
def lstm\_cell\(\): \# 基本的lstm cell
return tf.contrib.rnn.BasicLSTMCell\(self.hidden\_dim,
state\_is\_tuple=True\)
def gru\_cell\(\): \# gru cell,速度更快
return tf.contrib.rnn.GRUCell\(self.hidden\_dim\)
def dropout\_cell\(\): \# 在每個cell後添加dropout
if \(self.rnn\_model == 'lstm'\):
cell = lstm\_cell\(\)
else:
cell = gru\_cell\(\)
return tf.contrib.rnn.DropoutWrapper\(cell,
output\_keep\_prob=self.dropout\)
cells = \[dropout\_cell\(\) for \_ in range\(self.num\_layers\)\]
cell = tf.contrib.rnn.MultiRNNCell\(cells, state\_is\_tuple=True\) \# 多層rnn
\_inputs = self.input\_embedding\(\)
\_outputs, \_ = tf.nn.dynamic\_rnn\(cell=cell,
inputs=\_inputs, dtype=tf.float32\)
\# \_outputs的shape為 \[batch\_size, num\_steps, hidden\_dim\]
last = \_outputs\[:, -1, :\] \# 只需要最後一個輸出
\# dense 和 softmax 用於分類,以找出各詞的機率
logits = tf.layers.dense\(inputs=last, units=self.vocab\_size\)
prediction = tf.nn.softmax\(logits\)
self.\_logits = logits
self.\_pred = prediction
def cost\(self\):
"""計算交叉熵代價函數"""
cross\_entropy = tf.nn.softmax\_cross\_entropy\_with\_logits\(
logits=self.\_logits, labels=self.\_targets\)
cost = tf.reduce\_mean\(cross\_entropy\)
self.cost = cost
def optimize\(self\):
"""使用adam優化器"""
optimizer = tf.train.AdamOptimizer\(learning\_rate=self.learning\_rate\)
self.optim = optimizer.minimize\(self.cost\)
def error\(self\):
"""計算錯誤率"""
mistakes = tf.not\_equal\(
tf.argmax\(self.\_targets, 1\), tf.argmax\(self.\_pred, 1\)\)
self.errors = tf.reduce\_mean\(tf.cast\(mistakes, tf.float32\)\)
訓練
def run_epoch(num_epochs=10):
config = LMConfig\(\) \# 載入配置項
\# 載入來源資料,這裡只需要訓練集
train\_data, \_, \_, words, word\_to\_id = \
ptb\_raw\_data\('simple-examples/data'\)
config.vocab\_size = len\(words\)
\# 數據分批
input\_train = PTBInput\(config, train\_data\)
batch\_len = input\_train.batch\_len
\# 構建模型
model = PTBModel\(config\)
\# 創建session,初始化變數
sess = tf.Session\(\)
sess.run\(tf.global\_variables\_initializer\(\)\)
print\('Start training...'\)
for epoch in range\(num\_epochs\): \# 反覆運算輪次
for i in range\(batch\_len\): \# 經過多少個batch
x\_batch, y\_batch = input\_train.next\_batch\(\)
\# 取一個批次的資料,運行優化
feed\_dict = {model.\_inputs: x\_batch, model.\_targets: y\_batch}
sess.run\(model.optim, feed\_dict=feed\_dict\)
\# 每500個batch,輸出一次中間結果
if i % 500 == 0:
cost = sess.run\(model.cost, feed\_dict=feed\_dict\)
msg = "Epoch: {0:>3}, batch: {1:>6}, Loss: {2:>6.3}"
print\(msg.format\(epoch + 1, i + 1, cost\)\)
\# 輸出部分預測結果
pred = sess.run\(model.\_pred, feed\_dict=feed\_dict\)
word\_ids = sess.run\(tf.argmax\(pred, 1\)\)
print\('Predicted:', ' '.join\(words\[w\] for w in word\_ids\)\)
true\_ids = np.argmax\(y\_batch, 1\)
print\('True:', ' '.join\(words\[w\] for w in true\_ids\)\)
print\('Finish training...'\)
sess.close\(\)
需要經過多次的訓練才能得到一個較為合理的結果。
[0] http://gaussic.github.io/2017/08/24/tensorflow-language-model/