For Step 2
Goal: #Build the dictionary and replace rare words with UNK token.
Total code of step 2. (Python)
vocabulary_size=50000
def build_dataset(words):
count=\[\['UNK',-1\]\]
count.extend\(collections.Counter\(words\).most\_common\(vocabulary\_size-1\)\)
dictionary=dict\(\)
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0 # dictionary['UNK']
unk_count += 1
data.append(index)
count[0][1] = unk_count
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reverse_dictionary
data, count, dictionary, reverse_dictionary = build_dataset(words)
del words # Hint to reduce memory.
print('Most common words (+UNK)', count[:5])
print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])
data_index = 0
Description:
- Define the vocabulary size as 50000.
vocabulary_size=50000
Define function build_dataset()
Set variable "count" as dictionary type and give initial value as ['UNK', -1], then extend this variable size equal to vocabulary_size and those content are the most common dictionary data in "words" variable.
def build_dataset(words):
count=\[\['UNK',-1\]\]
count.extend\(collections.Counter\(words\).most\_common\(vocabulary\_size-1\)\)
Init a empty dictionary.
dictionary=dict\(\)
Build another dictionary for "word" compare to index.
`for word, _ in count: ` `dictionary[word] = len(dictionary)`
Init a empty array "data" and "unk_conut" as 0.
data = list()
unk_count = 0