Powered by GitBook

For Step 2

Goal: #Build the dictionary and replace rare words with UNK token.

Total code of step 2. (Python)

vocabulary_size=50000

def build_dataset(words):
  count=\[\['UNK',-1\]\]
  count.extend\(collections.Counter\(words\).most\_common\(vocabulary\_size-1\)\)
  dictionary=dict\(\)
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count += 1
      data.append(index)
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
  return data, count, dictionary, reverse_dictionary

  data, count, dictionary, reverse_dictionary = build_dataset(words)
  del words  # Hint to reduce memory.
  print('Most common words (+UNK)', count[:5])
  print('Sample data', data[:10], [reverse_dictionary[i] for i in data[:10]])

  data_index = 0

Description:

Define the vocabulary size as 50000.

vocabulary_size=50000

Define function build_dataset()
Set variable "count" as dictionary type and give initial value as ['UNK', -1], then extend this variable size equal to vocabulary_size and those content are the most common dictionary data in "words" variable.

def build_dataset(words):
count=\[\['UNK',-1\]\]
count.extend\(collections.Counter\(words\).most\_common\(vocabulary\_size-1\)\)

Init a empty dictionary.

dictionary=dict\(\)

Build another dictionary for "word" compare to index.

`for word, _ in count: `

`dictionary[word] = len(dictionary)`

Init a empty array "data" and "unk_conut" as 0.

data = list()
unk_count = 0

Reference

results matching ""

No results matching ""