Skip-gram Model
Architecture of the skip-gram model.
Fig. shows the architecture of the skip-gram model.A word is input at the input layer to predict its context words set as the target words at the output layer.
The output of the hidden layer is
or represented component-wise as
On the output layer, instead of outputing one multinomial distribution, output C multinomial distribtions with each multinomial distribtion computed using the same hidden-to-output weight matrix W¯¯′. The input of j-th neuron on $m$-th panel in the output layer is obtained as
where uj,m of all panels are the same since they share the same weights. The probability of j-th (j=1,2,⋯,V) word on m-th panel is
It is desired to maximize the probability of the context words given the input word.
The loss function is defined as
where the subscript jo,m means the word with the subscript is the m-th target context word in Cx(wk).
The derivative of E to uj,m is