Skip-gram Model
Architecture of the skip-gram model.
Fig. shows the architecture of the skip-gram model.A word is input at the input layer to predict its context words set as the target words at the output layer.
The output of the hidden layer is
h¯=v¯wk(1)
or represented component-wise as
hi=vwk,i=wki(2)
On the output layer, instead of outputing one multinomial distribution, output C multinomial distribtions with each multinomial distribtion computed using the same hidden-to-output weight matrix W¯¯′. The input of j-th neuron on $m$-th panel in the output layer is obtained as
uj,m=uj=v¯wj′⋅h¯=i=1∑Nwij′hi=i=1∑Nwij′wki(3)
where uj,m of all panels are the same since they share the same weights. The probability of j-th (j=1,2,⋯,V) word on m-th panel is
yj,m=p(wj,m∣wk)=j′=1∑Veuj′euj,m=j′=1∑Vev¯wj′′⋅h¯ev¯wj′⋅h¯(4)
It is desired to maximize the probability of the context words given the input word.
The loss function is defined as
E=−lnp(wjo,1,wjo,2,⋯,wjo,C∣wk)=−lnm=1∏Cp(wjo,m∣wk)=−lnm=1∏Cyjo,m,m=−lnm=1∏Cj′=1∑Vev¯wj′′⋅h¯ev¯wjo,m′⋅h¯(5)
where the subscript jo,m means the word with the subscript is the m-th target context word in Cx(wk).
The derivative of E to uj,m is
∂uj,m∂E=yj,m−δjjo,m≐ej,m(6)