CBOW Model with One-word Context

The CBOW model predicts one word by its context words (input of CBOW model) [1].

Define a vocabulary containing VV words as

V={d1,d2,,dk,,dV} {\mathcal V} = \{ d_1, d_2, \cdots, d_k, \cdots, d_V\}

where dkd_k is the kk-th word in the vocabulary. The training corpus C{\mathcal C} can be constituted by NaN_a articles as C={article1,article2,, articleNa}{\mathcal C} = \{ \textrm{article}_1, \textrm{article}_2, \cdots, \textrm{ article}_{N_a} \}.

Fig.1. Schematic of a word sequence and an article.

Fig.1. shows the schematic of a word sequence and an article. A word sequence is formed by concatenating the words in the vocabulary V{\mathcal V}. Each article is formed by concatenating nn word sequences.

Considering the context word being one word, the CBOW is reduced to a bigram model.

Fig.2. Data flow of CBOW model with one-word context.x¯k\bar{x}^k is the one-hot encoded vector of word dkd_k and is input to the NN for CBOW model with one-word context. y¯\bar{y} is the output of the NN. The input word vector representation w¯k\bar{w}_k and output word vector representation w¯j\bar{w}_j' are two kinds of word vector representations.

Fig.2. shows the data flow of CBOW model with one-word context.The word dkd_k is one-hot encoded into x¯k\bar{x}^k and x¯k\bar{x}^k is input to the neural network (NN) for CBOW model with one-word context, expanded as

x¯k=[x1k,x2k,,xk1k,xkk,xk+1k,,xVk]t(1) \bar{x}^k = [x_1^k, x_2^k, \cdots, x_{k-1}^k, x_k^k, x_{k+1}^k, \cdots, x_V^k]^t \pod{\text{1}}

where xnk=0x_n^k = 0 for nkn \neq k and xkk=1x_k^k = 1; tt stands for transpose operation.

Fig.3 Schematic of one-hot encoding for kk-th word, dkx¯k=[x1k,x2k,,xkk,,xVk]t=[0,,1,,0]td_k \to \bar{x}^k = [x_1^k, x_2^k,\cdots, x^k_k,\cdots, x_V^k]^t=[0,\cdots, 1, \cdots, 0]^t

Fig.3 shows the schematic of one-hot-encoding for kk-th word dkd_k.

The output y¯=[y1,,yj,,yV]t\bar{y} = [y_1, \cdots, y_j, \cdots, y_V]^t has the size of VV; yj=p(djx¯k)y_j = p(d_j| \bar{x}^k) is a probability that the next word is djd_j given the one-hot encoded vector x¯k\bar{x}^k with the property that j=1Vyj=1\displaystyle \sum_{j=1}^V y_j=1.

The NN is trained by inputting the articles in the training corpus C{\mathcal C} to the NN word by word with the given context word dkd_k and its corresponding target word djtd_{j_t}. jtj_t is the index of the target word.

Fig.4 Schematic of the target word djtd_{j_t} under CBOW model of one-word context dkd_k. (a) dk=d17d_k = d_{17}, djt=d17d_{j_t} = d_{17}. (b) dk=d21d_k = d_{21}, djt=d77d_{j_t} = d_{77}.

Fig.4 shows the schematic of the target word under CBOW model of one-word context. The target word is the next word of the given context word in the word sequence.

Fig.5. Overall training flow-chart of the example "媽媽,我愛您".

Fig.5 shows the overall training flow-chart of the example "媽媽,我愛您".

Fig.6. Schematic of output y¯\bar{y} of a well-trained NN given x¯k\bar{x}^k in a testing example. (a) input "媽", (b) input "我".

Fig.6 shows the schematic of output y¯\bar{y} of a well-trained NN given x¯k\bar{x}^k in a testing example for input words "媽" and "我". As testing a well-trained NN with input word "媽", one finds that the probabilities of "媽" and "," might be higher than most other words in the vocabulary. As inputting "我", the probability of "愛" might be higher than most other words in the vocabulary.

The input word vector representation w¯k\bar{w}_k and output word vector representation w¯j\bar{w}_j' are two kinds of word vector representations.

Fig.7. Architecture of the NN for the CBOW model with one context word. w¯k\bar{w}_k is input word vector representation of dkd_k, and w¯j\bar{w}_j' is output word vector representation of djd_j. Both of w¯k\bar{w}_k and w¯j\bar{w}_j' are of dimension DD, DVD \ll V.

Fig.7 shows the architecture of the NN for the CBOW model with one context word.W¯¯\bar{\bar{W}} and W¯¯\bar{\bar{W}}' are the input-to-hidden and hidden-to-output weight matrices. w¯k\bar{w}_k is input word vector representation of dkd_k and w¯j\bar{w}_j' is output word vector representation of djd_j. The neuron numbers in the input layer and in the output layer are both chosen to be the vocabulary size VV, and the hidden layer size is DD. Usually, DVD \ll V. For example, V=8000V = 8000 and D=60D = 60 or 100100.

Forward Propagation

The input-to-hidden weight between the neuron kk in the input layer and the neuron ii in the hidden layer is denoted as wkiw_{ki} , forming a V×DV \times D weight matrix as

W¯¯=[w11w12w1Dw21w22w2Dwk1wkiwkDwV1wV2wVD](2) \bar{\bar{W}} = \left[\begin{matrix} w_{11} & w_{12} & \cdots & \cdots & w_{1D} \\ w_{21} & w_{22} & \cdots& \cdots & w_{2D} \\ \vdots & \cdots & \ddots& \cdots & \vdots \\ w_{k1} & \cdots & w_{ki}& \cdots & w_{kD} \\ \vdots & \cdots & \ddots& \cdots & \vdots \\ w_{V1} & w_{V2} & \cdots & \cdots & w_{VD} \end{matrix}\right] \pod{\text{2}}

where the kk-th row of W¯¯\bar{\bar{W}} contains the weights whcih connect the neuron kk in the input layer to all neurons in the hidden layer as shown in Fig.6. Define the transpose of the kk-th row of W¯¯\bar{\bar{W}} as the input word vector representation w¯k\bar{w}_k, namely,

w¯k[wk1,,wki,,wkD]t \bar{w}_{k} \doteq \left[ \begin{matrix} w_{k1}, \cdots, w_{ki}, \cdots, w_{kD} \end{matrix} \right]^t

which is the DD-dimensional vector representation of the input word dkd_k.

The hidden-layer output is obtained as

h¯=W¯¯tx¯k \bar{h} = \bar{\bar{W}}^{t}\bar{x}^{k}

=[w11w21wk1wV1w12w22wV2wkiw1Dw2DwkDwVD][00xkk=100]=[wk1wkiwkD]=w¯k(3) =\left[\begin{matrix} w_{11} & w_{21} & \cdots & w_{k1} & \cdots & w_{V1} \\ w_{12} & w_{22} & \cdots & \cdots & \cdots & w_{V2} \\ \vdots & \cdots & \cdots & w_{ki}& \cdots & \vdots \\ \vdots & \cdots & \ddots& \cdots & \vdots & \vdots \\ w_{1D} & w_{2D} & \cdots & w_{kD} & \cdots & w_{VD} \end{matrix}\right] \left[\begin{matrix} 0 \\ \vdots \\ 0 \\ x_k^k = 1 \\ 0 \\ \vdots \\ 0 \end{matrix}\right] = \left[\begin{matrix} w_{k1} \\ \vdots \\ w_{ki} \\ \vdots \\ w_{kD} \end{matrix}\right] = \bar{w}_{k} \pod{\text{3}}

The hidden-to-output weights are denoted as wijw_{ij}', which connect the neuron ii in the hidden layer and the neuron jj in the output layer and form an D×VD \times V weight matrix W¯¯\bar{\bar{W}}' as

W¯¯=[w11w12w1jw1Vw21w22w2jw2Vwi1wijwiVwD1wD2wDjwDV](4) \bar{\bar{W}}' = \left[ \begin{matrix} w_{11}' & w_{12}' & \cdots & w_{1j}' & \cdots & w_{1V}' \\ w_{21}' & w_{22}' & \cdots & w_{2j}'& \cdots & w_{2V}' \\ \vdots & \cdots & \ddots & \cdots & \cdots & \vdots \\ w_{i1}' & \cdots & \cdots & w_{ij}' & \cdots & w_{iV}' \\ \vdots & \cdots & \ddots & \cdots & \cdots & \vdots \\ w_{D1}' & w_{D2}' & \cdots & w_{Dj}' & \cdots & w_{DV}' \end{matrix} \right] \pod{\text{4}}

where the jj-th column contains the weights which connect all neurons in the hidden layer to the jj-th neuron in the output layer as shown in Fig.2. Define the jj-th column of W¯¯\bar{\bar{W}}' as the output vector v¯j\bar{v}_j', namely,

w¯j[w1j,w2j,,wij,,wDj]t(5) \bar{w}_j' \doteq [w_{1j}', w_{2j}', \cdots, w_{ij}', \cdots, w_{Dj}']^t \pod{\text{5}}

Note that the output vector w¯k\bar{w}_k' is another DD-dimensional vector representation of the input word dkd_k. By substituting (5)(5) into (4)(4), we can represent W¯¯\bar{\bar{W}}' as

W¯¯=[w¯1,,w¯j,,w¯V](6) \bar{\bar{W}}' = \left[ \bar{w}'_1, \cdots, \bar{w}'_j, \cdots, \bar{w}'_V \right] \pod{\text{6}}

The vector h¯\bar{h} in (3)(3) is weighted by W¯¯\bar{\bar{W}}' to obtain the input of the output layer as

u¯=W¯¯th¯=[w11w21wi1wD1w12w22wD2wijw1Vw2VwiVwDV][wk1wkiwkD]=[w¯1,,w¯j,,w¯V]tw¯k \bar{u} = {\bar{\bar{W}}'}^t \cdot \bar{h} = \left[ \begin{matrix} w_{11}' & w_{21}' & \cdots & w_{i1}' & \cdots & w_{D1}' \\ w_{12}' & w_{22}' & \cdots & \vdots & \cdots & w_{D2}' \\ \vdots & \vdots & \ddots & w_{ij}' & \cdots & \vdots \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ w_{1V}' & w_{2V}' & \cdots & w_{iV}' & \cdots & w_{DV}' \end{matrix} \right] \left[ \begin{matrix} w_{k1} \\ \vdots \\ w_{ki} \\ \vdots \\ w_{kD} \end{matrix} \right] = \left[ \bar{w}'_1, \cdots, \bar{w}'_j, \cdots, \bar{w}'_V \right]^t \bar{w}_k

which can be represented as

u¯=[w¯1,,w¯j,,w¯V]tw¯k=[u1ujuV]=[w¯1tw¯kw¯jtw¯kw¯Vtw¯k]=[w¯1w¯kw¯jw¯kw¯Vw¯k](7) \bar{u} = \left[ \bar{w}'_1, \cdots, \bar{w}'_j, \cdots, \bar{w}'_V \right]^t \bar{w}_k = \left[ \begin{matrix} u_1 \\ \vdots \\ u_j \\ \vdots \\ u_V \end{matrix} \right] = \left[ \begin{matrix} {\bar{w}_1'}^t \bar{w}_k \\ \vdots \\ {\bar{w}_j'}^t \bar{w}_k \\ \vdots \\ {\bar{w}_V'}^t \bar{w}_k \end{matrix} \right] = \left[ \begin{matrix} \bar{w}_1' \cdot \bar{w}_k \\ \vdots \\ \bar{w}_j' \cdot \bar{w}_k \\ \vdots \\ \bar{w}_V' \cdot \bar{w}_k \end{matrix} \right] \pod{\text{7}}

The output yjy_j of the jj-th neuron in the output layer is a probability that the next word is djd_j given the one-hot encoded vector x¯k\bar{x}^k as

yj=p(djx¯k)=eujj=1Veuj=ew¯jw¯kj=1Vew¯jw¯k(8) y_j = p(d_j | \bar{x}^k)= \frac{e^{u_j}}{\displaystyle \sum_{j = 1}^V e^{u_j}} = \frac{e^{\bar{w}'_j \cdot \bar{w}_k}}{\displaystyle \sum_{j = 1}^V e^{\bar{w}'_j \cdot \bar{w}_k}} \pod{\text{8}}

The training objective is to maximize the probability yjty_{j_t} of observing the target word djtd_{j_t} given x¯k\bar{x}^k.

The loss function is defined as

L=lnp(djtx¯k)(9) L = -\ln p(d_{j_t} | \bar{x}^k) \pod{\text{9}}

Note that maximizing yjt=p(djtx¯k)y_{j_t} = p(d_{j_t} | \bar{x}^k) is to minimize LL.

Backward Propagation

By using (8)(8), the loss function is expressed as

L=lnyjt=ujt+ln(j=1Veuj)(10) L = -\ln y_{j_t} = - u_{j_t} + \ln \left(\sum_{j=1}^V e^{u_j} \right) \pod{\text{10}}

The partial derivative of LL with respect to uju_j is

Luj=yjy^j(11) \frac{\partial L}{\partial u_j} = y_j - \hat{y}_j \pod{\text{11}}

where we define the desired output y^j\hat{y}_j as

y^j={0,jjt1,j=jt \hat{y}_j = \left\{ \begin{matrix} 0, & j \neq j_t \\ 1, & j = j_t \end{matrix} \right.

The supporting material of (11)(11) is

Lujt=1+eujtj=1Veuj=yjt1Luj=Lyjtyjtuj=1yjtyjtuj=1yjteujtj=1Veujuj=1yjteujt(j=1Veuj)1uj=1yjteujt[(j=1Veuj)2](j=1Veuj)uj=1yjteujtj=1Veujeujj=1Veuj=yjtyjtyj=yj, jjt \begin{aligned} & \frac{\partial L}{\partial u_{j_t}} = -1 + \frac{e^{u_{j_t}} }{\displaystyle \sum_{j=1}^V e^{u_j}} = y_{j_t} - 1 \\ & \frac{\partial L}{\partial u_j} = \frac{\partial L}{\partial y_{j_t}} \frac{\partial y_{j_t}}{\partial u_j} = - \frac{1}{y_{j_t}}\frac{\partial y_{j_t}}{\partial u_j}= -\frac{1}{y_{j_t}} \frac{\partial \frac{e^{u_{j_t} }}{\sum_{j=1}^V e^{u_j}}}{\partial u_j} =-\frac{1}{y_{j_t}} e^{u_{j_t}} \frac{\partial \displaystyle \left( \sum_{j=1}^V e^{u_j} \right)^{-1}}{ \partial u_j} \\ &= -\frac{1}{y_{j_t}} e^{u_{j_t}} \left[-\left( \sum_{j=1}^V e^{u_j} \right)^{-2} \right] \frac{ \partial \left( \sum_{j=1}^V e^{u_j} \right) }{\partial u_j} \\ &= \frac{1}{y_{j_t}} \frac{e^{u_{j_t}} }{ \sum_{j=1}^V e^{u_j} } \frac{e^{u_j} }{ \sum_{j=1}^V e^{u_j} } = \frac{y_{j_t}}{y_{j_t}} y_j = y_j, \ j \neq j_t \end{aligned}

Thus, we can define the error between the NN output and the desired output as

e¯=y¯y^ or ej=yjy^j(12) \bar{e} = \bar{y} - \hat{y} \ \mathrm{or} \ e_j = y_j - \hat{y}_j \pod{\text{12}}

where y^=[y^1,,y^j,,y^V]t\hat{y} = [\hat{y}_1, \cdots, \hat{y}_j, \cdots, \hat{y}_V]^t.

Fig.8. Overview of updating weights of the NN for CBOW model.

Fig.8 shows the overview of updating weights of the NN for CBOW model.

The derivative of LL to the hidden-to-output weight wijw_{ij}' is

Lwij=Lujujwij=ejwki(13) \frac{\partial L}{\partial w'_{ij}} = \frac{\partial L}{\partial u_j} \frac{\partial u_j}{\partial w'_{ij}} = e_j w_{ki} \pod{\text{13}}

The supporting material of (13)(13) is

uj=w¯jw¯k=i=1Nwijwkiujwij=wki \begin{aligned} u_j = \bar{w}_j' \cdot \bar{w}_k & = \sum_{i = 1}^N w_{ij}' w_{ki} \\ \frac{\partial u_j}{\partial w'_{ij}} & = w_{ki} \end{aligned}

By using stochastic gradient descent, we obtain the updating equation for hidden-to-output weights wijw_{ij}' as

wij(new)=wij(old)ηLwij=wij(old)ηejwki(old)(14) {w_{ij}'}^{(new)} = {w_{ij}'}^{(old)} - \eta \frac{\partial L}{\partial w'_{ij}} = {w_{ij}'}^{(old)} - \eta e_j w_{ki}^{(old)} \pod{\text{14}}

or equivalently

w¯j(new)=w¯j(old)ηejw¯k(old)(15) {\bar{w}_j'}^{(new)} = {\bar{w}_j'}^{(old)} - \eta e_j \bar{w}_k^{(old)} \pod{\text{15}}

where η>0\eta > 0 is the learning rate.

Since at jjtj \neq j_t, ej>0e_j > 0, which is called overestimating and can be seen in (12)(12), that w¯j(old){\bar{w}_j'}^{(old)} subtract a scaled w¯k(old)\bar{w}_k^{(old)} in (15)(15) leads to the angle between w¯j(new){\bar{w}_j'}^{(new)} and w¯k(old)\bar{w}_k^{(old)} increasing. Since at j=jtj = j_t, ej<0e_j < 0, which is called underestimating, that w¯j(old){\bar{w}_j'}^{(old)} add a scaled w¯k(old)\bar{w}_k^{(old)} in (15)(15) leads to the angle between w¯j(new){\bar{w}_j'}^{(new)} and w¯k(old)\bar{w}_k^{(old)} decreasing. If yjty_{j_t} is close to 1, the error is close to 0 and w¯jt\bar{w}'_{j_t} is nearly unchanged.

Next, find the update equation for input-to-hidden weights W¯¯\bar{\bar{W}}. The derivative of LL to the output of the hidden layer hih_i is

Lhi=j=1VLujujhi=j=1Vejwij(16) \frac{\partial L}{\partial h_i} = \sum_{j = 1}^V \frac{\partial L}{\partial u_j} \frac{\partial u_j}{\partial h_i} = \sum_{j = 1}^V e_j w'_{ij} \pod{\text{16}}

The supporting material of (16)(16) is

hi=wki h_i = w_{ki}

and

ujhi=ujwki=wij \frac{\partial u_j}{\partial h_i} = \frac{\partial u_j}{\partial w_{ki}} = w_{ij}'

The derivative of LL to the input-to-hidden weights wkiw_{ki} is

Lwki=Lhihiwki=Lhi=j=1Vejwij(17) \frac{\partial L}{\partial w_{ki}} = \frac{\partial L}{\partial h_i} \frac{\partial h_i}{\partial w_{ki}} = \frac{\partial L}{\partial h_i} = {\displaystyle \sum_{j = 1}^V} e_j w'_{ij} \pod{\text{17}}

The update equation for input-to-hidden weights wkiw_{ki} is

wki(new)=wki(old)ηLwki=wki(old)ηj=1Vejwij(old)(18) w_{ki}^{(new)} = w_{ki}^{(old)} - \eta \frac{\partial L}{\partial w_{ki}} = w_{ki}^{(old)} - \eta \sum_{j = 1}^V e_j {w_{ij}'}^{(old)} \pod{\text{18}}

or

w¯k(new)=w¯k(old)ηj=1Vejw¯j(old)(19) \bar{w}_k^{(new)} = \bar{w}_k^{(old)} - \eta \sum_{j = 1}^V e_j {\bar{w}_j'}^{(old)} \pod{\text{19}}

The input word vector representation w¯k\bar{w}_k is updated by adding the sum of scaled output word vector representations w¯j\bar{w}_j'. At jjtj \neq j_t (ej>0e_j > 0) , the contribution of w¯j\bar{w}_j' will put w¯k\bar{w}_k farther away from w¯j\bar{w}_j'. At j=jtj = j_t (ej<0e_j < 0), the contribution of w¯jt\bar{w}_{j_t}' will move w¯k\bar{w}_k closer to w¯jt\bar{w}_{j_t}'. If the contribution of all the output word vectors is nearly zero, the input word vector remains nearly unchanged.

Fig.9. Flow-chart to train distributed vector representations of words in vocabulary V{\mathcal V} by a training corpus C{\mathcal C}.

Fig.9. shows a flow-chart to train distributed vector representations of words in vocabulary V{\mathcal V} by a training corpus C{\mathcal C}.

[0]

X. Rong, word2vec parameter learning explained, arXiv:1411.2738, 2014.

[1]

T. Mikolov, K. Chen, G. Corrado and J. Dean, Efficient estimation of word representations in vector space,

arXiv:1301.3781, 2013.

results matching ""

    No results matching ""