CBOW Model with Multi-word Context

The CBOW model predicts one target word by its context words [1].

Consider a vocabulary containing V words can be expressed as

V={d1,d2,,dk,,dV} {\mathcal V} = \{ d_1, d_2, \cdots, d_k, \cdots, d_V\}

where dkd_k is the kk-th word in the vocabulary. The training corpus C{\mathcal C} can be constituted by NaN_a articles as

C={article1,article2,,articleNa} {\mathcal C} = \{ {\textrm article}_1, {\textrm article}_2, \cdots, {\textrm article}_{N_a} \}

.Each article is constituted by the words in the vocabulary V{\mathcal V}. For example,

article1=d1 d5 d18 d56 d2  {\textrm article}_1 = 'd_1 \ d_5 \ d_{18} \ d_{56} \ d_2 \ \cdots '

The CC-word context of target word djod_{j_o} in an article is defined as

Cx(djo)={dcmdcm is the word in the context of djo,m=1,2,,C}(1) C_x (d_{j_o}) = \left\{ d_{c_m}| d_{c_m} \textrm{ is the word in the context of } d_{j_o}, m = 1, 2, \cdots, C \right\} \pod{\text{1}}

where the subscript cmc_m can be the integers between 11 and VV. For example, in article1{\textrm article}_1, the first 4-word context of d18d_{18} is

Cx(d18)={dc1,dc2,dc3,dc4}={d1,d5,d56,d2} C_x (d_{18}) = \{ d_{c_1}, d_{c_2}, d_{c_3}, d_{c_4} \} = \{ d_1, d_5, d_56, d_2 \}

Fig.1. Data flow of CBOW model with CC-word context.x¯c1,,x¯cC\bar{x}^{c_1}, \cdots, \bar{x}^{c_C} are the one-hot encoded vectors of words dc1,,dcCd_{c_1}, \cdots, d_{c_C} and are input to the NN at the same time for CBOW model with CC-word context. y¯\bar{y} is the output of the NN. The input vectors v¯c1,,v¯cC\bar{v}_{c_1}, \cdots, \bar{v}_{c_C} and output vector v¯j\bar{v}_j' are two kinds of word vector representations.

Fig.1. shows the data flow of CBOW model with CC-word context.The words in the context Cx(djo)C_x(d_{j_o}) are all one-hot encoded into x¯c1,,x¯cC\bar{x}^{c_1}, \cdots, \bar{x}^{c_C} which are input to the neural network (NN) for the CBOW model.The output y¯=[y1,,yj,,yV]t\bar{y} = [y_1, \cdots, y_j, \cdots, y_V]^t has the size of VV and yjy_j is a probability that the target word is djd_j given the one-hot encoded vectors x¯cm\bar{x}^{c_m}, m=1,,Cm = 1,\cdots, C. The input vectors v¯cm,m=1,,C\bar{v}_{c_m}, m = 1, \cdots, C and output vector v¯j\bar{v}_j' are two kinds of word vector representations and will be elaborated later.The NN is trained by inputting the articles in the training corpus C{\mathcal C} to the NN word by word.

Fig.1. Architecture of the NN for CBOW model with CC context words of the target word djod_{j_o}.

Fig.1. shows the architecture of the NN for CBOW model with CC context words of the target word djod_{j_o}. A softmax function is still imposed at the end of the output layer.

The hidden layer output is calcuted as

h¯=1CW¯¯t(x¯c1+x¯c2++x¯cC)=1C(v¯wc1+v¯wc2++v¯wcC)(2) \bar{h} = \frac{1}{C} \bar{\bar{W}}^t \cdot \left( \bar{x}_{c_1} + \bar{x}_{c_2} + \cdots + \bar{x}_{c_C} \right) = \frac{1}{C} \left( \bar{v}_{w_{c_1}} + \bar{v}_{w_{c_2}} + \cdots + \bar{v}_{w_{c_C}} \right) \pod{\text{2}}

or represented component-wise as

hi=1Cm=1Cwcmi, i=1,2,,N(3) h_i = \frac{1}{C} \sum_{m = 1}^C w_{c_m i}, \ i = 1, 2, \cdots, N \pod{\text{3}}

The output of the neural network at jj-th neuron is the probability of word wjw_j given the context of wjow_{j_o}, namely,

yj=p(wjCx(wjo))=ev¯wjh¯j=1Vev¯wjh¯(4) y_j = p(w_j | Cx(w_{j_o}) ) = \frac{e^{\bar{v}_{w_j}' \cdot \bar{h}} }{\displaystyle \sum_{j = 1}^V e^{\bar{v}_{w_j}' \cdot \bar{h}}} \pod{\text{4}}

The loss function is defined as

E=lnp(wjoCx(wjo))(5) E = -\ln p(w_{j_o}| Cx(w_{j_o})) \pod{\text{5}}

[0]

X. Rong, word2vec parameter learning explained, arXiv:1411.2738, 2014.

[1]

T. Mikolov, K. Chen, G. Corrado and J. Dean, Efficient estimation of word representations in vector space,

arXiv:1301.3781, 2013.

results matching ""

    No results matching ""