Training CWS

The max-margin criterion is used to train the model.

The margin method generaly outperform both likelihood and perception methods.

For a given character sequence x(i)x^{(i)}, denote the correct segmented sentence for x(i)x^{(i)} as y(i)y^{(i)}.

A structured margin loss Δ(y(i),y^)\Delta(y^{(i)},\hat{y}) is defined for predicitng a segmented sentence y^\hat{y},

Δy(i),y^)=t=1mμ1¯{y(i),ty^t}\Delta y^{(i)},\hat{y})=\displaystyle \sum_{t=1}^m \mu \bar{1} \{y^{(i),t} \neq \hat{y}^t \}

where mm is the length of sequence x(i)x^{(i)} and μ\mu is the discount parameter. The calculation of margin loss could be regarded as to count the number of incorrectly segmented characters and then multiple it with a fixed discount parameter for smoothing.

Therefore, the loss is proportional to the number of incorrectly segmented characters.

Given a set of training set, Ω\Omega, the regularized objective function is th eloss function J(θ)J(\theta) including an 2\ell_2 norm term

J(θ)=1Ωx(i),y(i))Ωi(θ)+λ2θ22J(\theta) = \displaystyle \frac{1}{|\Omega|} \sum_{x^{(i)},y^{(i)}) \in \Omega } \ell_i (\theta) + \frac{\lambda}{2} \Vert \theta \Vert_2^2

where

i(θ)=maxy^GEN(x(i))(s(y^,θ)+Δ(y(i),y^)s(y(i),θ))\ell_i(\theta)=\max_{\hat{y} \in \textrm{GEN}(x^{(i)})} ( s(\hat{y}, \theta) + \Delta (y^{(i)},\hat{y}) - s(y^{(i)},\theta))

where the function s(.)s(.) is the sentence score defined as

s(y[1:n],θ)=t=1n(u¯y¯t,p¯ty¯t)s(y_{[1:n]},\theta)=\displaystyle \sum_{t=1}^n ( \bar{u} \cdot \bar{y}_t, \bar{p}_t \cdot \bar{y}_t)

θ\theta is the parameter set in model.

Due to the hinge loss, the objective function is not differentiable, a subgradient method is used to compute a gradient like direction.

The diagonal variant of AdaGrad is used with minibatches to minimize the objective.

The update for the ii-th parameter at time step tt is as

θt,i=θt1,iατ=1gτ,i2gt,i\theta_{t,i}=\theta_{t-1,i} - \displaystyle \frac{\alpha}{\sqrt{\sum_{\tau=1}} g_{\tau,i}^2} g_{t,i}

where α\alpha is the initial learning rate and gτ,iRθig_{\tau,i} \in R^{|\theta_i|} is the subfradient at time step τ\tau for parameter θi\theta_i.

results matching ""

    No results matching ""