Training CWS

The max-margin criterion is used to train the model.

The margin method generaly outperform both likelihood and perception methods.

For a given character sequence $x^{(i)}$ , denote the correct segmented sentence for $x^{(i)}$ as $y^{(i)}$ .

A structured margin loss $\Delta(y^{(i)},\hat{y})$ is defined for predicitng a segmented sentence $\hat{y}$ ,

$\Delta y^{(i)},\hat{y})=\displaystyle \sum_{t=1}^m \mu \bar{1} \{y^{(i),t} \neq \hat{y}^t \}$

where $m$ is the length of sequence $x^{(i)}$ and $\mu$ is the discount parameter. The calculation of margin loss could be regarded as to count the number of incorrectly segmented characters and then multiple it with a fixed discount parameter for smoothing.

Therefore, the loss is proportional to the number of incorrectly segmented characters.

Given a set of training set, $\Omega$ , the regularized objective function is th eloss function $J(\theta)$ including an $\ell_2$ norm term

$J(\theta) = \displaystyle \frac{1}{|\Omega|} \sum_{x^{(i)},y^{(i)}) \in \Omega } \ell_i (\theta) + \frac{\lambda}{2} \Vert \theta \Vert_2^2$

where

$\ell_i(\theta)=\max_{\hat{y} \in \textrm{GEN}(x^{(i)})} ( s(\hat{y}, \theta) + \Delta (y^{(i)},\hat{y}) - s(y^{(i)},\theta))$

where the function $s(.)$ is the sentence score defined as

$s(y_{[1:n]},\theta)=\displaystyle \sum_{t=1}^n ( \bar{u} \cdot \bar{y}_t, \bar{p}_t \cdot \bar{y}_t)$

$\theta$ is the parameter set in model.

Due to the hinge loss, the objective function is not differentiable, a subgradient method is used to compute a gradient like direction.

The diagonal variant of AdaGrad is used with minibatches to minimize the objective.

The update for the $i$ -th parameter at time step $t$ is as

$\theta_{t,i}=\theta_{t-1,i} - \displaystyle \frac{\alpha}{\sqrt{\sum_{\tau=1}} g_{\tau,i}^2} g_{t,i}$

where $\alpha$ is the initial learning rate and $g_{\tau,i} \in R^{|\theta_i|}$ is the subfradient at time step $\tau$ for parameter $\theta_i$ .

Training_CWS

results matching ""

No results matching ""