Training CWS
The max-margin criterion is used to train the model.
The margin method generaly outperform both likelihood and perception methods.
For a given character sequence x(i), denote the correct segmented sentence for x(i) as y(i).
A structured margin loss Δ(y(i),y^) is defined for predicitng a segmented sentence y^,
Δy(i),y^)=t=1∑mμ1¯{y(i),t≠y^t}
where m is the length of sequence x(i) and μ is the discount parameter. The calculation of margin loss could be regarded as to count the number of incorrectly segmented characters and then multiple it with a fixed discount parameter for smoothing.
Therefore, the loss is proportional to the number of incorrectly segmented characters.
Given a set of training set, Ω, the regularized objective function is th eloss function J(θ) including an ℓ2 norm term
J(θ)=∣Ω∣1x(i),y(i))∈Ω∑ℓi(θ)+2λ∥θ∥22
where
ℓi(θ)=maxy^∈GEN(x(i))(s(y^,θ)+Δ(y(i),y^)−s(y(i),θ))
where the function s(.) is the sentence score defined as
s(y[1:n],θ)=t=1∑n(u¯⋅y¯t,p¯t⋅y¯t)
θ is the parameter set in model.
Due to the hinge loss, the objective function is not differentiable, a subgradient method is used to compute a gradient like direction.
The diagonal variant of AdaGrad is used with minibatches to minimize the objective.
The update for the i-th parameter at time step t is as
θt,i=θt−1,i−√∑τ=1gτ,i2αgt,i
where α is the initial learning rate and gτ,i∈R∣θi∣ is the subfradient at time step τ for parameter θi.