Overall of Word Segmentation Learning For Chinese

MM Chiou 0630/2018

The Chinese word segmentation (CWS) problem is formulated as finding a mapping from an input character sequence csequencec_{\textrm{sequence}} to a word sequence yy, and the output sentence yy^* satsifies

y=arg maxyGEN(i=1nscore(yiy1,,yi1))y^*=\textrm{arg } \max_{y \in \textrm{GEN}} (\displaystyle \sum_{i=1}^n \textrm{score}(y_i|y_1, \cdots,y_{i-1}))

where nn is the number of word candidates in yy.

GEN(xx) denotes the set of possible segmentations for an input character sequence xx.

The scoring function of this model is sensitive to the complete contents of partially segmented sentence.

Fig.1 framework of this CWS model.

Fig.1 shows the framework of this CWS model.

A neural network scoring model is designed to evaluate the likelihood of a segmented sentence.

A decoder is developed to find the segmented sentence with the highest score.

A max-margin method is utilized to perform the training by comparing the structured difference of decoder output and the golden segmentation.

Neural Network Scoring Model

The score for a segmented sentence is computed by first mapping into a sequence of word candidate vectors, then the scoring model takes the word candidate vector sequence as input, scoring on each word candidate from two perspectives:

(1) how likely the word candidate itself can be recognized as a legal word;

(2) how reasonable the link is for the word candidate to follow previous segmentation history immediately.

After that, the word candidate is appended to the segmentation history, updating the state of the scoring system for subsequent judgements.

Fig.2 Architecture of neural network scoring model.

Fig.2 illustrates the entire scoring neural network.

cic_i denotes the ii-th input character, i.e., csequence=[c1,c2,]c_{\textrm{sequence}}=[c_1, c_2, \cdots]

yjy_j denotes the learned representation of the jj-th word candidate.

pkp_k denotes the prediction for the k+1k+1-th word candidate.

uu is the trainable parameter vector for scoring the likelihood of individual word candidates.

Word Score

Character Embedding

The scores are decided at the word-level, and using word embedding will lead to a remarkable issue that rare words and out-of-vocabulary words will be poorly estimated.

In addition, the character-level information inside an nn-gram can be helpful to judge whether it is a true word.

Therefore, a lookup table of character embeddings is used as the bottom layer.

Formally, we have a character dictionary DcD_c of size Dc|D_c|.

Then each character ckDcc_k \in D_c is represented as a real-valued vector (character embedding) c¯kRd\bar{c}_k \in R^d, where dd is the dimensionality of the vector space.

The character embeddings are then stacked into an embedding matrix M¯¯cRd×Dc\bar{\bar{M}}_c \in R^{d \times |D_c|}.

For a character ckDcc_k \in D_c, its character embedding c¯kRd\bar{c}_k \in R^d is retrieved by the embedding layer according to its index kk.

Gated Combination Neural Network (GCNN)

Fig.3 Gated combination neural network

Fig.3 shows the gated combination neural network (GCNN).

In order to obtain word representation w¯\bar{w} through its characters c1,c2,,cLc_1,c_2,\cdots,c_L, character vectors c¯1,c¯2,,c¯L\bar{c}_1,\bar{c}_2,\cdots,\bar{c}_L are integrated into their word representation using a weight matrix W¯¯(L)\bar{\bar{W}}^{(L)} that is shared across all words with the same length LL, followed by a non-linear function g()g(\cdot).

w¯=g(W¯¯(L)[c¯1c¯L])\bar{w} = g (\bar{\bar{W}}^{(L)} \left[ \begin{matrix} \bar{c}_1 \\ \vdots \\ \bar{c}_L \end{matrix} \right] )

where c¯1,c¯2,,c¯L\bar{c}_1,\bar{c}_2,\cdots, \bar{c}_L are dd -dimensional character vector representations, respectively,

The corresponding word vector w¯\bar{w} will be dd -dimensional as well.

W¯¯(L)Rd×Ld\bar{\bar{W}}^{(L)} \in R^{d \times Ld} and gg is a non-linear function as mentioned above.

The mechanism can not sufficiently model the complicated combination features.

Gated structure in neural network can be useful for hybrid feature extraction according to [1], a gated combination neural network (GCNN) was proposed in [0] for character compositionality which contains two types of gates, reset gate and update gate.

Intuitively, the reset gates decide which part of the character vectors should be mixed while the update gates decide what to preserve when combining the characters information.

Concretely, for words with length LL, the word vector w¯Rd\bar{w} \in R^d is computed as

w¯=z¯Nw^+i=1Lz¯ic¯i\bar{w}=\bar{z}_N \odot \hat{w}+ \displaystyle\sum_{i=1}^L \bar{z}_i \odot \bar{c}_i

where z¯N\bar{z}_N, z¯i\bar{z}_i (1iL1 \leq i \leq L) are update gates for new activation w^\hat{w} and governed characters respectively, and \odot indicates element-wise multiplication.

The new activation w^\hat{w} is computed as

w^=tanhW¯¯(L)[r¯1c¯1r¯Lc¯L]\hat{w} = \textrm{tanh} \bar{\bar{W}}^{(L)} \left[ \begin{matrix} \bar{r}_1 \odot \bar{c}_1 \\ \vdots \\ \bar{r}_L \odot \bar{c}_L \end{matrix} \right]

where W¯¯(L)Rd×Ld\bar{\bar{W}}^{(L)} \in R^{d \times Ld} and r¯iRd\bar{r}_i \in R^d (1iL1 \leq i \leq L) are the reset gates for governed characters, respectively, which can be formalized as

[r¯1r¯L]=σ(R¯¯(L)[r¯1c¯L])\left[ \begin{matrix} \bar{r}_1 \\ \vdots \\ \bar{r}_L \odot \end{matrix} \right] = \sigma (\bar{\bar{R}}^{(L)}\left[ \begin{matrix} \bar{r}_1 \\ \vdots \\ \bar{c}_L \end{matrix} \right] )

where R¯¯(L)RLd×Ld\bar{\bar{R}}^{(L)} \in R^{Ld \times Ld} is the coefficient matrix of reset gates and σ\sigma denotes the sigmoid function.

The update gates can be formalized as

[z¯Nz¯1z¯L]=exp(U¯¯(L)[w^c¯1c¯L])[1/Z¯1/Z¯1/Z¯]\left[ \begin{matrix} \bar{z}_N \\ \bar{z}1 \\ \vdots \\ \bar{z}_L \end{matrix} \right] = \textrm{exp}(\bar{\bar{U}}^{(L)} \left[ \begin{matrix}\hat{w} \\ \bar{c}_1 \\ \vdots \\ \bar{c}_L \end{matrix} \right] ) \odot \left[ \begin{matrix} 1/\bar{Z} \\ 1/\bar{Z} \\ \vdots \\ 1/\bar{Z} \end{matrix} \right]

where U¯¯(L)R(L+1)d×(L+1)d\bar{\bar{U}}^{(L)} \in R^{(L+1)d \times (L+1)d} is the coefficient matrix of update gates, and Z¯Rd\bar{Z} \in R^d is the normalization vector,

Zk=i=1L[exp(U¯¯(L)[w^c¯1c¯L])]d×i+kZ_k = \displaystyle \sum_{i=1}^L [ \textrm{exp} ( \bar{\bar{U}}^{(L)} \left[ \begin{matrix} \hat{w} \\ \bar{c}_1 \\ \vdots \\ \bar{c}_L \end{matrix} \right] ) ]_{d \times i +k}

where k=0,1,,d1k=0,1,\cdots,d-1.

According to the normalization condition, the update gates are constrained by

z¯N+i=1Lz¯i=1\bar{z}_N+ \displaystyle \sum_{i=1}^L \bar{z}_i=1

The gated mechanism is capable of capturing both character and character interaction characteristics to give an efficient word representation.

Word Score

Denote the learned vector representations for a segmented sentence y¯¯=[y¯1,y¯2,,y¯n]\bar{\bar{y}} = [\bar{y}_1,\bar{y}_2, \cdots,\bar{y}_n] , where nn is the number of word candidates in sentence.

Word score will be computed by the dot product of vector y¯i\bar{y}_i (1in1 \leq i \leq n) and a trainable parameter vector u¯Rd\bar{u} \in R^d

Word Score(y¯i)=u¯y¯i\textrm{Word Score}(\bar{y}_i) = \bar{u} \cdot \bar{y}_i

It indicates how likely a word candidate by itself is to be a true word.

Fig.4 Link scores (dashed lines)

Fig.4 shows schematic of link scores.

An LSTM system is used to capture the coherence in a segmented sentence.

The LSTM is an effective tool for sequence modeling tasks using its hidden states for history information preservation.

The LSTM introduces a memory cell to preserve states over long periods of time, and controls the update of hidden state and memory cell by three types of gates, input gate, forget gate and output gate.

Concretely, each step of LSTM takes input x¯t\bar{x}_t, h¯t1\bar{h}_{t-1}, c¯t1\bar{c}_{t-1} and produces h¯t\bar{h}_t, c¯t\bar{c}_t as

i¯t=σ(W¯¯ix¯t+U¯¯ih¯t1+b¯i)\bar{i}_t= \sigma ( \bar{\bar{W}}^i \bar{x}_t + \bar{\bar{U}}^i \bar{h}_{t-1} + \bar{b}^i)

f¯t=σ(W¯¯fx¯t+U¯¯fh¯t1+b¯f)\bar{f}_t = \sigma( \bar{\bar{W}}^f\bar{x}_t + \bar{\bar{U}}^f \bar{h}_{t-1} + \bar{b}^f)

o¯t=σ(W¯¯ox¯t+U¯¯h¯t1+b¯o)\bar{o}_t=\sigma ( \bar{\bar{W}}^o \bar{x}_t + \bar{\bar{U}} \bar{h}_{t-1} +\bar{b}^o)

c^t=tanhW¯¯cx¯t+U¯¯ch¯t1+b¯c)\hat{c}_t = \textrm{tanh} \bar{\bar{W}}^c \bar{x}_t + \bar{\bar{U}}^c \bar{h}_{t-1} + \bar{b}^c)

c¯t=f¯tc¯t1+i¯tc^t\bar{c}_t = \bar{f}_t \odot \bar{c}_{t-1} + \bar{i}_t \odot \hat{c}_t

h¯t=o¯ttanh(c¯t)\bar{h}_t = \bar{o}_t \odot \textrm{tanh}(\bar{c}_t)

where σ\sigma is element-wise sigmoid function and \odot is element-wise multiplication.

i¯t\bar{i}_t is input gate vector at time tt.

f¯t\bar{f}_t is forget gate vector at time tt

o¯t\bar{o}_t is output gate vector at time tt.

h¯t\bar{h}_t is memory cell activation vector at time tt.

All of which have the same size as hidden state vector h¯tRH\bar{h}_t \in R^H.

LSTMs have been shown to outperform RNNs on many NLP tasks, notably language modeling [2].

LSTM is used to chain the word candidates in a left-to-right incremental manner in this work.

At time step tt, a prediction p¯t+1Rd\bar{p}_{t+1} \in R^d about next word y¯t+1\bar{y}_{t+1} is made based on the hidden state h¯t\bar{h}_t,

p¯t+1=tanh(W¯¯ph¯t+b¯p)\bar{p}_{t+1} = \textrm{tanh} (\bar{\bar{W}}^p \bar{h}_t + \bar{b}^p)

The link score for next word y¯t+1\bar{y}_{t+1} is computed as

link score(y¯t+1)=p¯t+1y¯t+1\textrm{link score} (\bar{y}_{t+1}) = \bar{p}_{t+1} \cdot \bar{y}_{t+1}

Due to the LSTM structure, the prediction vector p¯t+1\bar{p}_{t+1} carries information detected from the entire segmentation history, including previous segmentation decisions.

In this way, this model gains the ability of sequence-level discrimination rather than local optimization.

Sentence Score

Sentence score for a segmented sentence yy with nn word candidates is compted by summing up word scores and link scores.

s(y[1:n],θ)=t=1n(u¯y¯t+p¯ty¯t)s(y_{[1:n]}, \theta) = \displaystyle \sum_{t=1}^n (\bar{u} \cdot \bar{y}_t + \bar{p}_t \cdot \bar{y}_t)

where θ\theta is the parameter set.

Decoding

The total number of possible segmented sentences grows exponentially with the length of character sequene, which makes it impractical to compute the scores of every possible segmentation.

In order to get exact inference, most sequence-labeling systems address this problem with a Viterbi search, which takes the advantage of their that the tag interactions only exist within adjacent characters (Markov assumption).

However, since this model is intended to capture complete history of segmentation decision, such dynamic programming algorithms can not be adopted in this situation.

Beam Search Algorithm

A beam-search algorithm was proposed [0] in practice use and efficient with dynamic programming motivations.

The main idea is that any segmentation of the first ii characters can be separated as two parts, the first part consists of characters with indexes from 0 to jj that is denoted as yy, the rest part is the word composed by c[j+1:i]c[j+1:i].

The influence from previous segmentation yy can be represented as a triple (y.score,y.h¯,y.c¯)(y.\textrm{score}, y.\bar{h}, y.\bar{c}), where y.scorey.\textrm{score} indicates the current score, y.h¯y.\bar{h} indicates the current hidden state vector and y.c¯y.\bar{c} indicates the current memory cell vector.

Beam search ensures that the total time for segmenting a sentence of nn characters is w×k×nw \times k \times n, where ww is the maximum word length, and kk is the maximum beam size.

Input: model parameter θ\theta, beam size kk, maximum word length ww, input character sequence c[1:n]c[1:n].

Output: approx. kk best segments

1: π[0]\pi[0] <- {(score=0,h¯=h¯0,c¯=c¯0)}\{ (\textrm{score}=0, \bar{h} = \bar{h}_0, \bar{c} = \bar{c}_0) \}

2: ...

[0]

Neural Word Segmentation Learning for Chinese, 2016, cs.CL.

[1]

Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing Huang. 2015a. Gated recursive neural network for chinese word segmentation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1744–1753.

[2]

Martin Sundermeyer, Ralf Schluter, and Hermann Ney. ¨ 2012. Lstm neural networks for language modeling. In 13th Annual Conference of the International Speech Communication Association.

results matching ""

    No results matching ""