Overall of Word Segmentation Learning For Chinese

MM Chiou 0630/2018

The Chinese word segmentation (CWS) problem is formulated as finding a mapping from an input character sequence $c_{\textrm{sequence}}$ to a word sequence $y$ , and the output sentence $y^*$ satsifies

$y^*=\textrm{arg } \max_{y \in \textrm{GEN}} (\displaystyle \sum_{i=1}^n \textrm{score}(y_i|y_1, \cdots,y_{i-1}))$

where $n$ is the number of word candidates in $y$ .

GEN( $x$ ) denotes the set of possible segmentations for an input character sequence $x$ .

The scoring function of this model is sensitive to the complete contents of partially segmented sentence.

Fig.1 framework of this CWS model.

Fig.1 shows the framework of this CWS model.

A neural network scoring model is designed to evaluate the likelihood of a segmented sentence.

A decoder is developed to find the segmented sentence with the highest score.

A max-margin method is utilized to perform the training by comparing the structured difference of decoder output and the golden segmentation.

Neural Network Scoring Model

The score for a segmented sentence is computed by first mapping into a sequence of word candidate vectors, then the scoring model takes the word candidate vector sequence as input, scoring on each word candidate from two perspectives:

(1) how likely the word candidate itself can be recognized as a legal word;

(2) how reasonable the link is for the word candidate to follow previous segmentation history immediately.

After that, the word candidate is appended to the segmentation history, updating the state of the scoring system for subsequent judgements.

Fig.2 Architecture of neural network scoring model.

Fig.2 illustrates the entire scoring neural network.

$c_i$ denotes the $i$ -th input character, i.e., $c_{\textrm{sequence}}=[c_1, c_2, \cdots]$

$y_j$ denotes the learned representation of the $j$ -th word candidate.

$p_k$ denotes the prediction for the $k+1$ -th word candidate.

$u$ is the trainable parameter vector for scoring the likelihood of individual word candidates.

Word Score

Character Embedding

The scores are decided at the word-level, and using word embedding will lead to a remarkable issue that rare words and out-of-vocabulary words will be poorly estimated.

In addition, the character-level information inside an $n$ -gram can be helpful to judge whether it is a true word.

Therefore, a lookup table of character embeddings is used as the bottom layer.

Formally, we have a character dictionary $D_c$ of size $|D_c|$ .

Then each character $c_k \in D_c$ is represented as a real-valued vector (character embedding) $\bar{c}_k \in R^d$ , where $d$ is the dimensionality of the vector space.

The character embeddings are then stacked into an embedding matrix $\bar{\bar{M}}_c \in R^{d \times |D_c|}$ .

For a character $c_k \in D_c$ , its character embedding $\bar{c}_k \in R^d$ is retrieved by the embedding layer according to its index $k$ .

Gated Combination Neural Network (GCNN)

Fig.3 Gated combination neural network

Fig.3 shows the gated combination neural network (GCNN).

In order to obtain word representation $\bar{w}$ through its characters $c_1,c_2,\cdots,c_L$ , character vectors $\bar{c}_1,\bar{c}_2,\cdots,\bar{c}_L$ are integrated into their word representation using a weight matrix $\bar{\bar{W}}^{(L)}$ that is shared across all words with the same length $L$ , followed by a non-linear function $g(\cdot)$ .

$\bar{w} = g (\bar{\bar{W}}^{(L)} \left[ \begin{matrix} \bar{c}_1 \\ \vdots \\ \bar{c}_L \end{matrix} \right] )$

where $\bar{c}_1,\bar{c}_2,\cdots, \bar{c}_L$ are $d$ -dimensional character vector representations, respectively,

The corresponding word vector $\bar{w}$ will be $d$ -dimensional as well.

$\bar{\bar{W}}^{(L)} \in R^{d \times Ld}$ and $g$ is a non-linear function as mentioned above.

The mechanism can not sufficiently model the complicated combination features.

Gated structure in neural network can be useful for hybrid feature extraction according to [1], a gated combination neural network (GCNN) was proposed in [0] for character compositionality which contains two types of gates, reset gate and update gate.

Intuitively, the reset gates decide which part of the character vectors should be mixed while the update gates decide what to preserve when combining the characters information.

Concretely, for words with length $L$ , the word vector $\bar{w} \in R^d$ is computed as

$\bar{w}=\bar{z}_N \odot \hat{w}+ \displaystyle\sum_{i=1}^L \bar{z}_i \odot \bar{c}_i$

where $\bar{z}_N$ , $\bar{z}_i$ ( $1 \leq i \leq L$ ) are update gates for new activation $\hat{w}$ and governed characters respectively, and $\odot$ indicates element-wise multiplication.

The new activation $\hat{w}$ is computed as

$\hat{w} = \textrm{tanh} \bar{\bar{W}}^{(L)} \left[ \begin{matrix} \bar{r}_1 \odot \bar{c}_1 \\ \vdots \\ \bar{r}_L \odot \bar{c}_L \end{matrix} \right]$

where $\bar{\bar{W}}^{(L)} \in R^{d \times Ld}$ and $\bar{r}_i \in R^d$ ( $1 \leq i \leq L$ ) are the reset gates for governed characters, respectively, which can be formalized as

$\left[ \begin{matrix} \bar{r}_1 \\ \vdots \\ \bar{r}_L \odot \end{matrix} \right] = \sigma (\bar{\bar{R}}^{(L)}\left[ \begin{matrix} \bar{r}_1 \\ \vdots \\ \bar{c}_L \end{matrix} \right] )$

where $\bar{\bar{R}}^{(L)} \in R^{Ld \times Ld}$ is the coefficient matrix of reset gates and $\sigma$ denotes the sigmoid function.

The update gates can be formalized as

$\left[ \begin{matrix} \bar{z}_N \\ \bar{z}1 \\ \vdots \\ \bar{z}_L \end{matrix} \right] = \textrm{exp}(\bar{\bar{U}}^{(L)} \left[ \begin{matrix}\hat{w} \\ \bar{c}_1 \\ \vdots \\ \bar{c}_L \end{matrix} \right] ) \odot \left[ \begin{matrix} 1/\bar{Z} \\ 1/\bar{Z} \\ \vdots \\ 1/\bar{Z} \end{matrix} \right]$

where $\bar{\bar{U}}^{(L)} \in R^{(L+1)d \times (L+1)d}$ is the coefficient matrix of update gates, and $\bar{Z} \in R^d$ is the normalization vector,

$Z_k = \displaystyle \sum_{i=1}^L [ \textrm{exp} ( \bar{\bar{U}}^{(L)} \left[ \begin{matrix} \hat{w} \\ \bar{c}_1 \\ \vdots \\ \bar{c}_L \end{matrix} \right] ) ]_{d \times i +k}$

where $k=0,1,\cdots,d-1$ .

According to the normalization condition, the update gates are constrained by

$\bar{z}_N+ \displaystyle \sum_{i=1}^L \bar{z}_i=1$

The gated mechanism is capable of capturing both character and character interaction characteristics to give an efficient word representation.

Word Score

Denote the learned vector representations for a segmented sentence $\bar{\bar{y}} = [\bar{y}_1,\bar{y}_2, \cdots,\bar{y}_n]$ , where $n$ is the number of word candidates in sentence.

Word score will be computed by the dot product of vector $\bar{y}_i$ ( $1 \leq i \leq n$ ) and a trainable parameter vector $\bar{u} \in R^d$

$\textrm{Word Score}(\bar{y}_i) = \bar{u} \cdot \bar{y}_i$

It indicates how likely a word candidate by itself is to be a true word.

Link Score

Fig.4 Link scores (dashed lines)

Fig.4 shows schematic of link scores.

An LSTM system is used to capture the coherence in a segmented sentence.

The LSTM is an effective tool for sequence modeling tasks using its hidden states for history information preservation.

The LSTM introduces a memory cell to preserve states over long periods of time, and controls the update of hidden state and memory cell by three types of gates, input gate, forget gate and output gate.

Concretely, each step of LSTM takes input $\bar{x}_t$ , $\bar{h}_{t-1}$ , $\bar{c}_{t-1}$ and produces $\bar{h}_t$ , $\bar{c}_t$ as

$\bar{i}_t= \sigma ( \bar{\bar{W}}^i \bar{x}_t + \bar{\bar{U}}^i \bar{h}_{t-1} + \bar{b}^i)$

$\bar{f}_t = \sigma( \bar{\bar{W}}^f\bar{x}_t + \bar{\bar{U}}^f \bar{h}_{t-1} + \bar{b}^f)$

$\bar{o}_t=\sigma ( \bar{\bar{W}}^o \bar{x}_t + \bar{\bar{U}} \bar{h}_{t-1} +\bar{b}^o)$

$\hat{c}_t = \textrm{tanh} \bar{\bar{W}}^c \bar{x}_t + \bar{\bar{U}}^c \bar{h}_{t-1} + \bar{b}^c)$

$\bar{c}_t = \bar{f}_t \odot \bar{c}_{t-1} + \bar{i}_t \odot \hat{c}_t$

$\bar{h}_t = \bar{o}_t \odot \textrm{tanh}(\bar{c}_t)$

where $\sigma$ is element-wise sigmoid function and $\odot$ is element-wise multiplication.

$\bar{i}_t$ is input gate vector at time $t$ .

$\bar{f}_t$ is forget gate vector at time $t$

$\bar{o}_t$ is output gate vector at time $t$ .

$\bar{h}_t$ is memory cell activation vector at time $t$ .

All of which have the same size as hidden state vector $\bar{h}_t \in R^H$ .

LSTMs have been shown to outperform RNNs on many NLP tasks, notably language modeling [2].

LSTM is used to chain the word candidates in a left-to-right incremental manner in this work.

At time step $t$ , a prediction $\bar{p}_{t+1} \in R^d$ about next word $\bar{y}_{t+1}$ is made based on the hidden state $\bar{h}_t$ ,

$\bar{p}_{t+1} = \textrm{tanh} (\bar{\bar{W}}^p \bar{h}_t + \bar{b}^p)$

The link score for next word $\bar{y}_{t+1}$ is computed as

$\textrm{link score} (\bar{y}_{t+1}) = \bar{p}_{t+1} \cdot \bar{y}_{t+1}$

Due to the LSTM structure, the prediction vector $\bar{p}_{t+1}$ carries information detected from the entire segmentation history, including previous segmentation decisions.

In this way, this model gains the ability of sequence-level discrimination rather than local optimization.

Sentence Score

Sentence score for a segmented sentence $y$ with $n$ word candidates is compted by summing up word scores and link scores.

$s(y_{[1:n]}, \theta) = \displaystyle \sum_{t=1}^n (\bar{u} \cdot \bar{y}_t + \bar{p}_t \cdot \bar{y}_t)$

where $\theta$ is the parameter set.

Decoding

The total number of possible segmented sentences grows exponentially with the length of character sequene, which makes it impractical to compute the scores of every possible segmentation.

In order to get exact inference, most sequence-labeling systems address this problem with a Viterbi search, which takes the advantage of their that the tag interactions only exist within adjacent characters (Markov assumption).

However, since this model is intended to capture complete history of segmentation decision, such dynamic programming algorithms can not be adopted in this situation.

Beam Search Algorithm

A beam-search algorithm was proposed [0] in practice use and efficient with dynamic programming motivations.

The main idea is that any segmentation of the first $i$ characters can be separated as two parts, the first part consists of characters with indexes from 0 to $j$ that is denoted as $y$ , the rest part is the word composed by $c[j+1:i]$ .

The influence from previous segmentation $y$ can be represented as a triple $(y.\textrm{score}, y.\bar{h}, y.\bar{c})$ , where $y.\textrm{score}$ indicates the current score, $y.\bar{h}$ indicates the current hidden state vector and $y.\bar{c}$ indicates the current memory cell vector.

Beam search ensures that the total time for segmenting a sentence of $n$ characters is $w \times k \times n$ , where $w$ is the maximum word length, and $k$ is the maximum beam size.

Input: model parameter $\theta$ , beam size $k$ , maximum word length $w$ , input character sequence $c[1:n]$ .

Output: approx. $k$ best segments

1: $\pi[0]$ <- $\{ (\textrm{score}=0, \bar{h} = \bar{h}_0, \bar{c} = \bar{c}_0) \}$

2: ...

[0]

Neural Word Segmentation Learning for Chinese, 2016, cs.CL.

[1]

Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing Huang. 2015a. Gated recursive neural network for chinese word segmentation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1744–1753.

[2]

Martin Sundermeyer, Ralf Schluter, and Hermann Ney. ¨ 2012. Lstm neural networks for language modeling. In 13th Annual Conference of the International Speech Communication Association.

Overview_Algorithm_Word_segmentation

Overall of Word Segmentation Learning For Chinese

Neural Network Scoring Model

Word Score

Character Embedding

Gated Combination Neural Network (GCNN)

Word Score

Link Score

Sentence Score

Decoding

Beam Search Algorithm

results matching ""

No results matching ""