Overall of Word Segmentation Learning For Chinese
MM Chiou 0630/2018
The Chinese word segmentation (CWS) problem is formulated as finding a mapping from an input character sequence to a word sequence , and the output sentence satsifies
where is the number of word candidates in .
GEN() denotes the set of possible segmentations for an input character sequence .
The scoring function of this model is sensitive to the complete contents of partially segmented sentence.
Fig.1 framework of this CWS model.
Fig.1 shows the framework of this CWS model.
A neural network scoring model is designed to evaluate the likelihood of a segmented sentence.
A decoder is developed to find the segmented sentence with the highest score.
A max-margin method is utilized to perform the training by comparing the structured difference of decoder output and the golden segmentation.
Neural Network Scoring Model
The score for a segmented sentence is computed by first mapping into a sequence of word candidate vectors, then the scoring model takes the word candidate vector sequence as input, scoring on each word candidate from two perspectives:
(1) how likely the word candidate itself can be recognized as a legal word;
(2) how reasonable the link is for the word candidate to follow previous segmentation history immediately.
After that, the word candidate is appended to the segmentation history, updating the state of the scoring system for subsequent judgements.
Fig.2 Architecture of neural network scoring model.
Fig.2 illustrates the entire scoring neural network.
denotes the -th input character, i.e.,
denotes the learned representation of the -th word candidate.
denotes the prediction for the -th word candidate.
is the trainable parameter vector for scoring the likelihood of individual word candidates.
Word Score
Character Embedding
The scores are decided at the word-level, and using word embedding will lead to a remarkable issue that rare words and out-of-vocabulary words will be poorly estimated.
In addition, the character-level information inside an -gram can be helpful to judge whether it is a true word.
Therefore, a lookup table of character embeddings is used as the bottom layer.
Formally, we have a character dictionary of size .
Then each character is represented as a real-valued vector (character embedding) , where is the dimensionality of the vector space.
The character embeddings are then stacked into an embedding matrix .
For a character , its character embedding is retrieved by the embedding layer according to its index .
Gated Combination Neural Network (GCNN)
Fig.3 Gated combination neural network
Fig.3 shows the gated combination neural network (GCNN).
In order to obtain word representation through its characters , character vectors are integrated into their word representation using a weight matrix that is shared across all words with the same length , followed by a non-linear function .
where are -dimensional character vector representations, respectively,
The corresponding word vector will be -dimensional as well.
and is a non-linear function as mentioned above.
The mechanism can not sufficiently model the complicated combination features.
Gated structure in neural network can be useful for hybrid feature extraction according to [1], a gated combination neural network (GCNN) was proposed in [0] for character compositionality which contains two types of gates, reset gate and update gate.
Intuitively, the reset gates decide which part of the character vectors should be mixed while the update gates decide what to preserve when combining the characters information.
Concretely, for words with length , the word vector is computed as
where , () are update gates for new activation and governed characters respectively, and indicates element-wise multiplication.
The new activation is computed as
where and () are the reset gates for governed characters, respectively, which can be formalized as
where is the coefficient matrix of reset gates and denotes the sigmoid function.
The update gates can be formalized as
where is the coefficient matrix of update gates, and is the normalization vector,
where .
According to the normalization condition, the update gates are constrained by
The gated mechanism is capable of capturing both character and character interaction characteristics to give an efficient word representation.
Word Score
Denote the learned vector representations for a segmented sentence , where is the number of word candidates in sentence.
Word score will be computed by the dot product of vector () and a trainable parameter vector
It indicates how likely a word candidate by itself is to be a true word.
Link Score
Fig.4 Link scores (dashed lines)
Fig.4 shows schematic of link scores.
An LSTM system is used to capture the coherence in a segmented sentence.
The LSTM is an effective tool for sequence modeling tasks using its hidden states for history information preservation.
The LSTM introduces a memory cell to preserve states over long periods of time, and controls the update of hidden state and memory cell by three types of gates, input gate, forget gate and output gate.
Concretely, each step of LSTM takes input , , and produces , as
where is element-wise sigmoid function and is element-wise multiplication.
is input gate vector at time .
is forget gate vector at time
is output gate vector at time .
is memory cell activation vector at time .
All of which have the same size as hidden state vector .
LSTMs have been shown to outperform RNNs on many NLP tasks, notably language modeling [2].
LSTM is used to chain the word candidates in a left-to-right incremental manner in this work.
At time step , a prediction about next word is made based on the hidden state ,
The link score for next word is computed as
Due to the LSTM structure, the prediction vector carries information detected from the entire segmentation history, including previous segmentation decisions.
In this way, this model gains the ability of sequence-level discrimination rather than local optimization.
Sentence Score
Sentence score for a segmented sentence with word candidates is compted by summing up word scores and link scores.
where is the parameter set.
Decoding
The total number of possible segmented sentences grows exponentially with the length of character sequene, which makes it impractical to compute the scores of every possible segmentation.
In order to get exact inference, most sequence-labeling systems address this problem with a Viterbi search, which takes the advantage of their that the tag interactions only exist within adjacent characters (Markov assumption).
However, since this model is intended to capture complete history of segmentation decision, such dynamic programming algorithms can not be adopted in this situation.
Beam Search Algorithm
A beam-search algorithm was proposed [0] in practice use and efficient with dynamic programming motivations.
The main idea is that any segmentation of the first characters can be separated as two parts, the first part consists of characters with indexes from 0 to that is denoted as , the rest part is the word composed by .
The influence from previous segmentation can be represented as a triple , where indicates the current score, indicates the current hidden state vector and indicates the current memory cell vector.
Beam search ensures that the total time for segmenting a sentence of characters is , where is the maximum word length, and is the maximum beam size.
Input: model parameter , beam size , maximum word length , input character sequence .
Output: approx. best segments
1: <-
2: ...
[0]
Neural Word Segmentation Learning for Chinese, 2016, cs.CL.
[1]
Xinchi Chen, Xipeng Qiu, Chenxi Zhu, and Xuanjing Huang. 2015a. Gated recursive neural network for chinese word segmentation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pages 1744–1753.
[2]
Martin Sundermeyer, Ralf Schluter, and Hermann Ney. ¨ 2012. Lstm neural networks for language modeling. In 13th Annual Conference of the International Speech Communication Association.