QACNN Algorithm

MM 0605/2018


In question answering task, a reading paragraph PrawP_{\textrm{raw}}, a query QrawQ_{\textrm{raw}} and several answer choices CrawC_{\textrm{raw}} are given.

The target of the model is to choose a correct answer from multiple choices based on information of PrawP_{\textrm{raw}} and QrawQ_{\textrm{raw}}.

Fig.1 Overview on QACNN model. y¯\bar{y} denotes the probability of choices A, B, C and D.

Fig.1 shows the pipeline overview of QACNN.

The embedding layer is used to transform PrawP_{\textrm{raw}}, QrawQ_{\textrm{raw}}, CrawC_{\textrm{raw}} into word embedding PP, QQ, CC.

The compare layer generates paragraph-query similarity map PQPQ and paragraph-choice similarity map PCPC.

The QACNN consists of two staged-CNN architecture.

The first stage projects word level feature into sentence-level, and the second stage projects sentence-level feature into paragraph-level.

The output of first stage are paragraph's sentence features based on the query r¯¯PQ\bar{\bar{r}}^{PQ} and choice r¯¯PC\bar{\bar{r}}^{PC}

r¯¯PQ=[r¯1PQ,,r¯nPQ,,r¯NPQ]\bar{\bar{r}}^{PQ}=[\bar{r}^{PQ}_1,\cdots, \bar{r}^{PQ}_n, \cdots,\bar{r}^{PQ}_N] and r¯¯PC=[r¯1PC,,r¯nPC,,r¯NPC]\bar{\bar{r}}^{PC}=[\bar{r}^{PC}_1,\cdots, \bar{r}^{PC}_n, \cdots,\bar{r}^{PC}_N].

nn is the index for sentence, and NN is the total number of sentence in paragraph.

The output of second stage is answer feature r¯\bar{r}.

Each choice has its corresponding answer feature.

Every choice answer feature, r¯A\bar{r}_A, r¯B\bar{r}_B, r¯C\bar{r}_C, r¯D\bar{r}_D, are inputs to prediction layer, and the output denotes the probability of choices A, B, C and D, respectively.

y¯=[pA,pB,pC,pD]\bar{y}=[p_A,p_B,p_C,p_D].

Similarity Mapping layer

Similarity mapping layer is composed of two part :\colon embedding layer and compare layer.

Given a paragraph PrawP_{\textrm{raw}} (with NN sentences), a query QrawQ_{\textrm{raw}}, and a choice CrawC_{\textrm{raw}}, the embedding layer transforms every words in PrawP_{\textrm{raw}}, QrawQ_{\textrm{raw}} and CrawC_{\textrm{raw}} into word embedding :\colon
P={p¯ni}i=1,n=1I,NP =\{\bar{p}_n^i \}_{i=1,n=1}^{I,N} :\colon NN is the total number of sentence in paragraph, and II is the total number of words in each sentence.

Q={q¯j}j=1JQ =\{\bar{q}^j \}_{j=1}^{J} :\colon Query is considered as one sentence.

C={c¯k}k=1KC = \{ \bar{c}^k \}_{k=1}^K :\colon choice is considered as one sentence.

The sentences in all the paragraphs have the same length by padding.

JJ is the total number of words in query sentence. KK is the total number of words in choice sentence.

p¯ni\bar{p}_n^i, q¯j\bar{q}^j and c¯k\bar{c}^k are word embeddings.

Word embedding can be obtained by any type of embedding technique, such as RNN, sequence-to-sequence model, Word2Vec, etc.

The pre-trained Glo Ve word vectors is adopted as the embedding without further modification [3].

The compare layer compares each paragraph sentence PnP_n to QQ and CC at word level separately.


Fig.2 Similarity map between paragraph P\mathbf{P} and query Q\mathbf{Q}. I\mathbf{I} denotes the length of each sentence Pn\mathbf{P_n}, J\mathbf{J} denotes the length of query Q\mathbf{Q}.

Fig.2 shows the similarity between paragraph PP and query QQ. II denotes the length of each sentence PnP_n. JJ denotes the length of query QQ. PnQ={cos(p¯ni,q¯j)}i=1,j=1I,JP_{n}Q = {\{ \cos \left(\bar{p}_n^i,\bar{q}^j \right) \}}_{i=1,j=1}^{I,J}

Fig.3 Similarity map between paragraph PP and choice CC. KK denotes the length of choice CC.

Fig.3 shows the similarity between paragraph PP and choice CC. KK denotes the total number of words in choice sentence CC.

Each word in sentences of paragraph is compared to each word in query and choice.

PnC={cos(p¯ni,c¯k)}i=1,k=1I,KP_{n}C={\{ \cos (\bar{p}_n^i,\bar{c}^k) \}}_{i=1,k=1}^{I,K}

Two similarity map, paragraph-query (PQPQ) and paragraph-choice (PCPC) similarity map are created as

PQ=[P1Q,P2Q,,PNQ]RN×J×IPQ=[P_{1}Q,P_{2}Q, \cdots ,P_{N}Q] \in R^{N \times J \times I}

PC=[P1C,P2C,,PNC]RN×K×IPC=[P_{1}C,P_{2}C, \cdots ,P_{N}C] \in R^{N \times K \times I}

QACNN Layer

QACNN contains a two-staged CNN combined with query-based attention mechanism.

An attention convolutional matching layer is used to integrate two similarity maps.

QACNN layer is used to learn the location relationship pattern.

Each stage comprises two major part:\colon attention map and output representation.

Attention Map of First stage

Fig.4-0 First stage CNN attention part, from PnQP_nQ to word-level attention map a¯n\bar{a}_n.

Fig. 4-0 shows the derivation of word-level attention map a¯n\bar{a}_n from PnQP_nQ.

PnQRJ×IP_n Q \in R^{J \times I} is nn-th sentence slice PnQRJ×IP_n Q \in R^{J \times I}in PQPQ , and CNN is applied on PnQP_nQ with convolution kernel W¯¯¯1ARJ×l×d\bar{\bar{\bar{W}}}_1^A \in R^{J \times l \times d} . dd and ll represent width of kernel and number of kernel, respectively.

The superscript AA denotes attention map, and the subscript 11 denotes the first stage CNN.

The generated feature q¯¯nARl×(Id+1)\bar{\bar{q}}_n^A \in R^{l \times(I-d+1)} is expressed as

q¯¯nA=sigmoid(W¯¯1APnQ+b¯1A)\bar{\bar{q}}_n^A =\textrm{sigmoid}( \bar{\bar{W}}_1^A * P_n Q + \bar{b}_1^A)

where b¯1ARl\bar{b}_1^A \in R^l is the bias.

With W¯¯1A\bar{\bar{W}}_1^A would learn the query syntactic structure and give weight to each paragraph's location. Sigmoid function is chosen as activation function in this stage.

Max pooling is performed to q¯¯nA\bar{\bar{q}}_n^A perpendicularly to find the largest weight between different kernels in the same location, using max pool kernel shaped ll, and then generate word-level attention map a¯nRId+1\bar{a}_n \in R^{I-d+1} for each sentence.

a¯n=max pool(q¯¯nA)\bar{a}_n = \textrm{max pool}(\bar{\bar{q}}_n^A), note q¯¯nARl×(Id+1)\bar{\bar{q}}_n^A \in R^{l \times (I-d+1)} and a¯nRId+1\bar{a}_n \in R^{I-d+1}.

Fig.4 First stage CNN attention part. Each paragraph sentences are used to generate its corresponding word-level attention map a¯n\bar{a}_n, n=1,2,,Nn=1,2,\cdots,N.

Fig.4 shows the architecture of the attention map in first stage CNN, and all sentences in paragraph are used.

Output Representation of First Stage

The paragraph's sentence features based on the query (r¯¯PQ\bar{\bar{r}}^{PQ}) and the choice (r¯¯PC\bar{\bar{r}}^{PC}) are derived at first stage.

Fig.5-0 Derivation of paragraph feature based on query, r¯nPQ\bar{r}_n^{PQ}.

Fig.5-0 shows the derivation of paragraph feature based on query, r¯nPQ\bar{r}_n^{PQ}.

Fig.5 All paragraph's sentence feature based on query, r¯¯PQ=[r¯1PQ,,r¯nPQ,,r¯NPQ]\bar{\bar{r}}^{PQ}=[\bar{r}_1^{PQ},\cdots,\bar{r}_n^{PQ},\cdots,\bar{r}_N^{PQ}].

Fig.5 shows all paragraph's sentence feature based on query, r¯¯PQ=[r¯1PQ,,r¯nPQ,,r¯NPQ]\bar{\bar{r}}^{PQ}=[\bar{r}_1^{PQ},\cdots,\bar{r}_n^{PQ},\cdots,\bar{r}_N^{PQ}].

Fig.6-0 Derivation of paragraph's sentence feature based on choice, r¯nPC\bar{r}_n^{PC}.

Fig.6-0 shows the derivation of paragraph's sentence feature based on choice, r¯nPC\bar{r}_n^{PC}.

.

Fig.6 All paragraph's sentence feature based on choice, r¯¯PC=[r¯1PC,,r¯nPC,,r¯NPC]\bar{\bar{r}}^{PC}=[\bar{r}_1^{PC},\cdots,\bar{r}_n^{PC},\cdots,\bar{r}_N^{PC}].

Fig.6 shows all paragraph's sentence feature based on choice, r¯¯PC=[r¯1PC,,r¯nPC,,r¯NPC]\bar{\bar{r}}^{PC}=[\bar{r}_1^{PC},\cdots,\bar{r}_n^{PC},\cdots,\bar{r}_N^{PC}].

r¯¯PQ\bar{\bar{r}}^{PQ} and r¯¯PC\bar{\bar{r}}^{PC} are two output representation part of first-stage CNN architecture.

Kernels W¯¯¯1RRl×K×d\bar{\bar{\bar{W}}}_1^R \in R^{l \times K \times d} and bias b¯1RRl\bar{b}_1^R \in R^l are applied to PnQP_n Q to acquire query-based sentence features.

q¯¯nR=ReLU(W¯¯¯1RPnQ+b¯1R)Rl×(Id+1)\bar{\bar{q}}_n^R = \textrm{ReLU}(\bar{\bar{\bar{W}}}_1^R \otimes P_n Q + \bar{b}_1^R) \in R^{l \times ( I-d+1)}

Identical kernels W¯¯¯1RRl×K×d\bar{\bar{\bar{W}}}_1^R \in R^{l \times K \times d} and bias b¯1RRl\bar{b}_1^R \in R^l are applied to PnCP_n C to aggregate pattern of location relationship and acquire choice-based sentence features. The superscript RR denotes output representation

c¯¯nR=ReLU(W¯¯¯1RPnC+b¯1R)=[(c¯n,1R)t(c¯n,R)t]Rl×(Id+1)\bar{\bar{c}}_n^R = \textrm{ReLU}(\bar{\bar{\bar{W}}}_1^R \otimes P_n C + \bar{b}_1^R)= \left[ \begin{matrix} (\bar{c}_{n,1}^R)^t \\ \vdots \\(\bar{c}_{n,\ell}^R)^t \end{matrix} \right] \in R^{l \times ( I-d+1)}

c¯¯nR\bar{\bar{c}}_n^R is multiplied by the word-level attention map a¯nRId+1\bar{a}_n \in R^{I-d+1} through the first dimension.

c¯¯nR=[(c¯n,1R)ta¯n(c¯n,R)ta¯n]\bar{\bar{c}}_n^R=\left[ \begin{matrix} (\bar{c}_{n,1}^R)^t \odot \bar{a}_n \\ \vdots \\(\bar{c}_{n,\ell}^R)^t \odot \bar{a}_n \end{matrix} \right],

The max pool operations are applied on q¯¯nRRl×(Id+1)\bar{\bar{q}}_n^R \in R^{l \times ( I-d+1)} and c¯¯nRRl×(Id+1)\bar{\bar{c}}_n^R \in R^{l \times ( I-d+1)} horizontally with kernel shape (Id+1)(I-d+1) to get the query-based sentence features r¯nPQRl\bar{r}_n^{PQ} \in R^l and choice-based sentence features r¯nPCRl\bar{r}_n^{PC} \in R^l.

r¯nPQ=max pool(q¯¯nR)\bar{r}_n^{PQ} = \textrm{max pool}(\bar{\bar{q}}_n^R)

r¯nPC=max pool(c¯¯nR)\bar{r}_n^{PC} = \textrm{max pool}(\bar{\bar{c}}_n^R)

Attention map of second stage

Fig.7-0 Derivation of sentence-level attention map a¯\bar{a} from r¯¯PQ\bar{\bar{r}}^{PQ}.

Fig.7-0 shows the derivation of sentence-level attention map a¯\bar{a} from query-based sentence features r¯¯PQ\bar{\bar{r}}^{PQ}.

Fig.7 Derivation of sentence-level attention map a¯\bar{a} from r¯¯PQ\bar{\bar{r}}^{PQ}.

Fig.7 shows the derivation of sentence-level attention map a¯\bar{a} from query-based sentence features r¯¯PQ\bar{\bar{r}}^{PQ}.

r¯¯PQ=[r¯1PQ,r¯2PQ,,r¯NPQ]\bar{\bar{r}}^{PQ}=[\bar{r}_1^{PQ}, \bar{r}_2^{PQ}, \cdots, \bar{r}_N^{PQ}] is the input of second stage CNN, and will be further refined by CNN with kernel W¯¯¯2ARl×d×l\bar{\bar{\bar{W}}}_2^A \in R^{l \times d \times l} and generates intermediate features q¯¯ARl×Nd+1\bar{\bar{q}}^A \in R^{l \times N-d+1}.

q¯¯A=sigmoid(W¯¯¯2Ar¯¯PQ+b¯2A)\bar{\bar{q}}^A=\textrm{sigmoid}(\bar{\bar{\bar{W}}}_2^A* \bar{\bar{r}}^{PQ} + \bar{b}_2^A)

The max pool operation is applied on q¯¯A\bar{\bar{q}}^A with kernel shaped ll, and the sentence-level attention map a¯RNd+1\bar{a} \in R^{N-d+1} is obtained.

a¯=max pool(q¯¯A)\bar{a}=\textrm{max pool}(\bar{\bar{q}}^A)

Output Representation of Second Stage

Fig.8-0 Derivation of answer feature r¯\bar{r} from r¯¯PC\bar{\bar{r}}^{PC}.

Fig. 8-0 shows the derivation of answer feature r¯\bar{r} from r¯¯PC\bar{\bar{r}}^{PC}.

Fig.8 Derivation of answer feature r¯\bar{r} from choice-based sentence feature r¯¯PC\bar{\bar{r}}^{PC} .

Fig.8 shows the output representation part of the second stage.

The output representation part of the second stage has two input:\colon sentence-level attention map a¯\bar{a} and sentence-level features r¯¯PC=[r¯1PC,r¯2PC,r¯NPC]\bar{\bar{r}}^{PC}=[\bar{r}_1^{PC},\bar{r}_2^{PC},\bar{r}_N^{PC} ].

The c¯¯R\bar{\bar{c}}^R and answer feature r¯\bar{r} is obtained as

r¯={max(ctRa)}t=1l\bar{r}=\{ \textrm{max} (c_t^R \odot a)\}_{t=1}^l

where

c¯¯R=ReLU(W¯¯¯2Rr¯¯PC+b¯2R)\bar{\bar{c}}^R = \textrm{ReLU}(\bar{\bar{\bar{W}}}_2^R*\bar{\bar{r}}^{PC} + \bar{b}_2^R)

and W¯¯¯2RRl×l×d\bar{\bar{\bar{W}}}_2^R \in R^{l \times l \times d}, b¯2RRl\bar{b}_2^R \in R^l, and c¯¯RRl×(Nd+1)\bar{\bar{c}}^R \in R^{l \times (N-d+1)}. The output representation of certain choice r¯Rl\bar{r} \in R^l is the final output of QACNN Layer.

Fig.9 Overall data flow of QACNN.

Prediction Layer

Prediction layer is the final part of QACNN. r¯ARl\bar{r}_A \in R^l is used to represent the final output representation of the AA-th choice.

r¯A,r¯B,r¯C,r¯D\bar{r}_A, \bar{r}_B, \bar{r}_C, \bar{r}_D is passed to two-fully-connected layer and compute probability for each choice using softmax

R¯¯=[r¯A,r¯B,r¯C,r¯D]\bar{\bar{R}} =[\bar{r}_A, \bar{r}_B, \bar{r}_C, \bar{r}_D].

The output vector is the probability of each choices

y¯=[pApBpE]=softmax(W¯¯o(tanh(W¯¯pR¯¯+b¯p))+bo)\bar{y} = \left[ \begin{matrix} p_A \\ p_B \\ \vdots \\ p_E \end{matrix} \right] = \textrm{softmax}(\bar{\bar{W}}^o(\textrm{tanh}( \bar{\bar{W}}^p \bar{\bar{R}} + \bar{b}^p )) + b^o).

where W¯¯pRl×l\bar{\bar{W}}^p \in R^{l \times l}, b¯pRl\bar{b}^p \in R^l, W¯¯oRl×Mc\bar{\bar{W}}^o \in R^{l \times M_c} and boRMcb^o \in R^{M_c}.

Two fully-connected layer are applied.

W¯¯P\bar{\bar{W}}^P and W¯¯o\bar{\bar{W}}^o are the weight matrix of first and second fully-connected layer, respectively.

b¯p\bar{b}^p and b¯o\bar{b}^o are the bias of first and second fully-connected layer, respectively.

McM_c is the number of choices.

Hyperbolic function can be expressed as

tanhx=exexex+ex\textrm{tanh} x = \frac{e^x-e^{-x}}{e^x + e^{-x}}.


[0]

Query-based Attention CNN for Text Similarity Map

Tzu-Chien Liu, Yu-Hsueh Wu, Hung-Yi Lee, 2017.

[1]

M.Tapaswi, Y.Zhu, R. Stiefelhagen,

"Movieqa: Understanding stories in movies through question-answering," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[2]

oS. Wang and J. Jiang, " A compare-aggregate model for matching text sequences," 1611.01747, 2016.

[3]

J. Pennington, R. Socher, and C. Manning, "Glove: Global vectors for word representation," in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp.1532-1543.

results matching ""

    No results matching ""