Experiment on AMRNN

MM 05/06/2018


Experiment Setup

Dataset Collection: The collected TOEFL dataset included 963 examples in total (717 for training, 124 for validation, 122 for testing). Each example included a story, a question and 4 choices.

Besides the audio recording of each story, the manual transcription of the story are also available. A pydub library was used to segment the full audio recording into utterances. Each audio recording has 57.9 utterances in average. There are in average 657.7 words in a story, 12.01 words in question and 10.35 words in each choices.

Speech Recognition: the CMU speech recogninizer-Sphinx was used to transcribe teh audio story. The recognition word error rate (WER) was 34.32 %.

Pre-processing: A pre-trained 300 dimension glove vector model was used to obtain the vector representation for each word. Each utterance in the stories, question and each choice can be represented as a fixed length vector by adding the vectors of the all component words. Before training, the utterances were pruned in the story whose vector representation has cosine distance far from the question's. The percentage of the pruned utterances was determined by the performance of the model on the development set. The vector representations of utterances, questions and choices were only used in this preprocessing stage and the base line approaches.

Baselines

The proposed model is compared with some commonly used simple baselines in [1] and the memory network [2].

Choice length: the most naive baseline is to select the choices based on the number of words in it without listening to the stories and looking at the questions. This included: (i) selecting the longest choice, (ii) selecting the shortest choice or (iii) selecting the choice with the length most different from the rest choices.

Within-Choices similarity: With the vector representations for the choices in pre-processing in experiment setup, the cosine distance among the four choices can be computed, and the one which is (i) the most similar to or (ii) the most different from the others can be selected.

Question and Choice Similarity: With the vector representations for the choices and questions in pre-processing of Experiment setup, the choices with the highest cosine similarity to the question is selected.

Sliding Window: This sliding window try to found (建立) a window utterances in the story with the maximum similarity to the question. The similarity between a window of utterances and a question was the averaged cosine similarity of the utterances in the window and the question by their glove vector representation.

After obtaining the window with the with the largest cosine similarity to the question, the confidence score of each choice is the average cosine similarity between the utterances in the window and the choice. The choice with the highest score is selected as the answer.

Memory Network : the memory network is implemented with some modifications for this TOEFL task to find out if memory network was able to deal it. The original memory network did not have the embedding module for the choices, so we used the module for question in the memory network to embed the choices. Besides, in order to have the memory network select the select the answer out of four choices, instead of outputting a word in its original version, the cosine similarity between the output of the last hop and the choices are computed to select the closest choice as the answer. All the parameters of embedding layers are shared in the memory network for avoiding overfitting. Without this modification, very poor results were obtained on the testing set. The embedding size of the memory network was set 128, stochastic gradient descent was used with initial learning rate of 0.01. Batch size was 40. The size of hop was tuned from 1 to 3 by development set.

Results

The accuracy: number of question answered correctlytotal number of question\frac{\textrm{number of question answered correctly}}{\textrm{total number of question}} is used as the evaluation metric.

Table.1 shows the results.

model manual ASR
choice length longest/shortest/diffferent 22.95% / 35.25 % 30.33
within choices similar/ different 36.07 % / 27.87%
question choice 24.59 %
sliding window 36.61 % 31.15 %
memory network 39.17 % 39.17 %
2016 TOEFL model word / sentence 49.16 % / 51.67 % 48.33 % / 46.67 %

Table.1 accuracy results of different models.

The model is trained on the manual transcriptions of the stories, while tested the model on the testing set with both manual transcriptions (column labelled "Manual") and ASR transcriptions (column labelled "ASR").

Choice length: part (a) shows the performance of three models for selecting the answer with the longest, shortest or most different length, ranging from 23 % to 35 %.

Within choices similarity: part (b) shows the performance of two models for selecting the choice which is most similar to or the most different from the others. The accuracy are 36.09% and 27.87, respectively.

Question and choice similarity: In part (c), selecting the choice which is the most similar to the question only yielded 24.59 %, very close to randomly guess.

Sliding window: part (d) for sliding window is the first baseline model considering the transcription of the stories. The window size {1,2,3,5,10,15,20,30} are trained and found the best window size to be 5 on the development set. This implied the useful information for answering the questions is probably within 5 sentences. The performance of 31.15% and 33.61% with and without ASR errors respectively tells how ASR errors affected the results, and the task here is too difficult for this approach to get good results.

Memory Network: The results of memory network in part (e) shows this task is relatively difficult for it, even though memory network was successful in some other tasks. However, the performance of 39.17% accuracy was clearly better than all approaches mentioned above, and it is interesting that this result was independent of the ASR errors and the reason is under investigation. The performance was 31% accuracy when we did not use the shared embedding layer in the memory network.

AMRNN Model: The results of the AMRNN model are listed in part (f), respectively for the attention mechanism on word-level and sentence-level.

Without the ASR errors, the AMRNN model with sentence-level attention gave an accuracy as high as 51.67%, and slightly lower for word-level attention. It is interesting that without ASR errors, sentence-level attention is about 2.5% higher than word-level attention. Very possibly because that getting the information from the whole sentence is more useful than listening carefully at every words, especially for the conceptual and high-level questions in this task. Paying too much attention to every single word may be a bit noisy.

On the other hand, the 34.32% ASR errors affected the model on sentence-level more than on word-level. This is very possibly because the incorrectly recognized words may seriously change the meaning of the whole sentences. However, with attention on word-level, when a word is incorrectly recognized, the model may be able to pay attention on other correctly recognized words to compensate for ASR errors and still come up with correct answer.

Analysis on a typical Example

Fig.4 shows the visualization of the attention weights obtained for a typical example story in the testing set, with the AMRNN model using word-level or sentence-level attention on manual or ASR transcriptions respectively. The darker the color, the higher the weights. Only a small part of the story is shown where the response of the model made good difference. This story was mainly talking about the thick cloud and some mysteries on Venus.The question for this story is "What is a possible origin of Venus' clouds?" and the correct choice is "Gases released as a result of volcanic activity." In the manual transcriptions case (left half of fig.4), both models, with word-level or sentence-level attention, answered the question right and focused on the core and informative words/sentences to the question. The sentence-level model successfully captured the sentence including" ...volcanic eruptions often omits gases." ; while the word-level model captured some important key words like "volcanic eruptions", "emit gases". However, in ASR cases (right half of fig.4), the ASR errors misled both models to put some attention on some irrelevant words/sentence"In other area, you got canyons..."; while the word-level model focused on some irrelevant words "canyons", "rift malaise", but still capture some correct important words like "volcanic: or "eruptions" to answer correctly. By the darkness of the color, we can observe that the problem caused by ASR errors was more serious for the sentence-level attention when capturing the key concepts needed for the question. This may explain why in part (f) of Table 1 we find degradation caused by ASR errors was less for word-level model than for sentence-level model.

Conclusion

A new task with tht TOEFL corpus was created.

TOEFL is an English examination, where the English learning is asked to listen to a story up to 5 minutes and then answer some corresponding questions. The learner needs to do deduction, logic and summarization for answering the question. The AMRNN model is able to deal with this challenging task.

On manual transcriptions, the proposed model achieved 51.56% accuracy, while the memory network got 39.17% accuracy. Even on ASR transcriptions with WER of 34.32%, the AMRNN model still yielded 48.33% accuracy. It is found that although sentence-level attention achieved the best results on the manual transcription, word-level attention outperformed the sentence-level when there were ASR errors.

[1]

M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, ``Movieqa: Understanding stories in movies through question-answering," arxiv preprint arXiv:1512.02902,2015.

[2]

S. Sukhbaatar, J. Weston, R. Fergus et al., “End-to-end memory networks,” in Advances in Neural Information Processing Systems,
2015, pp. 2431–2439

results matching ""

    No results matching ""