Datasets
We used MovieQA (Tapaswi et al., 2016) as the source MCQA dataset, and TOFEL listening comprehension test (Tseng et al., 2016) and MCTest (Richardson et al., 2013) as two separate target datasets. Example of the three datasets are shown in Table 1.
Table 1: Example of the story-question-choices triplet from MovieQA, TOFEL listening comprehension test, and MCTest datasets. S, Q, and denote the story, question, and one of the answer choices, respectively. For Movie QA, each question comes with five answer choices; and for TOFEL and MCTest, each question comes with only four answer choices. The correct answer is marked in bold.
Movie QA is dataset that aims to evaluate automatic story comprehension from both video and text. The dataset provides multiple sources of information such as plot synopses, scripts, subtitles, and video clips that can be used to infer answers. We only used plot synopses of the dataset, so our setting is the same as pure textual MCQA. The dataset contains 9848/1958 train/dev examples; each question come with a set of five possible answer choices with only one correct answer.
TOFEL listening comprehension test is a very challenging MCQA dataset that contains 717/124/122 train/dev/test examples. It aims to test knowledge and skills of academic English for global English learners whose native languages are not English. There are only four answer choices for each question. The stories in this dataset are in audio form. Each story comes with two transcripts: manual and ASR transcriptions, where the latter is obtained by running the CMU sphinx recognizer (Walker et al., 2004) on original audio files. We use TOFEL-manual and TOFEL-ASR to denote tow versions, respectively. We highlight that the questions in this dataset are not easy because most of answers cannot be found by simply matching the question and the choices without understanding the story. For example, there are questions regarding the gist(主旨) of the story or the conclusion for the conversation.
MCTest is a collection of 660 elementary-level children's stories. Each question comes with a set of four answer choices. There are two variants in this dataset: MC160 and MC500. The former contains 280/120/240 train/dev/test examples, while latter contains 1200/200/600 train/dev/test examples and is considered more difficult.
The two chosen target dataset are challenging because the stories and questions are complicated, and only small training sets are available. Therefore, it is difficult to train statistical models on only their training sets because the small size limits the number of parameters in the models, and prevents learning any complex language concepts simultaneously with the capacity to answer questions. We demonstrate that we can effectively overcome these difficulties via transfer learning in Section 5.