LIBRISPEECH: AN ASR CORPUS BASED ON PUBLIC DOMAIN AUDIO BOOKS

MM Chiou


The Librispeech is a corpus of read English speech.

The Librispeech corpus is suitable for training and evaluating speech recognition systems.

The LibrisSpeech corpus is derived from audiobooks that are part of the LibriVox project, and contains 1000 hours of speech sampled at 16 kHz.

The LibrisSpeech corpus is available for download, along with separately prepared language-model training data and pre-built language models.

The acoustic models trained on LibrisSpeech render lower error rate on the Wall Street Journal (WSJ) test sets than models trained on WSJ itself.

Kaldi scripts has been released as well to build the speech recognition systems.

Introduction

The rapid increase in the amount of multimedia content of the internet makes it feasible to automatically collect data for the purpose of training statistical models.

When the source data is organized into well curated, machine readable collections, training statistical model is more feasible.

The LibriVox project is a volunteer effort.

The LibriVox is responsible for the creation of approximately 8,000 public domain audio books.

The majority of LibriVox project are in English.

Most of the recordings are based on texts from Project Gutenberg, also in the public domain.

The volunteer-supported speech-gathering effort Voxforge, on which the acoustic models for alignment were trained, contains a certain amount of LibriVox audio, but the dataset is much smaller than LibrisSpeech, with around 100 hours of English speech, and suffers from major gender and per speaker duration imbalances.

The LibriSpeech corpus is a read speech data set based on LibriVox's audio books.

The LibrisSpeech corpus is available under the permissive CC BY 4.0 license and there are example scripts in the open source Kaldi ASR toolkit [1] that demonstrate how high quality acoustic models can be trained on this data.

Section 2 presents the long audio alignment procedure used in the creation of this corpus.

Section 3 describes the structure of the corpus.

In section 4, the process used to build the language models is described.

Section 5 present experimental results on models trained on this dataset, using both the LibriSpeech dev and test sets and Wall Street Journal (WSJ) test sets.

Conclusion

English read speech from audio books are aligned and segmented with the corresponding book text, and filtered out segments with noisy transcripts, in order to produce a corpus of English read speech suitable for training speech recognition systems.

Models trained with corpus have been demonstrated to have better performance on the standard Wall Street Journal (WSJ) test sets than models built on WSJ itself, the larger size of corpus (1000 hours, versus the 82 hours of WSJ's si-284 data) outweighs the audio mismatch.

The corpus is released online.

http://openslr.org/12/

These data have been introduced scripts into the Kaldi speech recognition toolkit, and users can replicate these results.

[0]

LIBRISPEECH: AN ASR CORPUS BASED ON PUBLIC DOMAIN AUDIO BOOKS

assil Panayotov, Guoguo Chen∗, Daniel Povey∗, Sanjeev Khudanpur∗

https://www.danielpovey.com/files/2015_icassp_librispeech.pdf

[1]

D. Povey, A. Ghoshal, et al., “The Kaldi Speech Recognition Toolkit,” inProc. ASRU, 2011.

results matching ""

    No results matching ""