Open Speech and Language Resource (Open SLR)
OpenSLR is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and software related to speech recognition.
OpenSLR intend to be a convenient place for anyone to put data for downloaded publicly.
OpenSLR aims to mirror software available elsewhere, in order to provide a failover location.
OpenSLR mirror some software which is used in the Kaldi scripts.
OpenSLR aims to provide a central, hassle-free place for others to put their speech resources.
If you want to download things from this site, please download them one at a time, and please don't use any fancy software-- just download things from your browser or use 'wget'.
OpenSLR have a firewall rule to drop connections from hosts with more than 5 simultaneous connections, and certain types of download software may activate this rule.
Aside from the main site openslr.org, openSLR also have a mirror in China that is available at cn-mirror.openslr.org.
The mirror server is made available by Surfing Technology.
LibriSpeech ASR corpus
Identifier:SLR12
Summary:Large-scale (1000 hours) corpus of read English speech
Category:Speech
License:CC BY 4.0
Downloads (use a mirror closer to you):
dev-clean.tar.gz[337M] (development set, "clean" speech ) Mirrors:[China]
dev-other.tar.gz[314M] (development set, "other", more challenging, speech ) Mirrors:[China]
test-clean.tar.gz[346M] (test set, "clean" speech ) Mirrors:[China]
test-other.tar.gz[328M] (test set, "other" speech ) Mirrors:[China]
train-clean-100.tar.gz[6.3G] (training set of 100 hours "clean" speech ) Mirrors:[China]
train-clean-360.tar.gz[23G] (training set of 360 hours "clean" speech ) Mirrors:[China]
train-other-500.tar.gz[30G] (training set of 500 hours "other" speech ) Mirrors:[China]
intro-disclaimers.tar.gz[695M] (extracted LibriVox announcements for some of the speakers ) Mirrors:[China]
original-mp3.tar.gz[87G] (LibriVox mp3 files, from which corpus' audio was extracted ) Mirrors:[China]
original-books.tar.gz[297M] (Project Gutenberg texts, against which the audio in the corpus was aligned ) Mirrors:[China]
raw-metadata.tar.gz[33M] (Some extra meta-data produced during the creation of the corpus ) Mirrors:[China]
md5sum.txt[600 bytes] (MD5 checksums for the archive files ) Mirrors:[China]
About this resource:
LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech, prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is derived from read audiobooks from the LibriVox project, and has been carefully segmented and aligned.
Acoustic models, trained on this data set, are available atkaldi-asr.organd language models, suitable for evaluation can be found athttp://www.openslr.org/11/.
For more information, see the paper "LibriSpeech: an ASR corpus based on public domain audio books", Vassil Panayotov, Guoguo Chen, Daniel Povey and Sanjeev Khudanpur, ICASSP 2015 (submitted)(pdf)
LibriSpeech language models, vocabulary and G2P models
Identifier:SLR11
Summary:Language modelling resources, for use with the LibriSpeech ASR corpus
Category:Text
License:Public domain
Downloads (use a mirror closer to you):
librispeech-lm-corpus.tgz[1.8G] ( 14500 public domain books, used as training material for the LibriSpeech's LM ) Mirrors:[China]
librispeech-lm-norm.txt.gz[1.5G] (Normalized LM training text ) Mirrors:[China]
librispeech-vocab.txt[1.7M] (200K word vocabulary for the LM ) Mirrors:[China]
librispeech-lexicon.txt[5.6M] (Pronunciations, some of which G2P auto-generated, for all words in the vocabulary ) Mirrors:[China]
3-gram.arpa.gz[759M] (3-gram ARPA LM, not pruned ) Mirrors:[China]
3-gram.pruned.1e-7.arpa.gz[34M] (3-gram ARPA LM, pruned with theshold 1e-7 ) Mirrors:[China]
3-gram.pruned.3e-7.arpa.gz[13M] (3-gram ARPA LM, pruned with theshold 3e-7 ) Mirrors:[China]
4-gram.arpa.gz[1.3G] (4-gram ARPA LM, usually used for rescoring ) Mirrors:[China]
g2p-model-5[20M] (Fifth order Sequitur G2P model ) Mirrors:[China]
About this resource:
Language modeling resources to be used in conjunction with the (soon-to-be-released) LibriSpeech ASR corpus.
This corpus and these resources were prepared by Vassil Panayotov with the assistance of Daniel Povey and Sanjeev Khudanpur. We hope to finalize this and release the corpus here by the ICASSP deadline (early October 2014).