Powered by GitBook

Resource	Name	Category	Summary

SLR1	Yesno	Speech	Sixty recordings of one individual saying yes or no in Hebrew; each recording is eight words long.

SLR2	OpenFST	Software	A mirror of the OpenFst toolkit

SLR3	sph2pipe	Software	A mirror of the sph2pipe software

SLR4	sctk	Software	A mirror of the sctk scoring software

SLR5	MSU Switchboard transcipts	Text	A mirror of the Mississippi State transcripts and lexicon for Switchboard.

SLR6	Vystadial	Speech	English and Czech data, mirrored from the Vystadial project

SLR7	TED-LIUM	Speech	English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here)

SLR8	Sprakbanken	Text	Danish pronunciation dictionary generated using eSpeak

SLR9	The AMI pack	Text	Some auxiliary non-speech data used to build AMI systems with Kaldi

SLR10	SRE Data	Misc	Various files from SRE data that NIST used to host online

SLR11	LibriSpeech language models, vocabulary and G2P models	Text	Language modelling resources, for use with the LibriSpeech ASR corpus

SLR12	LibriSpeech ASR corpus	Speech	Large-scale (1000 hours) corpus of read English speech

SLR13	RWCP Sound Scene Database	Speech + Software	A database of recordings of real-world sounds and measured room impulse responses

SLR14	BEEP Dictionary	Text	Phonemic transcriptions of over 250,000 English words. (British English pronunciations)

SLR15	SRE Speaker List	Misc	A list linking speakers across NIST SRE corpra

SLR16	The AMI Corpus	Speech	Acoustic speech data and meta-data from The AMI corpus.

SLR17	MUSAN	Audio	A corpus of music, speech, and noise

SLR18	THCHS-30	Speech	A Free Chinese Speech Corpus Released by CSLT@Tsinghua University

SLR19	TED-LIUMv2	Audio	TED-LIUM corpus release 2, English speech recognition training corpus from TED talks, created by Laboratoire d’Informatique de l’Université du Maine (LIUM) (mirrored here)

SLR20	Aachen Impulse Response Database	Audio	Aachen Impulse Response database (AIR): a database of room impulse responses (mirrored here)

SLR21	Spanish Word list	Text	A list of words in Spanish with frequency derived from a large corpus (Spanish Gigaword).

SLR22	THUYG-20	Speech	A free Uyghur speech database Released by CSLT@Tsinghua University & Xinjiang University

SLR23	NIST LRE 2007 Key	Misc	A file containing metadata for the utterances in the LRE 2007 evaluation

SLR24	Iban	Speech	Iban language text and speech corpora for ASR

SLR25	ALFFA (African Languages in the Field: speech Fundamentals and Automation)	Speech	Amharic, Swahili and Wolof data, mirrored from the ALFFA git repository

SLR26	Simulated Room Impulse Response Database	Audio	A database of simulated room impulse responses

SLR27	Cantab-TEDLIUM Release 1.1 (February 2015)	Text	Cantab Research Language models for the TEDLIUM database

SLR28	Room Impulse Response and Noise Database	Audio	A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision.

SLR29	Sprakbanken_Swe	Text	Swedish pronunciation dictionary

SLR30	Sinhala TTS	Speech	Sinhalese multi-speaker TTS corpora

SLR31	Mini LibriSpeech ASR corpus	Speech	Subset of LibriSpeech corpus for purpose of regression testing

SLR32	High quality TTS data for four South African languages (af, st, tn, xh)	Speech	Multi-speaker TTS data for four South African languages, Afrikaans, Sesotho, Setswana and isiXhosa.

SLR33	Aishell	Speech	Mandarin data, provided by Beijing Shell Shell Technology Co.,Ltd

SLR34	Santiago Spanish Lexicon	Text	A pronouncing dictionary for the Spanish language.

SLR35	Large Javanese ASR training data set	Speech	Javanese ASR training data set containing ~185K utterances.

SLR36	Large Sundanese ASR training data set	Speech	Sundanese ASR training data set containing ~220K utterances.

SLR37	High quality TTS data for Bengali languages	Speech	Multi-speaker TTS data for Bangladesh Bengali (bn-BD) and Indian Bengali (bn-IN).

SLR38	Free ST Chinese Mandarin Corpus	Speech	A free Chinese Mandarin corpus by Surfingtech (www.surfing.ai), containing utterances from 855 speakers, 102600 utterances;

SLR39	Heroico	Speech	Spanish data, mirrored from the LDC

SLR40	Zeroth-Korean	Speech Corpus for Automatic Speech Recognition	Korean Open-source Speech Corpus for Speech Recognition by Zeroth Project (https://github.com/goodatlas/zeroth\)

SLR41	High quality TTS data for Javanese.	Speech	Multi-speaker TTS data for Javanese (jv-ID)

SLR42	High quality TTS data for Khmer.	Speech	Multi-speaker TTS data for Khmer (km-KH)

SLR43	High quality TTS data for Nepali.	Speech	Multi-speaker TTS data for Nepali (ne-NP)

SLR44	High quality TTS data for Sundanese.	Speech	Multi-speaker TTS data for Sundanese (su-ID)

SLR45	Free ST American English Corpus	Speech	A free American English corpus by Surfingtech (www.surfing.ai), containing utterances from 10 speakers, Each speaker has about 350 utterances;

SLR46	Tunisian_MSA	Speech	Tunisian Modern Standard Arabic

SLR47	Primewords Chinese Corpus Set 1	Speech	Chinese Mandarin corpus released by Shanghai Primewords Co. Ltd. (www.primewords.cn), containing 100 hours of speech data.

SLR48	MADCAT Arabic data splits	Other	Unofficial data splits (dev/train/test) for the MADCAT Arabic LDC corpus

SLR49	VoxCeleb Data	Misc	Various files for the VoxCeleb datasets

SLR50	MADCAT Chinese data splits	Other	Unofficial data splits (dev/train/test) for the MADCAT Chinese LDC corpus

SLR51	TED-LIUM Release 3	Speech	TED-LIUM corpus release 3

SLR52	Large Sinhala ASR training data set	Speech	Sinhala ASR training data set containing ~185K utterances.

SLR53	Large Bengali ASR training data set	Speech	Bengali ASR training data set containing ~196K utterances.

SLR54	Large Nepali ASR training data set	Speech	Nepali ASR training data set containing ~157K utterances.

SLR55	CLMAD	Text	A Chinese Language Model Adaptation Dataset (CLMAD).

SLR56	IAM Aachen splits	Other	Aachen data splits (train/test/val) for the IAM dataset.

[0]

http://www.openslr.org/resources.php

results matching ""

No results matching ""