Data for Tensorflow Speech Recognition Challenge
There are four data files: link_to_gcp_credits_form.txt, sample_submission.7z, test.7z and train.7z.
File descriptions:
train.7z
train.7z contains a few informational files and a folder of audio files. The audio folder contains subfolders with 1 second clips of voice commands, with the folder name being the label of the aduio clip.
There are more labels that should be predicted. The labels you will need to predict in Test are yes, no, up, down, left, right, on, off, stop, go.
Everything else should be considered weither unknown or silence.
The folder _background_noise_ contains longer clips of "silence" that you can break up and use as training input.
The files contained in the training audio are not uniquely named across labels, but they are unique if you include the label folder. For example, 00f0204f_nohach_0.wav is found in 14 folders, but that file is a different speec command in each folder.
The file are named so the first element is the subject id of the person who gave the voice command, and the last element indicated repeated commands. Repeated commands are when the subject repeats the same word multiple times. Subject id is not provided for the test data, and you can assume that the majority of commands in the test data were from subjects not seen in training.
you can expect some inconsistencies in the properties of the training data (e.g., length of the audio).
test.7z
test.7z contains an audio folder with 150,000+files in teh format clip_000044442.wav. The task is to predict the correct label. Not all of the files are evaluated fo rthe leaderboard score.
sample_submission.csv
sample_submission.csv is a sample submission file in teh correct format.
link_to_gcp_credicts_form.txt
link_to_gcp_credicts_form.txt provides the URL to request $500 in GCP credits, provided to the first 500 requestors. Please see this clarification on credit qualification [5].
Speech Commands Data Set v0.01
This is a set of one-second.wave audio files, each containing a single spoken English word.
These words are from a small set of commands, and are spoken by a variety of different speekers. The audio files are organized into folders based on the word they contain, and this data set is designed to help train simple machine learning models.
It's licensed under the Creative Common BY 4.0 license.
You can see the license file in this folder for full details. ITs original location was at
http://download.tensorflow.org/data/speech_commands_v0.01.tar.gz.
History
This is version 0.01 of teh data set containing 64.727 audio files, released on August 3rd 2017.
Collection
The audio files were collected using crowdsourcing, see
https://github.com/petewarden/extract_loudest_section
for some of the open source audio collection code we used (and please consider contributing to enlarge this data set).
The goal was to gather examples of people speaking single-word commands, rather than conversational sentences, so they were prompted for individual words over the course of a five minute session.
Twenty core command words were recorded, with most speakers saying each of them five times. The core words are "yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go", "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", and "nine".
To help distinguish unrecognized words, there are also ten auxiliary words, which most speakers only said once. These include "bed", "bird", "cat", "dog", "happy", "house", "marvin", "sheila", "tree", and "wow".
Organization
The files are organized into folders, with each directory name labeling the word that is spoken in all the contained audio files. No details were kept of any of the participants age, gender, or location, and random ids were assigned to each individual. These ids are stable though, and encoded in each file name as the first part before the underscore. If a participant contributed multiple utterances of the same word, these are distinguished by the number at the end of the file name. For example, the file path
happy/3cfc6b3a_nohash_2.wav indicates that the word spoken was "happy", the speakers id was "3cfc6b3a", and this is the third utterance of that word by this speaker in the data set.
The 'nohash' section is to ensure that all the utterances by a single speaker are sorted into the same training partition, to keep similar repetitions from giving unrealistically optimistic evaluation scores.
Partitioning
The audio clips have not been separated into training, test, and validation sets explicitly, but by convention a hashing function is used to stably assign each file to a set:
```python MAX_NUM_WAVS_PER_CLASS=2**27-1#~134M
def which _set(filename,validatiion_percentage, testing_percentage): """Determines which data partition the file should belong to.
We want to keep files in the same training, validataion, or testing sets even if new ones are added over time. This makes it less likely that testing samples will accidentally be reused in training when long runs are restarted for example. To keep this stability, a hash of the filename is taken and used to determine which set it should belong to. This determination only depends on the name and the set proportions, so it won't change as other files are added.
...
Processing
The original audio files were collected in uncontrolled locations by peopole around the world. We requested that they do the recording in a closed room for privacy reasons, but did not stipulate any quality requirements. This way by design, since we wanted examples of the sort of speech data that we're likely to encounter in consumer and robotics applications, where we do not have much control over the recording equipment or environment. The data was captured in a variety of formats, for example Ogg Vorbis encoding for the web app, and then converted to a 16-bit little-endian PCM-encoded WAVE file at a 16000 sample rate. The audio was then trimmed to a one second length to align most utterances, using the extract_loudest_section tool. The audio files were then screened for silence or incorrect words, and arranged into folders by label.
Background Noise
To help train networks to cope with noisy environments, it can be helpful to mix in realistic background audio. The _background_noise_ folder contains a set of longer audio clips that are either recordings or mathematical simulations of noise. For more details, see the _background_noise_/README.md.
Citations
If you use the Speech Commands dataset in your work, please cite it as:
APA-style citation: "Warden P. Speech Commands: A public dataset for single-word speech recognition, 2017. Available fromhttp://download.tensorflow.org/data/speech_commands_v0.01.tar.gz".
BibTeX @article{speechcommands, title={Speech Commands: A public dataset for single-word speech recognition.}, author={Warden, Pete}, journal={Dataset available fromhttp://download.tensorflow.org/data/speech_commands_v0.01.tar.gz}, year={2017} }
Credits
Massive thanks are due to everyone who donated recordings to this data set, I'm very grateful. I also couldn't have put this together without the help and support of Billy Rutledge, Rajat Monga, Raziel Alvarez, Brad Krueger, Barbara Petit, Gursheesh Kour, and all the AIY and TensorFlow teams.
[0]
https://www.kaggle.com/c/tensorflow-speech-recognition-challenge
[5]
https://www.kaggle.com/c/tensorflow-speech-recognition-challenge/discussion/44449