Exemplary Data Preparation in Dummies Tutorial
Audio Data
The exemplary data set set of 100 files for setting up an ASR system is described as following.
The file format is WAV, and the content is one sentence/utterance.
Each sentence/utterance contains 3 spoken digits (0,1,2,...,9) recorded in English language.
Each of these audio files is named in recognizable way (e.g. 1_5_6.wav, which means that the spoken sentence is one, five six), and placed in the recognizable folder representing particular speaker during a particular recording session.
There are 10 different speakers. Note that ASR systems must be trained and tested on different speakers, the more speakers the better performance.
The 100 *.wav files are placed in 10 folders related to particular speakers.
If the recordings of the same person are in two different quality/noise environments, they can be put in separate folders.
The exemplary dataset can be adjusted to particular case.
Task
Go to kaldi-trunk/egs/digits directory and create digits_audio folder.
In kaldi-trunk/egls/digits/digits_create two folders: train and test.
Select one speaker of your choice to represent testing data set.
Use this speaker's 'speakerID' as name for an another new folder in kaldi-trunk/egs/digits/digits_audio/test directory.
Then put there all the audio files related to that person.
Put the rest 9 speakers into train folder, and this will be training data set.
Also create subfolders for each speaker.
Acoustic Data
Some text files should be created to allow Kaldi to communicate with audio data.
Considering these files as 'must be done'.
Each file in this section can be considered as a text file with some number of strings, and each string is in new line.
These strings needed to be sorted.
Kaldi scripts utils/validate_data_dir.sh and utils/fix_data_dir.sh can be used to check and fix data.
utils directory is attached to project in Tools attachment section.
Task
In kaldi-trunk/egs/digits directory, a folder 'data' is created.
Then 'test' and 'train' subfolders are created in 'data' folder.
Some files are created in subfolder.
The file names in 'test' and 'train' subfolder have the same way, but they are different datasets.
a) spk2gender
This file informs about speakers gender.
Kaldi assuming 'speakerID' is a unique name of each speaker. in this case, it is also a 'recordingID', and every speaker has one audio data folder from one recording session.
In the exemplary example, there are 5 female and 5 male speakers. f=female, m=male.
pattern:<speakerID><gender>
cristine f
dad m
josh m
july f
$ so on ...
b) wav.scp
This file connects every utterance (sentence said by one person during particular recording session) with an audio file related to this utterance.
According to naming approach of Kaldi, 'utteranceID' is equivalent to 'speakerID'+'3 words'.
Pattern: <utteranceID><full_path_to_audio_file>
dad_4_4_2 /home/{user}/kaldi-trunk/egs/digits/digits_audio/train/dad/4_4_2.wav
july_1_2_5 /home/{user}/kaldi-trunk/egs/digits/digits_audio/train/july/1_2_5.wav
july_6_8_3 /home/{user}/kaldi-trunk/egs/digits/digits_audio/train/july/6_8_3.wav
# and so on ...
c). text
The file contains every utterance matched with its text transcription.
Pattern: <uterranceID><text_transcription>
data_4_4_2 four four two
july_1_2_5 one two five
july_6_8_3 six eight three
# and so on ...
d). utt2spk
This files tells the ASR system which utterance belongs to particular speaker.
Pattern: <utteranceID> <speakerID>
dad_4_4_2 dad
july_1_2_5 july
july_6_8_3 july
# and so on
e) corpus.txt
This file has a slightly different directory.
In kaldi-trunk/egs/digits/data create another folder 'local'.
In kaldi-trunk/egs/digits/data/local create a file corpus.txt which should contain every single utterance transcription that can occur in ASR system.
In the exemplary case, it will be 100 lines from 100 audio files.
Pattern: <text_transcription>
one two five
six eight three
four four two
# and so on ...
Language Data
This section relates to language modelling files that considered as 'must be done'.
The syntax details can be studied in [1], and each file is precisely described.
Some examples can be read in other egs scripts.
Task
In kaldi-trunk/egs/digits/data/local directory, a folder 'dict' is created.
In kaldi-trunk/egs/digits/data/local/dict create following files:
a). lexicon.txt
This file contains every word from your dictionary with its 'phone transcription' taken from /egs/voxforge/.
Pattern: <word> <phone 1> <phone 2> ...
!SIL sil
<UNK> spn
eight ey t
five f ay v
four f ao r
nine n ay n
one hh w ah n
one w ah n
seven s eh v ah n
six s ih k s
three th r iy
two t uw
zero z ih r ow
zero z iy r ow
b). nonsilence_phones.txt
This file lists nonsilence phones that are present.
Pattern: <phone>
ah
ao
ay
eh
ey
f
hh
ih
iy
k
n
ow
r
s
t
th
uw
w
v
z
c). silence_phones.txt
This file lists silence phones.
Pattern: <phone>
sil
spn
d). optional_silence.txt
This file lists optional silence phones.
Pattern: <phone>
sil
[0]
http://kaldi-asr.org/doc/kaldi_for_dummies.html
[1]