Kaldi Experiment

Experimental results on REsource Management (RM) corpus and on Wall Street Journal are reported.

The reported results correspond to version 1.0 of Kaldi; the scripts that correspond to these experiments may be found in egs/rm/s1 and egs/wsj/s1.

Comparison with previously publised results

Table.1 BASIC TRIPHONE SYSTEM ONRESOURCEMANAGEMENT: %WERS

Table.1 shows the results of a context-dependent triphone system with mixture-of-Gaussian densities; the HTK baseline numbers are taken from [1] and the systems use essentially the same algorithms.

The features are MFCCs with per-speaker cepstral mean subtraction.

The language model is the word-pair bigram language model supplied with the RM corpus.

The WERs are essentially the same.

Decoding time was about 0.13 x RT, measured on a Intel Xeon CPU at 2.27 GHz.

The system identifier for the Kaldi results is tri3c.

Table II Basic Triphone system, WSJ, 20K open vocabulary, bigram LM, SI-284 TRAIN: %WERs.

Table II shows similar results fro the Wall Street Journal system, this time without cepstral mean subtraction.

The WSJ corpus comes with bigram and trigram language models.

The baseline results are reported in [2] (referred as Bell Lab).

The HTK system was gener-dependent (a gender-independent baseline was not reported), so the HTK results are slightly better. The Kaldi decoding time was about 0.5 x RT.

Other experiment

Some results on both the WSJ test sets (Nov'92 and Nov'93) are reported using systems trained on teh SI-84 part of the training data, that demonstrate different features that are supported by Kaldi.

REsults on the RM task is averaged over 6 test sets: the 4 mentioned in table I together with Mar'87 and Oct'87.

The best result for a conventional GMM system is achieved by a SAT system that splices 9 frames (4 on each side of the current frame) and uses LDA to project down to 40 dimensions, together with MLLT.

Better performance is achieved on average, with an SGMM system trained on the same features, with speaker vectors and fMLLR adaptation.

The last line, with the best results, includes the exponential transform in the features.

[0]

The Kaldi Speech Recognition Toolkit

Daniel Povey, Arnab Ghoshal,
Gilles Boulianne, Luka ́sˇ Burget, Ondˇrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motl ́ıcˇek, Yanmin Qian, Petr Schwarz, Jan Silovsky , Georg Stemmer10, Karel Vesely ́

[1]

D. Povey and P.C. Woodland, "Frame discrimination training for HMMs for large vocabulary speech recognition" in Proc. IEEE ICASSP, vol.1, 19999, pp.333-336.

[2]

W.Reichl and W. Chou, "Robust decision tree state tying for continuous speech recognition," IEEE Transactions on Speech and Auodio Processing,"

vol.8, no.5, pp.555-566, September 2000.

results matching ""

    No results matching ""