The Kaldi Speech Recognition Toolkit

According to legend, Kaldi was the Ethiopian goathered who discovered the coffe plant.

The Kaldi speech recognition toolkit is a free, open-source toolkit for speech recognition research.

Kaldi provides a speech recognition system based on finite-state transducers (using the freely available OpenFst), together with detailed documentation and scripts for building complete recognition systems.

Kaldi is written in C++, and the core library supports modeling of arbitrary phonetic-context sizes, acoustic modeling with subspace Gaussian mixture models (SGMM) as well as standard Gaussian mixture models, together with all commonly used linear and affine transforms.

Kaldi is released under the Apache License v2.0, which is highly non-restrictive, making it suitable for a wide community of users.

Introduction

Kaldi is an open-source toolkit for speech recognition written in C++ and licensed under the Apache License v2.0.

The goal of Kaldi is to make the code flexible to modify and extend.

Kaldi is available on SourceForge (http://kaldi.sf.net/.

The tools compile on commonly used Unix-like systems and Microsoft Windows.

Researchers on automatic speech recognition (ASR) have several potentials choices of open-source toolkits for building a recognition system. Notable among these are HTK (written in J language) [1], Julius (written in C language) [2], Sphinx-4 (written in Java language)[3], and the RWTH (written in C++) [4] ASR toolkit.

The Kaldi has goal to have a finite-state transducer (FST) based framework, extensive linear algebra support, and a non-restrictive license.

The Kaldi features include

  1. integration with finite state transducers

The OpenFst toolkit [5] has been used as a library.

  1. Extensive linear algebra support

A matrix library which wraps standard BLAS and LAPACK routines have been included.

  1. Extensible design

The algorithm has been designed to have generic form.

For instance, the Kaldi decoders work with an interface that provides a score for a particular frame and FST input symobl.

Thus, the decoder could work from any suitable source of scores.

  1. Open license

The code is licensed under Apache v2.0, which is one of the least restrictive licenses available.

5 complete recipes

Complete recipes for building speech recognition systems are available.

The speech recognition can work from widely available databases such as those provided by Linuistic Data Consortium.

6 Through Testing

The goal is for all or neary all the code to have corresponding test routines.

The main intended use for Kaldi is acoustic modeling research; thus, HTK and the RWTH ASR toolkit (RASR) are competitors.

The chief advantage versus HTK is modern, flexible, cleanly structured code and better WFST and math support.

The Kaldi license terms are more open than either HTK or RASR.

The paper is organized as follows:

The structure of the code and design choices are described in section II.

The individual components of a speech recognition system that the Kaldi toolkit supports (e.g. feature extraction, acoustic modeling, phonetic decision trees, language modeling and decoders) are described in section III, IV, V, VI and VIII.

Some benchmarking results are provided in section IX.

[0]

The Kaldi Speech Recognition Toolkit

Daniel Povey, Arnab Ghoshal,
Gilles Boulianne, Luka ́sˇ Burget, Ondˇrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motl ́ıcˇek, Yanmin Qian, Petr Schwarz, Jan Silovsky , Georg Stemmer10, Karel Vesely ́

[1]

S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. Woodland,The HTK Book (for version 3.4). Cambridge University Engineering Department, 2009.

[2] A. Lee, T. Kawahara, and K. Shikano, “Julius – an open source real- time large vocabulary recognition engine,” inEUROSPEECH, 2001, pp. 1691–1694.

[3] W. Walker, P. Lamere, P. Kwok, B. Raj, R. Singh, E. Gouvea, P. Wolf, and J. Woelfel, “Sphinx-4: A flexible open source framework for speech recognition,” Sun Microsystems Inc., Technical Report SML1 TR2004- 0811, 2004.

[4] D. Rybach, C. Gollan, G. Heigold, B. Hoffmeister, J. Lo ̈o ̈f, R. Schlu ̈ter, and H. Ney, “The RWTH Aachen University Open Source Speech Recognition System,” inINTERSPEECH, 2009, pp. 2111–2114.

[5] C.Allauzen,M.Riley,J.Schalkwyk,W.Skut,andM.Mohri,“OpenFst: a general and efficient weighted finite-state transducer library,” inProc. CIAA, 2007.

results matching ""

    No results matching ""