Feature Extraction in Kaldi toolkit
Feature extraction and waveform-reading code aims to create standard MFCC andPLP features, setting reasonable defaults but leaving available the options that people are most likely to want to tweak (for example, the number of melbins, minimum and maximum frequency cutoffs, etc.)
Commonly used feature extraction approaches are supported, including VTLN, cepstral mean and variance normalization, LDA, STC/MLLT, HLDA, and so on.
Acoustic Modeling
Diagonal GMMS and subspace Gaussian Mixture Models (SGMMs) are two conventional Acoustic models supportedby Kaldi toolkit.
Kaldi are extensible to new kinds of model.
Gaussian Mixture Models
GMMS with diagonal and full covariance structure are supported.
Rather than representing individual Gaussian densities separately, a GMM class is parametereized by the natural parameters, i.e. means times inverse covariances and inverse covariances.
The GMM classes store the constant term in likelihood computation , which consists of all the terms that do not depend on the data vector.
Such an implementation is suitable for efficient log-likelihood computation with simple dot-product.
GMM-based acoustic model
The acoustic model class AmDiaGmm represents a collection of DiaGmm objects, indexed by 'pdf-ids' that correspond to context-dependent HMM states. This class does not represent any HMM structure, but just a collection of densities (i.e., GMMs). There are separate classes that represent the HMM structure, principally the topology and transition-modeling code and the code responsible for compiling decoding graphs, which provide a mapping between the HMM states and the pdf index of the acoustic model class.
Speaker adaptation and other linear transforms like maximum likelihood linear transform (MLLT) or semi-tied covariance (STC) are implemented by separate classes.
HMM topology
In Kaldi, it is possible to separtely specify the HMM topology for each context-independent phone.
The topology format allows non-emitting states, and allows the user to pre-specify typing of the p.d.f's in different HMM states.
Speaker Adaptation
Both model-space adaptation using maximum likelihood linear regression (MLLR) and feature-space adaptation using feature-space MLLR (fMLLR), also known as constrained MLLR are supported.
For both MLLR and fMLLR, multiple transforms can be estimated using a regression tree.
When a single fMLLR transform is needed, it can be used as an additional processing step in the feature pipeline.
The toolkit also supports speaker normalization using a linear approximation to VTLN, or conventional feature-level VTLN, or a more generic approach for gender normalization.
Both fMLLR and VTLN can be used for speaker adaptive training (SAT) of the acoustic models.
Subspace Gaussian Mixure Models
For subspace Gaussian mixture models (SGMMs), the toolkit provides an implementation of the approach described in [1].
There is a single class AmSgmm that represents a whole collection of pdf's; unlike the GMM case, there is no class that represents a single pdf of the SGMM.
Separate classes handle model estimation and speaker adaptation using fMLLR.
[0]
The Kaldi Speech Recognition Toolkit
Daniel Povey, Arnab Ghoshal,
Gilles Boulianne, Luka ́sˇ Burget, Ondˇrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motl ́ıcˇek, Yanmin Qian, Petr Schwarz, Jan Silovsky , Georg Stemmer10, Karel Vesely ́
[1]
D. Povey, L. Burget et al., "The subspace Gaussian mixture model