Creating Decoding Graphs

Weighted Finite State Transducers (WFSTs) is used as training and decoding algorithms.

In the conventional recipe [1], the input symbols on the decoding graph correspond to context-dependent states (in Kaldi toolkit, these symbols are numeric and called "pdf-ids).

Different phones are allowed to share the same pdf-ids, several problems is accompanied with this approach, including not being able to determinize the FSTs, and not having sufficient information from the Viterbi path through an FST to work out the phone sequence or to train the transition probabilities.

In order to fix these problems, Kaldi put on the input of the FSTs a slightly more fine-grained integer identifier (called transition-id), that encodes the pdf-id, the phone i is a member of, and the arc (transition) within the topology specification for that phone.

There is a one-to-one mapping between the transition-ids and the transition-probability parameters in the model.

Kaldi decided make transitions as fine-grained without increasing the size of the decoding graph.

Kaldi decoding-graph construction process is based on the recipe described in [1].

However, there are a number of differences.

One difference relates to the way Kaldi handle "weight-push", which is the operation that is supposed to ensure that the FST is stochastic.

"Stochastic" means that the weights in the FST sum to one in the appropriate sense, for each state (like a properly normalized HMM).

Weight pushing may fail or may lead to bad pruning behavior if the FST representing the grammar or language model (G) is not stochastic, e.g. for backoff language models.

Kaldi approach is to avoid weight-pushing altogether, but to ensure that each of graph creation "preserves stochasticity" in an appropriate sense.

Informally, what this means is that the "non-sum-to-one-ness" (the failure to sum to one) will never get worse than what was originally present in G.

Decoders

Several decoders, from simple to highly optimized are used.

on-the-fly language model rescoring and lattic generation will be added.

"Decoders" means a C++ class that implements the core decoding algorithm.

The decoders do not require a particular type of acoustic model.

The decoders need an object satisfying a interface with a function that provides some kind of acoustic model score for a particular (input-symbol and frame).

class DecodableInterface{
public:
 virtual float LogLikelihood (int frame, int index)=0;
 virtual bool IsLastFrame (int frame)=0;
 virtual int NumIndices()=0;
 virtual ~DeocdableInterface() {}
 };

Command-line decoding programs are simple, do just one pass of decoding, and are all specialized for one decoder and one acoustic-model type.

Multi-pass decoding is implemented at the script level.

[0]

The Kaldi Speech Recognition Toolkit

Daniel Povey, Arnab Ghoshal,
Gilles Boulianne, Luka ́sˇ Burget, Ondˇrej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motl ́ıcˇek, Yanmin Qian, Petr Schwarz, Jan Silovsky , Georg Stemmer10, Karel Vesely ́

[1]

M. Mohri, F. Pereira, and M. Riley, "Weighted finite-state transducers in speech recognition," Computer speech and Language, vol.20, nol.1, pp.69-88, 2002.

results matching ""

    No results matching ""