'KALDI'

Decoders used in the Kaldi toolkit

In the Kaldi toolkit there is no single "canonical" decoder, or a fixed interface that decoders must satisfy.

There are currently three decoders available: SimpleDecoder, FasterDecoder, and KaldiDecoder. By "decoder" we mean the internal code of the decoder; there are command-line programs that wrap these decoders so that they can decode particular types of model (e.g. GMMs), or with particular special conditions (e.g. multi-class fMLLR). Examples of command-line programs that decode are gmm-decode-simple, gmm-decode-faster, gmm-decode-kaldi, and gmm-decode-faster-fmllr. We have avoided creating a single command-line program that can do every possible kind of decoding, as this could quickly become hard to modify and debug.

The Decodable interface

In order to minimize the interaction between the decoder and the acoustic modeling code, we have created a base class (DecodableInterface) which mediates between the decoder and the acoustic modeling code. The DecodableInterface object can be viewed as a wrapper for the pair (acoustic model, feature file). This might seem a slightly unnatural object. However, there is a good reason for its existence. The interaction between the acoustic model and the features can be quite complex (think about adaptation with multiple transforms), and by taking this out of the decoder we substantially simplify what the decoder has to know. The DecodableInterface object can be thought of as a matrix of size (number of frames) by (number of nonzero input labels on the graph).

The basic operation of a decoder is to "decode this object of type DecodableInterface".

The DecodableInterface object has only three functions:

  virtual BaseFloat LogLikelihood(int32 frame, int32 index);
  virtual bool IsLastFrame(int32 frame);
  virtual int32 NumIndices();

The function LogLikelihood() returns the log-likelihood for this frame and index; the index would normally be the (one-based) transition-id, see Integer identifiers used by TransitionModel. The frame is a zero-based quantity. The most normal DecodableInterface object will just look up the appropriate feature vector (using the index "frame"), work out the pdf-id corresponding to that transition-id, and return the corresponding acoustic log-likelihood. Acoustic probability scales are also applied by the DecodableInterface object, but they are not part of its generic interface because the interface represents the minimum that the decoder "needs to know", and it does not need to know about the probability scales.

In order to make all the core decoding code applicable to data that is obtained in real time, we don't put a NumFames() function in the interface, but just an IsLastFrame() function that roughly means "are we done yet?".

SimpleDecoder: the simplest possible decoder

As an illustration of a "prototypical" decoder, consider the class SimpleDecoder. This very simple decoder has been included mostly for reference and for debugging more highly optimized decoders.

Interface of SimpleDecoder

The constructor of SimpleDecoder takes the FST to decode with, and a decoding beam:

  SimpleDecoder(const fst::Fst<fst::StdArc> &fst, BaseFloat beam);

Decoding an utterance is accomplished by the following function:

  void Decode(DecodableInterface &decodable);

Here is an example code fragment where we construct a Decodable object and decode it:

  DecodableAmDiagGmmScaled gmm_decodable(am_gmm, trans_model, features,
                                         acoustic_scale);
  decoder.Decode(gmm_decodable);

The type DecodableAmDiagGmmScaled is a very simple object that, given a transition-id, works out from trans_model (type: TransitionModel) the appropriate pdf-id, gets the corresponding row from the features (type: Matrix<BaseFloat>), works out the likelihood from am_gmm (type: AmDiagGmm), and scales it by acoustic_scale (type: float).

After calling this, we can get the traceback with the following call:

  bool GetOutput(bool is_final, fst::MutableFst<fst::StdArc> *fst_out);

It returns a boolean success value. If we call this function with is_final==true, it only considers tokens that are at a final state; the normal procedure is to first call it with is_final==true, and if this returns false (no tokens reached a final state), call it with is_final==false as a backup strategy. On success, this function puts its output into the FST object "fst_out". It creates a linear acceptor whose input and output labels are whatever labels were on the FST (typically transition-ids and words, respectively), and whose weights contain the acoustic, language model and transition weights.

How SimpleDecoder works

This decoder stores tracebacks at the token level that are garbage collected. The token is of type SimpleDecoder::Token, which has the following member variables:

  class Token {
   public:
    Arc arc_;
    Token *prev_;
    int32 ref_count_;
    Weight weight_;
    ...

The member of type Arc (this is a typedef to fst::StdArc) is a copy of the arc in the original FST, except it has the acoustic likelihood contribution added in. It contains the input and output labels, the weight and the next state (in the FST). The "prev_" member is the traceback; the "ref_count_" is used in the garbage collection algorithm; the "Weight" is a typedef to fst::StdArc::Weight but essentially it just stores a floating-point value which represents the accumulated cost up to this point.

Class SimpleDecoder contains just four data members, declared as follows:

  unordered_map<StateId, Token*> cur_toks_;
  unordered_map<StateId, Token*> prev_toks_;
  const fst::Fst<fst::StdArc> &fst_;
  BaseFloat beam_;

The last two of these (the FST and the beam) are constant during decoding. The members "cur_toks_" and "prev_toks_" store the currently active tokens for the current and previous frame respectively. The central loop of the Decode() function is as follows:

for(int32 frame = 0; !decodable.IsLastFrame(frame-1); frame++) {
  ClearToks(prev_toks_);
  std::swap(cur_toks_, prev_toks_);
  ProcessEmitting(decodable, frame);
  ProcessNonemitting();
  PruneToks(cur_toks_, beam_);
}

These statements are all self-explanatory except for ProcessEmitting() and ProcessNonemitting(). The ProcessEmitting() function propagates tokens from prev_toks_ (i.e. the previous frame) to cur_toks_ (i.e. the current frame). It only considers emitting arcs (i.e. arcs with nonzero input label). For each token (say "tok") in prev_toks_, it looks at the state associated with the token (in tok->arc_.nextstate), and for each arc out of that state that is emitting, it creates a new token with a traceback to "tok" and with an "arc_" field coped from that arc, except with the associated weight updated to include the acoustic contribution. The "weight_" field, representing the accumulated cost up to this point, will be the sum (the product, in the semiring interpretation) of tok->weight_ and the weight of the recently added arc. Each time we attempt to add a new token to "cur_toks_", we have to make sure there is no existing token associated with the same FST state. If there is, we keep only the best.

The function ProcessNonemitting() deals only with cur_toks_ and not with prev_toks_; it propagates nonemitting arcs, i.e. arcs with zero/<eps> as the input label/symbol. The newly created tokens will point back to other tokens in cur_toks_. The weights on the arcs will just be the weights from the FST. ProcessNonemitting() may have to process chains of epsilons. It uses a queue to store states that need to be processed.

After decoding, the function GetOutput(), discussed above, will trace back from the most likely token at the final state (taking into account its final probability, if is_final==true), and produce a linear FST with one arc for each arc in the traceback sequence. There may be more of these than the number of frames, since there are separate tokens created for nonemitting arcs.

FasterDecoder: a more optimized decoder

The decoder FasterDecoder has almost exactly the same interface as SimpleDecoder. The only important new configuration value is "max-active", which controls the maximum number of states that can be active at one time. Apart from enforcing the max-active states, the only major difference is a data-structure related one. We replace the type std::unordered_map<StateId, Token*> with a new type HashList<StateId, Token*>, where HashList is our own templated type created for this purpose. HashList stores a singly-linked-list structure whose elements are also accessible via a hash table, and it offers the capability to free up the hash table for a new list structure while giving sequential access to the old list structure. This is so that we can use the hash table to access what in SimpleDecoder was cur_toks_, while still having access to what in SimpleDecoder was prev_toks_, in the form of a list.

The main pruning step FasterDecoder takes place in ProcessEmitting. Conceptually what is happening is that we take the tokens in what in SimpleDecoder was prev_toks_, and just before ProcessEmitting we prune using the beam and specified maximum number of active states (whichever is tighter). The way this is actually implemented is that we call a function GetCutoff(), which returns a weight cutoff value "weight_cutoff" that corresponds to the tighter of these two criteria; this cutoff value applies to the tokens in prev_toks_. Then when we go through prev_toks_ (this variable does not exist in FasterDecoder, but conceptually), we only process those tokens better than the cutoff.

The code in FasterDecoder as it relates to cutoffs is a little more complicated than just having the one pruning step. The basic observation is this: it's pointless to create a very large number of tokens if you are only going to ignore most of them later. So the situation in ProcessEmitting is: we have "weight_cutoff" but wouldn't it be nice if we knew what the value of "weight_cutoff" on the next frame was going to be? Call this "next_weight_cutoff". Then, whenever we process arcs that have the current frame's acoustic likelihoods, we could just avoid creating the token if the likelihood is worse than "next_weight_cutoff". In order to know the next weight cutoff we have to know two things. We have to know the best token's weight on the next frame, and we have to know the effective beam width on the next frame. The effective beam width may differ from "beam" if the "max_active" constraint is limiting, and we use the heuristic that the effective beam width does not change very much from frame to frame. We attempt to estimate the best token's weight on the next frame by propagating the currently best token (later on, if we find even better tokens on the next frame we will update this estimate). We get a rough upper bound on the effective beam width on the next frame by using the variable "adaptive_beam". This is always set to the smaller of "beam" (the specified maximum beam width), or the effective beam width as determined by max_active, plus beam_delta (default value: 0.5). When we say it is a "rough upper bound" we mean that it will usually be greater than or equal to the effective beam width on the next frame. The pruning value we use when creating new tokens equals our current estimate of the next frame's best token, plus "adaptive_beam". With finite "beam_delta", it is possible for the pruning to be stricter than dictated by the "beam" and "max_active" parameters alone, although at the value 0.5 we do not believe this happens very often.

KaldiDecoder: an even more optimized decoder

The class KaldiDecoder should provide a fast and general purpose decoder. It is templated on the type of the DecodableInterface to work with any acoustic model and is templated on the type of FST, to work with any type of semi-ring. The design of the KaldiDecoder was partly inspired by a very fast, FST-based decoder written by Ondrej Glembek, but was rewritten to be flexible and extendible for other types of recognition networks, acoustic models and semi-rings. In spite of the generalizations, the goal is to not sacrify much of the original speed.

Interface of KaldiDecoder

The constructor of KaldiDecoder takes just KaldiDecoderOptions as argument:

 template<class Decodable, class Fst>
 KaldiDecoder<Decodable, Fst>::KaldiDecoder(KaldiDecoderOptions opts);

KaldiDecoderOptions is a structure that registers decoder options like pruning beamwidth to the command-line parser and passes their values to the decoder. The type of FST (and arc-type) that is used as recognition network and the type of DecodableInterface are passed as template options.

Decoding an utterance is accomplished by the following function:

 fst::VectorFst<fst::StdArc>* Decode(const Fst &fst, Decodable &decodable);

We do not only pass the features of the utterance (in Decodable), but also the recognition network as the parameter "fst". Each utterance can be decoded by a different network. KaldiDecoder requires non-reordered graphs - i.e. it is expected, that each arc leaving a state has the same transition-id (pdf-id) assigned to it. Since in this case all leaving arcs evaluate a common acoustic pdf, the evaluation of the acoustic model can be done state based and all at once, which in the case of the previous decoder lead to a significant speed-up.

The Decode function returns the traceback as an FST object. To save memory during decoding, not the whole traceback is saved - new arcs are only added for each encountered word. If successfull (otherwise NULL), the FST is a linear acceptor whose output labels are the (word) labels that were on the FST and whose weights contain the acoustic, language model and transition weights - accumulated since the last word label. Normally only tokens reaching a final state are considered. However, if no token reached a final state, a warning is uttered and the path of the best active token in the last frame is returned instead.

How KaldiDecoder works

The most important data members of class KaldiDecoder are:

 StateOrder<StateId> *state_order_;
 WordLinkStore wl_store_;
 TokenStore tokens_, tokens_next_;
 TokenSet<StateId> *active_tokens_;

The function of "active_tokens_" very much corresponds to "prev_toks_" and "cur_toks_" in SimpleDecoder - it holds the active tokens during decoding. Here, TokenSet contains two unordered set of Tokens

  • one for consecutive access for the current frame
  • and another allowing random access (by node number) for the next frame

Token is almost the same data structure as WordLink, which is used to store the traceback. The WordLinks are handled by WordLinkStore, which implements an own allocator and garbage collection. In the same way, tokens are handled by TokenStore. We use three different Stores, since it might speed up due to caching.

StateOrder is a multi-functional data structure: It is used to dynamically determine the topological order of network states, which might not be known in advance, if we build the recognition network dynamically, and which is necessary when evaluating chains of non-emitting arcs. After determining the topological state order, StateOrder is used as a topologically sorted priority queue of states.

The central part of the Decode() function is as follows:

 InitDecoding(fst, decodable);
 frame_index_ = -1;
 do {
   frame_index_++;
   ProcessNonEmitting();
   EvaluateAndPrune();
   ProcessEmitting();
 } while (!(p_decodable_->IsLastFrame(frame_index_)));
 FinalizeDecoding();

InitDecoding() initializes data structures used for decoding and inserts a first token (in the start state). It contains all tokens in non-emitting nodes to be forwarded in the next frame, that resulted from forwarding the active tokens from the last frame. From the start state, we explore the network to find all non-emitting nodes and store them on them queue (see below VisitNode).

ProcessNonEmitting() works analog to SimpleDecoder - it forwards tokens through non-emitting arcs, but it considers the order given by the priority queue in "state_order_", so that no arc will be processed twice.

EvaluateAndPrune() evaluates the scores of the acoustic model for all pdf-ids in the active tokens for this frame. As already mentioned, since all outgoing arcs share the same pdf-id, it can be evaluated once per state and the score is added to all tokens, before forwarding them in the next step. "EvaluateAndPrune()" takes also care of pruning: the best active token before and after the acoustic evaluation is found, and the new "beam_threshold_" (best token score plus beamwidth) is computed, on whose basis the following functions ("ProcessEmitting" and "ProcessNonEmitting") prune new tokens. The number of active tokens is also pruned by "max_active_tokens", which may result in a tighter "beam_threshold_" (adaptive beam).

ProcessEmitting() works analog to SimpleDecoder - it forwards tokens through emitting arcs, using the already computed acoustic scores. On the fly each newly discovered node is explored using the function VisitNode() by recursively following the arcs (depth first). "VisitNode()" finds the topological state order for evaluating the non-emitting arcs and at the same time builds the priority queue. A second version of this function, VerifyNode(), has the same functionality, but at the same checks that the recognition network has the properties required by the decoder (e.g. no epsilon loops). After exploration, new tokens are dispatched - Tokens to be forwarded on non-emitting arcs are put to the priority queue, all other are put on member list of "active_tokens_".

If the last frame has been processed, FinalizeDecoding() forwards the tokens through non-emitting arcs, using slightly modified versions of "ProcessNonEmitting()" and "ProcessEmitting()" and at the same time the final state weights are evaluated. The path of the best successfull token is back-tracked and returned as output FST.

BiglmDecoder: decoding with large language models.

There are two basic ways in Kaldi to use large language models (i.e. language models larger than a few million arcs, for which it would be difficult to successfully build the decoding graph). One way is to generate a lattice using a small LM, and to rescore this lattice with a large LM (see Lattice generating decoders below, and ref lattices). The other way is to use a "biglm" decoder, e.g. BiglmFasterDecoder. The basic idea is to create the decoding graph HCLG with a small grammar, and compose dynamically with the difference between a large grammar and the small grammar. Note that while we use the word "grammar" for compatibility with the standard notation, we have in mind a statistical language model. Imagine that the small grammar is G (an FST), and the large one is G'. The basic idea is to search, in decoding time, the graph formed by the triple composition $ HCLG \circ G^- \circ G' $, where $G^T-$ is like $G$ but with its scores reversed. We'll give the high-level idea of how we do this first. We construct an on-demand composed FST, call it F, which is $F = G^- \circ G' $. Then while decoding, we construct on-demand the FST $HCLG \circ F$. The problem with this is that we would always take the worst-scoring path through G, e.g. improperly take the backoff arc, which would make the subtraction of the original FST scores incorrect.

The way we solve the problem above is to use some knowledge about the structure of $G$ and $G'$ (we assume they are are ARPA-style language models), and treat them as epsilon-free, deterministic FSTs. That is: when searching for an arc from a particular state with a particular input label, if we find that input label we take it (and return just that arc), otherwise we follow the epsilon transition, and recursively look for an arc with that label. In terms of the external interface of the FST, it looks like there was an arc from the original state (i.e. it looks like a language model FST that has been subjected to epsilon removal). We created a special interface for this type of FST, which we call fst::DeterministicOnDemandFst; it has a new function GetArc(), which finds the arc with a particular input label, if it exists (by assumption, there cannot be more than one). Both G and G' are of type fst::DeterministicOnDemandFst, and so is their composition. This means that the decoder doesn't have to implement a generic composition algorithm; instead, whenever it crosses an arc in HCLG, it has only to update the language-model state (a state-identifier in F). The decoding algorithm is almost exactly the same as for the baseline, except the state-space (a hash index that we use) is not just the state in HCLG, but a pair of (state in HCLG, state in F). There is no subtantial extra work introduced by this, but this decoder is still a bit slower (e.g. nearly twice as slow in a typical setup) versus a decoder with the same beam, without the "biglm" part. The reason seems to be that with the biglm decoder, more states are in the beam (because a state in HCLG may now have more ``copies'', corresponding to different different histories with distinct langauage model states in HCLG. However, with the same beam, the biglm decoder does give better accuracy than lattice rescoring of lattices produced with a small grammar. The reason, we believe, is better pruning: the ``biglm'' decoder does the Viterbi beam pruning with closer-to-optimal language model scores. Of course, it is still not as good pruning as we would get by using a HCLG compiled with the big grammar, because the biglm decoder only updates the ``good'' language model score every time it crosses a word.

Lattice generating decoders

There are lattice-generating versions of some of the decoders described above. There is LatticeFasterDecoder, LatticeSimpleDecoder, and LatticeBiglmFasterDecoder. See lattice for more details on lattice generation.

 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Defines