6.8.2 KaldiDecoder

6.8.2.1 Outline

KaldiDecoder is an acoustic model decoder software developed for HARK  . This particular decoder was designed using libraries from Kaldi 1 , a deep learning speech recognition toolkit. Up until HARK  version 2.2.0, JuliusMFT (a large vocabulary speech recognition decoder system based on Julius ) had been the core of HARK  ’s speech recognition. In response to the recent trends in speech recognition software design, we have decided to provide a Kaldi -based decoder, KaldiDecoder , beginning with HARK  version 2.3.0.

Comparing this new decoder with other standard Kaldi -based decoders, the following features are available:

  1. Connectivity with HARK  modules (same handling as JuliusMFT )

    • Compatible with both MSLS and MFCC feature data input via the network (mfcnet)

    • Supports the addition of source location info (SrcInfo)

    • Supports the recognition of simultaneous speech (mutual exclusion)

    The connections with HARK  can be made through SpeechRecognitionClient (or through SpeechRecognitionSMNClient ), identical to JuliusMFT .

  2. Compatibility with JuliusMFT .

    • Compatible with JuliusMFT output emulation (in both module-mode and standard output formats)

    KaldiDecoder replicates JuliusMFT output as closely as possible, such that modification to the JuliusMFT -based sound system (in its demonstration system or sound scoring system) should be minimal.

  3. Additional Kaldi functionality

    • Implementation of online decoding for nnet1 models

    Kaldi ’s standard decoder provides only offline nnet1 model decoding.

  4. Functions to be implemented

    • Missing feature recognition (except for the already implemented mfcnet-masked data structure recognition)

    • nnet2 and nnet3 models

The features and functions described in numbers 1, 2, and 3 above have been implemented without any change to Kaldi .

The following section explains the method of installing and using KaldiDecoder , with a step-by-step procedure for connecting to HARK  in FlowDesigner.

6.8.2.2 Start up and setting

Execution of KaldiDecoder is performed as follows when assuming a settings file named as kaldi.conf for example.

  > kaldidecoder --config=kaldi.conf (for Ubuntu OS)
  > kaldidecoder.exe --config=kaldi.conf (for Windows OS)

After starting KaldiDecoder in online mode, the socket connection in HARK  is performed by starting a network that contains SpeechRecognitionClient (or SpeechRecognitionSMNClient ) for which an IP address and a port number are correctly set to enable the speech recognition.

The abovementioned kaldi.conf is a text file that describes settings for KaldiDecoder . The content of the setting file consists basically of argument options that begin with “--”, and the user can also specify arguments directly as KaldiDecoder options when starting. Moreover, descriptions that come after # are treated as comments. Confirm the options used for KaldiDecoder by executing the following command:

  > kaldidecoder --help (for Ubuntu OS)
  > kaldidecoder.exe --help (for Windows OS)

The minimum required settings for using nnet1 models in KaldiDecoder are the following seven items:

The offline decoding mode requires an additional setting to specify the list of features to be evaluated:

If the above setting is excluded, KaldiDecoder will automatically execute in the online decoding mode (mfcnet input mode).

Modify the online decoding mode mfcnet input port or result output port by changing the following options (default port number values are shown):

  1. Basic Configurations (Input Files and Modes Settings)

    • --nnet-type=model type number

      This is to set the nnet model type in Kaldi . The default setting value is “1”, and it is the only working setting in the current version. Two more model types will be made available in a future version, and you will be able to change values in the setting as follows:

      Set the value to “1” when using a nnet1 model created using Karel Vesely’s method. Change it to “2”, if you use a nnet2 model created using Daniel Povey’s method. Likewise, set it to “3” when using a nnet3 model created using Daniel Povey’s method.

    • --filename-words=word list file name

      This setting specifies the word list file. You can set it as a path relative to the current directory or as an absolute path. The file format of the word list file is the same as that of the words.txt file included in the lexicon directory used in Kaldi for training/validation or in the language model used for evaluation. This setting is mandatory for the output of recognition results.

    • --filename-phones=phoneme list file name

      This is to set the phoneme list file. You can set it as a path relative to the current directory or as an absolute path. The file format of the phoneme list file is the same as the phones.txt file included in the lexicon directory used in Kaldi for training/validation or in the language model used for evaluation. This setting is optional, as it is only necessary when you conduct phoneme alignment.

    • --filename-align-lexicon=lexicon file name

      This is to set the lexicon file. You can set it as a path relative to the current directory or as an absolute path. The file format of the lexicon file is the same as that of the align_lexicon.int file included in the lexicon directory used in Kaldi for training/validation or in the language model used for evaluation. If you create a lang directory using prepare_lang.pl , the lexicon file will be output to lang/phones/aligned_lexicon.int . This setting is mandatory for the output of recognition results.

    • --filename-feature-transform=FeatureTransform file name

      This is to set the FeatureTransform file, which is output to the path of the trained DNN as
      exp/tri*dnn*/final.feature_transform . This setting is mandatory when you use KaldiDecoder in nnet1 model decoding, as it is separate from the acoustic model.

    • --filename-nnet=nnet file name

      This is to set the nnet file, which is output to the path of the trained DNN as exp/tri*dnn*/final.nnet (note: this path is a symbolic link). This setting is mandatory when you use KaldiDecoder in nnet1 model decoding, as it is separate from the acoustic model.

    • --filename-mdl=mdl file name

      This is to set the mdl file, which is output to the path of the trained DNN as exp/tri*dnn*/final.mdl (note: this path is a symbolic link). This setting is mandatory, as the acoustic model is required for the output of recognition results.

    • --filename-class-frame-counts=class-frame-counts file name

      This is to set the class-frame-counts file, which is output to the path of the trained DNN as
      exp/tri*dnn*/ali_train_pdf.counts . This setting is mandatory when you use KaldiDecoder in nnet1 model decoding, as it is separate from the acoustic model.

    • --filename-fst=FST file name

      This is to set the FST file. If you create the graph directory using mkgraph.sh , the FST file will be output to graph*/HCLG.fst . This setting is mandatory, as it is required for the output of recognition results.

    • --filename-features-list=features list file name

      When using KaldiDecoder in offline decoding mode, it is required to specify a text list containing paths to the feature file(s) to be evaluated (note: it is the name of the list; not of the feature file(s)). If this option is not set, KaldiDecoder runs in the online docoding mode.

    • --port-mfcnet=port number

      This is to set the port number that receives the acoustic features and masks transmitted via network by SpeechRecognitionClient (or SpeechRecognitionSMNClient ). It is similar to the “-adport” input port setting for mfcnet input mode in JuliusMFT . The same default port number as in JuliusMFT , 5530, will be used if none is provided. This option is valid only in the online decoding mode.

    • --port-result=port number

      This is to set the connection port that transmits the recognition results through the network. It is similar to the “-module” output port setting for module mode in JuliusMFT . The same default port number as in JuliusMFT , 10500, will be used if none is provided.

    • --lm-name

      This is to set the language model name. When this setting is active, the LMNAME attribute is provided in module mode output. It has no default value.

  2. Decoder Tuning Configuration (Weighting and Pruning)

    The items described in this section are options inherited from the features implemented in the decoder (classes) in Kaldi .

    • --acoustic-scale=acoustic scale value

      This is to set the acoustic scale value. The default setting value is “0.5”, which is generally the inverse of the best LM weight obtained in scoring.
      Reference: https://sourceforge.net/p/kaldi/discussion/1355348/thread/924c555b/

    • --max-active

      This is to set the decoder’s maximum number of active states. Although increasing the value provides more accurate results, it also significantly affects the decoding speed. The default setting value is “2147483647” (maximum value of an int32 type). We empirically recommend setting it between 2000 and 5000 to obtain better performance.

    • --min-active

      This is to set the decoder’s minimum number of active states. The default setting value is “200”.

    • --beam=beam width

      This is to set the decoding beam width. Although increasing the value provides more accurate results, it also considerably affects the decoding speed. The default setting value is “16”. For details, refer to the quotation in Table 6.89.

    • --beam-delta=beam delta

      This is to set the decoding beam delta. This parameter is obscure and relates to a speed-up in the way in which the “max-active” constraint is applied. Increasing the value provides more accurate results. The default setting value is “0.5”. For details, refer to the quotation in Table 6.89.

    • --delta=delta

      This is the tolerance value used in determinization, which is set as “0.000976562” by default. For details, refer to the quotaion in Table 6.89.

    • --hash-ratio

      This is a ratio value to control the hash behavior in decoding process. The default setting value is “2”.

    • --prune-interval=frame count

      This is to set the frame interval at which tokens are to be pruned. The default setting value is “25”.

    • --splice=splice count

      This is to set the DNN input splice count. It is the time context around the current frame. The default setting value is “3”, which means that there are 3 frames both before and after the current frame.

    The following article in the Table 6.89 is quoted from the Kaldi website (http://www.danielpovey.com/kaldi-docs/decoders.html).

    Table 6.89: Quotation from “FasterDecoder: a more optimized decoder”

    
      The code in FasterDecoder as it relates to cutoffs is a little more complicated than just having the 
    one pruning step. The basic observation is this: it's pointless to create a very large number of tokens 
    if you are only going to ignore most of them later. So the situation in ProcessEmitting is: we have  
    "weight_cutoff" but wouldn't it be nice if we knew what the value of "weight_cutoff" on the next frame 
    was going to be? Call this "next_weight_cutoff". Then, whenever we process arcs that have the current 
    frame's acoustic likelihoods, we could just avoid creating the token if the likelihood is worse than 
    "next_weight_cutoff". In order to know the next weight cutoff we have to know two things. We have to 
    know the best token's weight on the next frame, and we have to know the effective beam width on the 
    next frame. The effective beam width may differ from "beam" if the "max_active" constraint is limiting, 
    and we use the heuristic that the effective beam width does not change very much from frame to frame. 
    We attempt to estimate the best token's weight on the next frame by propagating the currently best 
    token (later on, if we find even better tokens on the next frame we will update this estimate). We get 
    a rough upper bound on the effective beam width on the next frame by using the variable "adaptive_beam". 
    This is always set to the smaller of "beam" (the specified maximum beam width), or the effective beam 
    width as determined by max_active, plus beam_delta (default value: 0.5). When we say it is a 
    "rough upper bound" we mean that it will usually be greater than or equal to the effective beam width 
    on the next frame. The pruning value we use when creating new tokens equals our current estimate of the 
    next frame's best token, plus "adaptive_beam". With finite "beam_delta", it is possible for the pruning 
    to be stricter than dictated by the "beam" and "max_active" parameters alone, although at the value 0.5 
    we do not believe this happens very often.
        

    Povey, Daniel:
    Citing Sources: [http://www.danielpovey.com/kaldi-docs/decoders.html#decoders_faster]: para. 3: [December 6, 2016]
        

  3. Lattice Configuration

    The items described in this section are options that have been inherited from the features implemented in the decoder (classes) in Kaldi .

    • --determinize-lattice

      This is to determinize the lattices. It keeps only the best probability distribution function (p.d.f.) sequence for each word sequence.

    • --lattice-beam=beam width

      This is to set the beam width in lattice generation. Increasing the value gives deeper lattices, which also significantly affects the decoding speed. The default setting value is “10”.

    • --max-mem=maximum memory allocation size

      This is to set the maximum approximate size of memory allocated when determinizing the lattices. However, the actual usage may be higher than the specified value because the allocation may occur more than once.

    • --minimize

      When this option is given, minimize the lattices after determinization.

    • --phone-determinize

      When this option is given, do an initial pass of determinization on both phonemes and words. See also the article on “--word-determinize”.

    • --word-determinize

      When this option is given, do a second pass of determinization on words only. See also the article on “--phone-determinize”.

  4. Others

    • --config=configuration file name

      This setting specifies the config file, which can be specidied repeatedly.

    • --enable-debug

      This option enables the debugging output. The default seting is “disabled”.

    • --help

      This is to display the help menu. When this option is given, all other options are ignored.

    • --print-args

      When this option is enabled, the command line arguments are sent to the standard output. The default setting is “enabled”. Set it as “--print-args=false” to disable it.

    • --verbose=log level

      This is to set detail level of log information. Increasing the value gives more detailed log output. The default setting value is “0”.

The functionality called “module mode” in the original Julius or JuliusMFT is also available in KaldiDecoder . Selecting the online decoding mode automatically enables it. In addition, the standard output is not deactivated as in Julius or JuliusMFT ; both standard and socket (network) outputs can be used in KaldiDecoder ’s online decoding mode.


6.8.2.3 Detailed description

6.8.2.3.1 mfcnet communication specification

 

In order to use mfcnet as an acoustic input source, the argument “--filename-features-list” must not be given when starting up KaldiDecoder as mentioned above. In this case, KaldiDecoder acts as a TCP/IP communications server, starting up in the listening state and waiting for input. Moreover, the HARK  modules SpeechRecognitionClient and SpeechRecognitionSMNClient work as a client to transmit acoustic features and Missing Feature Mask to KaldiDecoder . The client connects to KaldiDecoder for every utterance and closes the connection immediately after the transmission is complete. The data to be transmitted must be little endian (note that it is not a network byte order). Concretely, communication is performed as follows for one utterance.

  1. Socket connection

    The client opens the socket and connects to the mfcnet communication port in KaldiDecoder .

  2. Communication initialization (data transmitted once at the beginning)

    The client transmits information on the sound source that is going to be transmitted, as shown in Table 6.90, immediately after the socket connection. The sound soure information is expressed in a SourceInfo structure (Table 6.91) and has a sound source ID, sound source direction and time of transmission start. The time is indicated in a timeval structure defined in <sys/time.h> and is the elapsed time from the starting time point (January 1, 1970 00:00:00) in the system time zone. The time indicates the elapsed period from the starting point thereafter.

  3. Data transmission (data transmitted at every frame)

    Acoustic features and Missing Feature Mask are transmitted. Features of one utterance are transmitted as frames, shown in Table 6.92, repeatedly until the end of the speech section. It is assumed inside the KaldiDecoder that the dimension number of feature vectors and mask vectors are the same.

  4. Connection end (data transmitted once at the end)

    After transmitting features for one sound source, data (Table 6.93) that indicate completion is transmitted. KaldiDecoder will return to the listening state to receive the next sound source data until either the data indicating completion is received or its socket connection is severed. It is therefore possible to resume and continue data reception in an environment with a relatively unstable connection.

  5. Socket disconnection

    After the ending process, the sockets are closed. If they close without the ending process, it executes exception tasks; thus, the output of the recognition results may be delayed. Likewise, any data transmitted after the ending process is ignored regardless of the sockets being open or closed.

    Table 6.90: Data to be transmitted only once at the beginning (acoustic source information)

    Size[byte]

    Type

    Data to be transmitted

    4

    int 

    28 (= sizeof(SourceInfo))

    28

    SourceInfo

    Sound source information of features that are going to be transmitted


    Table 6.91: SourceInfo structure

    Member variable name

    Type

    Description

    source_id

    int 

    Sound source ID

    azimuth

    float 

    Horizontal direction [deg]

    elevation

    float 

    Vertical direction [deg]

    time

    timeval

    Time (standardized to 64 bit processor and 16 bytes long)


    Table 6.92: Data to be transmitted for every frame (features, masks data and dimensions information)

    Size[byte]

    Type

    Data to be transmitted

    4

    int 

    N1=(dimension number of feature vector) $\times $ sizeof(float )

    N1

    float [N1]

    feature vector (float array)

    4

    int 

    N2=(dimension number of mask vector) $\times $ sizeof(float )

    N2

    float [N2]

    mask vector (float array)


    Table 6.93: Data to be transmitted only once at the end (data to indicate completion)

    Size[byte]

    Type

    Data to be transmitted

    4

    int 

    0


6.8.2.3.2 Module mode communication specification

 

When setting the online decoding, KaldiDecoder operates similarly to the module mode in Julius . In the module mode, KaldiDecoder works as a TCP/IP communication server and provides recognition results to clients such as jcontrol. The character encoding for Japanese text depends on that of the language model used. An XML-like format is used just like in Julius for data representation, and a “.” (period) is transmitted to indicate the data completion for each and every message. As an additional feature of KaldiDecoder , it can also output results in the standard XML format without the “.” (period) mark. The meaning of the most common tags transmitted by KaldiDecoder is as follows.

When comparing to the original Julius , two changes were made to KaldiDecoder as follows.


6.8.2.3.3 Example output of KaldiDecoder 

 

  1. Example output of standard output mode

    
     source_id = 0, azimuth = 0.000000, elevation = 16.700001, sec = 1466144473, usec =
      169637
     ### Recognition: 2nd pass (RL heuristic best-first)
     STAT: 00
     sentence1: ORDER PLEASE
     wseq1: ORDER PLEASE
     phseq1: ao ao ao r r r r r d d d d er er er er er er er p p p l l l iy iy iy iy iy 
      iy iy z z z
     cmscore1: 1.000 1.000
     score1: 260.002472 ( AM: 274.768372, LM: -14.765888 )
    
    

  2. Example output of socket output mode (module mode)

    
     <SOURCEINFO SOURCEID="0" AZIMUTH="0.000000" ELEVATION="16.700001" SEC="1466144473"
     USEC="169637"/>
     .
     <STARTRECOG SOURCEID="0"/>
     .
     <ENDRECOG SOURCEID="0"/>
     .
     <RECOGOUT SOURCEID="0">
       <SHYPO RANK="1" SCORE="260.002472" AMSCORE="274.768372" LMSCORE="-14.765888">
         <WHYPO WORD="ORDER" CLASSID="ORDER" PHONE="" CM="1.000"/>
         <WHYPO WORD="PLEASE" CLASSID="PLEASE" PHONE="" CM="1.000"/>
       </SHYPO>
     </RECOGOUT>
     .
    
    

6.8.2.4 Notice


6.8.2.5 Installation method


Footnotes

  1. http://kaldi-asr.org/