HARK Document Version 3.0.0. (Revision: 9272) : KaldiDecoder

6.8.2 KaldiDecoder

6.8.2.1 Outline

KaldiDecoder is an acoustic model decoder software developed for HARK . This particular decoder was designed using libraries from Kaldi ¹ , a deep learning speech recognition toolkit. Up until HARK version 2.2.0, JuliusMFT (a large vocabulary speech recognition decoder system based on Julius ) had been the core of HARK ’s speech recognition. In response to the recent trends in speech recognition software design, we have decided to provide a Kaldi -based decoder, KaldiDecoder , beginning with HARK version 2.3.0.

Comparing this new decoder with other standard Kaldi -based decoders, the following features are available:

Connectivity with HARK modules (same handling as JuliusMFT )
- Compatible with both MSLS and MFCC feature data input via the network (mfcnet)
- Supports the addition of source location info (SrcInfo)
- Supports the recognition of simultaneous speech (mutual exclusion)
The connections with HARK can be made through SpeechRecognitionClient (or through SpeechRecognitionSMNClient ), identical to JuliusMFT .
Compatibility with JuliusMFT .
- Compatible with JuliusMFT output emulation (in both module-mode and standard output formats)
KaldiDecoder replicates JuliusMFT output as closely as possible, such that modification to the JuliusMFT -based sound system (in its demonstration system or sound scoring system) should be minimal.
Additional Kaldi functionality
- Implementation of online decoding for nnet1 models
Kaldi ’s standard decoder provides only offline nnet1 model decoding.
Functions to be implemented
- Missing feature recognition (except for the already implemented mfcnet-masked data structure recognition)

The features and functions described in numbers 1, 2, and 3 above have been implemented without any change to Kaldi .

The following section explains the method of installing and using KaldiDecoder , with a step-by-step procedure for connecting to HARK in FlowDesigner.

6.8.2.2 Start up and setting

Execution of KaldiDecoder is performed as follows assuming a settings file named kaldi.conf is used.

  > kaldidecoder --config=kaldi.conf (for Ubuntu OS)
  > kaldidecoder.exe --config=kaldi.conf (for Windows OS)

After starting KaldiDecoder in online mode, the socket connection in HARK is performed by starting a network that contains SpeechRecognitionClient (or SpeechRecognitionSMNClient ) for which an IP address and a port number are correctly set to enable the speech recognition.

The abovementioned kaldi.conf is a text file that describes settings for KaldiDecoder . The content of the setting file consists basically of argument options that begin with “--”, and the user can also specify arguments directly as KaldiDecoder options when starting. Moreover, descriptions that come after # are treated as comments. Confirm the options used for KaldiDecoder by executing the following command:

  > kaldidecoder --help (for Ubuntu OS)
  > kaldidecoder.exe --help (for Windows OS)

The minimum required settings for using nnet1 models in KaldiDecoder are the following seven items:

--filename-words=<YOUR_PATH>/words.txt
--filename-align-lexicon=<YOUR_PATH>/align_lexicon.int
--filename-feature-transform=<YOUR_PATH>/final.feature_transform
--filename-nnet=<YOUR_PATH>/final.nnet
--filename-mdl=<YOUR_PATH>/final.mdl
--filename-class-frame-counts=<YOUR_PATH>/ali_train_pdf.counts
--filename-fst=<YOUR_PATH>/HCLG.fst

The minimum required settings for using nnet3 models in KaldiDecoder are the following four items:

--filename-words=<YOUR_PATH>/words.txt
--filename-align-lexicon=<YOUR_PATH>/align_lexicon.int
--filename-mdl=<YOUR_PATH>/final.mdl
--filename-fst=<YOUR_PATH>/HCLG.fst

In the case of the chain model, in addition to the setting of nnet3, the following setting is necessary:

--frame-subsampling-factor=3

If iVector was used in nnet3/chain model learning, the following setting is additionally required:

--ivector-extraction-config=<YOUR_PATH>/ivector_extractor.conf

The offline decoding mode requires an additional setting to specify the list of features to be evaluated: Notes: From HARK version 2.5.0, parallel decoding became possible also in the case of offline decoding, so it is recommended to restrict the number of the decoder instances which are simultaneously activated with the “--max-tasks=<Number of the Cores>” options. As in HARK version 2.4.0 and earlier, if you wish to guarantee the order of the recognition results to be the same as of the features list, specify 1 as shown below.

--filename-features-list=<YOUR_PATH>/features_list.txt
--max-tasks=1

If the above setting is not given, KaldiDecoder will automatically execute in the online decoding mode (mfcnet input mode).

Modify the online decoding mode mfcnet input port or result output port by changing the following options (default port number values are shown):

--port-mfcnet=5530
--port-result=10500

Basic Configurations (Input Files and Modes Settings)
- --nnet-type=model type number
  This is to set the nnet model type in Kaldi . The default setting value is “1”. Depending on the model you have, you can change the settings as follows:
  Set the value to “1” when using a nnet1 model created using Karel Vesely’s method. Change it to “3”, if you use a nnet3/chain model created using Daniel Povey’s method.
- --filename-words=word list file name
  This setting specifies the word list file. You can set it as a path relative to the current directory or as an absolute path. The file format of the word list file is the same as that of the words.txt file included in the lexicon directory used in Kaldi for training/validation or in the language model used for evaluation. This setting is mandatory for the output of recognition results.
- --filename-phones=phoneme list file name
  This is to set the phoneme list file. You can set it as a path relative to the current directory or as an absolute path. The file format of the phoneme list file is the same as the phones.txt file included in the lexicon directory used in Kaldi for training/validation or in the language model used for evaluation. This setting is optional, as it is only necessary when you conduct phoneme alignment.
- --filename-align-lexicon=lexicon file name
  This is to set the lexicon file. You can set it as a path relative to the current directory or as an absolute path. The file format of the lexicon file is the same as that of the align_lexicon.int file included in the lexicon directory used in Kaldi for training/validation or in the language model used for evaluation. If you create a lang directory using prepare_lang.pl , the lexicon file will be output to lang/phones/aligned_lexicon.int . This setting is mandatory for the output of recognition results.
- --filename-feature-transform=FeatureTransform file name
  This is to set the FeatureTransform file, which is output to the path of the trained DNN as
  exp/<model_dir>/final.feature_transform . This setting is mandatory when you use KaldiDecoder in nnet1 model decoding, as it is separate from the acoustic model.
- --filename-nnet=nnet file name
  This is to set the nnet file, which is output to the path of the trained DNN as exp/<model_dir>/final.nnet (note: this path is a symbolic link). This setting is mandatory when you use KaldiDecoder in nnet1 model decoding, as it is separate from the acoustic model.
- --filename-mdl=mdl file name
  This is to set the mdl file, which is output to the path of the trained DNN as exp/<model_dir>/final.mdl (note: this path is a symbolic link). This setting is mandatory, as the acoustic model is required for the output of recognition results.
- --filename-class-frame-counts=class-frame-counts file name
  This is to set the class-frame-counts file, which is output to the path of the trained DNN as
  exp/<model_dir>/ali_train_pdf.counts . This setting is mandatory when you use KaldiDecoder in nnet1 model decoding, as it is separate from the acoustic model.
- --filename-fst=FST file name
  This is to set the FST file. If you create the graph directory using mkgraph.sh , the FST file will be output to graph*/HCLG.fst . This setting is mandatory, as it is required for the output of recognition results.
- --filename-features-list=features list file name
  When using KaldiDecoder in offline decoding mode, it is required to specify a text list containing paths to the feature file(s) to be evaluated (note: it is the name of the list; not of the feature file(s)). If this option is not set, KaldiDecoder runs in the online docoding mode.
- --ivector-extraction-config=ivector extraction config file name
  This is to set the ivector extraction config file, which is output to the path of the trained DNN as
  exp/<model_dir>/ivector*/conf/ivector_extractor.conf . This setting is mandatory when you use iVectors in nnet3/chain model learning.
- --frame-subsampling-factor=frame subsampling factor
  This is to set the frame subsampling factor, which is output to the path of the trained DNN as
  exp/<model_dir>/frame-subsampling-factor (Only the integer value written in this file is necessary) . This setting is mandatory when you use a chain model. It must be set as follows.
  --frame-subsampling-factor=3
- --frames-per-chunk=frames per chunk
  This is to set the number of frames in each chunk that is separately evaluated by the neural net. Measured before any subsampling, if the --frame-subsampling-factor option is used. (i.e. counts input frames) See the step/nnet3/decode.sh or steps/nnet3/decode_looped.sh scripts used by Kaldi ’s decode recipe for this option. The default setting value is 20, but it is set to 50 in the above decoding recipe.
- --port-mfcnet=port number
  This is to set the port number that receives the acoustic features and masks transmitted via network by SpeechRecognitionClient (or SpeechRecognitionSMNClient ). It is similar to the “-adport” input port setting for mfcnet input mode in JuliusMFT . The same default port number as in JuliusMFT , 5530, will be used if none is provided. This option is valid only in the online decoding mode.
- --port-result=port number
  This is to set the connection port that transmits the recognition results through the network. It is similar to the “-module” output port setting for module mode in JuliusMFT . The same default port number as in JuliusMFT , 10500, will be used if none is provided.
- --host-mfcnet=host name
  This option sets the IP address or host name of the server that listens to the port configured with the --port-mfcnet option. The default setting value is “localhost” .
- --host-result=host name
  This option sets the IP address or host name of the server that listens to the port configured with the --port-result option. The default setting value is “localhost” .
- --lm-name
  This is to set the language model name. When this setting is active, the LMNAME attribute is provided in module mode output. It has no default value.

Decoder Tuning Configuration (Weighting and Pruning)

The items described in this section are options inherited from the features implemented in the decoder (classes) in Kaldi .

--acoustic-scale=acoustic scale value
This is to set the scaling factor for acoustic log-likelihoods. The default setting value is “0.1”, which is generally the inverse of the best LM weight obtained in scoring.
Reference: https://sourceforge.net/p/kaldi/discussion/1355348/thread/924c555b/
However, for chain models, “1.0” is the best setting.
Detail, please refere to http://kaldi-asr.org/doc/chain.html#chain_decoding .
--max-active
This is to set the decoder’s maximum number of active states. Although increasing the value provides more accurate results, it also significantly affects the decoding speed. The default setting value is “2147483647” (maximum value of an int32 type). We empirically recommend setting it between 2000 and 5000 to obtain better performance.
--min-active
This is to set the decoder’s minimum number of active states. The default setting value is “200”.
--beam=beam width
This is to set the decoding beam width. Although increasing the value provides more accurate results, it also considerably affects the decoding speed. The default setting value is “16.0”. $(>0.0)$ For details, refer to the quotation in Table 6.151.
--beam-delta=beam delta
This is to set the decoding beam delta. This parameter is obscure and relates to a speed-up in the way in which the “max-active” constraint is applied. Increasing the value provides more accurate results. The default setting value is “0.5”. $(>0.0)$ For details, refer to the quotation in Table 6.151.
--delta=delta
This is the tolerance value used in determinization, which is set as “0.000976562” by default. For details, refer to the quotaion in Table 6.151.
--hash-ratio
This is a ratio value to control the hash behavior in decoding process. The default setting value is “2.0”. $(>=1.0)$
--prune-interval=frame count
This is to set the frame interval at which tokens are to be pruned. The default setting value is “25”. $(>0)$
--splice=splice count
This is to set the DNN input splice count. It is the number of frames around the current frame. The default setting value is “3”, which means that there are 3 frames both before and after the current frame.

The following article in the Table 6.151 is quoted from the Kaldi website (http://www.danielpovey.com/kaldi-docs/decoders.html).

Table 6.151: Quotation from “FasterDecoder: a more optimized decoder”

The code in FasterDecoder as it relates to cutoffs is a little more complicated than just having the
one pruning step. The basic observation is this: it's pointless to create a very large number of tokens
if you are only going to ignore most of them later. So the situation in ProcessEmitting is: we have
"weight_cutoff" but wouldn't it be nice if we knew what the value of "weight_cutoff" on the next frame
was going to be? Call this "next_weight_cutoff". Then, whenever we process arcs that have the current
frame's acoustic likelihoods, we could just avoid creating the token if the likelihood is worse than
"next_weight_cutoff". In order to know the next weight cutoff we have to know two things. We have to
know the best token's weight on the next frame, and we have to know the effective beam width on the
next frame. The effective beam width may differ from "beam" if the "max_active" constraint is limiting,
and we use the heuristic that the effective beam width does not change very much from frame to frame.
We attempt to estimate the best token's weight on the next frame by propagating the currently best
token (later on, if we find even better tokens on the next frame we will update this estimate). We get
a rough upper bound on the effective beam width on the next frame by using the variable "adaptive_beam".
This is always set to the smaller of "beam" (the specified maximum beam width), or the effective beam
width as determined by max_active, plus beam_delta (default value: 0.5). When we say it is a
"rough upper bound" we mean that it will usually be greater than or equal to the effective beam width
on the next frame. The pruning value we use when creating new tokens equals our current estimate of the
next frame's best token, plus "adaptive_beam". With finite "beam_delta", it is possible for the pruning
to be stricter than dictated by the "beam" and "max_active" parameters alone, although at the value 0.5
we do not believe this happens very often.

Povey, Daniel:
Citing Sources: [http://www.danielpovey.com/kaldi-docs/decoders.html#decoders_faster]: para. 3: [December 6, 2016]

Lattice Configuration
The items described in this section are options that have been inherited from the features implemented in the decoder (classes) in Kaldi .
- --determinize-lattice
  This is to determinize the lattices. It keeps only the best probability distribution function (p.d.f.) sequence for each word sequence.
- --lattice-beam=beam width
  This is to set the beam width in lattice generation. Increasing the value gives deeper lattices, which also significantly affects the decoding speed. The default setting value is “10”.
- --max-mem=maximum memory allocation size
  This is to set the maximum approximate size of memory allocated when determinizing the lattices. However, the actual usage may be higher than the specified value because the allocation may occur more than once.
- --minimize
  When this option is given, minimize the lattices after determinization.
- --phone-determinize
  When this option is given, do an initial pass of determinization on both phonemes and words. See also the article on “--word-determinize”.
- --word-determinize
  When this option is given, do a second pass of determinization on words only. See also the article on “--phone-determinize”.
Others
- --config=configuration file name
  This setting specifies the config file, which can be specidied repeatedly.
- --enable-debug
  This option enables the debugging output. The default seting is “disabled”.
- --help
  This is to display the help menu. When this option is given, all other options are ignored.
- --print-args
  When this option is enabled, the command line arguments are sent to the standard output. The default setting is “enabled”. Set it as “--print-args=false” to disable it.
- --verbose=log level
  This is to set detail level of log information. Increasing the value gives more detailed log output. The default setting value is “0”.

The functionality called “module mode” in the original Julius or JuliusMFT is also available in KaldiDecoder . Selecting the online decoding mode automatically enables it. In addition, the standard output is not deactivated as in Julius or JuliusMFT ; both standard and socket (network) outputs can be used in KaldiDecoder ’s online decoding mode.

6.8.2.3 Detailed description

6.8.2.3.1 mfcnet communication specification

In order to use mfcnet as an acoustic input source, the argument “--filename-features-list” must not be given when starting up KaldiDecoder as mentioned in the options description. In this case, KaldiDecoder acts as a TCP/IP communications server, starting up in the listening state and waiting for input. Moreover, the HARK modules SpeechRecognitionClient and SpeechRecognitionSMNClient work as a client to transmit acoustic features and Missing Feature Mask to KaldiDecoder . The client connects to KaldiDecoder for every utterance and closes the connection immediately after the transmission is complete. The data to be transmitted must be little endian (note that it is not a network byte order). Concretely, communication is performed as follows for one utterance.

Socket connection
The client opens the socket and connects to the mfcnet communication port in KaldiDecoder .
Communication initialization (data transmitted once at the beginning)
The client transmits information on the sound source that is going to be transmitted, as shown in Table 6.152, immediately after the socket connection. The sound soure information is expressed in a SourceInfo structure (Table 6.153) and has a sound source ID, sound source direction and time of transmission start. The time is indicated in a timeval structure defined in <sys/time.h> and is the elapsed time from the starting time point (January 1, 1970 00:00:00) in the system time zone. The time indicates the elapsed period from the starting point thereafter.
Data transmission (data transmitted at every frame)
Acoustic features and Missing Feature Mask are transmitted. Features of one utterance are transmitted as frames, shown in Table 6.154, repeatedly until the end of the speech section. It is assumed inside the KaldiDecoder that the dimension number of feature vectors and mask vectors are the same.
Connection end (data transmitted once at the end)
After transmitting features for one sound source, data (Table 6.155) that indicate completion is transmitted. KaldiDecoder will return to the listening state to receive the next sound source data until either the data indicating completion is received or its socket connection is severed. It is therefore possible to resume and continue data reception in an environment with a relatively unstable connection.

Socket disconnection

After the ending process, the sockets are closed. If they close without the ending process, it executes exception tasks; thus, the output of the recognition results may be delayed. Likewise, any data transmitted after the ending process is ignored regardless of the sockets being open or closed.

Table 6.152: Data to be transmitted only once at the beginning (acoustic source information)

Size[byte]	Type	Data to be transmitted
4	`int`	28 (= sizeof(SourceInfo))
28	SourceInfo	Sound source information of features that are going to be transmitted

Table 6.153: SourceInfo structure

Member variable name	Type	Description
source_id	`int`	Sound source ID
azimuth	`float`	Horizontal direction [deg]
elevation	`float`	Vertical direction [deg]
time	timeval	Time (standardized to 64 bit processor and 16 bytes long)

Table 6.154: Data to be transmitted for every frame (features, masks data and dimensions information)

Size[byte]	Type	Data to be transmitted
4	`int`	N1=(dimension number of feature vector) $\times$ sizeof(`float` )
N1	`float` [N1]	feature vector (float array)
4	`int`	N2=(dimension number of mask vector) $\times$ sizeof(`float` )
N2	`float` [N2]	mask vector (float array)

Table 6.155: Data to be transmitted only once at the end (data to indicate completion)

Size[byte]	Type	Data to be transmitted
4	`int`	0

6.8.2.3.2 Module mode communication specification

When setting the online decoding, KaldiDecoder operates similarly to the module mode in Julius . In the module mode, KaldiDecoder works as a TCP/IP communication server and provides recognition results to clients such as jcontrol. The character encoding for Japanese text depends on that of the language model used. An XML-like format is used just like in Julius for data representation, and a “.” (period) is transmitted to indicate the data completion for each and every message. As an additional feature of KaldiDecoder , it can also output results in the standard XML format without the “.” (period) mark. The meaning of the most common tags transmitted by KaldiDecoder is as follows.

INPUT tag
This tag represents information related to inputs and has STATUS and TIME as attributes. The values for STATUS are LISTEN, STARTREC or ENDREC. LISTEN indicates that KaldiDecoder is ready to receive speech. STARTREC indicates that the reception of features has started. ENDREC indicates that the last feature of the sound source being received has arrived. TIME indicates the time at that instant.
SOURCEINFO tag
This tag represents information related to sound sources and is an original tag of KaldiDecoder . It has ID, AZIMUTH, ELEVATION, SEC and USEC as attributes. The SOURCEINFO tag is transmitted when starting the recognition process. Its ID indicates a sound source ID given by HARK (not the speaker ID but numbers uniformly given to each sound source). AZIMUTH and ELEVATION indicate horizontal and vertical direction (degrees), respectively, seen from the microphone array coordinate system for the first frame of the sound source. SEC and USEC indicate the time of the first frame of the sound source. SEC indicates seconds and USEC indicates the microseconds fraction.
RECOGOUT tag
This tag represents recognition results, and its sub-element is either a gradual output or the final output. For gradual output, it has the PHYPO tag as a sub-element, and for the final output, it has the SHYPO tag as a sub-element. In the case of the final output, only SHYPO tags for the number of candidates specified in the parameters are output.
PHYPO tag
This tag represents gradual candidates and it has vectors of WHYPO tags for candidate words as sub-elements. It has PASS, SCORE, FRAME and TIME as attributes. PASS indicates the number of decoding passes and is always 1. SCORE indicates the accumulated score of this candidate. FRAME indicates the number of frames that have been processed in order to output this candidate. TIME indicates time (sec) at that instant.
SHYPO tag
This tag represents a sentence hypothesis and it has vectors of WHYPO tags for candidate words as sub-elements. It has PASS, RANK, SCORE, AMSCORE and LMSCORE as attributes. PASS indicates the number of decoding passes and, when available, is always set to 1. RANK indicates the rank order of a hypothesis. SCORE indicates the logarithmic likelihood of this hypothesis, AMSCORE indicates a logarithmic acoustic likelihood and LMSCORE indicates a logarithmic language probability.
WHYPO tag
This tag represents word hypotheses and and has WORD, CLASSID, PHONE and CM as attributes. WORD indicates notations, CLASSID indicates the word that is the key in a statistical language model, PHONE indicates phoneme sequences and CM indicates the confidence for the word. Word confidence is included only to maintain compatibility with the Julius -based decoder output, and its value, fixed at 1.0, is irrelevant to the acutual performance.
SYSINFO tag
This tag represents the status of the system and it has PROCESS as an attribute. When PROCESS is EXIT, it indicates normal termination. When PROCESS is ERREXIT, it indicates abnormal termination. When PROCESS is ACTIVE, it indicates that speech recognition can be performed. When PROCESS is SLEEP, it indicates that speech recognition is halted.
Whether or not these tags and attributes are output depends on the arguments set when starting KaldiDecoder . The SOURCEINFO tag is always output, and the others are the same as those of the original Julius and therefore users are recommended to refer to Argument Help of the original Julius .

When comparing to the original Julius , two changes were made to KaldiDecoder .

Addition of items related to the SOURCEINFO tag for information on source localization as described above, and also the embedding of sound source ID (SOURCEID) to the following tags: STARTRECOG, ENDRECOG, INPUTPARAM, GMM, RECOGOUT, REJECTED, RECOGFAIL, GRAPHOUT, SOURCEINFO
Changes were made to the format of the module mode in order to reduce the delay caused by mutual exclusion when processing simultaneous utterances. Concretely, mutual exclusion used to be performed utterance-wise, but now the output is divided so that the exclusion control can be performed tag-wise. Also, modifications were made to the output of the following one-time tags.
<< Tags separated by start-tag / end-tag >>
- <RECOGOUT> ... </RECOGOUT>
- <GRAPHOUT> ... </GRAPHOUT>
- <GRAMINFO> ... </GRAMINFO>
- <RECOGPROCESS> ... </RECOGPROCESS>
<< One-line tags that are internally split and output multiple times >>
- <RECOGFAIL ... />
- <REJECTED ... />
- <SR ... />

6.8.2.3.3 Example output of KaldiDecoder

Example output of standard output mode

 source_id = 0, azimuth = 0.000000, elevation = 16.700001, sec = 1466144473, usec =
  169637
 ### Recognition: 2nd pass (RL heuristic best-first)
 STAT: 00
 sentence1: ORDER PLEASE
 wseq1: ORDER PLEASE
 phseq1: ao ao ao r r r r r d d d d er er er er er er er p p p l l l iy iy iy iy iy 
  iy iy z z z
 cmscore1: 1.000 1.000
 score1: 260.002472 ( AM: 274.768372, LM: -14.765888 )

Example output of socket output mode (module mode)

 <SOURCEINFO SOURCEID="0" AZIMUTH="0.000000" ELEVATION="16.700001" SEC="1466144473"
 USEC="169637"/>
 .
 <STARTRECOG SOURCEID="0"/>
 .
 <ENDRECOG SOURCEID="0"/>
 .
 <RECOGOUT SOURCEID="0">
   <SHYPO RANK="1" SCORE="260.002472" AMSCORE="274.768372" LMSCORE="-14.765888">
     <WHYPO WORD="ORDER" CLASSID="ORDER" PHONE="" CM="1.000"/>
     <WHYPO WORD="PLEASE" CLASSID="PLEASE" PHONE="" CM="1.000"/>
   </SHYPO>
 </RECOGOUT>
 .

6.8.2.4 Notice

Restraint of the PHONE tag
Although JuliusMFT supports PHONE tag output for each WORD, the same feature is not implemented in KaldiDecoder because of Kaldi ’s structural reasons: it causes performance degradation. Therefor, in the socket output mode, the phoneme output for each WHYPO tag is not supported. In the standard output mode, only output with no pipes ("|") between words is supported.
Known issue in the Windows version
There were confirmed issues of corrupted characters when outputting recognition results in a multi-byte encoding to the standard output on Windows. The default character set for Ubuntu terminal’s standard output is UTF-8, so the issue occurs when using the same language model on Windows. In other words, a mismatch between the character sets used in the operating system console and language model causes this problem. To avoid this, start KaldiDecoder with the output redirection "> filename", and open the output recognition text file with an appropriate text editor. The reason for this restriction is that, in Julius , it was possible to convert the character set at the output time using the “iconv” library or the internal implementation “libjcode” as needed; however, Kaldi does not have a character set conversion feature, and it has not been implemented in KaldiDecoder .

6.8.2.5 Installation method

Using apt
If the HARK apt repository is registered, installation can be done as follows.
> sudo apt install kaldidecoder-hark

Installing from source

Since KaldiDecoder uses libraries from Kaldi , Kaldi must be built in advance. However, Kaldi libraries are not packaged in Ubuntu, so it has to be compiled from source by executing the following commands.

> sudo apt update
> sudo apt install git automake autoconf libtool cmake cmake-extras build-essential
> sudo apt install libopenblas-base libopenblas-dev
> cd ~/
> mkdir <YOUR_DIR>
> cd <YOUR_DIR>
> git clone https://github.com/kaldi-asr/kaldi.git
> git checkout <COMMIT_ID>
> cd kaldi/tools
> make
> cd ../src
> ./configure --mathlib=OPENBLAS --openblas-root=/usr
> make clean -j <CORES>
> make depend -j <CORES>
> make -j <CORES>
> cd ../
> wget http://archive.hark.jp/harkrepos/dists/<DISTRO>/non-free/source/kaldidecoder
-hark_<HARK_VER>.tar.xz
> tar -Jxvf kaldidecoder-hark_<HARK_VER>.tar.xz
> cd kaldidecoder3
> mkdir build
> cd build
> cmake .. -DCMAKE_BUILD_TYPE=None -DOPENBLAS_ROOT_DIR:STRING=/usr -DCMAKE_VERBOSE_
MAKEFILE:BOOL=TRUE
> make
> sudo make install

** <YOUR_DIR>  : Your work directory                     --  e.g.) kaldi_build
** <COMMIT_ID> : Git commit ID of Kaldi version to build --  e.g.) 4571f47f84
                 Please read the KaldiDecoder's README to check
                 which Kaldi commit ID it was based on...
** <CORES>     : How many cores do you have              --  e.g.) 4
** <DISTRO>    : Ubuntu distribution                     --  e.g.) xenial
** <HARK_VER>  : HARK version                            --  e.g.) 2.5.0-openblas

Since it is installed in /usr/local/bin by default, the “-DCMAKE_INSTALL_PREFIX” must be set as follows in order to install in /usr/bin like the package version.

> cd kaldidecoder3
> mkdir build
> cd build
> cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DCMAKE_BUILD_TYPE=None -DOPENBLAS_ROOT_DIR:
STRING=/usr -DCMAKE_VERBOSE_MAKEFILE:BOOL=TRUE
> make
> sudo make install

If the output of “kaldidecoder --help” is as shown below, the KaldiDecoder installation was successful.

> kaldidecoder --help
usage: If you requests need use ONLINE decoding with nnet1 model.
       (ONLINE mode is default)
...

With the above, installation is complete.

For the installation method on Windows OS, please refere to Section 3.2.