Problem
This is the first time I am trying to recognize speech with HARK.
Solution
Speech recognition with HARK consists of two main processes.
Feature extraction from an audio signal with HARK
Speech recognition with JuliusMFT
If you are performing speech recognition for the first time, it is better to modify the sample networks of speech recognition, as shown in the Appendix.
:
MSLS and MFCC features are supported by HARK. As an example, we will explain how to extract audio feature consisting of MSLS, $\Delta $ MSLS, and $\Delta $ power, or MFCC, $\Delta $ MFCC, and $\Delta $ power.
Figure 2.12 and 2.13 shows network files to extract MSLS and MFCC features, respectively. PreEmphasis , MelFilterBank , Delta , FeatureRemover and either the MSLSExtraction orMFCCExtraction nodes are used. The SpeechRecognitionClient node sends the extracted feature to JuliusMFT by socket connection. Speech recognition is dependent on sound sources.
To save features, use the SaveFeatures or SaveHTKFeatures node.
:
JuliusMFT , which is based on Julius, is used to recognize the extracted features. If this is the first time you are using Julius, see the Julius web page and learn the basic usage of Julius.
Use “mfcnet” option for input format when you want to receive features with socket connections from HARK. The following is an example;
\begin{verbatim} -input mfcnet -plugindir /usr/lib/julius\_plugin -notypecheck -h hmmdefs -hlist triphones -gram sample -v sample.dict
The first three lines are necessary to receive features from HARK.
Line 1 to receive features from the socket connection,
Line 2 for the plugin enabling the use of the socket connection, Line 3 for MSLS feature.
The “-plugindir” option must be set correctly according to your environment.
Discussion
The simplest method consists of:
Read monaural sound using AudioStreamFromMic node
Connect the output of the AudioStreamFromWave node to the input of the PreEmphasis node, as shown in Figure 2.12
If you want to recognize separated sound from the GHDSS node, connect the output of the GHDSS node to the Synthesize node in Figure 2.12 or 2.13.
See Also
Since using JuliusMFT is almost the same as using Julius, the latter manual may be useful. If you want to learn more about the features or models used in JuliusMFT , see Feature extraction or Acoustic and Language Models.
If you perform sound source localization and/or sound source separation, see the recipes entitled Sound recording fails, Sound source localization fails, Sound source separation fails, and Speech recognition fails