2.4 Learning speech recognition

Problem

This is the first time I am trying to recognize speech with HARK.

Solution

Speech recognition with HARK consists of two main processes.

Feature extraction from an audio signal with HARK
Speech recognition with JuliusMFT

If you are performing speech recognition for the first time, it is better to modify the sample networks of speech recognition, as shown in the Appendix.

Feature extraction

MSLS and MFCC features are supported by HARK. As an example, we will explain how to extract audio feature consisting of MSLS, $\Delta$ MSLS, and $\Delta$ power, or MFCC, $\Delta$ MFCC, and $\Delta$ power.

$\includegraphics[width=\textwidth ]{fig/recipes/LearningHARK-recog-msls.png}$

Figure 2.12: MSLS

$\includegraphics[width=\textwidth ]{fig/recipes/LearningHARK-recog-mfcc.png}$

Figure 2.13: MFCC

Figure 2.12 and 2.13 shows network files to extract MSLS and MFCC features, respectively. PreEmphasis , MelFilterBank , Delta , FeatureRemover and either the MSLSExtraction orMFCCExtraction nodes are used. The SpeechRecognitionClient node sends the extracted feature to JuliusMFT by socket connection. Speech recognition is dependent on sound sources.

To save features, use the SaveFeatures or SaveHTKFeatures node.

Speech Recognition

JuliusMFT , which is based on Julius, is used to recognize the extracted features. If this is the first time you are using Julius, see the Julius web page and learn the basic usage of Julius.

Use “mfcnet” option for input format when you want to receive features with socket connections from HARK. The following is an example;

-input mfcnet
-plugindir /usr/lib/julius_plugin
-notypecheck
-h hmmdefs
-hlist triphones
-gram sample
-v sample.dict

The first three lines are necessary to receive features from HARK.
Line 1 to receive features from the socket connection,
Line 2 for the plugin enabling the use of the socket connection, Line 3 for MSLS feature.
The “-plugindir” option must be set correctly according to your environment.

Discussion

The simplest method consists of

Read monaural sound using AudioStreamFromMic node
Connect the output of the AudioStreamFromWave node to the input of the PreEmphasis node, as shown in Figure 2.12

If you want to recognize separated sound from the GHDSS node, connect the output of the GHDSS node to the Synthesize node in Figure 2.12 or 2.13.