2.4 Learning speech recognition

Problem

This is the first time I am trying to recognize speech with HARK.

Solution

Speech recognition with HARK consists of two main processes.

  1. Feature extraction from an audio signal with HARK

  2. Speech recognition with JuliusMFT 

If you are performing speech recognition for the first time, it is better to modify the sample networks of speech recognition, as shown in the Appendix.

Feature extraction


MSLS and MFCC features are supported by HARK. As an example, we will explain how to extract audio feature consisting of MSLS, $\Delta $ MSLS, and $\Delta $ power, or MFCC, $\Delta $ MFCC, and $\Delta $ power.

\includegraphics[width=\textwidth ]{fig/recipes/LearningHARK-recog-msls.png}
Figure 2.12: MSLS
\includegraphics[width=\textwidth ]{fig/recipes/LearningHARK-recog-mfcc.png}
Figure 2.13: MFCC

Figure 2.12 and 2.13 shows network files to extract MSLS and MFCC features, respectively. PreEmphasis , MelFilterBank , Delta  , FeatureRemover and either the MSLSExtraction orMFCCExtraction nodes are used. The SpeechRecognitionClient node sends the extracted feature to JuliusMFT by socket connection. Speech recognition is dependent on sound sources.

To save features, use the SaveFeatures or SaveHTKFeatures node.

Speech Recognition


JuliusMFT , which is based on Julius, is used to recognize the extracted features. If this is the first time you are using Julius, see the Julius web page and learn the basic usage of Julius.

Use “mfcnet” option for input format when you want to receive features with socket connections from HARK. The following is an example;

-input mfcnet
-plugindir /usr/lib/julius_plugin
-notypecheck
-h hmmdefs
-hlist triphones
-gram sample
-v sample.dict

The first three lines are necessary to receive features from HARK.
  Line 1 to receive features from the socket connection,
  Line 2 for the plugin enabling the use of the socket connection,   Line 3 for MSLS feature.
The “-plugindir” option must be set correctly according to your environment.

Discussion

The simplest method consists of

If you want to recognize separated sound from the GHDSS node, connect the output of the GHDSS node to the Synthesize node in Figure 2.12 or 2.13.

See Also

Since using JuliusMFT is almost the same as using Julius, the latter manual may be useful. If you want to learn more about the features or models used in JuliusMFT , see Feature extraction or Acoustic and Language Models.

If you perform sound source localization and/or sound source separation, see the recipes entitled Sound recording fails, Sound source localization fails, Sound source separation fails, and Speech recognition fails