2.4 Learning speech recognition


This is the first time I am trying to recognize speech with HARK.


Speech recognition with HARK consists of two main processes.

  1. Feature extraction from an audio signal with HARK

  2. Speech recognition with JuliusMFT 

Each file and parameter settings necessary for this process is complex, so instead of creating from scratch, it is better to modify the sample networks and apply the appropriate changes directly, as shown in ap:samples]Appendix.

2.4.1 Feature extraction

This section describes the instruction on how to create a network for the extraction from speech of features supported by HARK, which are MSLS and MFCC. Specifically the creation of network that extract commonly used features such as MSLS, $\Delta $ MSLS, and $\Delta $ power or MFCC, $\Delta $ MFCC, and $\Delta $ power

\includegraphics[width=\textwidth ]{fig/recipes/LearningHARK-recog-msls.png}
Figure 2.13: MSLS
\includegraphics[width=\textwidth ]{fig/recipes/LearningHARK-recog-mfcc.png}
Figure 2.14: MFCC

Figure 2.13 and 2.14 shows network files to extract MSLS and MFCC features, respectively. PreEmphasis , MelFilterBank , Delta  , FeatureRemover and either MSLSExtraction orMFCCExtraction nodes are used to extract the features. The SpeechRecognitionClient node sends the extracted feature to JuliusMFT by socket connection. Other than the input of the features, the sound source localization result is also necessary for the speech recognition of each sound source.

If is possible to check if the extraction of the features is successful by saving the features in a file using the SaveFeatures and SaveHTKFeatures node.

2.4.2 Speech Recognition

JuliusMFT , which is based on the speech recognition engine Julius, is used for speech recognition. For users that has no experience in using Julius, see Julius website to learn the basic usage of Julius.

It is necessary to set the input format of the file settings to “mfcnet“ in order to recieve and recognize the features extracted through HARK by socket connections. Below is an example of the settings:

-input mfcnet 
-plugindir /usr/lib/julius\_plugin 
-h hmmdefs 
-hlist triphones 
-gram sample 
-v sample.dict

The first three lines above are necessary to receive features from HARK.
  Line 1 to receive features from the socket connection,
  Line 2 for setting the installation path of the plugin that enables the use of socket connection,
  Line 3 for type check in JuliusMFT of MSLS feature extracted with HARK.
The “-plugindir” option must be set correctly according to the environment.


The simplest method consists of:

To recognize separated sound, connect the output of the GHDSS node (in frequency domain) to the Synthesize node as shown in Figure 2.13 or 2.14.

See Also

Since the usage of JuliusMFT is mostly the same with Julius, the Julius website can be used for reference. To learn more about the features or models used in JuliusMFT , see c:Feature]Feature extraction or c:Models]Acoustic and Language Models.

To perform sound source localization and/or sound source separation, see the following recipes: Sound recording fails, Sound source localization fails, Sound source separation fails, and Speech recognition fails