2.4 Learning speech recognition

Problem

This is the first time I am trying to recognize speech with HARK.

Solution

Speech recognition with HARK consists of two main processes.

Feature extraction from an audio signal with HARK
Speech recognition with JuliusMFT

Each file and parameter settings necessary for this process is complex, so instead of creating from scratch, it is better to modify the sample networks and apply the appropriate changes directly, as shown in ap:samples]Appendix.

2.4.1 Feature extraction

:
This section describes the instruction on how to create a network for the extraction from speech of features supported by HARK, which are MSLS and MFCC. Specifically the creation of network that extract commonly used features such as MSLS, $\Delta$ MSLS, and $\Delta$ power or MFCC, $\Delta$ MFCC, and $\Delta$ power

$\includegraphics[width=\textwidth ]{fig/recipes/LearningHARK-recog-msls.png}$

Figure 2.13: MSLS

$\includegraphics[width=\textwidth ]{fig/recipes/LearningHARK-recog-mfcc.png}$

Figure 2.14: MFCC

Figure 2.13 and 2.14 shows network files to extract MSLS and MFCC features, respectively. PreEmphasis , MelFilterBank , Delta , FeatureRemover and either MSLSExtraction orMFCCExtraction nodes are used to extract the features. The SpeechRecognitionClient node sends the extracted feature to JuliusMFT by socket connection. Other than the input of the features, the sound source localization result is also necessary for the speech recognition of each sound source.

If is possible to check if the extraction of the features is successful by saving the features in a file using the SaveFeatures and SaveHTKFeatures node.

2.4.2 Speech Recognition

:
JuliusMFT , which is based on the speech recognition engine Julius, is used for speech recognition. For users that has no experience in using Julius, see Julius website to learn the basic usage of Julius.

It is necessary to set the input format of the file settings to “mfcnet“ in order to recieve and recognize the features extracted through HARK by socket connections. Below is an example of the settings:

\begin{verbatim}
-input mfcnet 
-plugindir /usr/lib/julius\_plugin 
-notypecheck 
-h hmmdefs 
-hlist triphones 
-gram sample 
-v sample.dict

The first three lines above are necessary to receive features from HARK.
  Line 1 to receive features from the socket connection,
  Line 2 for setting the installation path of the plugin that enables the use of socket connection,
  Line 3 for type check in JuliusMFT of MSLS feature extracted with HARK.
The “-plugindir” option must be set correctly according to the environment.

Discussion

The simplest method consists of:

Read monaural sound using AudioStreamFromMic node
Connect the output of the AudioStreamFromWave node to the input of the PreEmphasis node (in time domain), as shown in Figure 2.13

To recognize separated sound, connect the output of the GHDSS node (in frequency domain) to the Synthesize node as shown in Figure 2.13 or 2.14.