2.2 Learning sound localization

Problem

I want to perform source localization in HARK but I don’t know where to start.

Solution

(1) Source localization of an audio file

$\includegraphics[width=.5\linewidth ]{fig/recipes/LearningHARK_002_01_1}$

(2.4.a) MAIN Subnetwork

$\includegraphics[width=.8\linewidth ]{fig/recipes/LearningHARK_002_01_2}$

(2.4.b) Iterator Subnetwork

Figure 2.5: HARK network file for sound source localization using a .wav file

Fig. 2.5 shows an example of a HARK network file for sound source localization using a .wav file input. The .wav file contains multi-channel signals recorded by a microphone array. In the network file, it localizes sound sources and displays their locations.

For the node property settings in the network file, see Section 6.2 in the HARK document.

Samples of HARK network file including sound source localization are provided in Samples.

For the first simple test, download “HARK Automatic Speech Recognition Pack” containing the version of HARK you are using or the model you want to use. Unzip the downloaded file and type the following command in the unzipped directory.

harkmw sep_rec_offline.n 2SPK-jp.wav

The localization result will then be displayed, as shown in Fig. 2.6. If the window and the localization result is displayed, it means that the network is working correctly.

$\includegraphics[width=90mm]{fig/recipes/LearningHARK_002_02_1.eps}$

Figure 2.6: Snapshot of the sound source localization result using sep_rec_offline.n

(2) Real time sound source localization from a microphone

Fig. 2.7 shows an example of a HARK network file for real-time sound source localization using a microphone array.

$\includegraphics[width=.3\linewidth ]{fig/recipes/LearningHARK_002_03_1}$

(2.6.a) MAIN Subnetwork

$\includegraphics[width=.8\linewidth ]{fig/recipes/LearningHARK_002_03_2}$

(2.6.b) Iterator Subnetwork

Figure 2.7: HARK network file for sound source localization using a microphone array

Here, AudioStreamFromWave in Fig. 2.5 is replaced by AudioStreamFromMic . By properly setting the parameters in AudioStreamFromMic , a sound source can be localized in real time using a microphone array. For the setting of these parameters, see Section 6.2 in the HARK document. If the network file works properly, the localization result is displayed as in Fig. 2.6. If it does not work properly, read , “Sound recording fails” or “Sound source localization fails”

(3) Sound source localization with suppression of constant noise

The sound source localization shown in Fig. 2.5 and Fig. 2.7 can not determine which sound sources are desired. If there are several of high power noise in the environment, LocalizeMUSIC will only localize noise. In the worst case, it cannot localize speech, resulting in a drastic degradation of performance of automatic speech recognition.

This is especially true for automatic speech recognition by a robot-embedded microphone array, in which there are several sources of high power noise related to the robot motor and fan, degrading the performance of the entire system.

To solve this problem, HARK supports the pre-measured noise suppression function in sound source localization. There are 2 steps to enable this function:

Generation of pre-measured noise files for localization
Sound source localization with these noise files

The next two section explain (3-1) and (3-2), respectively.

(3-1) Generation of pre-measured noise files for localization

$\includegraphics[width=.6\linewidth ]{fig/recipes/LearningHARK_002_04_1}$

(2.7.a) MAIN Subnetwork

$\includegraphics[width=.8\linewidth ]{fig/recipes/LearningHARK_002_04_2}$

(2.7.b) Iterator Subnetwork

Figure 2.8: HARK network file for generating the noise files for sound source localization

Fig. 2.8 shows an example of a HARK network file for generating a pre-measured noise file for sound source localization. To set the parameter of the HARK nodes, see Section 6.2 in the HARK document. The Iterator (LOOP0) subnetwork in Fig. 2.8 has 3 Constant nodes, an IterCount node, a Smaller node, and an Equal node. The parameter settings for those nodes are:

node_Constant_1
- VALUE
  int type. VALUE = 200.
  This represents the frame length used to generate the noise file from the first frame.
node_Constant_2
- VALUE
  string type. VALUE = NOISEr.dat.
  File name for the real part of the noise file.
node_Constant_3
- VALUE
  string type. VALUE = NOISEi.dat.
  File name for the imaginary part of the noise file.
node_IterCount_1
- No parameter
  This outputs the index of the HARK processing frames
node_Smaller_1
- No parameter
  This determines the index of HARK processing frames is larger than a specific number.
node_Equal_1
- No parameter
  This determines if the index of HARK processing frames is equal to a specific number.

Here, the node_Constant_1 VALUE is set to 200. The MAX_SUM_COUNT in CMMakerFromFFTwithFlag is then set to a value greater than 200.

This network file utilizes a .wav file input containing only noise. Depending on the VALUE of node_Constant_1, this node generates noise file for certain frames.

When the network file is executed, two files will be generated in the current working directory which are NOISEr.dat and NOISEi.dat. These two files are used for sound source localization with noise-suppression function.

In this example, 200 frames is used from the first frame to generate the noise file. By using conditions other than those of the Smaller node, the frame to be used for generation can be specified.

(3-2) Sound source localization with the noise files

$\includegraphics[width=.5\linewidth ]{fig/recipes/LearningHARK_002_05_1}$

(2.8.a) MAIN Subnetwork

$\includegraphics[width=.9\linewidth ]{fig/recipes/LearningHARK_002_05_2}$

(2.8.b) Iterator Subnetwork

Figure 2.9: HARK network file for sound source localization with pre-measured noise suppression

Fig. 2.9 shows an example of a HARK network file for sound source localization using noise files generated in (3-1), NOISEr.dat and NOISEi.dat. For the parameter settings of the HARK nodes, see Section 6.2 of the HARK document. The Iterator (LOOP0) subnetwork in Fig. 2.9 has 3 Constant nodes, and the parameter setting for those nodes are:

node_Constant_1
- VALUE
  string type. VALUE = NOISEr.dat.
  File name for the real part of the loaded noise file.
node_Constant_2
- VALUE
  string type. VALUE = NOISEi.dat.
  File name for the imaginary part of the loaded noise file.
node_Constant_3
- VALUE
  int type. VALUE = 0.
  This enables updating noise information every frame. If 0, the noise files are loaded only at the first frame.

CMLoad reads the noise files, NOISEr.dat and NOISEi.dat, and whitens the noise in sound source localization. To enable the noise suppression function, set MUSIC_ALGORITHM in LocalizeMUSIC to GEVD or GSVD. The details of the algorithm for the noise suppression are described in Section 6.2 of the HARK document.

When the HARK network file is executed, the sound source localization results will be displayed, similar with Fig. 2.6. Compared with localization without noise suppression, it is noticeable that there is greater focus on speech localization.

Discussion

For all the details about the algorithm and noise suppression in LocalizeMUSIC , see Section 6.2 in the HARK document. To increase accuracy, read the recipe in Chapter 8 or the descriptions of the nodes LocalizeMUSIC and SourceTracker in the HARK document and tune it.