2.3 Learning sound separation

Problem

I want to separate a sound signal coming from a specific location.

Solution

To perform sound source separation in HARK, a network file and a binary format transfer function called HGTF(binary format) which is used in the separation module GHDSS, or a microphone position file (HARK text file) are necessary.

The following 4 items are required in the creation of a network file for sound source separation:

Audio signal acquisition:: AudioStreamFromMic or AudioStreamFromWave node is used to input audio signal.
Source location:: ConstantLocalization , LoadSourceLocation , or LocalizeMUSIC node is used to specify the location of the desired audio to be separated. Use ConstantLocalization node to perform simple source separation, or use LocalizeMUSIC node to perform separation while doing online localization.
Sound source separation:: The GHDSS is the node used for sound source separation. The input consists of source location and the audio signal. The output is the separated signal. The GHDSS node requires a transfer function, which is stored as an HGTF binary format file or calculated from a microphone position file.
Saving the Separated signal:: Since a separated signal is in the frequency domain, the user must use the Synthesize node to revert it back to time domain before using the SaveRawPCM or SaveWavePCM node to save the output.

Post-processing of the separated signal can be performed in HARK. It is not really necessary to do post-processing, but it may be needed depending on the environment or the purpose of using the separated signal.

Post-processing: :
Noise suppression can be applied to the separated signal by using the PowerCalcForMap , HRLE , CalcSpecSubGain , EstimateLeak , CalcSpecAddPower , and SpectralGainFilter nodes.

Figure 2.10, 2.11, and 2.12 are images of sample networks. Fig.2.11 shows a sound source without post-processing, while Fig. 2.12 shows a sound source with post-processing. Sound source separation can be performed by connecting a GHDSS node in the network file created in the Learning sound localization. Then set the HGTF or the microphone position file in the parameters. However, to be able to listen to the separated signal, Since the output is in the frequency domain, use the Synthesize node to revert it back to time domain before saving it using the SaveRawPCM or SaveWavePCM node.

$\includegraphics[width=6.0cm]{fig/recipes/LearningHARK-separation-main.png}$

Figure 2.10: MAIN

$\includegraphics[width=\textwidth ]{fig/recipes/LearningHARK-separation-ghdss.png}$

Figure 2.11: MAIN_LOOP (without post-processing)

$\includegraphics[width=\textwidth ]{fig/recipes/LearningHARK-separation-hrle.png}$

Figure 2.12: MAIN_LOOP (with post-processing)

When executing these networks, simultaneous speech from two speakers are separated and saved in sep_0.wav, sep_1.wav, ...

Discussion

Offline / Online: :
For online separation, replace the AudioStreamFromWave with the AudioStreamFromMic node.
Specific direction / Estimated direction: :
By connecting the output of LocalizeMUSIC node to the INPUT_SOURCES input of GHDSS node, the estimated direction of the sound source can be used in sound source separation. When it is connected to ConstantLocalization node, only a specific direction will be separated. In addition, when the output of LocalizeMUSIC is stored using the SaveSourceLocation node, it can be loaded using the LoadSourceLocation node.
Measurement-based / Calculation-based transfer function: :
The input of the parameters should be set according to the transfer function to be used. To use a measurement-based transfer function, TJ_CONJ is set to “DATABASE“, and TJ_CONJ_FILENAME is set to the file name of the transfer function. To use a microphone direction calculation-based transfer function, TJ_CONJ is set to “CALC“, and MIC_FILENAME is set to the file name of the transfer function. To change the device besides Kinect, or to change the position of the microphone array, a different localization and separation transfer function is necessary.

Each parameter in the sample is already tuned in advanced, so there is a possibility that the performance of sound source separation will deteriorate because of differences in the environment. To know more about parameter tuning, see c:Separation]Sound Source Separation.