HARK Document Version 3.4.0. (Revision: 9509) : ML

6.3.8 ML

6.3.8.1 Outline of the node

Perform sound source separation using the method of maximum likelihood (Maximum Likelihood estimation). In this algorithm, obtain a separation matrix on condition that likelihood function is maximized assuming that the input signal is a sum of a single target sound source and Gaussian noise. Transfer function information from the sound source to the microphone, the period information of the sound source (detection result of the speech period), and a correlation matrix of the known noise are required.

Node inputs are:

Multi-channel complex spectrum of mixed sound,
Direction of localized sound sources,
A correlation matrix of known noise.

Note outputs are a set of complex spectrum of each separated sound.

6.3.8.2 Necessary file

Table 6.55: Necessary files for ML

Corresponding parameter name	Description
TF_CONJ_FILENAME	Transfer function of a microphone array

6.3.8.3 Usage

When to use

This node is used to perform sound source separation on the sound source direction originated using a microphone array. The sound source direction can be either a value estimated by sound source localization or a constant value.

Typical connection

Figure 6.71 shows a connection example of the ML . The node has three inputs as follows:

INPUT_FRAMES takes a multi-channel complex spectrum containing the mixture of sounds produced by for example MultiFFT ,
INPUT_SOURCES takes the results of sound source localization produced by for example LocalizeMUSIC or ConstantLocalization ,
INPUT_NOISE_CM takes a correlation matrix of known noise produced by for example CMLoad .

The output is the separated signals.

$\includegraphics[width=.8\textwidth ]{fig/modules/ML.png}$

Figure 6.71: Network Example using ML

6.3.8.4 Input-output and property of the node

Input

INPUT_FRAMES: : Matrix<complex<float> > type. Multi-channel complex spectra. Corresponding to the complex spectrum of input waveform from each microphone, the rows correspond to the channels and the columns correspond to the frequency bins.
INPUT_SOURCES: : Vector<ObjectRef> type. A Vector array of the Source type object in which sound source localization results are stored. Typically, takes the output of SourceIntervalExtender connected to SourceTracker .
INPUT_NOISE_CM: : Matrix<complex<float> > type. A correlation matrix for each frequency bin. The rows represent the frequency bin ( $NFFT / 2 + 1$ rows) and the columns represent the $M$ -th order complex square correlation array ( $M * M$ columns).

Output

OUTPUT: : Map<int, ObjectRef> type. A pair containing the sound source ID and the complex spectrum of the separated sound (Vector<complex<float> > type). Output as many as the number of sound sources .

Parameter

LENGTH: : int type. Analysis frame length [samples], which must be equal to the values at a preceding node (e.g. AudioStreamFromMic or the MultiFFT ). The default is 512.
ADVANCE: : int type. Shift length of a frame [samples], which must be equal to the values at a preceding node (e.g. AudioStreamFromMic or the MultiFFT ). The default is 160.
SAMPLING_RATE: : int type. Sampling frequency of the input waveform [Hz]. The default is 16000.
LOWER_BOUND_FREQUENCY: : int type. This parameter is the minimum frequency used when separation processing is performed. Processing is not performed for frequencies below this value and the value of the output spectrum is zero then. The user designates a value in the range from 0 to half of the sampling frequency.
UPPER_BOUND_FREQUENCY: : int type. This parameter is the maximum frequency used when separation processing is performed. Processing is not performed for frequencies above this value and the value of the output spectrum is zero then. LOWER_BOUND_FREQUENCY $<$ UPPER_BOUND_FREQUENCY must be maintained.
TF_CONJ_FILENAME: : string type. The file name in which the transfer function database of your microphone array is saved. Refer to Section 5.3.1 for the detail of the file format.
REG_FACTOR: : float type. The coefficient. See the equation (81). The default value is 0.0001.
ENABLE_DEBUG: : bool type. The default value is false. Setting the value to true outputs the separation status to the standard output.

Table 6.56: Parameter list of ML

Parameter name	Type	Default value	Unit	Description
LENGTH	`int`	512	[pt]	Analysis frame length.
ADVANCE	`int`	160	[pt]	Shift length of frame.
SAMPLING_RATE	`int`	16000	[Hz]	Sampling frequency.
LOWER_BOUND_FREQUENCY	`int`	0	[Hz]	The minimum frequency value used for separation processing.
UPPER_BOUND_FREQUENCY	`int`	8000	[Hz]	The maximum frequency value used for separation processing.
TF_CONJ_FILENAME	`string`			File name of transfer function database of your microphone array.
REG_FACTOR	`float`	0.0001		The coefficient。・See the equation (81)
ENABLE_DEBUG	`bool`	`false`		Enable or disable to output the separation status to standard output.

6.3.8.5 Details of the node

Technical details: Please refer to the following reference for the details.

Brief explanation of sound source separation: Table 6.57 shows the notation of variables used in sound source separation problems. Since the source separation is performed frame-by-frame in the frequency domain, all the variable is computed in a complex field. Also, the separation is performed for all $K$ frequency bins ( $1 \leq k \leq K$ ). Here, we omit $k$ from the notation. Let $N$ , $M$ , and $f$ denote the number of sound sources and the number of microphones, and the frame index, respectively.

Table 6.57: Notation of variables

Variables	Description
$\boldsymbol {S}(f) = \left[S_1(f), \dots , S_ N(f)\right]^ T$	Complex spectrum of target sound sources at the $f$ -th frame.
$\boldsymbol {X}(f) = \left[X_1(f), \dots , X_ M(f)\right]^ T$	Complex spectrum of a microphone observation at the $f$ -th frame, which corresponds to INPUT_FRAMES.
$\boldsymbol {N}(f) = \left[N_1(f), \dots , N_ M(f)\right]^ T$	Complex spectrum of added noise.
$\boldsymbol {H} = \left[ \boldsymbol {H}_1, \dots , \boldsymbol {H}_ N \right] \in \mathbb {C}^{M \times N}$	Transfer function matrix from the $n$ -th sound source ( $1 \leq n \leq N$ ) to the $m$ -th microphone ( $1 \leq m \leq M$ ).
$\boldsymbol {K}(f) \in \mathbb {C}^{M \times M}$	Correlation matrix of known noise.
$\boldsymbol {W}(f) = \left[ \boldsymbol {W}_1, \dots , \boldsymbol {W}_ M \right] \in \mathbb {C}^{N \times M}$	Separation matrix at the $f$ -th frame.
$\boldsymbol {Y}(f) = \left[Y_1(f), \dots , Y_ N(f)\right]^ T$	Complex spectrum of separated signals.

Use the following linear model for the signal processing:

$\displaystyle \boldsymbol {X}(f)$

$\displaystyle =$

$\displaystyle \boldsymbol {H}\boldsymbol {S}(f) + \boldsymbol {N}(f) \label{eq:ML_ observation}$

(78)

The purpose of the separation is to estimate $\boldsymbol {W}(f)$ based on the following equation:

$\displaystyle \boldsymbol {Y}(f)$

$\displaystyle =$

$\displaystyle \boldsymbol {W}(f)\boldsymbol {X}(f) \label{eq:ML-separation}$

(79)

so that $\boldsymbol {Y}(f)$ is getting close to $\boldsymbol {S}(f)$ .

The separation matrix $W_{\textrm{ML}}$ based on the maximum likelihood method is expressed by the following equation.

$\displaystyle W_{\textrm{ML}}(f)$

$\displaystyle =$

$\displaystyle \frac{\tilde{\boldsymbol {K}}^{-1}(f)\boldsymbol {H}}{\boldsymbol {H}^{H}\tilde{\boldsymbol {K}}^{-1}(f)\boldsymbol {H}}$

(80)

$\tilde{\boldsymbol {K}(f)}$ can be expressed as below.

$\displaystyle \label{eq:MLsep} \tilde{\boldsymbol {K}}(f)$

$\displaystyle =$

$\displaystyle \boldsymbol {K}(f) + ||\boldsymbol {K}(f)||_{\textrm{F}}\alpha \boldsymbol {I}$

(81)

$||\boldsymbol {K}(f)||_{\textrm{F}}$ is the Frobenius norm of the known noise correlation matrix $\boldsymbol {K}(f)$ 。、 $\alpha$ is the REG_FACTOR。、 and $\boldsymbol {I}$ is an identity matrix.

Trouble shooting: Basically, follow the GHDSS node troubleshooting.

6.3.8.6 Reference

F. Asano: ’Array signal processingfor acoustics —Localization, tracking and separation of sound sources—, The Acoustical Society of Japan, 2011.