6.3.8 ML

6.3.8.1 Outline of the node

Perform sound source separation using the method of maximum likelihood (Maximum Likelihood estimation). In this algorithm, obtain a separation matrix on condition that the likelihood function is maximized assuming that the input signal is a sum of a single target sound source and Gaussian noise. Transfer function information from the sound source to the microphone, the period information of the sound source (detection result of the speech period), and a correlation matrix of the known noise are required.

Node inputs are:

Note outputs are a set of complex spectrum of each separated sound.

6.3.8.2 Necessary file

Table 6.51: Necessary files for ML 

Corresponding parameter name

Description

TF_CONJ_FILENAME

Transfer function of a microphone array

6.3.8.3 Usage

When to use

This node is used to perform sound source separation on the sound source direction originated using a microphone array. The sound source direction can be either a value estimated by sound source localization or a constant value.

Typical connection

Figure 6.64 shows a connection example of the ML . The node has three inputs as follows:

  1. INPUT_FRAMES takes a multi-channel complex spectrum containing the mixture of sounds produced by for example MultiFFT ,

  2. INPUT_SOURCES takes the results of sound source localization produced by for example LocalizeMUSIC or ConstantLocalization ,

  3. INPUT_NOISE_CM takes a correlation matrix of known noise produced by for example CMLoad .

The output is the separated signals.

\includegraphics[width=.8\textwidth ]{fig/modules/ML.png}
Figure 6.64: Network Example using ML 

6.3.8.4 Input-output and property of the node

Input

INPUT_FRAMES

: Matrix<complex<float> > type. Multi-channel complex spectra. Corresponding to the complex spectrum of input waveform from each microphone, the rows correspond to the channel and the columns correspond to the frequency bins.

INPUT_SOURCES

: Vector<ObjectRef> type. A Vector array of the Source type object in which sound source localization results are stored. Typically, takes the output of SourceIntervalExtender connected to SourceTracker .

INPUT_NOISE_CM

: Matrix<complex<float> > type. A correlation matrix for each frequency bin. A $M$-th order complex square array with correlation matrix outputs $NFFT / 2 + 1$ items. Matrix<complex<float> > contains rows corresponding to frequency ($NFFT / 2 + 1$ rows), and columns containing the complex correlation matrix ($M * M$ columns across).

Output

OUTPUT

: Map<int, ObjectRef> type. A pair containing the sound source ID of a separated sound and a 1-channel complex spectrum of the separated sound (Vector<complex<float> > type).

Parameter

LENGTH

: int type. Analysis frame length [samples], which must be equal to the values at the preceding node (e.g. AudioStreamFromMic or the MultiFFT node). The default value is 512[samples].

ADVANCE

: int type. Shift length of a frame [samples], which must be equal to the values at the preceding node (e.g. AudioStreamFromMic or the MultiFFT node). The default value is 160[samples].

SAMPLING_RATE

: int type. Sampling frequency of the input waveform [Hz]. The default value is 16000[Hz].

LOWER_BOUND_FREQUENCY

: int type. The minimum frequency value used for separation processing. For frequencies below this value, no processing is performed and the output spectrum is 0. Specify the value in the range between 0 and up to the half of the sampling frequency value.

UPPER_BOUND_FREQUENCY

: int type. The maximum frequency value used for separation processing. For frequencies above this value, no processing is performed and the output spectrum is 0. The UPPER_BOUND_FREQUENCY must be greater than the LOWER_BOUND_FREQUENCY.

TF_CONJ_FILENAME

: string type. The file name in which the transfer function database of your microphone array is saved. Refer to Section 5.3.1 for the detail of the file format.

REG_FACTOR

: float type. The coefficient. See the equation (76). The default value is 0.0001.

ENABLE_DEBUG

: bool type. The default value is false. Setting the value to trueoutputs the separation status to the standard output.

Table 6.52: Parameter list of ML 

Parameter name

Type

Default value

Unit

Description

LENGTH

int 

512

[pt]

Analysis frame length.

ADVANCE

int 

160

[pt]

Shift length of frame.

SAMPLING_RATE

int 

16000

[Hz]

Sampling frequency.

LOWER_BOUND_FREQUENCY

int 

0

[Hz]

The minimum frequency value used for separation processing.

UPPER_BOUND_FREQUENCY

int 

8000

[Hz]

The maximum frequency value used for separation processing.

TF_CONJ_FILENAME

string 

   

File name of transfer function database of your microphone array.

REG_FACTOR

float 

0.0001

 

The coefficient.See the equation (76)

ENABLE_DEBUG

bool 

false

 

Enable or disable to output the separation status to standard output.

6.3.8.5 Details of the node

Technical details: Please refer to the following reference for the details.

Brief explanation of sound source separation: Table 6.53 shows the notation of variables used in sound source separation problems. Since the source separation is performed frame-by-frame in the frequency domain, all the variable is computed in a complex field. Also, the separation is performed for all $K$ frequency bins ($1 \leq k \leq K$). Here, we omit $k$ from the notation. Let $N$, $M$, and $f$ denote the number of sound sources and the number of microphones, and the frame index, respectively.

Table 6.53: Notation of variables

Variables

Description

$\boldsymbol {S}(f) = \left[S_1(f), \dots , S_ N(f)\right]^ T$

Complex spectrum of target sound sources at the $f$-th frame.

$\boldsymbol {X}(f) = \left[X_1(f), \dots , X_ M(f)\right]^ T$

Complex spectrum of a microphone observation at the $f$-th frame, which corresponds to INPUT_FRAMES.

$\boldsymbol {N}(f) = \left[N_1(f), \dots , N_ M(f)\right]^ T$

Complex spectrum of added noise.

$\boldsymbol {H} = \left[ \boldsymbol {H}_1, \dots , \boldsymbol {H}_ N \right] \in \mathbb {C}^{M \times N}$

Transfer function matrix from the $n$-th sound source ($1 \leq n \leq N$) to the $m$-th microphone ($1 \leq m \leq M$).

  $\boldsymbol {K}(f) \in \mathbb {C}^{M \times M}$

Correlation matrix of known noise.

$\boldsymbol {W}(f) = \left[ \boldsymbol {W}_1, \dots , \boldsymbol {W}_ M \right] \in \mathbb {C}^{N \times M}$

Separation matrix at the $f$-th frame.

$\boldsymbol {Y}(f) = \left[Y_1(f), \dots , Y_ N(f)\right]^ T$

Complex spectrum of separated signals.

Use the following linear model for the signal processing:

  $\displaystyle \boldsymbol {X}(f) $ $\displaystyle = $ $\displaystyle \boldsymbol {H}\boldsymbol {S}(f) + \boldsymbol {N}(f) \label{eq:ML_ observation} $   (73)

The purpose of the separation is to estimate $\boldsymbol {W}(f)$ based on the following equation:

  $\displaystyle \boldsymbol {Y}(f) $ $\displaystyle = $ $\displaystyle \boldsymbol {W}(f)\boldsymbol {X}(f) \label{eq:ML-separation} $   (74)

so that $\boldsymbol {Y}(f)$ is getting close to $\boldsymbol {S}(f)$.

The separation matrix $W_{\textrm{ML}}$ based on the maximum likelihood method is expressed by the following equation.

  $\displaystyle W_{\textrm{ML}}(f) $ $\displaystyle = $ $\displaystyle \frac{\tilde{\boldsymbol {K}}^{-1}(f)\boldsymbol {H}}{\boldsymbol {H}^{H}\tilde{\boldsymbol {K}}^{-1}(f)\boldsymbol {H}} $   (75)

$\tilde{\boldsymbol {K}(f)}$ can be expressed as below.

  $\displaystyle \label{eq:MLsep} \tilde{\boldsymbol {K}}(f) $ $\displaystyle = $ $\displaystyle \boldsymbol {K}(f) + ||\boldsymbol {K}(f)||_{\textrm{F}}\alpha \boldsymbol {I} $   (76)

$||\boldsymbol {K}(f)||_{\textrm{F}}$ is the Frobenius norm of the known noise correlation matrix $\boldsymbol {K}(f)$, $\alpha $ is the REG_FACTOR, and $\boldsymbol {I}$ is an identity matrix.

Trouble shooting: Basically, follow the GHDSS node troubleshooting.

6.3.8.6 Reference