6.3.8 ML

6.3.8.1 Outline of the node

Perform sound source separation using the method of maximum likelihood (Maximum Likelihood estimation). In this algorithm, obtain a separation matrix on condition that likelihood function is maximized assuming that the input signal is a sum of a single target sound source and Gaussian noise. Transfer function information from the sound source to the microphone, the period information of the sound source (detection result of the speech period), and a correlation matrix of the known noise are required.

Node inputs are:

Note outputs are a set of complex spectrum of each separated sound.

6.3.8.2 Necessary file

Table 6.55: Necessary files for ML 

Corresponding parameter name

Description

TF_CONJ_FILENAME

Transfer function of a microphone array

6.3.8.3 Usage

When to use

This node is used to perform sound source separation on the sound source direction originated using a microphone array. The sound source direction can be either a value estimated by sound source localization or a constant value.

Typical connection

Figure 6.71 shows a connection example of the ML . The node has three inputs as follows:

  1. INPUT_FRAMES takes a multi-channel complex spectrum containing the mixture of sounds produced by for example MultiFFT ,

  2. INPUT_SOURCES takes the results of sound source localization produced by for example LocalizeMUSIC or ConstantLocalization ,

  3. INPUT_NOISE_CM takes a correlation matrix of known noise produced by for example CMLoad .

The output is the separated signals.

\includegraphics[width=.8\textwidth ]{fig/modules/ML.png}
Figure 6.71: Network Example using ML 

6.3.8.4 Input-output and property of the node

Input

INPUT_FRAMES

: Matrix<complex<float> >  type. Multi-channel complex spectra. Corresponding to the complex spectrum of input waveform from each microphone, the rows correspond to the channels and the columns correspond to the frequency bins.

INPUT_SOURCES

: Vector<ObjectRef>  type. A Vector array of the Source type object in which sound source localization results are stored. Typically, takes the output of SourceIntervalExtender connected to SourceTracker .

INPUT_NOISE_CM

: Matrix<complex<float> >  type. A correlation matrix for each frequency bin. The rows represent the frequency bin ($NFFT / 2 + 1$ rows) and the columns represent the $M$-th order complex square correlation array ($M * M$ columns).

Output

OUTPUT

: Map<int, ObjectRef>  type. A pair containing the sound source ID and the complex spectrum of the separated sound (Vector<complex<float> > type). Output as many as the number of sound sources .

Parameter

LENGTH

: int  type. Analysis frame length [samples], which must be equal to the values at a preceding node (e.g. AudioStreamFromMic or the MultiFFT ). The default is 512.

ADVANCE

: int  type. Shift length of a frame [samples], which must be equal to the values at a preceding node (e.g. AudioStreamFromMic or the MultiFFT ). The default is 160.

SAMPLING_RATE

: int  type. Sampling frequency of the input waveform [Hz]. The default is 16000.

LOWER_BOUND_FREQUENCY

: int  type. This parameter is the minimum frequency used when separation processing is performed. Processing is not performed for frequencies below this value and the value of the output spectrum is zero then. The user designates a value in the range from 0 to half of the sampling frequency.

UPPER_BOUND_FREQUENCY

: int  type. This parameter is the maximum frequency used when separation processing is performed. Processing is not performed for frequencies above this value and the value of the output spectrum is zero then. LOWER_BOUND_FREQUENCY $<$ UPPER_BOUND_FREQUENCY must be maintained.

TF_CONJ_FILENAME

: string  type. The file name in which the transfer function database of your microphone array is saved. Refer to Section 5.3.1 for the detail of the file format.

REG_FACTOR

: float  type. The coefficient. See the equation (81). The default value is 0.0001.

ENABLE_DEBUG

: bool  type. The default value is false. Setting the value to true outputs the separation status to the standard output.

Table 6.56: Parameter list of ML 

Parameter name

Type

Default value

Unit

Description

LENGTH

int 

512

[pt]

Analysis frame length.

ADVANCE

int 

160

[pt]

Shift length of frame.

SAMPLING_RATE

int 

16000

[Hz]

Sampling frequency.

LOWER_BOUND_FREQUENCY

int 

0

[Hz]

The minimum frequency value used for separation processing.

UPPER_BOUND_FREQUENCY

int 

8000

[Hz]

The maximum frequency value used for separation processing.

TF_CONJ_FILENAME

string 

   

File name of transfer function database of your microphone array.

REG_FACTOR

float 

0.0001

 

The coefficient。・See the equation (81)

ENABLE_DEBUG

bool 

false

 

Enable or disable to output the separation status to standard output.

6.3.8.5 Details of the node

Technical details: Please refer to the following reference for the details.

Brief explanation of sound source separation: Table 6.57 shows the notation of variables used in sound source separation problems. Since the source separation is performed frame-by-frame in the frequency domain, all the variable is computed in a complex field. Also, the separation is performed for all $K$ frequency bins ($1 \leq k \leq K$). Here, we omit $k$ from the notation. Let $N$, $M$, and $f$ denote the number of sound sources and the number of microphones, and the frame index, respectively.

Table 6.57: Notation of variables

Variables

Description

$\boldsymbol {S}(f) = \left[S_1(f), \dots , S_ N(f)\right]^ T$

Complex spectrum of target sound sources at the $f$-th frame.

$\boldsymbol {X}(f) = \left[X_1(f), \dots , X_ M(f)\right]^ T$

Complex spectrum of a microphone observation at the $f$-th frame, which corresponds to INPUT_FRAMES.

$\boldsymbol {N}(f) = \left[N_1(f), \dots , N_ M(f)\right]^ T$

Complex spectrum of added noise.

$\boldsymbol {H} = \left[ \boldsymbol {H}_1, \dots , \boldsymbol {H}_ N \right] \in \mathbb {C}^{M \times N}$

Transfer function matrix from the $n$-th sound source ($1 \leq n \leq N$) to the $m$-th microphone ($1 \leq m \leq M$).

$\boldsymbol {K}(f) \in \mathbb {C}^{M \times M}$

Correlation matrix of known noise.

$\boldsymbol {W}(f) = \left[ \boldsymbol {W}_1, \dots , \boldsymbol {W}_ M \right] \in \mathbb {C}^{N \times M}$

Separation matrix at the $f$-th frame.

$\boldsymbol {Y}(f) = \left[Y_1(f), \dots , Y_ N(f)\right]^ T$

Complex spectrum of separated signals.

Use the following linear model for the signal processing:

  $\displaystyle \boldsymbol {X}(f) $ $\displaystyle = $ $\displaystyle \boldsymbol {H}\boldsymbol {S}(f) + \boldsymbol {N}(f) \label{eq:ML_ observation} $   (78)

The purpose of the separation is to estimate $\boldsymbol {W}(f)$ based on the following equation:

  $\displaystyle \boldsymbol {Y}(f) $ $\displaystyle = $ $\displaystyle \boldsymbol {W}(f)\boldsymbol {X}(f) \label{eq:ML-separation} $   (79)

so that $\boldsymbol {Y}(f)$ is getting close to $\boldsymbol {S}(f)$.

The separation matrix $W_{\textrm{ML}}$ based on the maximum likelihood method is expressed by the following equation.

  $\displaystyle W_{\textrm{ML}}(f) $ $\displaystyle = $ $\displaystyle \frac{\tilde{\boldsymbol {K}}^{-1}(f)\boldsymbol {H}}{\boldsymbol {H}^{H}\tilde{\boldsymbol {K}}^{-1}(f)\boldsymbol {H}} $   (80)

$\tilde{\boldsymbol {K}(f)}$ can be expressed as below.

  $\displaystyle \label{eq:MLsep} \tilde{\boldsymbol {K}}(f) $ $\displaystyle = $ $\displaystyle \boldsymbol {K}(f) + ||\boldsymbol {K}(f)||_{\textrm{F}}\alpha \boldsymbol {I} $   (81)

$||\boldsymbol {K}(f)||_{\textrm{F}}$ is the Frobenius norm of the known noise correlation matrix $\boldsymbol {K}(f)$。、 $\alpha $ is the REG_FACTOR。、 and $\boldsymbol {I}$ is an identity matrix.

Trouble shooting: Basically, follow the GHDSS node troubleshooting.

6.3.8.6 Reference