The GHDSS node performs sound source separation based on the GHDSS (Geometric Highorder Dicorrelationbased Source Separation) algorithm. The GHDSS algorithm utilizes microphone arrays and performs the following two processes.
Higherorder decorrelation between sound source signals,
Direcitivity formation towards the sound source direction.
For directivity formulation, the positional relation of the microphones given beforehand is used as a geometric constraint. The GHDSS algorithm implemented in the current version of HARK utilizes the transfer function of the microphone arrays as the positional relation of the microphones. Node inputs are the multichannel complex spectrum of the sound mixture and data concerning sound source directions. Note outputs are a set of complex spectrum of each separated sound.
Changes from HARK 2.0 to HARK 2.1
Using Zip Format
In HARK 2.1, GHDSS uses Zip Format for TF_CONJ_FILENAME, INITW_FILENAME and EXPORT_W_FILENAME．
Changing the parameter for calculating error at updating W
In HARK 2.1, GHDSS uses error calculated by the distance from the former frame localization. The unit is [mm].
Corresponding parameter name 
Description 
TF_CONJ_FILENAME 
Transfer function of microphone array 
INITW_FILENAME 
Initial value of separation matrix 
When to use
Given a sound source direction, the node separates a sound source originating from the direction with a microphone array. As a sound source direction, either a value estimated by sound source localization or a constant value may be used.
Typical connection
Figure 6.56 shows a connection example of the GHDSS . The node has two inputs as follows:
INPUT_FRAMES takes a multichannel complex spectrum containing the mixture of sounds,
INPUT_SOURCES takes the results of sound source localization.
To recognize the output, that is a separate sound, it may be given to MelFilterBank to convert it to speech features for speech recognition. As a way to improve the performance of automatic speech recognition, it may be given to one of the following nodes:
The PostFilter node to suppress the interchannel leak and diffusive noise caused by the source separation processing (shown in the upper right part in Fig.6.56).
PowerCalcForMap , HRLE , and SpectralGainFilter in cascade to suppress the interchannel leakage and diffusive noise caused by source separation processing (this tuning would be easier than with PostFilter ), or
PowerCalcForMap , MelFilterBank and MFMGeneration in cascade to generate missing feature masks so that the separated speech is recognized by a missingfeaturetheory based automatic speech recognition system (shown in the lower right part of Fig.6.56).
Parameter name 
Type 
Default value 
Unit 
Description 
LENGTH 
512 
[pt] 
Analysis frame length. 

ADVANCE 
160 
[pt] 
Shift length of frame. 

SAMPLING_RATE 
16000 
[Hz] 
Sampling frequency. 

LOWER_BOUND_FREQUENCY 
0 
[Hz] 
The minimum value of the frequency used for separation processing 

UPPER_BOUND_FREQUENCY 
8000 
[Hz] 
The maximum value of the frequency used for separation processing 

TF_CONJ_FILENAME 
File name of transfer function database of your microphone array 

INITW_FILENAME 
A file name in which the initial value of the separation matrix is described. 

SS_METHOD 
ADAPTIVE 
A stepsize calculation method based on higherorder decorrelation. Select FIX, LC_MYU or ADAPTIVE. FIX indicates fixed values, LC_MYU indicates the value that links with the stepsize based on geometric constraints and ADAPTIVE indicates automatic regulation. 

SS_METHOD==FIX 
The following is valid when FIX is chosen for SS_METHOD. 

SS_MYU 
0.001 
A stepsize based on higherorder decorrelation at the time of updating a separation matrix 

SS_SCAL 
1.0 
The scale factor in a higherorder correlation matrix computation 

NOISE_FLOOR 
0.0 
The threshold value of the amplitude for judging the input signal as noise (upper limit) 

LC_CONST 
FULL 
Determine geometric constraints. Select DIAG or FULL. If DIAG, the geometric constraints contain only direct sound part. If FULL, the geometric constraints use whole part. 

LC_METHOD 
ADAPTIVE 
The stepsize calculation method based on geometric constraints. Select FIX or ADAPTIVE. FIX indicates fixed values and ADAPTIVE indicates automatic regulation. 

LC_METHOD==FIX 

LC_MYU 
0.001 
The stepsize when updating a separation matrix based on higherorder decorrelation. 

UPDATE_METHOD_TF_CONJ 
POS 
Designate a method to update transfer functions. Select POS or ID. 

UPDATE_METHOD_W 
ID 
Designate a method to update separation matrixes. Select ID, POS or ID_POS. 

UPDATE_ACCEPT_DISTANCE 
300.0 
[mm] 
The threshold value of distance difference for judging a sound source as identical to another in separation processing. 

EXPORT_W 
false 
Designate whether separation matrixes are to be written to files. 

EXPORT_W==true 
The following is valid when truefor EXPORT_W. 

EXPORT_W_FILENAME 
The name of the file to which the separation matrix is written. 

UPDATE 
STEP 
The method to update separation matrixes. Select STEP or TOTAL. In STEP, separation matrixes are updated based on the geometric constraints after an update based on higherorder decorrelation. In TOTAL, separation matrixes are updated based on the geometric constraints and higherorder decorrelation at the same time. 
Input
: Matrix<complex<float> > type. Multichannel complex spectra. Rows correspond to channels, i.e., complex spectra of waveforms input from microphones, and columns correspond to frequency bins.
: Vector<ObjectRef> type. A Vector array of the Source type object in which Source localization results are stored. It is typically connected to the SourceTracker node and SourceIntervalExtender node and its outputs are used.
Output
: Map<int, ObjectRef> type. A pair containing the sound source ID of a separated sound and a 1channel complex spectrum of the separated sound
(Vector<complex<float> > type).
Parameter
: int type. Analysis frame length, which must be equal to the values at a preceding stage value (e.g. AudioStreamFromMic or the MultiFFT node).
: int type. Shift length of a frame, which must be equal to the values at a preceding stage value (e.g. AudioStreamFromMic or the MultiFFT node).
: int type. Sampling frequency of the input waveform.
This parameter is the minimum frequency used when GHDSS processing is performed. Processing is not performed for frequencies below this value and the value of the output spectrum is zero then. The user designates a value in the range from 0 to half of the sampling frequency.
This parameter is the maximum frequency used when GHDSS processing is performed. Processing is not performed for frequencies above this value and the value of the output spectrum is zero then. LOWER_BOUND_FREQUENCY $<$ UPPER_BOUND_FREQUENCY must be maintained.
: string type. The file name in which the transfer function database of your microphone array is saved. Refer to Section 5.3.1 for the detail of the file format.
: string type. The file name in which the initial value of a separation matrix is described. Initializing with a converged separation matrix through preliminary computation allows for separation with good precision from the beginning. The file given here must be ready beforehand by setting to truefor EXPORT_W. For its format, see 5.3.2 .
: string type. Select a stepsize calculation method based on higherorder decorrelation. When wishing to fix it at a designated value, select FIX. When wishing to set a stepsize based on geometric constraints, select LC_MYU. When wishing to perform automatic regulation, select ADAPTIVE.
When FIX is chosen: set SS_MYU.
SS_MYU: float type. The default value is 0.01. Designate the stepsize to be used when updating a separation matrix based on higherorder decorrelation. By setting this value and LC_MYU to zero and passing a separation matrix of delayandsum beamformer type as INITW_FILENAME, processing equivalent to delayandsum beamforming is performed.
: float type. The default value is 1.0. Designate the scale factor of a hyperbolic tangent function (tanh) in calculation of the higherorder correlation matrix. A positive real number greater than zero must be designated. The smaller the value is, the less nonlinearity, which makes the calculation close to a normal correlation matrix calculation.
: float type. The default value is 0. The user designates the threshold value (upper limit) of the amplitude for judging the input signal as noise. When the amplitude of the input signal is equal to or less than this value, it is judged as a noise section and the separation matrix is not updated. When noise is large, and a separation matrix becomes stable and does not converge, a positive real number is to be designated.
: string type. Select a method for geometric constraints. Set it as DIAG to use only the diagonal parts (direct sound parts) in the geometric constraints. Select FULL if you want to use whole geometric constraints. Since a blind spot is formed automatically by the higherorder decorrelation, a highly precise separation is achieved in DIAG. The default is FULL.
: string type. Select a stepsize calculation method based on the geometric constraints. When wishing to fix at the designated value, select FIX. When wishing to perform automatic regulation, select ADAPTIVE.
When FIX is chosen: Set LC_MYU.
LC_MYU: float type. The default value is 0.001. Designate the stepsize at the time of updating a separation matrix based on the geometric constraints. Setting this value and LC_MYU to zero and passing the separation matrix of the beamformer of Delay and Sum type as INITW_FILENAME enables the processing equivalent to the beamformer of the Delay and Sum type.
: string type. Select ID or POS. The default value is POS. The user designates if updates of the complex conjugate TF_CONJ of a transfer function will be performed based on IDs given to each sound source (in the case of ID) or on a source position (in the case of POS)
: string type. Select ID, POS or ID_POS. The default value is ID. When source position information is changed, recalculation of the separation matrix is required. The user designates a method to judge that the source location information has changed. A separation matrix is saved along with its corresponding sound source ID and sound source direction position for a given period of time. Even if the sound stops once, when a detected sound is judged to be from the same direction, separation processing is performed with the values of the saved separation matrix again. The user sets criteria to judge if such a separation matrix will be updated in the above case. When ID is selected, it is judged if the sound source is in the same direction by the sound source ID. When POS is selected, it is judged by comparing the sound source directions. When ID_POS is selected, if the sound source is judged not to be the same sound source using a sound source ID comparison, further judgment is performed by comparing the positions of the sound source direction.
: float type. The default value is 300.0. The unit is [mm]. The user sets an allowable error of distance for judging if the sound is from the same direction when POS or ID_POS are selected for UPDATE_METHOD_TF_CONJ and UPDATE_METHOD_W.
: bool type. The default value is false. The user determines if the results of the separation matrix updated by GHDSS will be output. When true, select EXPORT_W_FILENAME.
: string type. This parameter is valid when EXPORT_W is set to true. Designate the name of the file into which a separation matrix will be output. For its format, see Section 5.3.2.
Formulation of sound source separation: Table 6.49 shows symbols used for the formulation of the sound source separation problem. The meaning of the indices is in Table 6.1. Since the calculation is performed in the frequency domain, the symbols generally indicate complex numbers in the frequency domain. Parameters, except transfer functions, generally vary with time but in the case of calculation in the same time frame, they are indicated with the time index $f$. Moreover, the following calculation describes the frequency bin $k_ i$. In a practical sense, the calculation is performed for each frequency bin $k_0, \dots ,k_{K1}$ of $K$ frequencies.
Parameter 
Description 
$\boldsymbol {S}(k_ i)= \left[S_1(k_ i), \dots ,S_ N(k_ i)\right]^ T$ 
The sound source complex spectrum corresponding to the frequency bin $k_ i$. 
$\boldsymbol {X}(k_ i)= \left[X_1(k_ i), \dots ,X_ M(k_ i)\right]^ T$ 
The vector of a microphone observation complex spectrum, which corresponds to INPUT_FRAMES. 
$\boldsymbol {N}(k_ i)= \left[N_1(k_ i), \dots ,N_ M(k_ i)\right]^ T$ 
The additive noise that acts on each microphone. 
$\boldsymbol {H}(k_ i)= \left[H_{m, n}(k_ i)\right]$ 
The transfer function matrix including reflection and diffraction ($M \times N$). 
$\boldsymbol {H}_ D(k_ i)= \left[H_{Dm, n}(k_ i)\right]$ 
The transfer function matrix of direct sound ($M \times N$). 
$\boldsymbol {W}(k_ i)= \left[W_{n, m}(k_ i)\right]$ 
The separation matrix ($N \times M$). 
$\boldsymbol {Y}(k_ i)= \left[Y_1(k_ i), \dots ,Y_ N(k_ i)\right]^ T$ 
The separation sound complex spectrum. 
$\mu _{SS}$ 
The stepsize at the time of updating a separation matrix based on the higherorder decorrelation, which corresponds to SS_MYU. 
$\mu _{LC}$ 
The stepsize at the time of updating a separation matrix based on geometric constraints, which corresponds to LC_MYU. 
The sound that is emitted from $N$ sound sources is affected by the transfer function $\boldsymbol {H}(k_ i)$ in space and observed through $M$ microphones as expressed by Equation (48).
$\displaystyle \boldsymbol {X}(k_ i) $  $\displaystyle = $  $\displaystyle \boldsymbol {H}(k_ i)\boldsymbol {S}(k_ i) + \boldsymbol {N}(k_ i). \label{eq:observation} $  (48) 
The transfer function $\boldsymbol {H}(k_ i)$ generally varies depending on shape of the room and positional relations between microphones and sound sources and therefore it is difficult to estimate it. However, ignoring acoustic reflection and diffraction, in the case that a relative position of microphones and sound source is known, the transfer function limited only to the direct sound $\boldsymbol {H}_ D(k_ i)$ is calculated as expressed in Equation (49).
$\displaystyle H_{Dm, n}(k_ i) $  $\displaystyle = $  $\displaystyle \exp \left(j2\pi l_ ir_{m, n}\right) \label{eq:tfd} $  (49)  
$\displaystyle l_ i $  $\displaystyle = $  $\displaystyle \frac{2\pi \omega _ i}{c}, \label{eq:wavenumber} $  (50) 
Here, $c$ indicates the speed of sound and $l_ i$ is the wave number corresponding to the frequency $\omega _ i$ in the frequency bin $k_ i$. Moreover, $r_{m, n}$ indicates difference between the distance from the microphone $m$ to the sound source $n$ and the difference between the reference point of the coordinate system (e.g. origin) to the sound source $n$. In other words, $\boldsymbol {H}_ D(k_ i)$ is defined as the phase difference incurred by the difference in arrival time from the sound source to each microphone.
The matrix of a complex spectrum of separated sound $\boldsymbol {Y}(k_ i)$ is obtained from the following equation.
$\displaystyle \boldsymbol {Y}(k_ i) $  $\displaystyle = $  $\displaystyle \boldsymbol {W}(k_ i)\boldsymbol {X}(k_ i) \label{eq:GHDSSseparation} $  (51) 
The GHDSS algorithm estimates the separation matrix $\boldsymbol {W}(k_ i)$ so that $\boldsymbol {Y}(k_ i)$ closes to $\boldsymbol {S}(k_ i)$.
Information assumed to be alreadyknown by this algorithm is as follows.
The number of sound sources $N$
Source position (The LocalizeMUSIC node estimates source location in HARK)
Microphone position
Transfer function of the direct sound component $\boldsymbol {H}_ D(k_ i)$ (measurement or approximation by Equation (49))
As unknown information,
Actual transfer function at the time of an observation $\boldsymbol {H}(k_ i)$
Observation noise $\boldsymbol {N}(k_ i)$
GHDSS estimates $\boldsymbol {W}(k_ i)$ so that the following conditions are satisfied.
Higherorder decorrelation of the separated signals
In other words, the diagonal component of the higherorder matrix $\boldsymbol {R}^{\phi (y)y}(k_ i) = E[\phi (\boldsymbol {Y}(k_ i)) \boldsymbol { Y}^ H(k_ i)] $ of the separated sound $\boldsymbol {Y}(k_ i)$ is made 0. Here, the operators $^ H$, $E[]$ and $\phi ()$ indicate a hermite transpose, time average operator and nonlinear function, respectively and a hyperbolic tangent function defined by the followings is used in this node.
$\displaystyle \phi (\boldsymbol {Y}) $  $\displaystyle = $  $\displaystyle [\phi (Y_1), \phi (Y_2), \dots , \phi (Y_ N)] ^ T $  (52)  
$\displaystyle \phi (Y_ k) $  $\displaystyle = $  $\displaystyle \tanh (\sigma Y_ k) \exp (j\angle (Y_ k)) $  (53) 
Here, $\sigma $ indicates a scaling factor (corresponds to SS_SCAL).
The direct sound component is separated without distortions (geometric constraints)
The product of the separation matrix $\boldsymbol {W}(k_ i)$ and the transfer function of the direct sound $\boldsymbol {H}_ D(k_ i)$ is made a unit matrix ($\boldsymbol {W}(k_ i)\boldsymbol {H}_ D(k_ i)= \boldsymbol {I}$)．
The evaluation function that an upper binary element is matched with is as follows. In order to simplify, the frequency bin $k_ i$ is abbreviated.
$\displaystyle J(\boldsymbol {W}) $  $\displaystyle = $  $\displaystyle \alpha J_1(\boldsymbol {W}) + \beta J_2(\boldsymbol {W}), \label{eq:evalFuncTotal} $  (54)  
$\displaystyle J_1(\boldsymbol {W}) $  $\displaystyle = $  $\displaystyle \sum _{i \neq j} R^{\phi (y)y}_{i, j}^2, \label{eq:evalFunc1} $  (55)  
$\displaystyle J_2(\boldsymbol {W}) $  $\displaystyle = $  $\displaystyle \boldsymbol {WH}_ D\boldsymbol {I}^2, \label{eq:evalFunc2} $  (56) 
Here, $\alpha $ and $\beta $ are weighting factors. Moreover, the norm of a matrix is defined as below. $\boldsymbol {M} ^2 = tr(\boldsymbol {MM}^ H)= \sum _{i, j}m_{i, j}^2$
An update equation of the separation matrix to minimize Equation (54) is obtained by the gradient method that uses the complex gradient calculation $\frac{\partial }{\partial \boldsymbol {W}^*}$.
$\displaystyle \boldsymbol {W}(k_ i, f+1) $  $\displaystyle = $  $\displaystyle \boldsymbol {W}(k_ i, f)  \mu \frac{\partial J}{\partial \boldsymbol {W}^*}(\boldsymbol {W}(k_ i, f)) \label{eq:updateSepMatStat} $  (57) 
Here, $\mu $ indicates a stepsize regulating the quantity of update of a separation matrix. Usually, when obtaining a complex gradient of the righthand side of Equation (57), multiple frame values are required for expectation value calculation such as $\boldsymbol {R}^{xx} = E[\boldsymbol {XX}^ H]$ and $\boldsymbol {R}^{yy} = E[\boldsymbol {YY}^ H]$. An autocorrelation matrix is not obtained in calculation of the GHDSS node. However, Equation (58), which uses only one frame, is used.
$\displaystyle \boldsymbol {W}(k_ i, f+1) $  $\displaystyle = $  $\displaystyle \boldsymbol {W}(k_ i, f)  \left[ \mu _{SS} \frac{\partial J_1}{\partial \boldsymbol {W}^*}(\boldsymbol {W}(k_ i, f)) + \mu _{LC} \frac{\partial J_2}{\partial \boldsymbol {W}^*}(\boldsymbol {W}(k_ i, f)) \right], \label{eq:updateSepMatInst} $  (58)  
$\displaystyle \frac{\partial J_1}{\partial \boldsymbol {W}^*}(\boldsymbol {W}) $  $\displaystyle = $  $\displaystyle \left(\phi (\boldsymbol {Y})\boldsymbol {Y}^ H  \mathrm{diag}[\phi (\boldsymbol {Y})\boldsymbol {Y}^ H] \right)\tilde{\phi }(\boldsymbol {W}\boldsymbol {X})\boldsymbol {X}^ H, \label{eq:J1} $  (59)  
$\displaystyle \frac{\partial J_2}{\partial \boldsymbol {W}^*}(\boldsymbol {W}) $  $\displaystyle = $  $\displaystyle 2\left(\boldsymbol {W}\boldsymbol {H}_ D  \boldsymbol {I} \right)\boldsymbol {H}_ D^ H, \label{eq:J2} $  (60) 
Here, $\tilde{\phi }$ is a partial differential of $\phi $ and is defined as follows.
$\displaystyle \tilde{\phi }(\boldsymbol {Y}) $  $\displaystyle = $  $\displaystyle [\tilde{\phi (Y_1)}, \tilde{\phi (Y_2)},\dots ,\tilde{\phi (Y_ N)}]^ T $  (61)  
$\displaystyle \tilde{\phi }(Y_ k) $  $\displaystyle = $  $\displaystyle \phi (Y_ k)+ Y_ k \frac{\partial \phi (Y_ k)}{\partial Y_ k} $  (62) 
Moreover, $\mu _{SS} = \mu \alpha $ and $\mu _{LC} = \mu \beta $, which are the stepsizes based on the higherorder decorrelation and geometric constraints. The stepsizes, which are automatically regulated, are calculated by the equations
$\displaystyle \mu _{SS} $  $\displaystyle = $  $\displaystyle \frac{J_1(\boldsymbol {W})}{2 \frac{\partial J_1}{\partial \boldsymbol {W}}(\boldsymbol {W})^2} $  (63)  
$\displaystyle \mu _{LC} $  $\displaystyle = $  $\displaystyle \frac{J_2(\boldsymbol {W})}{2\frac{\partial J_2}{\partial \boldsymbol {W}}(\boldsymbol {W})^2} $  (64) 
The indices of each parameter in Equations (59, 60) are $(k_ i, f)$, which are abbreviated above. The initial values of the separation matrix are obtained as follows.
$\displaystyle \boldsymbol {W}(k_ i) $  $\displaystyle = $  $\displaystyle \boldsymbol {H}_ D^ H(k_ i) / M, \label{eq:initSepMat} $  (65) 
Here, $M$ indicates the number of microphones.
Processing flow
The main processing for time frame $f$ in the GHDSS node is shown in Figure 6.57. The detailed processing related to fixed noise is as follows:
Acquiring a transfer function (direct sound)
Estimating the separation matrix $\boldsymbol {W}$
Performing sound source separation in accordance with Equation (51)
Writing of a separation matrix (When EXPORT_W is set to true)
Acquiring a transfer function: At the first frame, the transfer function, specified by the file name TF_CONJ_FILENAME, that is closest to the localization result is searched.
Processing after the second frame is as follows.
UPDATE_METHOD_TF_CONJ decides whether the transfer function selected in the previous frame is used or not in the currenct frame. If not used, the new transfer function is loaded from the file.
UPDATE_METHOD_TF_CONJ is ID
The acquired ID is compared with the ID one frame before.
Same: Succeed
Different: Read
UPDATE_METHOD_TF_CONJ is POS
The acquired direction is compared with the sound source direction one frame before
Error is less than UPDATE_ACCEPT_DISTANCE: Succeed
Error is more than UPDATE_ACCEPT_DISTANCE: Read
Estimating the separation matrix: The initial value of the separation matrix is different depending on if the user designates a value for the parameter INITW_FILENAME.
When the parameter INITW_FILENAME is not designated, the separation matrix $\boldsymbol {W}$ is calculated from the transfer function $\boldsymbol {H}_ D$.
When the parameter INITW_FILENAME is designated, the data most close to the direction of the input source localization result is searched for from the designated separation matrix.
Processing after the second frame is as follows.
The flow for estimating the separation matrix is shown in Figure 6.58. Here, the previous frame is updated based on Equation (58) or an initial value of the separation matrix is derived by the transfer function based on Equation (65).
When it is found that a sound source has disappeared by referring to the source localization information of the previous frame, the separation matrix is reinitialized.
When the number of sound sources does not change, the separation matrix diverges by the value of UPDATE_METHOD_W. The sound source ID and localization direction from the previous frame are compared with those of the current to determine if the separation matrix will be used continuously or initialized.
[c]UPDATE_METHOD_W is ID
Compare with the previous frame ID
Same: Update $\boldsymbol {W}$
Different: Initialize $\boldsymbol {W}$
[c]UPDATE_METHOD_W is POS
Compare with the former frame localization direction
Error is less than UPDATE_ACCEPT_DISTANCE: Update $\boldsymbol {W}$
Error is more than UPDATE_ACCEPT_DISTANCE: Initialize $\boldsymbol {W}$
[c] UPDATE_METHOD_W is ID_POS
Compare with the former frame ID
Same: Update $\boldsymbol {W}$
Localization directions are compared when IDs are different
Error is less than UPDATE_ACCEPT_DISTANCE: Update $\boldsymbol {W}$
Error more than UPDATE_ACCEPT_DISTANCE: Initialize $\boldsymbol {W}$
Writing of a separation matrix (When EXPORT_W is set to true) When EXPORT_W is set to true, a converged separation matrix is output to a file designated for EXPORT_W_FILENAME.
When multiple sound sources are detected, all those separation matrices are output to one file. When a sound source disappears, its separation matrix is written to a file.
When written to a file, it is determined to overwrite the existing sound source or add the sound source as a new sound source by comparing localization directions of the sound sources already saved. [c]Disappearance of sound source
Compare with localization direction of the sound source already saved
Error is less than UPDATE_ACCEPT_DISTANCE: Overwrite and save $\boldsymbol {W}$
Error more than UPDATE_ACCEPT_DISTANCE: Save additionally $\boldsymbol {W}$