Overview¶

HARK-Binaural+ includes binaural signal processing nodes for robot audition software HARK. You can also smoothly connect HARK-Binaural+ and other HARK packages such as HARK-FD or HARK-SSS packages.

Getting Started¶

1. Add HARK repository and install Basic HARK Packages. See HARK installation instructions for details.

2. Install HARK-Binaural+

sudo apt-get install hark-binaural+


Binaural Signal Processing Nodes¶

This section describes following 5 signal processing nodes for binaural robot audition:

1. BinauralMultisourceLocalization node
2. BinauralMultisourceTracker node
3. SourceSeparation node
4. SpeechEnhanvement node
5. VoiceActivityDetection node

BinauralMultisourceLocalization Node¶

Outline of the node¶

This node identify the location of sound sources in direction with two microphones.

Typical connection¶

The type of the input is multi-channel (2-ch) audio spectrum and that of the output is a list of localized directions of sound sources. Typical connection of this node is depicted as follows:

Input-output and property of the node¶

Input¶

AUDIO_SPECTRUM Matrix<complex<float> >
Windowed spectrum data. A row index is channel, and a column index is frequency.

Output¶

ESTIMATED_DOA Matrix<complex<float> >
Estimated directions of multi-sources.

Parameters¶

Parameters of this node are listed as follows:

Parameter name Type Default value Unit Description
SOUND_DIFFRACTION_COMPENSATION string FREE_SPACE   Compensation for the diffraction of sound waves with multi-path interference caused by contours of a spherical robot head.
GCCPHAT_THRESHOLD float 0.25   Threshold to avoid estimating DOAs of noise and reverberated sources.
MICS_DISTANCE float 17.4 cm Distance between two microphone.
NOISE_DURATION float 1.0 second Time duration to be regarded as “noise” from the first frame.
ADVANCE int 160 sample The length in sample between a frame and a previous frame.
SAMPLING_RATE int 16000 Hz Sampling rate.

Detail of the node¶

This node localizes sound sources based on generalized cross-correlation method weighted by the phase transform (GCC-PHAT) [1]. Although the basic GCC-PHAT algorithm assumes the number of sound sources is one, we maintain multiple sound sources by using dynamic K-means clustering [2] that is implemented in BinauralMultisourceTracker node. The localization performance is improved by signal-to-noise ratio (SNR)-weighting [3].

Let $$\hat{\Theta}_{mel} = \{\hat{\theta}_1, \hat{\theta}_2, \hat{\theta}_3, \cdots\}$$ be directions of sound sources localized, the binaural sound source localization is conducted following process:

$$\hat{\Theta}_{mel} = \mathop{\rm argmax}_{\theta} \frac{1}{F} \sum^F_{f=1} \frac{SNR_{inst}[f,n]}{1 + SNR_{inst}[f,n]} \cdot \frac{X_l[f,n]X_r^{\ast}[f,n]}{\left|X_l[f,n]X_r^{\ast}[f,n]\right|} \exp{\left( j2 \pi \frac{f}{F} fs \tau_{multi}(\theta) \right)},$$

$$SNR_{inst}\left[f,n\right]=\frac{\left|X_l\left[f,n\right]X_r^{\ast}\left[f,n\right]\right|-E\left[\left|N_l\left[f,n\right]N_r^{\ast}\left[f,n\right]\right|\right]} {E\left[\left|N_l\left[f,n\right]N_r^{\ast}\left[f,n\right]\right|\right]}$$, and

$$\tau_{multi}(\theta)=\frac{d_{lr}}{2v} \left( \frac{\theta}{180}\pi + {\rm sin}\left( \frac{\theta}{180}\pi \right) \right) - \frac{d_{lr}}{2v}\left({\rm sgn}(\theta)\pi - \frac{2\theta}{180}\pi \right) \cdot \left|\beta_{multi} {\rm sin}\left( \frac{\theta}{180}\pi\right)\right|$$

where $$X\left[f, n\right]$$ represents an input audio signal at frequency bin $$f$$ and time frame $$n$$.

References¶

 [1] Knapp and G. C. Carter, “The Generalized Correlation Method for Estimation of Time Delay,” IEEE Trans. on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320-327, 1976.
 [2] Kim and H. G. Okuno, “Improved Binaural Sound Localization and Tracking for Unknown Time-Varying Number of Speakers,” Advanced Robotics, vol. 27, no. 15, pp. 1161-1173, July 2013.
 [3] Kim, K. Nakadai, and H. G. Okuno, “Improved Sound Source Localization in Horizontal Plane for Binaural Robot Audition,” Applied Intelligence, Springer, Vol., No., accepted on March 25, 2014.

BinauralMultisourceTracker Node¶

Outline of the node¶

This node tracks the sound source locations estimated by BinauralMultisourceTracker node.

Typical connection¶

See Typical connection of the BinauralMultisourceLocalization node. To maintain multiple sound sources, this node is connected from the BinauralMultisourceLocalization node, and clusters the output of the localization node based on dynamic K-means clustering

Input-output and property of the node¶

Input¶

ESTIMATED_DOA Vector<ObjectRef>
Estimated directions of multisource.

Output¶

TRACKED_DOA
Tracked directions of multisource.

Parameters¶

Parameters of this node are listed as follows:

Parameter name Type Default value Unit Description
CLUSTERING_DURATION float 0.25 second Time duration to cluster direction estimations.
ADVANCE int 160 sample The length in sample between a frame and a previous frame.
SAMPLING_RATE int 16000 Hz Sampling rate.

SourceSeparation Node¶

Outline of the node¶

This node conducts blind sound source separation based on independent vector analysis.

Typical connection¶

The type of both the input and output of SourceSeparation node is multi-channel (2-ch) audio spectrum. Typical connection of this node is depicted as follows:

Input-output and property of the node¶

Input¶

INPUT_AUDIO_SPECTRUM Matrix<complex<float> >
Windowed spectrum data. A row index is channel, and a column index is frequency.

Output¶

OUTPUT_AUDIO_SPECTRUM Matrix<complex<float> >
Windowed and speech-enhanced spectrum data . A row index is channel, and a column index is frequency.

Parameters¶

Parameters of this node are listed as follows:

Parameter name Type Default value Unit Description
FFT_LENGTH int 512 sample Analysis frame length.
ITERATION_METHOD string FastIVA   Iteration method.
MAX_ITERATION int 700   Processing limitation: maximum number of iterations.
NUMBER_OF_SOURCE_TO_BE_SEPARATED int 2   Number of sound sources to be separated.
SEPARATION_TIME_LENGTH float 5.0 second Separation window length.
ADVANCE int 160 sample The length in sample between a frame and a previous frame.
SAMPLING_RATE int 16000 Hz Sampling rate.

Details of the node¶

This module conducts recovery of the original sound signals from the combined sound signal by using independent vector analysis (IVA) [4] or Fast independent vector analysis (Fast-IVA) [5]. In the case of IVA, the objective function is Kullback-Leibler (KL) divergence:

$$C={\rm constant}- \sum^F_f {\rm log}\left|{\rm det} W_{mkf}\right| - \sum^M_m E\left[{\rm log}P \left( \hat{S}_1, \cdots ,\hat{S}_M \right)\right]$$

where $$\hat{S}_m (m = 1, \cdots, M)$$ and $$W_{mkf}$$ represent the input signal of m-th microphone and the separation matrix of IVA, respectively. The lerning algorithm of IVA is based on natural gradient-descent method:

$$W^{new}_{mkf}=W^{old}_{mkf} + \eta \sum^K_k \left( I_{mk} - E \left[ \frac{\hat{S}_{kf}}{\sqrt{\sum^F_f \left| \hat{S}_{kf} \right|^2}} \hat{S}_{kf}^{\ast} \right] \right) W^{old}_{mkf}$$

where $$\eta$$ is learning rate (set at 0.1)

In the case of Fast-IVA, following modified objective function based KL divergence on is used:

$$C=-\sum^M_m E\left[{\rm log}P \left( \hat{S}_1, \cdots ,\hat{S}_M \right)\right] - \sum^M_m \beta\left[W^T_{mkf}W^{new}_{mkf}-1\right]$$,

where $$\beta$$ is Langrangian multiplier. The learning algorithm, on the other hand, is based on newton method with fixed point iteration:

$$W^{new}_{mkf}= E\left[\frac{1}{\sqrt{\sum^F_f \left|\hat{S}_{kf}\right|^2}} - \frac{\hat{S}^2_{kf}}{\left( \sqrt{\sum^F_f \left|\hat{S}_{kf}\right|^2}\right) ^3}\right] W^{old}_{mkf} -E\left[\frac{\hat{S}_{kf}}{\sqrt{\sum^F_f \left|\hat{S}_{kf}\right|^2}} X_{kf}\right]$$

References¶

 [4] Kim, H. T. Attias, S. Lee, and T. Lee, “Blind Source Separation Exploiting Higher-Order Frequency Dependencies,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 1, pp. 70–79, January 2007.
 [5] Lee, T. Kim, and T. Lee, “Fast fixed-point independent vector analysis algorithms for convolutive blind source separation” Signal Processing, vol. 87, no. 8, pp. 1859–1871, August 2007.

SpeechEnhancement Node¶

Outline of the node¶

This node improves the quality of the sound signal including a speech degraded by noise.

Typical connection¶

The type of both the input and output of this node is multi-channel audio spectrum. Typical connection of this node is depicted as follows:

Input-output and property of the node¶

Input¶

NOISY_SPEECH_SPECTRUM Matrix<complex<float> >
Windowed spectrum data. A row index is channel, and a column index is frequency.

Output¶

SPEECH_EHANCED_SPECTRUM Matrix<complex<float> >
Windowed and speech-enhanced spectrum data . A row index is channel, and a column index is frequency.

Parameters¶

Parameters of this node are listed as follows:

Parameter name Type Default value Unit Description
NOISE_REDUCTION_METHOD string MINIMUM_MEAN_SQUARE_ERROR   Noise reduction methods
HARMONIC_REGENERATION string USE   Harmonic regeneration after noise reduction
NOISE_PERIOD float 1.0 second Time to be regarded as “noise” from the first frame.
ADVANCE int 160 sample The length in sample between a frame and a previous frame.
SAMPLING_RATE int 16000 Hz Sampling rate

Detail of the node¶

This module supports noise reduction and harmonic regeneration by conducting spectral subtruction [6], wiener filtering [7], minimum mean squre error (MMSE) [8], two step wiener filtering (TSNR), and harmonic regeneration (HRNN) methods [9]. Let $$X\left[f, n\right]$$ be an input audio signal at frequency bin $$f$$ and time frame $$n$$, each method is conducted following processing.

Spectral subtraction: $$\hat{S}\left[f,n\right]=\left( \left|X\left[f,n\right]\right|^2 - \hat{\gamma}_N \right)$$

Wiener filtering: $$G_{Wiener}\left[f,n\right] = \hat{SNR}_{Decision-Directed}\left[f,n\right]/ \left( \hat{SNR}_{Decision-Directed}\left[f,n\right]+1 \right)$$

MMSE: $$G_{MMSE}\left[f,n\right] = gamma\left(1.5\right)\left(V_k\right)^{0.5} / \hat{SNR}_{instantaneous}\left[f,n\right]\cdot {\rm exp}(-V_k/2)\cdot (1+V_k) {\rm Bessel_0}\left(V_k /2\right) + V_k {\rm Bessel}_1\left(V_k /2\right)$$,

where $$V_k = G_{Wiener}\left[n\right] \cdot \hat{SNR}_{instantaneous}\left[n\right]$$

TSNR: $$G_{TSNR}\left[f,n\right] = \hat{SNR}_{TSNR}\left[f,n\right]/ \left(\hat{SNR}_{TSNR}\left[f,n\right] + 1\right)$$ , where $$\hat{SNR}_{TSNR}\left[f,n\right] = {\left|G_{Decision-Directed}\left[f,n\right] X\left[f,n\right]\right|}^2 / \hat{\gamma}_N$$

References¶

 [6] Lim and A.V. Oppenheim, “Enhancement and Bandwidth Compression of Noisy Speech,” Proceedings of the IEEE, vol. 67, no. 12, pp. 1586-1604, December 197.
 [7] MacAulay and M. L. Malpass, “Speech enhancement using a soft-decision noise suppression filter,” IEEE Transactions on Acoustics, Speech, Signal Processing, vol. ASSP-28, no. 2, pp. 137-145, April 1980.
 [8] Ephraim and D. Malah, “Speech enhancement using a minimum mean square error short-time spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech, Signal Processing, vol. 32, no. 6, pp. 1109-1121, December. 1984.
 [9] Plapous, C. Marro, and P. Scalart, “Improved Signal-to-Noise Ratio Estimation for Speech Enhancemen”, IEEE Transactions on Audio, Speech & Language Processing, pp.2098-2108, 2006.

VoiceActivityDetection Node¶

Outline of the node¶

This node delimits the a speech-present period.

Typical connection¶

This node is connected with the VoiceActivityDetection node. Typical connection of this node is depicted as follows:

Input-output and property of the node¶

Input¶

AUDIO_SPECTRUM Matrixd<complex<float> >
Windowed spectrum data. A row index is channel, and a column index is frequency.

Output¶

Decision of speech-present frame

Parameters¶

Parameters of this node are listed as follows:

Parameter name Type Default value Unit Description
VAD_NOISE_DURATION float 3.0 second Time duration to be regarded as “noise” from the first frame
VAD_THRESHOLD float 50.0   Threshold for voice activity decision.
ADVANCE int 160 sample The length in sample between a frame and a previous frame.
SAMPLING_RATE int 16000 Hz Sampling rate.

Detail of the node¶

This node estimates the voice activity by using log likelifood ratio of speech and noise variances of the zero-mean Gaussian statistical model [10]. Let $$X_{l/r}\left[f, n\right]$$ be an input audio signal at frequency bin $$f$$ and time frame $$n$$, this method regards speech-present when following equation is satisfied:

$$\frac{1}{F} \sum^{F}_{f=1} \gamma \left[f,n\right]- {\rm log} \gamma \left[f,n\right] -1 > \eta_{VAD}$$,

$$\lambda_N \left[f\right] = E\left|N_l\left[f\right] \cdot N_r\left[f\right]^{\ast}\right|$$,

$$\gamma\left[f,n\right]=\left|X_l \left[f,n\right] \cdot X_r\left[f,n\right]^{\ast}\right|/ \lambda_N\left[f\right]$$,

where $$N\left[f\right]$$ and $$\eta_{VAD}$$ represent the variance of a estimated noise and threshold parameter, respectively.

References¶

 [10] Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,” Signal Processing Letters, vol. 6, no. 1, pp. 1-3, January 1999.

Visualization Nodes¶

This section describes following 4 visualization nodes for the previous 5 nodes.

1. SSLVisualization node
2. SpectrumVisualization node
4. WaveVisualization node

SSLvisualization Node¶

Outline of the node¶

This node visualizes the estimated sound source locations.

Typical connection¶

See Typical connection of the BinauralMultisourceLocalization node. This node can be connected with BinauralMultisourceTracker node. The following figure shows an example of the visualization.

Input-output and property of the node¶

Input¶

INPUT_AUDIO_SIGNAL any
Input should be Matrix<float> or Map<int,ObjectRef>. In case of Matrix<float>, the rows should be channel indices and the columns should be frequency indices. In case of Map<int,ObjectRef>, the key is a source ID and the value is a vector of audio signals (Vector<float>).
INPUT_TRACKED_DOA Vector<ObjectRef>
Estimated/tracked directions of multisource.

Output¶

OUTPUT_AUDIO_SIGNAL any
Same as INPUT_AUDIO_SIGNAL
OUTPUT_TRACKED_DOA Vector<ObjectRef>
Same as INPUT_TRACKED_DOA

Parameters¶

Parameters of this node are listed as follows:

Parameter name Type Default value Unit Description
WINDOW_NAME string Visualization of multisource directions   Window name of the time-azimuth map.
VISUALIZATION_TIME_LENGTH float 10.0 second Visualization window length to show at the same time.
ADVANCE int 160 sample The length in sample between a frame and a previous frame.
SAMPLING_RATE int 16000 Hz Sampling rate.

SpectrumVisualization Node¶

Outline of the node¶

This node visualizes the audio spectrum.

Typical connection¶

See Typical connection of the Source Separation node. Following figure shows the visualization result of spectrum on the input signal.

Input-output and property of the node¶

Input¶

AUDIO_SPECTRUM Matrixd<complex<float> >
Windowed spectrum data. A row index is channel, and a column index is frequency.

Output¶

AUDIO_SPECTRUM Matrix<complex<float> >
Same as INPUT_AUDIO_SIGNAL

Parameters¶

Parameters of this node are listed as follows:

Parameter name Type Default value Unit Description
WINDOW_NAME string Visualization of audio spectrum   Visualization of audio spectrum.
VISUALIZATION_TIME_LENGTH float 10.0 second Visualization window length to show at the same time.
ADVANCE int 160 sample The length in sample between a frame and a previous frame.
SAMPLING_RATE int 16000 Hz Sampling rate.

Outline of the node¶

This node visualizes the detected speech segments.

Typical connection¶

See Typical connection of the VoiceActivityDetection node. Following figure shows the visualization result of VAD overlayed on the input signal.

Input-output and property of the node¶

Input¶

INPUT_AUDIO_SIGNAL Matrix<complex<float> >
Input should be Matrix<float> of Map<int,ObjectRef>. In case of Matrix<float>, the rows should be channel indices and the columns should be frequency indices. In case of Map<int,ObjectRef>, the key is a source ID and the value is a vector of audio signals (Vector<float>).
Decision of speech-present frame.

Output¶

OUTPUT_AUDIO_SIGNAL any
Same as OUTPUT_AUDIO_SIGNAL.

Parameters¶

Parameters of this node are listed as follows:

Parameter name Type Default value Unit Description
WINDOW_NAME string Visualization of detected speech segments   Window name of the time-azimuth map.
VISUALIZATION_TIME_LENGTH float 10.0 second Visualization window length to show at the same time.
ADVANCE int 160 sample The length in sample between a frame and a previous frame.
SAMPLING_RATE int 16000 Hz Sampling rate.

WaveVisualization Node¶

Outline of the node¶

This node visualize the signal wave.

Typical connection¶

The type of both the input and output of SourceSeparation node is multi-channel (2-ch) audio spectrum. Typical connection of this node is depicted as follows:

The following figure shows the example of the result.

Input-output and property of the node¶

Input¶

INPUT_AUDIO_SIGNAL any
Input should be Matrix<float> or Map<int,ObjectRef>. In case of Matrix<float>, the rows should be channel indices and the columns should be frequency indices. In case of Map<int,ObjectRef>, the key is a source ID and the value is a vector of audio signals (Vector<float>).

Output¶

OUTPUT_AUDIO_SIGNAL any
Same as input.

Parameter list of

Parameter name Type Default value Unit Description
WINDOW_NAME string Visualization of detected speech segments   Window name of the time-azimuth map
VISUALIZATION_TIME_LENGTH float 10.0   Visualization window length to show at the same time
ADVANCE int 160   The length in sample between a frame and a previous frame
SAMPLING_RATE int 16000 Hz Sampling rate