1.2 Design philosophy of HARK

The design philosophy of the robot audition software HARK is summarized as follows.

Provision of total functions, from the input to sound source localization / source separation / speech recognition: Guarantee of the total performance such as inputs from microphones installed in a robot, sound source localization, source separation, noise suppression and automatic speech recognition,
Correspondence to robot shape: Corresponds to the microphone layout required by a user and incorporation to signal processing,
Correspondence to multichannel A/D systems: Supports various multichannel A/D systems depending on price range / function,
Provision of optimal sound processing module and advice For the signal processing algorithms, each algorithm is based on effective premises, multiple algorithms have been developed for the same function and optimal modules are provided through the usage experience,
Real-time processing: Essential for performing interactions and behaviors through sounds.

We have continuously released and updated HARK under the above design concepts. Basically, users’ feedback which was sent to HARK Forum ¹ plays a great role for improvements, and bug-fix.

$\includegraphics[width=0.95\linewidth ]{fig/Intro/Stacks}$

Figure 1.1: Relation between the robot audition software HARK and the middleware FlowDesigner and OS

HARK uses HARK middleware as middleware except for the speech recognition part (Julius, Kaldi) and support tools. (Up to ver 2.5, We were used FlowDesigner [2].)

$\includegraphics[width=.95\linewidth ]{fig/Intro/HARKDesignerNetwork}$

Figure 1.2: Module construction for typical robot audition with HARK

1.2.1 Middleware HARK Designer

In robot audition, sound sources are typically separated based on sound source localization data, and speech recognition is performed for the separated speech. Each processing would become flexible by comprising it of multiple modules so that an algorithm can be partially substituted. Therefore, it is essential to introduce a middleware that allows efficient integration between the modules. However, as the number of modules to be integrated increases, the total overhead of the module connections increases and a real time performance is lost. It is difficult to respond to such a problem with a common frame such as CORBA (Common Object Request Broker Architecture), which requires data serialization at the time of module connection. Indeed, in each module of HARK, processing is performed with the same acoustic data in the same time frame. If each module used the acoustic data by memory-copying each time, the processing would be inefficient in terms of speed and memory efficiency.

We developed data flow oriented middleware HARK middleware. HARK middleware is an improved version of FlowDesigner [2] that was used as middleware until then, but the contents are reimplemented from scratch. As with conventional FlowDesigner, HARK middleware is faster and lighter in processing than general-purpose module integration framework such as ROS and CORBA.

FlowDesigner is a framework that realizes high-speed, lightweight module integration on the premise of use within a single computer ² ³ , but HARK middleware also considered distributed processing using multiple computers It is implementation. In FlowDesigner, the implementation language is C++ only, whereas HARK middleware is implemented with a combination of C++ and Pyhton. Compared to FlowDesigner, the overhead is increasing, but it is implemented in consideration of minimizing the overhead. Also, regarding module implementation, it is fully compatible with FlowDesigner, and these classes are realized as C ++ classes that inherit common superclasses (if you use HARK-Python, you can write modules with pyhton It is also possible). For this reason, interfaces between modules is naturally commonized. Since module connections are realized by calling a specific method of each class (function call), the overhead is small. Since data are transferred by pass-by-reference and pointers, processing is performed at high speed with few resources for the above-mentioned acoustic data. In other words, both data transmission rate between modules and module reuse can be maintained by using HARK middleware.

$\includegraphics[width=.7\linewidth ]{fig/Intro/GHDSSProperty3}$

Figure 1.3: Attribute setting screen of GHDSS

$\includegraphics[width=.5\textwidth ]{fig/Intro/ASIMO-Robovie-ears.eps}$

Figure 1.4: Three types of ear for robot (microphone layout).

1.2 shows a network of HARK middleware for the typical robot audition with HARK. Multichannel acoustic signals are acquired by input files and sound source localization / source separation are performed. Missing feature masks (MFM) are generated by extracting acoustic features from the separated sound and sent to speech recognition (ASR). Attribute of each module can be set on the attribute setting screen (Figure1.3 shows an example of the attribute setting screen of GHDSS ). Table 1.1 shows HARK modules and external tools that are currently provided for HARK. In the following section, outlines of each module are described with the design strategy.

1.2.2 Input device

Multiple microphones (microphone array) are mounted as ears of a robot in HARK for processing. Figure 4 shows an installment example of ears of a robot. Each of these example is equipped with a microphone array with eight channels though microphone arrays with the arbitrary number of channels can be used in HARK. The followings are the multichannel A/D conversion devices supported by HARK.

System in Frontier, Inc., The RASP series,
A/D conversion device of ALSA base, (e.g. RME, Hammerfall DSP series, Multiface AE)
Microsoft Kinect
Sony PS-EYE
Dev-Audio Microcone

These A/D systems have different number of input channels. Any number of channels can be used in HARK by changing internal parameters in HARK. However, the processing speed might fall under such a condition with a large number of channels. Both 16 bits and 24 bits are supported. Further, the sampling rate assumed for HARK is 16kHz and therefore a downsampling module can be used for 48KHz sampling data. Note that Tokyo Electron Device Ltd., TD-BD-16ADUSB (USB interface) is now not supported, since Linux kernel supported by them is too old.

Low-priced pin microphones are enough though it will be better if a preamplifier is used for resolving lack of gain. OctaMic II is available from RME.

Table 1.1: Nodes and Tools provided by HARK 3.2.0

Function	Category name	Module name	Description
Voice input output	AudioIO	AudioStreamFromMic	Acquire sound from microphone
		AudioStreamFromWave	Acquire sound from file
		SaveRawPCM	Save sound in file
		SaveWavePCM	Save sound in wav-formatted file
		SaveWavePCM2	Save sound in wav-formatted file
		HarkDataStreamSender	Socket-based data communication
Sound source	Localization	ConstantLocalization	Output constant localized value
Localization /		DisplayLocalization	Display localization result
tracking		LocalizeMUSIC	Localize sound source
		LoadSourceLocation	Load localization information from file
		NormalizeMUSIC	Normalize the MUSIC spectrum from LocalizeMUSIC
		SaveSourceLocation	Save source location information in file
		SourceIntervalExtender	Extend forward the tracking result
		SourceTracker	Source tracking
		SourceTrackerPF	Source tracking
		CSP	Estimate sound source direction by CSP
		LocalizeBFDS	Estimate sound source direction BFDS
		CMLoad	Load a Correlation Matrix (CM) file
		CMSave	Save a CM file
		CMChannelSelector	Channel selection for CM
		CMMakerFromFFT	Create a CM
		CMMakerFromFFTwithFlag	Create a CM
		CMDivideEachElement	Division of each element of CM
		CMMultiplyEachElement	Multiplication of each element of CM
		CMConjEachElement	Conjugate of CM
		CMInverseMatrix	Inverse of CM
		CMMultiplyMatrix	Multiplication of CM
		CMIdentityMatrix	Output identity CM
Sound source	Separation	BGNEstimator	Estimate background noise
separation		BeamForming	Sound source separation
		CalcSpecSubGain	Subtract noise spectrum subtraction and estimate optimum gain
		CalcSpecAddPower	Add power spectrum
		EstimateLeak	Estimate inter-channel leak noise
		GHDSS	Separate sound source by GHDSS
		ML	Separate sound source by ML
		MSNR	Separate sound source by MSNR
		MVDR	Separate sound source by MVDR
		HRLE	Estimate noise spectrum
		PostFilter	Perform post-filtering after sound source separation
		SemiBlindICA	Separate sound source by ICA with prior information
		SpectralGainFilter	Estimate voice spectrum
Feature	FeatureExtraction	Delta	Calculate $\Delta$ term
extract		FeatureRemover	Remove term
		MelFilterBank	Perform mel-scale filter bank processing
		MFCCExtraction	Extract MFCC
		MSLSExtraction	Extract MSLS
		PreEmphasis	Perform pre-emphasis
		SaveFeatures	Save features
		SaveHTKFeatures	Save features in the HTK form
		SpectralMeanNormalization	Normalize spectrum mean
		SpectralMeanNormalizationIncremental	Normalize spectrum mean
Missing	MFM	DeltaMask	Calculate $\Delta$ mask term
Feature		DeltaPowerMask	Calculate $\Delta$ power mask term
Mask		MFMGeneration	Generate MFM
Communication	ASRIF	SpeechRecognitionClient	Send feature to ASR
with ASR		SpeechRecognitionSMNClient	Same as above, with the feature SMN
Others	MISC	ChannelSelector	Select channel
		CombineSource	Combine the localization result
		DataLogger	Generate log output of data
		HarkParamsDynReconf	Dynamic parameter tuning via network
		LoadMapFrames	Load data saved by SaveMapFrames
		LoadMatrixFrames	Load data saved by SaveMatrixFrames
		LoadVectorFrames	Load data saved by SaveVectorFrames
		MapIDOffset	Shift the key of `Map<int, ObjectRef>`
		MapMatrixValueOverwrite	Overwrite `Matrix<ObjectRef>` of `Map<int, ObjectRef>`
		MapOperator	Mathematical operations on the `ObjectRef`
		MapSelectorBySource	Select separation result by `Source`
		MapToMap	Convert the `ObjectRef` type of `Map<int, ObjectRef>`
		MapToMatrix	Convert `Map<int, ObjectRef>` $\rightarrow$ to `Matrix`
		MapToVector	Convert `Map<int, ObjectRef>` $\rightarrow$ to `Vector`
		MapVectorValueOverwrite	Overwrite `Vector<ObjectRef>`
		MatrixToMap	Convert Matrix $\rightarrow$ to Map
		MatrixToMatrix	Convert a data type between `Matrix<float>` and `Matrix<complex<float> >`
		MatrixToVector	Convert `Matrix` $\rightarrow$ to `Vector`
		MatrixValueOverwrite	Overwrite the element of `Matrix<ObjectRef>`
		MultiGain	Calculate gain of multiple-channel
		MultiDownSampler	Perform downsampling
		MultiFFT	Perform multichannel FFT
		PowerCalcForMap	Calculate power of Map input
		PowerCalcForMatrix	Calculate power of matrix input
		ResizeMapMatrixValues	Change the size of `Matrix<ObjectRef>`
		ResizeMapVectorValues	Change the size of `Vector<ObjectRef>`
		SaveMapFrames	Save data of frames in the `Map` type
		SaveMatrixFrames	Save data of frames in the `Matrix` type
		SaveVectorFrames	Save data of frames in the `Vector<ObjectRef>` type
		SegmentAudioStreamByID	Select audio stream segment by ID
		SourceSelectorByDirection	Select sound source by direction
		SourceSelectorByID	Select sound source by ID
		SourceSelectorBySourceInfo	Select sound source by information(ID, power, direction)
		SourceTransformer	Edit values for information(ID, power, direction)
		Synthesize	Convert waveform
		TextConcatenate	Concatenate strings
		TextConverter	Convert data type to JSON text
		VADZC	Voice Activity Detection by ZC
		VectorToMap	Convert `Vector` $\rightarrow$ to `Map<int, ObjectRef>`
		VectorToMatrix	Convert `Vector` $\rightarrow$ to `Matrix`
		VectorToVector	Convert a data type between `Vector<float>` and `Vector<complex<float> >`
		VectorValueOverwrite	Overwrite the element of `Vector<ObjectRef>`
		WhiteNoiseAdder	Add white noise
A function	Category	Tool name	Description
Data generation	External tool	harktool4	Visualize data / Generate setting file
		wios	Sound recording tool for transfre function measurements

1.2.3 Sound source localization

MUltiple SIgnal Classification (MUSIC) method, which has shown the best performance in past experience, is employed for microphone arrays. The MUSIC method is the method that localizes sound sources based on source positions and impulse responses (transfer function) between each microphone. Impulse responses can be obtained by actual measurements or calculation with geometric positions of microphones. In HARK 0.1.7, the beamformer of ManyEars was available as a microphone array. This module is a 2D polar coordinate space (called “2D” in the semantics that direction information can be recognized in a 3D polar coordinate space). It has been reported that the error due to incorrect orientation is about 1.4 $^\circ$ when it is within 5 m, from a microphone array and the sound source interval leaves more than 20 $^\circ$ . However, the entire module of ManyEars is originally designed for 48 kHz sampling under the assumption that the sampling frequency is not 16 kHz, which is used in HARK, and microphones are arranged in free space when impulse responses are simulated from the microphone layout. For the above reason, impacts of the robot body cannot be considered and sound source localization accuracy of adaptive beamformers such as MUSIC is higher than that of common beamformers and therefore HARK 1.0.0 supports only the MUSIC method. In HARK 1.1, we supported GEVD-MUSIC and GSVD-MUSIC which are extended version of MUSIC. By the extension, we can suppress or whiten a known high power noise such as robot ego-noise and localize desired sounds under this noise. In HARK 1.2, we extended the algorithm to localize sound sources in a 3-dimensional way.

1.2.4 Sound source separation

For sound source separation, Geometric-Constrained High-order Source Separation (GHDSS ) [8], which is known to have the highest total performance in various acoustic environments from the past usage experience, PostFilter and the noise estimation method Histogram-based Recursive Level Estimation HRLE are employed for HARK 1.0.0. Presently, the best performance and stability in various acoustic environments are obtained by the combination of GHDSS and HRLE . Until now, various methods such as adoptive beamformer (delayed union type, adoptive type), Independent Component Analysis (ICA) and Geometric Source Separation (GSS ) have been developed and tested for evaluation. Sound source separation methods employed for HARK are summarized as follows:

Delayed union type beamformer employed for HARK 0.1.7,
Combination of ManyEars Geometric Source Separation (GSS ) and PostFilter [4], which was supported as an external module with HARK 0.1.7,
Combination of GSS and PostFilter as an original design [5] employed for in 1.0.0 HARK prerelease,
Combination of GHDSS and HRLE employed for HARK 1.0.0 [6, 8].

GSS of ManyEars used for HARK 0.1.7 is the method that uses transfer functions from a sound source to a microphone as a geometric constraint and separates the signal coming from a given sound source direction. A geometrical constraint is supposed to be a transfer function from the sound source to each microphone and transfer functions are obtained from the relation between microphone positions and sound source positions. This way of obtaining transfer functions was a cause of performance degradation under the condition that a transfer function changes as shape of a robot changes though the microphone layout is the same. GSS was redesigned for the HARK 1.0.0 prerelease. It was extended so that transfer functions of actual measurements can be used as a geometrical constraint. Further, modifications such as adaptive change of stepsize were made so as to accelerate convergence of a separation matrix. Furthermore, it has become possible to constitute a delayed union type beamformer by changing attribute setting of GSS . In accordance with the above change, the delayed union type beamformer DSBeamformer , which had been employed for HARK 0.1.7, has been removed. Most of sound source separation methods except ICA require direction information of the sound source to be separated as a parameter, which is common in sound source separation. If localization information is not provided, separation itself cannot be executed. On the other hand, robot’s steady noise has a comparatively strong property as a directional sound source and therefore the steady noise can be removed if sound source is localized. However, in fact, sound sources are not localized successfully for such noise in many cases and there was an actual case that separation performance of steady noise was degraded as a result. A function that continuously specifies noise sources in specific directions is added in GSS and GHDSS of HARK 1.0.0 prerelease, which enables to separate continuously the sound sources that cannot be localized. Generally, there is a limit for separation performance of the sound source separation based on linear processing, such as GSS and GHDSS and therefore it is essential to perform nonlinear processing called post-filter to improve the quality of separated sounds. The post-filter of ManyEars was redesigned and the post-filter for which parameter quantity was considerably reduced is employed for HARK 1.0.0 prerelease and the final version. The post-filter can be a “good knife” if it is used in a proper way though it is difficult to make full use of it and users may suffer its adverse effect if it is used in a wrong way. There are at least some parameters that should be set in PostFilter and it is difficult to set them properly. Furthermore, the post-filter performs nonlinear processing based on a probabilistic model. Therefore, a non-linear distortion spectrum occurs for separated sounds and the performance of speech recognition ratios for separated sounds does not easily improve. The steady noise estimation method called HRLE (Histogram-based Recursive Level Estimation), which is suited for GHDSS , is employed for HARK 1.0.0. The separated sounds with improved quality are obtained when using EstimateLeak , which has been developed by fully examining the GHDSS separation algorithm and estimates inter-channel leak energy, in combination with HRLE .

1.2.5 MFT-ASR: Speech recognition based on MFT

$\includegraphics[width=.9\linewidth ]{fig/Intro/MFT-concept}$

Figure 1.5: Schematic diagram of speech recognition using Missing Feature Theory

The spectral distortion caused by various factors such as sound mixture or separation is beyond those that are assumed in the conventional speech recognition community. In order to deal with it, it is necessary to connect more closely the sound source separation and speech recognition. In HARK, it is dealt with the speech recognition based on the missing feature theory (MFT-ASR) [4]. The concept of MFT-ASR is shown in Figure 1.5. The black and red lines in the figure indicate the time variation of acoustic features in a separated sound and that of an acoustic model for corresponding speech, used by the ASR system, respectively. Acoustic features of a separated sound greatly differ at some points from those of the system by distortion (Figure 1.5(a)). In MTF-ASR, influences of the distortion are ignored by masking the distorted points with Missing Feature Mask (MFM) (Figure 1.5(b)). MFM is a time reliability map that corresponds to acoustic features of a separated sound and a binary mask (also called a Hard Mask) is usually used. Masks with continuous values from 0 to 1 are called Soft Masks. In HARK, MFM is provided from the steady noise obtained from the post-filter and inter-channel energy. MTF-ASR, same as common speech recognition, is based on a Hidden Markov Model (HMM). Parts related to acoustic scores calculated from HMM (mainly the output probability calculation) are modified so that MFM can be used. In HARK, the multiband software Julius developed by Tokyo Institute of Technology Furui Laboratory is used, reinterpreted as MFT-ASR [13]. HARK 1.0.0 uses plug-in features of the Julius 4 type and the main part of MFT-ASR serves as a Julius plug-in. Using MFT-ASR serving as a plug-in allows Julius to be updated without having to modify MFT-ASR. Moreover, MFT-ASR works as a server / daemon independent from FlowDesigner and outputs results to the acoustic features transmitted via socket communication by a speech recognition client of HARK and to their MFM.

1.2.6 Acoustic feature extraction and noise application to noise adaptation of acoustic model

In order to improve the effectiveness of MFT and trap the spectral distortion only for specific acoustic features, Mel Scale Log Spectrum (MSLS) is used for acoustic features. Mel-Frequency Cepstrum Coefficient (MFCC), which is generally used for speech recognition, is also employed for HARK. However, distortion spreads in all features in MFCC and therefore it does not get along with MFT. When simultaneous speech is infrequent, better performance is achieved by speech recognition with MFCC in some cases. HARK 1.0.0 provides a new module to use the power term $\Delta$ with MSLS features . The effectiveness of the $\Delta$ power term for MFCC features has already been reported so far. It has already been confirmed that the 13-dimensional MSLS and $\Delta$ MSLS, and $\Delta$ power, which is the 27-dimensional MSLS feature, have better performance than the 24-dimensional MSLS and $\Delta$ MSLS (48 dimensions in total) used for HARK 0.1.7. In HARK, influences of distortion by the aforementioned non-linear separation are reduced by adding a small amount of white noise. An acoustic model is constructed by multi-condition training with clean speech and with white noise added. Then speech recognition is performed with the same amount of white noise added to recognized speech after separation. In this way, highly precise recognition is realized even when S/N is around -3 dB [6] for one speaker’s speech.

Footnotes

https://wp.hark.jp/forums/
Connecting across computers can be realized by creating a module for network connection like a connection with speech recognition in HARK.
The original version of FlowDesigner and function-improved version of FlowDesigner 0.9.0 are available at http://flowdesigner.sourceforge.net/ and https://www.hark.jp/, respectively.