# 1.2 Design philosophy of HARK

The design philosophy of the robot audition software HARK is summarized as follows.

1. Provision of total functions, from the input to sound source localization / source separation / speech recognition: Guarantee of the total performance such as inputs from microphones installed in a robot, sound source localization, source separation, noise suppression and automatic speech recognition,

2. Correspondence to robot shape: Corresponds to the microphone layout required by a user and incorporation to signal processing,

3. Correspondence to multichannel A/D systems: Supports various multichannel A/D systems depending on price range / function,

4. Provision of optimal sound processing module and advice For the signal processing algorithms, each algorithm is based on effective premises, multiple algorithms have been developed for the same function and optimal modules are provided through the usage experience,

5. Real-time processing: Essential for performing interactions and behaviors through sounds.

We have continuously released and updated HARK under the above design concepts. Basically, users’ feedback which was sent to HARK Forum 1 plays a great role for improvements, and bug-fix.

HARK uses HARK middleware as middleware except for the speech recognition part (Julius, Kaldi) and support tools. (Up to ver 2.5, We were used FlowDesigner [2].)

### 1.2.1 Middleware HARK Designer

In robot audition, sound sources are typically separated based on sound source localization data, and speech recognition is performed for the separated speech. Each processing would become flexible by comprising it of multiple modules so that an algorithm can be partially substituted. Therefore, it is essential to introduce a middleware that allows efficient integration between the modules. However, as the number of modules to be integrated increases, the total overhead of the module connections increases and a real time performance is lost. It is difficult to respond to such a problem with a common frame such as CORBA (Common Object Request Broker Architecture), which requires data serialization at the time of module connection. Indeed, in each module of HARK, processing is performed with the same acoustic data in the same time frame. If each module used the acoustic data by memory-copying each time, the processing would be inefficient in terms of speed and memory efficiency.

We developed data flow oriented middleware HARK middleware. HARK middleware is an improved version of FlowDesigner [2] that was used as middleware until then, but the contents are reimplemented from scratch. As with conventional FlowDesigner, HARK middleware is faster and lighter in processing than general-purpose module integration framework such as ROS and CORBA.

FlowDesigner is a framework that realizes high-speed, lightweight module integration on the premise of use within a single computer 2 3 , but HARK middleware also considered distributed processing using multiple computers It is implementation. In FlowDesigner, the implementation language is C++ only, whereas HARK middleware is implemented with a combination of C++ and Pyhton. Compared to FlowDesigner, the overhead is increasing, but it is implemented in consideration of minimizing the overhead. Also, regarding module implementation, it is fully compatible with FlowDesigner, and these classes are realized as C ++ classes that inherit common superclasses (if you use HARK-Python, you can write modules with pyhton It is also possible). For this reason, interfaces between modules is naturally commonized. Since module connections are realized by calling a specific method of each class (function call), the overhead is small. Since data are transferred by pass-by-reference and pointers, processing is performed at high speed with few resources for the above-mentioned acoustic data. In other words, both data transmission rate between modules and module reuse can be maintained by using HARK middleware.

1.2 shows a network of HARK middleware for the typical robot audition with HARK. Multichannel acoustic signals are acquired by input files and sound source localization / source separation are performed. Missing feature masks (MFM) are generated by extracting acoustic features from the separated sound and sent to speech recognition (ASR). Attribute of each module can be set on the attribute setting screen (Figure1.3 shows an example of the attribute setting screen of GHDSS ). Table 1.1 shows HARK modules and external tools that are currently provided for HARK. In the following section, outlines of each module are described with the design strategy.

### 1.2.2 Input device

Multiple microphones (microphone array) are mounted as ears of a robot in HARK for processing. Figure 4 shows an installment example of ears of a robot. Each of these example is equipped with a microphone array with eight channels though microphone arrays with the arbitrary number of channels can be used in HARK. The followings are the multichannel A/D conversion devices supported by HARK.

• System in Frontier, Inc., The RASP series,

• A/D conversion device of ALSA base, (e.g. RME, Hammerfall DSP series, Multiface AE)

• Microsoft Kinect

• Sony PS-EYE

• Dev-Audio Microcone

These A/D systems have different number of input channels. Any number of channels can be used in HARK by changing internal parameters in HARK. However, the processing speed might fall under such a condition with a large number of channels. Both 16 bits and 24 bits are supported. Further, the sampling rate assumed for HARK is 16kHz and therefore a downsampling module can be used for 48KHz sampling data. Note that Tokyo Electron Device Ltd., TD-BD-16ADUSB (USB interface) is now not supported, since Linux kernel supported by them is too old.

Low-priced pin microphones are enough though it will be better if a preamplifier is used for resolving lack of gain. OctaMic II is available from RME.

Table 1.1: Nodes and Tools provided by HARK 3.2.0
 Function Category name Module name Description Voice input output AudioIO Acquire sound from microphone Acquire sound from file Save sound in file Save sound in wav-formatted file Save sound in wav-formatted file Socket-based data communication Sound source Localization Output constant localized value Localization / Display localization result tracking Localize sound source Load localization information from file Normalize the MUSIC spectrum from LocalizeMUSIC Save source location information in file Extend forward the tracking result Source tracking Source tracking Estimate sound source direction by CSP Estimate sound source direction BFDS Load a Correlation Matrix (CM) file Save a CM file Channel selection for CM Create a CM Create a CM Division of each element of CM Multiplication of each element of CM Conjugate of CM Inverse of CM Multiplication of CM Output identity CM Sound source Separation Estimate background noise separation Sound source separation Subtract noise spectrum subtraction and estimate optimum gain Add power spectrum Estimate inter-channel leak noise Separate sound source by GHDSS Separate sound source by ML Separate sound source by MSNR Separate sound source by MVDR Estimate noise spectrum Perform post-filtering after sound source separation Separate sound source by ICA with prior information Estimate voice spectrum Feature FeatureExtraction Calculate $\Delta$ term extract Remove term Perform mel-scale filter bank processing Extract MFCC Extract MSLS Perform pre-emphasis Save features Save features in the HTK form Normalize spectrum mean Normalize spectrum mean Missing MFM Calculate $\Delta$ mask term Feature Calculate $\Delta$ power mask term Mask Generate MFM Communication ASRIF Send feature to ASR with ASR Same as above, with the feature SMN Others MISC Select channel Combine the localization result Generate log output of data Dynamic parameter tuning via network Load data saved by SaveMapFrames Load data saved by SaveMatrixFrames Load data saved by SaveVectorFrames Shift the key of Map Overwrite Matrix of Map Mathematical operations on the ObjectRef Select separation result by Source Convert the ObjectRef type of Map Convert Map $\rightarrow$ to Matrix Convert Map $\rightarrow$ to Vector Overwrite Vector Convert Matrix $\rightarrow$ to Map Convert a data type between Matrix and Matrix > Convert Matrix $\rightarrow$ to Vector Overwrite the element of Matrix Calculate gain of multiple-channel Perform downsampling Perform multichannel FFT Calculate power of Map input Calculate power of matrix input Change the size of Matrix Change the size of Vector Save data of frames in the Map type Save data of frames in the Matrix type Save data of frames in the Vector type Select audio stream segment by ID Select sound source by direction Select sound source by ID Select sound source by information(ID, power, direction) Edit values for information(ID, power, direction) Convert waveform Concatenate strings Convert data type to JSON text Voice Activity Detection by ZC Convert Vector $\rightarrow$ to Map Convert Vector $\rightarrow$ to Matrix Convert a data type between Vector and Vector > Overwrite the element of Vector Add white noise A function Category Tool name Description Data generation External tool harktool4 Visualize data / Generate setting file wios Sound recording tool for transfre function measurements

### 1.2.3 Sound source localization

MUltiple SIgnal Classification (MUSIC) method, which has shown the best performance in past experience, is employed for microphone arrays. The MUSIC method is the method that localizes sound sources based on source positions and impulse responses (transfer function) between each microphone. Impulse responses can be obtained by actual measurements or calculation with geometric positions of microphones. In HARK 0.1.7, the beamformer of ManyEars [3] was available as a microphone array. This module is a 2D polar coordinate space (called “2D” in the semantics that direction information can be recognized in a 3D polar coordinate space). It has been reported that the error due to incorrect orientation is about 1.4$^\circ$ when it is within 5 m, from a microphone array and the sound source interval leaves more than 20$^\circ$. However, the entire module of ManyEars is originally designed for 48 kHz sampling under the assumption that the sampling frequency is not 16 kHz, which is used in HARK, and microphones are arranged in free space when impulse responses are simulated from the microphone layout. For the above reason, impacts of the robot body cannot be considered and sound source localization accuracy of adaptive beamformers such as MUSIC is higher than that of common beamformers and therefore HARK 1.0.0 supports only the MUSIC method. In HARK 1.1, we supported GEVD-MUSIC and GSVD-MUSIC which are extended version of MUSIC. By the extension, we can suppress or whiten a known high power noise such as robot ego-noise and localize desired sounds under this noise. In HARK 1.2, we extended the algorithm to localize sound sources in a 3-dimensional way.

### 1.2.4 Sound source separation

For sound source separation, Geometric-Constrained High-order Source Separation (GHDSS ) [8], which is known to have the highest total performance in various acoustic environments from the past usage experience, PostFilter and the noise estimation method Histogram-based Recursive Level Estimation HRLE are employed for HARK 1.0.0. Presently, the best performance and stability in various acoustic environments are obtained by the combination of GHDSS and HRLE . Until now, various methods such as adoptive beamformer (delayed union type, adoptive type), Independent Component Analysis (ICA) and Geometric Source Separation (GSS ) have been developed and tested for evaluation. Sound source separation methods employed for HARK are summarized as follows:

1. Delayed union type beamformer employed for HARK 0.1.7,

2. Combination of ManyEars Geometric Source Separation (GSS ) and PostFilter [4], which was supported as an external module with HARK 0.1.7,

3. Combination of GSS and PostFilter as an original design [5] employed for in 1.0.0 HARK prerelease,

4. Combination of GHDSS and HRLE employed for HARK 1.0.0 [6, 8].

### 1.2.5 MFT-ASR: Speech recognition based on MFT

The spectral distortion caused by various factors such as sound mixture or separation is beyond those that are assumed in the conventional speech recognition community. In order to deal with it, it is necessary to connect more closely the sound source separation and speech recognition. In HARK, it is dealt with the speech recognition based on the missing feature theory (MFT-ASR) [4]. The concept of MFT-ASR is shown in Figure 1.5. The black and red lines in the figure indicate the time variation of acoustic features in a separated sound and that of an acoustic model for corresponding speech, used by the ASR system, respectively. Acoustic features of a separated sound greatly differ at some points from those of the system by distortion (Figure 1.5(a)). In MTF-ASR, influences of the distortion are ignored by masking the distorted points with Missing Feature Mask (MFM) (Figure 1.5(b)). MFM is a time reliability map that corresponds to acoustic features of a separated sound and a binary mask (also called a Hard Mask) is usually used. Masks with continuous values from 0 to 1 are called Soft Masks. In HARK, MFM is provided from the steady noise obtained from the post-filter and inter-channel energy. MTF-ASR, same as common speech recognition, is based on a Hidden Markov Model (HMM). Parts related to acoustic scores calculated from HMM (mainly the output probability calculation) are modified so that MFM can be used. In HARK, the multiband software Julius developed by Tokyo Institute of Technology Furui Laboratory is used, reinterpreted as MFT-ASR [13]. HARK 1.0.0 uses plug-in features of the Julius 4 type and the main part of MFT-ASR serves as a Julius plug-in. Using MFT-ASR serving as a plug-in allows Julius to be updated without having to modify MFT-ASR. Moreover, MFT-ASR works as a server / daemon independent from FlowDesigner and outputs results to the acoustic features transmitted via socket communication by a speech recognition client of HARK and to their MFM.

### 1.2.6 Acoustic feature extraction and noise application to noise adaptation of acoustic model

In order to improve the effectiveness of MFT and trap the spectral distortion only for specific acoustic features, Mel Scale Log Spectrum (MSLS) [4] is used for acoustic features. Mel-Frequency Cepstrum Coefficient (MFCC), which is generally used for speech recognition, is also employed for HARK. However, distortion spreads in all features in MFCC and therefore it does not get along with MFT. When simultaneous speech is infrequent, better performance is achieved by speech recognition with MFCC in some cases. HARK 1.0.0 provides a new module to use the power term $\Delta$ with MSLS features [6]. The effectiveness of the $\Delta$ power term for MFCC features has already been reported so far. It has already been confirmed that the 13-dimensional MSLS and $\Delta$ MSLS, and $\Delta$ power, which is the 27-dimensional MSLS feature, have better performance than the 24-dimensional MSLS and $\Delta$ MSLS (48 dimensions in total) used for HARK 0.1.7. In HARK, influences of distortion by the aforementioned non-linear separation are reduced by adding a small amount of white noise. An acoustic model is constructed by multi-condition training with clean speech and with white noise added. Then speech recognition is performed with the same amount of white noise added to recognized speech after separation. In this way, highly precise recognition is realized even when S/N is around -3 dB [6] for one speaker’s speech.

Footnotes

1. https://wp.hark.jp/forums/
2. Connecting across computers can be realized by creating a module for network connection like a connection with speech recognition in HARK.
3. The original version of FlowDesigner and function-improved version of FlowDesigner 0.9.0 are available at http://flowdesigner.sourceforge.net/ and https://www.hark.jp/, respectively.