# 1.2 Design philosophy of HARK

The design philosophy of the robot audition software HARK is summarized as follows.

1. Provision of total functions, from the input to sound source localization / source separation / speech recognition: Guarantee of the total performance such as inputs from microphones installed in a robot, sound source localization, source separation, noise suppression and automatic speech recognition,

2. Correspondence to robot shape: Corresponds to the microphone layout required by a user and incorporation to signal processing,

3. Correspondence to multichannel A/D systems: Supports various multichannel A/D systems depending on price range / function,

4. Provision of optimal sound processing module and advice For the signal processing algorithms, each algorithm is based on effective premises, multiple algorithms have been developed for the same function and optimal modules are provided through the usage experience,

5. Real-time processing: Essential for performing interactions and behaviors through sounds.

We have continuously released and updated HARK under the above design concepts as follows:

 Time Version Note Apr., 2008 HARK 0.1.7 Released as open source software Nov., 2009 HARK 1.0.0 pre-release many improvements, bug-fix, documentation Oct., 2010 HARK 1.0.0 bug-fix, released support tools Feb., 2012 HARK 1.1 functionality improvements, 64-bit support, ROS support, bug-fix Mar., 2013 HARK 1.2 3D localization, Windows support, English speech recognition, bug-fix

Basically, users’ feedback which was sent to hark-support ML plays a great role for improvements, and bug-fix.

As shown in Figure 1.1, HARK uses FlowDesigner [2] as middleware excluding the speech recognition part (Julius) and support tools. As understood from Figure 1.1, only the Linux OS is supported. One of the reasons is that the API called ALSA (Advanced Linux Sound Architecture) is used so as to support multiple multichannel A/D systems. HARK for PortAudio is also being developed since PortAudio has recently become available for Windows systems.

### 1.2.1 Middleware FlowDesigner

In robot audition, sound sources are typically separated based on sound source localization data, and speech recognition is performed for the separated speech. Each processing would become flexible by comprising it of multiple modules so that an algorithm can be partially substituted. Therefore, it is essential to introduce a middleware that allows efficient integration between the modules. However, as the number of modules to be integrated increases, the total overhead of the module connections increases and a real time performance is lost. It is difficult to respond to such a problem with a common frame such as CORBA (Common Object Request Broker Architecture), which requires data serialization at the time of module connection. Indeed, in each module of HARK, processing is performed with the same acoustic data in the same time frame. If each module used the acoustic data by memory-copying each time, the processing would be inefficient in terms of speed and memory efficiency. We have employed FlowDesigner [2] as a middleware that can respond to such a problem, which is a data flow-oriented GUI support environment. The processing in FlowDesigner is faster and lighter than those in the frames that can universally be used for integrations such as the CORBA frame. FlowDesigner is free (LGPL/GPL) middleware equipped with a data flow-oriented GUI development environment that realizes high-speed and lightweight module integration, premised on use in a single computer. In FlowDesigner, each module is realized as a class of C++. Since these classes have inherited common superclasses, the interface between modules is naturally commonized. Since module connections are realized by calling a specific method of each class (function call), the overhead is small. Since data are transferred by pass-by-reference and pointers, processing is performed at high speed with few resources for the above-mentioned acoustic data. In other words, both data transmission rate between modules and module reuse can be maintained by using FlowDesigner. We are publishing FlowDesigner for which bugs such as memory leak have been removed and its operationality (mainly the attribute settings) has been improved based on the past use experience.

1.2 shows a network of FlowDesigner for the typical robot audition with HARK. Multichannel acoustic signals are acquired by input files and sound source localization / source separation are performed. Missing feature masks (MFM) are generated by extracting acoustic features from the separated sound and sent to speech recognition (ASR). Attribute of each module can be set on the attribute setting screen (Figure1.3 shows an example of the attribute setting screen of GHDSS ). Table 1.1 shows HARK modules and external tools that are currently provided for HARK. In the following section, outlines of each module are described with the design strategy.

### 1.2.2 Input device

Multiple microphones (microphone array) are mounted as ears of a robot in HARK for processing. Figure 4 shows an installment example of ears of a robot. Each of these example is equipped with a microphone array with eight channels though microphone arrays with the arbitrary number of channels can be used in HARK. The followings are the multichannel A/D conversion devices supported by HARK.

• System in Frontier, Inc., The RASP series,

• A/D conversion device of ALSA base, (e.g. RME, Hammerfall DSP series, Multiface AE)

• Microsoft Kinect

• Sony PS-EYE

• Dev-Audio Microcone

These A/D systems have different number of input channels. Any number of channels can be used in HARK by changing internal parameters in HARK. However, the processing speed might fall under such a condition with a large number of channels. Both 16 bits and 24 bits are supported. Further, the sampling rate assumed for HARK is 16kHz and therefore a downsampling module can be used for 48KHz sampling data. Note that Tokyo Electron Device Ltd., TD-BD-16ADUSB (USB interface) is now not supported, since Linux kernel supported by them is too old.

Low-priced pin microphones are enough though it will be better if a preamplifier is used for resolving lack of gain. OctaMic II is available from RME.

Table 1.1: Nodes and Tools provided by HARK 1.1.0
 Function Category name Module name Description Voice input output AudioIO Acquire sound from microphone Acquire sound from file Save sound in file Save sound in wav-formatted file Socket-based data communication Sound source Localization Output constant localized value Localization / Display localization result tracking Localize sound source Load localization information from file Save source location information in file Extend forward the tracking result Source tracking Load a Correlation Matrix (CM) file Save a CM file Channel selection for CM Create a CM Create a CM Division of each element of CM Multiplication of each element of CM Inverse of CM Multiplication of CM Output identity CM Sound source Separation Estimate background noise separation Subtract noise spectrum subtraction and estimate optimum gain Add power spectrum Estimate inter-channel leak noise Separate sound source by GHDSS Estimate noise spectrum Perform post-filtering after sound source separation Estimate voice spectrum Feature FeatureExtraction Calculate $\Delta$ term extract Remove term Perform mel-scale filter bank processing Extract MFCC Extract MSLS Perform pre-emphasis Save features Save features in the HTK form Normalize spectrum mean Missing MFM Calculate $\Delta$ mask term Feature Calculate $\Delta$ power mask term Mask Generate MFM Communication ASRIF Send feature to ASR with ASR Same as above, with the feature SMN Others MISC Select channel Generate log output of data Convert Matrix $\rightarrow$ to Map Calculate gain of multiple-channel Perform downsampling Perform multichannel FFT Calculate power of Map input Calculate power of matrix input Select audio stream segment by ID Select sound source by direction Select sound source by ID Convert waveform Add white noise A function Category Tool name Description Data generation External tool harktool4 Visualize data / Generate setting file wios Sound recording tool for transfre function measurements

### 1.2.3 Sound source localization

MUltiple SIgnal Classification (MUSIC) method, which has shown the best performance in past experience, is employed for microphone arrays. The MUSIC method is the method that localizes sound sources based on source positions and impulse responses (transfer function) between each microphone. Impulse responses can be obtained by actual measurements or calculation with geometric positions of microphones. In HARK 0.1.7, the beamformer of ManyEars [3] was available as a microphone array. This module is a 2D polar coordinate space (called “2D” in the semantics that direction information can be recognized in a 3D polar coordinate space). It has been reported that the error due to incorrect orientation is about 1.4$^\circ$ when it is within 5 m, from a microphone array and the sound source interval leaves more than 20$^\circ$. However, the entire module of ManyEars is originally designed for 48 kHz sampling under the assumption that the sampling frequency is not 16 kHz, which is used in HARK, and microphones are arranged in free space when impulse responses are simulated from the microphone layout. For the above reason, impacts of the robot body cannot be considered and sound source localization accuracy of adaptive beamformers such as MUSIC is higher than that of common beamformers and therefore HARK 1.0.0 supports only the MUSIC method. In HARK 1.1, we supported GEVD-MUSIC and GSVD-MUSIC which are extended version of MUSIC. By the extension, we can suppress or whiten a known high power noise such as robot ego-noise and localize desired sounds under this noise. In HARK 1.2, we extended the algorithm to localize sound sources in a 3-dimensional way.

### 1.2.4 Sound source separation

For sound source separation, Geometric-Constrained High-order Source Separation (GHDSS ) [8], which is known to have the highest total performance in various acoustic environments from the past usage experience, PostFilter and the noise estimation method Histogram-based Recursive Level Estimation HRLE are employed for HARK 1.0.0. Presently, the best performance and stability in various acoustic environments are obtained by the combination of GHDSS and HRLE . Until now, various methods such as adoptive beamformer (delayed union type, adoptive type), Independent Component Analysis (ICA) and Geometric Source Separation (GSS ) have been developed and tested for evaluation. Sound source separation methods employed for HARK are summarized as follows:

1. Delayed union type beamformer employed for HARK 0.1.7,

2. Combination of ManyEars Geometric Source Separation (GSS ) and PostFilter [4], which was supported as an external module with HARK 0.1.7,

3. Combination of GSS and PostFilter as an original design [5] employed for in 1.0.0 HARK prerelease,

4. Combination of GHDSS and HRLE employed for HARK 1.0.0 [6, 8].

In order to improve the effectiveness of MFT and trap the spectral distortion only for specific acoustic features, Mel Scale Log Spectrum (MSLS) [4] is used for acoustic features. Mel-Frequency Cepstrum Coefficient (MFCC), which is generally used for speech recognition, is also employed for HARK. However, distortion spreads in all features in MFCC and therefore it does not get along with MFT. When simultaneous speech is infrequent, better performance is achieved by speech recognition with MFCC in some cases. HARK 1.0.0 provides a new module to use the power term $\Delta$ with MSLS features [6]. The effectiveness of the $\Delta$ power term for MFCC features has already been reported so far. It has already been confirmed that the 13-dimensional MSLS and $\Delta$ MSLS, and $\Delta$ power, which is the 27-dimensional MSLS feature, have better performance than the 24-dimensional MSLS and $\Delta$ MSLS (48 dimensions in total) used for HARK 0.1.7. In HARK, influences of distortion by the aforementioned non-linear separation are reduced by adding a small amount of white noise. An acoustic model is constructed by multi-condition training with clean speech and with white noise added. Then speech recognition is performed with the same amount of white noise added to recognized speech after separation. In this way, highly precise recognition is realized even when S/N is around -3 dB [6] for one speaker’s speech.