1.3.1 Three-speaker simultaneous speech recognition

\includegraphics[width=0.9\linewidth ]{fig/Intro/Order-e-1.eps} a) Robovie takes orders.
\includegraphics[width=0.9\linewidth ]{fig/Intro/Order-e-2.eps} b) Three people order for dishes simultaneously.
\includegraphics[width=0.9\linewidth ]{fig/Intro/Order-e-3.eps} c) Robovie repeats the orders after 1.9 seconds and reply a total amount.

Figure 1.6: Robovie-R2 recognizing three orders for dishes from three persons

The three-speaker simultaneous speech recognition system returns speech recognition results for each speaker though a series of processing of microphone inputs, source localization, sound source separation, missing feature mask generation and ASR. The module network in FlowDesigner is shown in Figure 1.2. The dialogue management module performs the following:

  1. Listen to speech of a user. When judging if it is an order request, perform the following processing.

  2. Perform a series of processing of robot audition – sound source localization / sound source separation / post-filter processing / extraction of acoustic features / missing feature mask generation.

  3. Transmit the acoustic features and Missing Feature Mask for the number of speakers to the speech recognition engine and receive speech recognition results.

  4. Analyze the speech recognition results. When they are about orders for dishes, repeat the orders and reply a total amount for the dishes.

  5. Take any following orders.

The acoustic model in speech recognition is intended for unspecified speakers. The language model is described in the context-free grammar and therefore it will become possible to recognize “large portion of ramen”, “large portion of spicy ramen” or “large portion of ramen and rice” by devising the grammar. In the conventional processing, which needed to go though multiple files, it took 7.9 seconds from the completion of three speakers’ speech to the completion of recognition. However, the response has been shortened to around 1.9 seconds by HARK. 1. It seems that since the response is fast, the robot immediately repeats each order after taking orders from all the speakers and reply a total amount. Further, in the case of the file input, speech complete time is clear though it depends on module setup and therefore latency from the completion of recognition after utterance to starting of the response by the robot is around 0.4 seconds. Moreover, the robot can turn to face the speaker at the time of repeating. HRP-2 performs responses with gestures. However, when giving such gestures to the robot, responses are delayed for preparation for the gestures, which leads to clumsy motions and therefore we need to make such setting carefully.


  1. Demonstration is available at http://winnie.kuis.kyoto-u.ac.jp/SIG/