6.1 Creating an acoustic model


This recipe discusses the creation of the acoustic model that is used in speech recognition. This is useful in improving the speech recognition performance after installing HARK in the robot.


An acoustic model is a statistical expression of the relationship between a phoneme and acoustic features and has a substantial impact in speech recognition. An acoustic model called Hidden Markov Model (HMM) is usually used.

When the layout of the microphones mounted in the robot is changed, or when the algorithm or parameter is changed during separation and speech enhancement, the properties of the acoustic features input into speech recognition may also change. Therefore, speech recognition can greatly improve by adapting an acoustic model to the new conditions or by creating a new acoustic model that meets these conditions.

The Hidden Malkov Model Toolkit (HTK) is used to create the acoustic model for the speech recognition engine Julius used in HARK.

The next section describes the methods to construct the three acoustic models below:

  1. Multi-condition training

  2. Additional training

  3. MLLR/MAP adaptation

Although there are various parameters in the actual acoustic model, a 3-state 16-mixture triphone will be used as an example. Many textbooks such as “HTK Book“, “IT Text Speech Recognition System“ etc. have been published which can serve as a reference to know more details about each parameter.