6.1.1 Multi-condition Training

The fundamental flow of the creation of a typical triphone-based acoustic model is shown below.

  1. Creation of training data

  2. Extraction of acoustic features

  3. Training of a monophone model

  4. Training of a non-context-dependent triphone model

  5. Status clustering

  6. Training of a context-dependent triphone model

In addition to the clean audio signal, the recorded audio from the robot is also used during training in multi-condition training. The data used in speech recognition in HARK is a data acquired by using microphone array for recording and has undergone the sound source separation and speech enhancement process. For this reason, the data for training should also undergo sound source separation and speech enhancement.
However, since a large amount of audio data is needed in the acoustic model training, recording this type of data is not realistic. Because of this, the impulse response in the transmission between the sound source and the microphone array is measured beforehand, and then by convoluting this impulse response with the clean audio, the data recorded by using a microphone array can be created virtually. See 5.25.4 for more details on the concrete creation of data.

Acoustic features extraction

The mel frequency cepstrum coefficient (MFCC) is often used for acoustic features. Although MFCC can be used, the Mel Scale Logarithmic Spectral coefficient (MSLS) is recommended for HARK. MSLS can be created easily from a wav file on a HARK network. MFCC can also be created on a HARK network in the same way. However, since MFCC is extracted in HTK, a similar tool HCopy is provided, making the number of parameters for MFCC extraction higher than in HARK. Regarding the usage of HCopy, refer to the HTKBook document. In any case, acoustic model is created by using HTK after feature extraction. See 10 for more details.

Dictionary, MLF(Mater Label File) preparation

  1. Data revision:

    Generally, even when using a distributed corpus, it is difficult to completely remove fluctuations in description and descriptive errors. Although these are difficult to notice beforehand, they should be revised as soon as they are found since such errors can degrade performance.

  2. Creation of words.mlf:

    Filenames that are used as (virtual) labels to support the features, and the file “words.mlf“ which includes the utterance written per word is created. The header of “words.mlf“ file should be #!MLF!#. Each entry should have a labeled filename enclosed with “ “ which is defined in the first line. Then, the utterance included in the labeled filename is separated per word in each row. In addition, half-sized period “.“ should be added in the last row of each entry.

  3. Creation of word dictionary:

    A word dictionary which associates the phoneme strings with words is created. To put it simply, each phoneme string and the corresponding word is as follows:

    AME a m e
    TOQTE t o q t e
    SILENCE sil
    SP      sp
  4. Creation of phoneme MLF(phones1.mlf):

    Phoneme MLFs are created with a dictionary and word MLF. Use HLEd concretely.

    % HLEd -d dic -i phones1.mlf phones1.led words.mlf

    Rules are described in phones1.led. The rule allowing sp (short pose) is described in HTKBook.

    IS silB silE

    The format of phoneme MLF is almost the same as that of word MLF except that the unit of lines is changed to phonemes from words. An example of phones1.mlf is shown below.


Preparation of the list train.scp for features file

Basically, this can be accomplished by creating a list of the feature quantity filenames in an full path (one filename per row). However, there are times when the feature quantity file contains abnormal values, and it is recommended to check the values first with HList, and only add files with correct values.

Preparation of triphone

Although this operation can be done after the monophone training, it may be necessary to revise phones1.mlf depending on the check results. Thus in order to save time, this operation is performed here.

  1. Creation of tri.mlf:

    First is to create a simple phoneme grouped by three.

    % HLEd -i tmptri.mlf mktri.led phones1.mlf

    An example of mktri.led is shown below. Phonemes described in mktri.led is removed from the context.

    WB sp
    WB silB
    WB silE

    Parameters are reduced with short vowel contexts by identifying the anteroposterior long vowel contexts. An example of the created tri.mlf is shown below.

    q-sh+u: q-sh+u
    sh-u:+k y-u:+k
    u:-k+a u-k+a
    ny+u: ny+u
    ny-u:+y y-u:+y
    u:-y+o: u-y+o
    o:-k+u o-k+u
  2. Creation of triphones:

    Triphones corresponds to the list of phonemes (grouped by three) in the tri.mlf file.

    grep -v lab tri.mlf |
    grep -v MLF |
    grep -v "\."|
    sort |
    uniq > triphones
  3. physicalTri:

    Triphone list that includes phoneme context that is not included during training (tri.mlf).

  4. Check of consistency:

    Check triphones and physicaiTri. Checking this is important.

Preparation of monophone

  1. Create an HMM prototype (proto-ini):

    proto can be created by using the HTK tool called MakeProtoHMMSet. Below is an example of proto-ini used for MSLS.

    <STREAMINFO> 1 27
    ~h "proto"
    <STATE> 2
    <MEAN> 27
    0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
    0.0 0.0 0.0 0.0 0.0 0.0 0.0
    <VARIANCE> 27
    <VARIANCE> 27
    1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
    <STATE> 3
    <MEAN> 27
    0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
    <VARIANCE> 27
    1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
    <STATE> 4
    <MEAN> 27
    0.0 0.0 0.0 0.0 0.0 0.0 0.0
    <VARIANCE> 27
    1.0 1.0 1.0 1.0 1.0 1.0 1.0
    <TRANSP> 5
    0.0 1.0 0.0 0.0 0.0
    0.0 0.6 0.4 0.0 0.0
    0.0 0.0 0.6 0.4 0.0
    0.0 0.0 0.0 0.7 0.3
    0.0 0.0 0.0 0.0 0.0

Creation of initial model

Use HCompV to create the initial HMM.

   %   mkdir hmm0
   %   HCompV -C config.train -f 0.01 -m -S train.scp -M hmm0 proto-ini

The config.train used for MSLS is below:


As a result, under hmm0/, proto and vFloor (initial model) that underwent training of despersion and average is created from all the training data included in the train.scp. This will take time depending on the data size.

  1. Creation of initial monophones:

    • hmm0/hmmdefs

      Allocate the value of hmm0/proto to all phonemes

      % cd hmm0
      % ../mkmonophone.pl proto ../monophone1.list > hmmdefs

      The monophone1.list is a list of phonemes including sp. In the HTKBook, the "monophone1.list" should be used after training with the phoneme list of "monophone0.list" without sp. Here, use the phoneme list that includes sp from the beginning.

    • hmm0/macros

      Create a file "macro" by rewriting some contents of vFloor. This is used as flooring when data are insufficient.

      % cp vFloor macro

      In this example, add the following as a header of macro. Generally, the description of the header should be the same as that of hmmdefs; i.e., dependent on the content of proto.

      <STREAMINFO> 1 25

Monophone Training

% cd ../
% mkdir hmm1 hmm2 hmm3

Perform the training repeatedly for atleast three times. (hmm1 hmm2 hmm3) * hmm1

% HERest -C config.train -I phones1.mlf -t 250.0 150.0 1000.0 -T 1 \
-S train.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1

* hmm2

% HERest -C config.train -I phones1.mlf -t 250.0 150.0 1000.0 -T 1 \
-S train.scp -H hmm1/macros -H hmm1/hmmdefs -M hmm2

* hmm3

% HERest -C config.train -I phones1.mlf -t 250.0 150.0 1000.0 -T 1 \
-S train.scp -H hmm2/macros -H hmm2/hmmdefs -M hmm3

Although alignment settings should be readjusted at this point, it has been omitted here.

Creation of Triphone

  1. Creation of Triphone from Monophone:

    % mkdir tri0
    % HHEd -H hmm3/macro -H hmm3/hmmdefs -M tri0 mktri.hed monophones1.list
  2. Initial training of triphone:

    % mkdir tri1
    % HERest -C config.train -I tri.mlf -t 250.0 150.0 1000.0 -T 1 -s stats \
    -S train.scp -H tri0/macro -H tri0/hmmdefs -M tri1 triphones

    Perform the training repeatedly for around 10 times.


  1. Clustering of 2000 status:

    % mkdir s2000
    % mkdir s2000/tri-01-00
    % HHEd -H tri10/macro -H tri10/hmmdefs -M s2000/tri-01-00 2000.hed \
    triphones > log.s2000

    Here, 2000.hed is described as follows. Stats on the first row is an output file obtained in 9.2. First, replace the value of thres to around 1000. Then set this value by trial and error so that the status number becomes 2000 in the execution log.

    RO 100.0 stats
    TR 0
    QS "L_Nasal" { N-*,n-*,m-* }
    QS "R_Nasal" { *+N,*+n,*+m }
    QS "L_Bilabial"
    { p-*,b-*,f-*,m-*,w-* }
    QS "R_Bilabial"
    { *+p,*+b,*+f,*+m,*+w }
    TR 2
    TB thres "TC_N2_" {("N","*-N+*","N+*","*-N").
    TB thres "TC_a2_" {("a","*-a+*","a+*","*-a").
    TR 1
    AU "physicalTri"
    ST "Tree,thres"



    The items written here are covered by clustering. In this example, only the same main phoneme with the same state is summarized.


    Split threshold Control the final state number by changing the dividing threshold value properly (e.g. 1000 or 1200) (confirm log)

  2. Training: Perform training after clustering.

    % mkdir s2000/tri-01-01
    % HERest -C config.train -I tri.mlf -t 250.0 150.0 1000.0 -T 1 \
    -S train.scp -H s2000/tri-01-00/macro -H s2000/tri-01-00/hmmdefs \
    -M s2000/tri-01-01 physicalTri

    Repeat for more than three times

Increase of the number of mixtures

  1. Increasing the number of mixtures (example of 1 $\rightarrow $ 2mixtures):

    % cd s2000
    % mkdir tri-02-00
    % HHEd -H tri-01-03/macro -H tri-01-03/hmmdefs -M tri-02-00 \
    tiedmix2.hed physicalTri
  2. Training:

    Perform training after increasing of the number of mixtures.

    % mkdir tri-02-01
    % HERest -C config.train -I tri.mlf -t 250.0 150.0 1000.0 -T 1 \
    -S train.scp -H s2000/tri-02-00/macro -H s2000/tri-02-00/hmmdefs \
    -M tri-02-01 physicalTri

    Repeat for more than three times. Repeat these steps and increase the number of mixtures to around 16 sequentially. It is recommended to double the number of mixtures. (2 $\rightarrow $ 4 $\rightarrow $ 8 $\rightarrow $ 16)