## 6.8.2 KaldiDecoder

### 6.8.2.1 Outline

KaldiDecoder is an acoustic model decoder software developed for HARK  . This particular decoder was designed using libraries from Kaldi 1 , a deep learning speech recognition toolkit. Up until HARK  version 2.2.0, JuliusMFT (a large vocabulary speech recognition decoder system based on Julius ) had been the core of HARK  ’s speech recognition. In response to the recent trends in speech recognition software design, we have decided to provide a Kaldi -based decoder, KaldiDecoder , beginning with HARK  version 2.3.0.

Comparing this new decoder with other standard Kaldi -based decoders, the following features are available:

1. Connectivity with HARK  modules (same handling as JuliusMFT )

• Compatible with both MSLS and MFCC feature data input via the network (mfcnet)

• Supports the addition of source location info (SrcInfo)

• Supports the recognition of simultaneous speech (mutual exclusion)

The connections with HARK  can be made through SpeechRecognitionClient (or through SpeechRecognitionSMNClient ), identical to JuliusMFT .

2. Compatibility with JuliusMFT .

• Compatible with JuliusMFT output emulation (in both module-mode and standard output formats)

KaldiDecoder replicates JuliusMFT output as closely as possible, such that modification to the JuliusMFT -based sound system (in its demonstration system or sound scoring system) should be minimal.

• Implementation of online decoding for nnet1 models

Kaldi ’s standard decoder provides only offline nnet1 model decoding.

4. Functions to be implemented

• Missing feature recognition (except for the already implemented mfcnet-masked data structure recognition)

• nnet2 and nnet3 models

The features and functions described in numbers 1, 2, and 3 above have been implemented without any change to Kaldi .

The following section explains the method of installing and using KaldiDecoder , with a step-by-step procedure for connecting to HARK  in FlowDesigner.

### 6.8.2.2 Start up and setting

Execution of KaldiDecoder is performed as follows when assuming a settings file named as kaldi.conf for example.

  > kaldidecoder --config=kaldi.conf (for Ubuntu OS)
> kaldidecoder.exe --config=kaldi.conf (for Windows OS)


After starting KaldiDecoder in online mode, the socket connection in HARK  is performed by starting a network that contains SpeechRecognitionClient (or SpeechRecognitionSMNClient ) for which an IP address and a port number are correctly set to enable the speech recognition.

The abovementioned kaldi.conf is a text file that describes settings for KaldiDecoder . The content of the setting file consists basically of argument options that begin with “--”, and the user can also specify arguments directly as KaldiDecoder options when starting. Moreover, descriptions that come after # are treated as comments. Confirm the options used for KaldiDecoder by executing the following command:

  > kaldidecoder --help (for Ubuntu OS)
> kaldidecoder.exe --help (for Windows OS)


The minimum required settings for using nnet1 models in KaldiDecoder are the following seven items:

• --filename-words=<YOUR_PATH>/words.txt

• --filename-align-lexicon=<YOUR_PATH>/align_lexicon.int

• --filename-feature-transform=<YOUR_PATH>/final.feature_transform

• --filename-nnet=<YOUR_PATH>/final.nnet

• --filename-mdl=<YOUR_PATH>/final.mdl

• --filename-class-frame-counts=<YOUR_PATH>/ali_train_pdf.counts

• --filename-fst=<YOUR_PATH>/HCLG.fst

The offline decoding mode requires an additional setting to specify the list of features to be evaluated:

• --filename-features-list=<YOUR_PATH>/features_list.txt

If the above setting is excluded, KaldiDecoder will automatically execute in the online decoding mode (mfcnet input mode).

Modify the online decoding mode mfcnet input port or result output port by changing the following options (default port number values are shown):

• --port-mfcnet=5530

• --port-result=10500

1. Basic Configurations (Input Files and Modes Settings)

• --nnet-type=model type number

This is to set the nnet model type in Kaldi . The default setting value is “1”, and it is the only working setting in the current version. Two more model types will be made available in a future version, and you will be able to change values in the setting as follows:

Set the value to “1” when using a nnet1 model created using Karel Vesely’s method. Change it to “2”, if you use a nnet2 model created using Daniel Povey’s method. Likewise, set it to “3” when using a nnet3 model created using Daniel Povey’s method.

• --filename-words=word list file name

This setting specifies the word list file. You can set it as a path relative to the current directory or as an absolute path. The file format of the word list file is the same as that of the words.txt file included in the lexicon directory used in Kaldi for training/validation or in the language model used for evaluation. This setting is mandatory for the output of recognition results.

• --filename-phones=phoneme list file name

This is to set the phoneme list file. You can set it as a path relative to the current directory or as an absolute path. The file format of the phoneme list file is the same as the phones.txt file included in the lexicon directory used in Kaldi for training/validation or in the language model used for evaluation. This setting is optional, as it is only necessary when you conduct phoneme alignment.

• --filename-align-lexicon=lexicon file name

This is to set the lexicon file. You can set it as a path relative to the current directory or as an absolute path. The file format of the lexicon file is the same as that of the align_lexicon.int file included in the lexicon directory used in Kaldi for training/validation or in the language model used for evaluation. If you create a lang directory using prepare_lang.pl , the lexicon file will be output to lang/phones/aligned_lexicon.int . This setting is mandatory for the output of recognition results.

• --filename-feature-transform=FeatureTransform file name

This is to set the FeatureTransform file, which is output to the path of the trained DNN as
exp/tri*dnn*/final.feature_transform . This setting is mandatory when you use KaldiDecoder in nnet1 model decoding, as it is separate from the acoustic model.

• --filename-nnet=nnet file name

This is to set the nnet file, which is output to the path of the trained DNN as exp/tri*dnn*/final.nnet (note: this path is a symbolic link). This setting is mandatory when you use KaldiDecoder in nnet1 model decoding, as it is separate from the acoustic model.

• --filename-mdl=mdl file name

This is to set the mdl file, which is output to the path of the trained DNN as exp/tri*dnn*/final.mdl (note: this path is a symbolic link). This setting is mandatory, as the acoustic model is required for the output of recognition results.

• --filename-class-frame-counts=class-frame-counts file name

This is to set the class-frame-counts file, which is output to the path of the trained DNN as
exp/tri*dnn*/ali_train_pdf.counts . This setting is mandatory when you use KaldiDecoder in nnet1 model decoding, as it is separate from the acoustic model.

• --filename-fst=FST file name

This is to set the FST file. If you create the graph directory using mkgraph.sh , the FST file will be output to graph*/HCLG.fst . This setting is mandatory, as it is required for the output of recognition results.

• --filename-features-list=features list file name

When using KaldiDecoder in offline decoding mode, it is required to specify a text list containing paths to the feature file(s) to be evaluated (note: it is the name of the list; not of the feature file(s)). If this option is not set, KaldiDecoder runs in the online docoding mode.

• --port-mfcnet=port number

This is to set the port number that receives the acoustic features and masks transmitted via network by SpeechRecognitionClient (or SpeechRecognitionSMNClient ). It is similar to the “-adport” input port setting for mfcnet input mode in JuliusMFT . The same default port number as in JuliusMFT , 5530, will be used if none is provided. This option is valid only in the online decoding mode.

• --port-result=port number

This is to set the connection port that transmits the recognition results through the network. It is similar to the “-module” output port setting for module mode in JuliusMFT . The same default port number as in JuliusMFT , 10500, will be used if none is provided.

• --lm-name

This is to set the language model name. When this setting is active, the LMNAME attribute is provided in module mode output. It has no default value.

2. Decoder Tuning Configuration (Weighting and Pruning)

The items described in this section are options inherited from the features implemented in the decoder (classes) in Kaldi .

• --acoustic-scale=acoustic scale value

This is to set the acoustic scale value. The default setting value is “0.5”, which is generally the inverse of the best LM weight obtained in scoring.

• --max-active

This is to set the decoder’s maximum number of active states. Although increasing the value provides more accurate results, it also significantly affects the decoding speed. The default setting value is “2147483647” (maximum value of an int32 type). We empirically recommend setting it between 2000 and 5000 to obtain better performance.

• --min-active

This is to set the decoder’s minimum number of active states. The default setting value is “200”.

• --beam=beam width

This is to set the decoding beam width. Although increasing the value provides more accurate results, it also considerably affects the decoding speed. The default setting value is “16”. For details, refer to the quotation in Table 6.89.

• --beam-delta=beam delta

This is to set the decoding beam delta. This parameter is obscure and relates to a speed-up in the way in which the “max-active” constraint is applied. Increasing the value provides more accurate results. The default setting value is “0.5”. For details, refer to the quotation in Table 6.89.

• --delta=delta

This is the tolerance value used in determinization, which is set as “0.000976562” by default. For details, refer to the quotaion in Table 6.89.

• --hash-ratio

This is a ratio value to control the hash behavior in decoding process. The default setting value is “2”.

• --prune-interval=frame count

This is to set the frame interval at which tokens are to be pruned. The default setting value is “25”.

• --splice=splice count

This is to set the DNN input splice count. It is the time context around the current frame. The default setting value is “3”, which means that there are 3 frames both before and after the current frame.

The following article in the Table 6.89 is quoted from the Kaldi website (http://www.danielpovey.com/kaldi-docs/decoders.html).

Table 6.89: Quotation from “FasterDecoder: a more optimized decoder”
  The code in FasterDecoder as it relates to cutoffs is a little more complicated than just having the one pruning step. The basic observation is this: it's pointless to create a very large number of tokens if you are only going to ignore most of them later. So the situation in ProcessEmitting is: we have "weight_cutoff" but wouldn't it be nice if we knew what the value of "weight_cutoff" on the next frame was going to be? Call this "next_weight_cutoff". Then, whenever we process arcs that have the current frame's acoustic likelihoods, we could just avoid creating the token if the likelihood is worse than "next_weight_cutoff". In order to know the next weight cutoff we have to know two things. We have to know the best token's weight on the next frame, and we have to know the effective beam width on the next frame. The effective beam width may differ from "beam" if the "max_active" constraint is limiting, and we use the heuristic that the effective beam width does not change very much from frame to frame. We attempt to estimate the best token's weight on the next frame by propagating the currently best token (later on, if we find even better tokens on the next frame we will update this estimate). We get a rough upper bound on the effective beam width on the next frame by using the variable "adaptive_beam". This is always set to the smaller of "beam" (the specified maximum beam width), or the effective beam width as determined by max_active, plus beam_delta (default value: 0.5). When we say it is a "rough upper bound" we mean that it will usually be greater than or equal to the effective beam width on the next frame. The pruning value we use when creating new tokens equals our current estimate of the next frame's best token, plus "adaptive_beam". With finite "beam_delta", it is possible for the pruning to be stricter than dictated by the "beam" and "max_active" parameters alone, although at the value 0.5 we do not believe this happens very often. 

Povey, Daniel:
Citing Sources: [http://www.danielpovey.com/kaldi-docs/decoders.html#decoders_faster]: para. 3: [December 6, 2016]


3. Lattice Configuration

The items described in this section are options that have been inherited from the features implemented in the decoder (classes) in Kaldi .

• --determinize-lattice

This is to determinize the lattices. It keeps only the best probability distribution function (p.d.f.) sequence for each word sequence.

• --lattice-beam=beam width

This is to set the beam width in lattice generation. Increasing the value gives deeper lattices, which also significantly affects the decoding speed. The default setting value is “10”.

• --max-mem=maximum memory allocation size

This is to set the maximum approximate size of memory allocated when determinizing the lattices. However, the actual usage may be higher than the specified value because the allocation may occur more than once.

• --minimize

When this option is given, minimize the lattices after determinization.

• --phone-determinize

When this option is given, do an initial pass of determinization on both phonemes and words. See also the article on “--word-determinize”.

• --word-determinize

When this option is given, do a second pass of determinization on words only. See also the article on “--phone-determinize”.

4. Others

• --config=configuration file name

This setting specifies the config file, which can be specidied repeatedly.

• --enable-debug

This option enables the debugging output. The default seting is “disabled”.

• --help

This is to display the help menu. When this option is given, all other options are ignored.

• --print-args

When this option is enabled, the command line arguments are sent to the standard output. The default setting is “enabled”. Set it as “--print-args=false” to disable it.

• --verbose=log level

This is to set detail level of log information. Increasing the value gives more detailed log output. The default setting value is “0”.

The functionality called “module mode” in the original Julius or JuliusMFT is also available in KaldiDecoder . Selecting the online decoding mode automatically enables it. In addition, the standard output is not deactivated as in Julius or JuliusMFT ; both standard and socket (network) outputs can be used in KaldiDecoder ’s online decoding mode.

### 6.8.2.3 Detailed description

#### 6.8.2.3.1 mfcnet communication specification

In order to use mfcnet as an acoustic input source, the argument “--filename-features-list” must not be given when starting up KaldiDecoder as mentioned above. In this case, KaldiDecoder acts as a TCP/IP communications server, starting up in the listening state and waiting for input. Moreover, the HARK  modules SpeechRecognitionClient and SpeechRecognitionSMNClient work as a client to transmit acoustic features and Missing Feature Mask to KaldiDecoder . The client connects to KaldiDecoder for every utterance and closes the connection immediately after the transmission is complete. The data to be transmitted must be little endian (note that it is not a network byte order). Concretely, communication is performed as follows for one utterance.

1. Socket connection

The client opens the socket and connects to the mfcnet communication port in KaldiDecoder .

2. Communication initialization (data transmitted once at the beginning)

The client transmits information on the sound source that is going to be transmitted, as shown in Table 6.90, immediately after the socket connection. The sound soure information is expressed in a SourceInfo structure (Table 6.91) and has a sound source ID, sound source direction and time of transmission start. The time is indicated in a timeval structure defined in <sys/time.h> and is the elapsed time from the starting time point (January 1, 1970 00:00:00) in the system time zone. The time indicates the elapsed period from the starting point thereafter.

3. Data transmission (data transmitted at every frame)

Acoustic features and Missing Feature Mask are transmitted. Features of one utterance are transmitted as frames, shown in Table 6.92, repeatedly until the end of the speech section. It is assumed inside the KaldiDecoder that the dimension number of feature vectors and mask vectors are the same.

4. Connection end (data transmitted once at the end)

After transmitting features for one sound source, data (Table 6.93) that indicate completion is transmitted. KaldiDecoder will return to the listening state to receive the next sound source data until either the data indicating completion is received or its socket connection is severed. It is therefore possible to resume and continue data reception in an environment with a relatively unstable connection.

5. Socket disconnection

After the ending process, the sockets are closed. If they close without the ending process, it executes exception tasks; thus, the output of the recognition results may be delayed. Likewise, any data transmitted after the ending process is ignored regardless of the sockets being open or closed.

Table 6.90: Data to be transmitted only once at the beginning (acoustic source information)
 Size[byte] Type Data to be transmitted 4 28 (= sizeof(SourceInfo)) 28 SourceInfo Sound source information of features that are going to be transmitted

Table 6.91: SourceInfo structure
 Member variable name Type Description source_id Sound source ID azimuth Horizontal direction [deg] elevation Vertical direction [deg] time timeval Time (standardized to 64 bit processor and 16 bytes long)

Table 6.92: Data to be transmitted for every frame (features, masks data and dimensions information)
 Size[byte] Type Data to be transmitted 4 N1=(dimension number of feature vector) $\times$ sizeof(float ) N1 float [N1] feature vector (float array) 4 N2=(dimension number of mask vector) $\times$ sizeof(float ) N2 float [N2] mask vector (float array)

Table 6.93: Data to be transmitted only once at the end (data to indicate completion)
 Size[byte] Type Data to be transmitted 4 0

#### 6.8.2.3.2 Module mode communication specification

When setting the online decoding, KaldiDecoder operates similarly to the module mode in Julius . In the module mode, KaldiDecoder works as a TCP/IP communication server and provides recognition results to clients such as jcontrol. The character encoding for Japanese text depends on that of the language model used. An XML-like format is used just like in Julius for data representation, and a “.” (period) is transmitted to indicate the data completion for each and every message. As an additional feature of KaldiDecoder , it can also output results in the standard XML format without the “.” (period) mark. The meaning of the most common tags transmitted by KaldiDecoder is as follows.

• INPUT tag

This tag represents information related to inputs and has STATUS and TIME as attributes. The values for STATUS are LISTEN, STARTREC or ENDREC. LISTEN indicates that KaldiDecoder is ready to receive speech. STARTREC indicates that the reception of features has started. ENDREC indicates that the last feature of the sound source being received has arrived. TIME indicates the time at that instant.

• SOURCEINFO tag

This tag represents information related to sound sources and is an original tag of KaldiDecoder . It has ID, AZIMUTH, ELEVATION, SEC and USEC as attributes. The SOURCEINFO tag is transmitted when starting the recognition process. Its ID indicates a sound source ID given by HARK  (not the speaker ID but numbers uniformly given to each sound source). AZIMUTH and ELEVATION indicate horizontal and vertical direction (degrees), respectively, seen from the microphone array coordinate system for the first frame of the sound source. SEC and USEC indicate the time of the first frame of the sound source. SEC indicates seconds and USEC indicates the microseconds fraction.

• RECOGOUT tag

This tag represents recognition results, and its sub-element is either a gradual output or the final output. For gradual output, it has the PHYPO tag as a sub-element, and for the final output, it has the SHYPO tag as a sub-element. In the case of the final output, only SHYPO tags for the number of candidates specified in the parameters are output.

• PHYPO tag

This tag represents gradual candidates and it has vectors of WHYPO tags for candidate words as sub-elements. It has PASS, SCORE, FRAME and TIME as attributes. PASS indicates the number of decoding passes and is always 1. SCORE indicates the accumulated score of this candidate. FRAME indicates the number of frames that have been processed in order to output this candidate. TIME indicates time (sec) at that instant.

• SHYPO tag

This tag represents a sentence hypothesis and it has vectors of WHYPO tags for candidate words as sub-elements. It has PASS, RANK, SCORE, AMSCORE and LMSCORE as attributes. PASS indicates the number of decoding passes and, when available, is always set to 1. RANK indicates the rank order of a hypothesis. SCORE indicates the logarithmic likelihood of this hypothesis, AMSCORE indicates a logarithmic acoustic likelihood and LMSCORE indicates a logarithmic language probability.

• WHYPO tag

This tag represents word hypotheses and and has WORD, CLASSID, PHONE and CM as attributes. WORD indicates notations, CLASSID indicates the word that is the key in a statistical language model, PHONE indicates phoneme sequences and CM indicates the confidence for the word. Word confidence is included only to maintain compatibility with the Julius -based decoder output, and its value, fixed at 1.0, is irrelevant to the acutual performance.

• SYSINFO tag

This tag represents the status of the system and it has PROCESS as an attribute. When PROCESS is EXIT, it indicates normal termination. When PROCESS is ERREXIT, it indicates abnormal termination. When PROCESS is ACTIVE, it indicates that speech recognition can be performed. When PROCESS is SLEEP, it indicates that speech recognition is halted.

Whether or not these tags and attributes are output depends on the arguments set when starting KaldiDecoder . The SOURCEINFO tag is always output, and the others are the same as those of the original Julius and therefore users are recommended to refer to Argument Help of the original Julius .

When comparing to the original Julius , two changes were made to KaldiDecoder as follows.

• Addition of items related to the SOURCEINFO tag for information on source localization as described above, and also the embedding of sound source ID (SOURCEID) to the following tags: STARTRECOG, ENDRECOG, INPUTPARAM, GMM, RECOGOUT, REJECTED, RECOGFAIL, GRAPHOUT, SOURCEINFO

• Changes were made to the format of the module mode in order to reduce the delay caused by mutual exclusion when processing simultaneous utterances. Concretely, mutual exclusion used to be performed utterance-wise, but now the output is divided so that the exclusion control can be performed tag-wise. Also, modifications were made to the output of the following one-time tags.

<< Tags separated by start-tag / end-tag >>

• <RECOGOUT> ... </RECOGOUT>

• <GRAPHOUT> ... </GRAPHOUT>

• <GRAMINFO> ... </GRAMINFO>

• <RECOGPROCESS> ... </RECOGPROCESS>

<< One-line tags that are internally split and output multiple times >>

• <RECOGFAIL ... />

• <REJECTED ... />

• <SR ... />

#### 6.8.2.3.3 Example output of KaldiDecoder

1. Example output of standard output mode

  source_id = 0, azimuth = 0.000000, elevation = 16.700001, sec = 1466144473, usec = 169637 ### Recognition: 2nd pass (RL heuristic best-first) STAT: 00 sentence1: ORDER PLEASE wseq1: ORDER PLEASE phseq1: ao ao ao r r r r r d d d d er er er er er er er p p p l l l iy iy iy iy iy iy iy z z z cmscore1: 1.000 1.000 score1: 260.002472 ( AM: 274.768372, LM: -14.765888 ) 
2. Example output of socket output mode (module mode)

  . . . . 

### 6.8.2.4 Notice

• Restraint of the PHONE tag

Although JuliusMFT supports PHONE tag output for each WORD, the same feature is not implemented in KaldiDecoder because of Kaldi ’s structural reasons: it causes performance degradation. Therefor, in the socket output mode, the phoneme output for each WHYPO tag is not supported. In the standard output mode, only output with no pipes ("|") between words is supported.

• Known issue in the Windows version

There were confirmed issues of corrupted characters when outputting recognition results in a multi-byte encoding to the standard output on Windows. The default character set for Ubuntu terminal’s standard output is UTF-8, so the issue occurs when using the same language model on Windows. In other words, a mismatch between the character sets used in the operating system console and language model causes this problem. To avoid this, start KaldiDecoder with the output redirection "> filename", and open the output recognition text file with an appropriate text editor. The reason for this restriction is that, in Julius , it was possible to convert the character set at the output time using the “iconv” library or the internal implementation “libjcode” as needed; however, Kaldi does not have a character set conversion feature, and it has not been implemented in KaldiDecoder .

### 6.8.2.5 Installation method

• Using apt-get

If the apt-get setting is ready, installation can be done as follows.

  > apt-get install kaldidecoder-hark 
• Installing from source

Since KaldiDecoder uses libraries from Kaldi , it must be built in advance. However, Kaldi libraries are not packaged in Ubuntu, so it has to be compiled from source by executing the following commands, which takes a long time.

  > apt-get install libatlas-base-dev git automake autoconf libtool > apt-get install portaudio19-dev speex libspeex-dev libpoco-dev > mkdir > cd > git clone https://github.com/kaldi-asr/kaldi.git > git checkout > cd kaldi/tools > make > cd extras > ./install_irstlm.sh > cd ../../src > ./configure > make depend -j > make -j > cd ../ > wget http://archive.hark.jp/harkrepos/dists//non-free/source/kaldidecoder-hark_.tar.xz > tar -zxvf kaldidecoder-hark_.tar.xz > cd kaldidecoder > ./configure > make > sudo make install ** : Your work directory -- e.g.) kaldi_build ** : Git commit id of the Kaldi -- e.g.) 7df7e0008c Whether KaldiDecoder was developed based on which commiting ID, Please read the KaldiDecoder's README. ** : How many cores do you have -- e.g.) 4 ** : Ubuntu distribution -- e.g.) xenial ** : HARK version -- e.g.) 2.3.0 

Since it is installed in /usr/local/bin by default, the “--prefix” must be set as follows in order to install in /usr/bin like the package version.

  > cd kaldidecoder > ./configure --prefix=/usr > make > sudo make install 

If the output of “kaldidecoder --help” is as shown below, the KaldiDecoder installation was successful.

  > kaldidecoder --help usage: If you requests need use ONLINE decoding with DNN. (ONLINE mode is default) e.g.) ./kaldidecoder [--port-mfcnet=5530] [--port-result=10500] ...(The middle part is omitted)... usage: If you requests need use OFFLINE decoding with DNN. (OFFLINE mode is must need --filelist option) e.g.) ./kaldidecoder --filename-features-list=features_list.txt ...(The rest is omitted)... 

With the above, installation is complete.

• For the installation method on Windows OS, please refere to Section 3.2.

Footnotes

1. http://kaldi-asr.org/