6.2 Creating a language model

Problem

To describe a method for constructing language models for speech recognition.

Solution

There are two types of language model, the large statistical language model, and the network grammar model, both can be used in Julius. (In the previous binary name, the former was called Julius, the latter was called Julian. But from ver 4, both were integrated to Julius). This chapter will focus more on the creation of statistical language models. Refer to the Julius website for more details on creating network grammar models.

Creation of a dictionary for Julius

The dictionary is created based on the morphological analysis of the correct text in HTK format. chasen (tea bowl) is used for morphological analysis. After installing chasen, create .chasenrc in the home directory. Then, assign the directory that includes grammar.cha as “grammar file“. Then define the output format as:

(grammar file /usr/local/chasen-2.02/dic))
(output format "%m+%y+%h/%t/%f\n"))

Prepare the correct text file and set the filename as “seikai.txt“. Then insert $<$s$>$, $<$/s$>$ at the beginning and end of each sentences since it will be used for language model creation.

Example of seikai.txt (words do not need to be separated)
<s> Twisted all reality towards themselves. </s>
<s> Gather information in New York for about a week. </s>
:

% chasen seikai.txt > seikai.keitaiso

See the contents of text.keitaiso; if any part of the morphological analysis is incorrect, revise it. Moreover, since the notation and reading of "he" and "ha" are different, alter the reading to "e" and "wa", respectively. It may be necessary to normalize of morphemes and remove other unwanted parts. These steps are omitted here.

Example of seikai.keitaiso
<s>+<s>+17/0/0

+
+75/0/0
</s>+</s>+17/0/0
EOS++
<s>+<s>+17/0/0

Next, by executing the commands below:

% w2s.pl seikai.keitaiso > seikai-k.txt

The format of each row will be converted to half-space, the $<$s$>$+$<$s$>$+17/0/0 above is replaced by $<$s$>$, and $<$/s$>$+$<$/s$>$+17/0/0 above is replaced by $<$/s$>$. Additionally, EOS is replaced by new line. With this, a morphologically analyzed corrected text can be created. This text will be used in the creation of the language model described below.



+
+75/0/0 </s>

Lastly, use seikai.keitaiso to create the dictionary.

% dic.pl seikai.keitaiso kana2phone_rule.ipa |
sort |
uniq > HTKDIC
% gzip HTKDIC

The HTKDIC.gz is the dictionary that will be used by Julius. use the option “-v“ to use it.

Termx: Those that are included in morphological analysis, chasen, HTK format, w2s.pl, dic.pl and kana2phone_rule.ipa - vocab2htkdic

Creation of language model for Julius

For creation of a language model, see “Speech recognition system” (Ohm sha). However, in order to create 2-gram and reversed 3-gram similar to the samples of jconf, using the CMU-Cambridge Toolkit alone is not sufficient, and the CMU-Cambridge Toolkit compatible “palmkit“ should be used. Also, recently the reversed 3-gram became unnecessary in Julius, so there may be circumstances where palmkit is not needed.

Below is an example on how to use palmkit. Prepare the correct text, then set the file name to seikai.txt. It is a must for this file to complete the morphological analysis. (In other words, punctuation marks are expressed in words, and the words are separated by a space.) $<$s$>$ and $<$/s$>$ are inserted at the beginning and end of the sentences, which is to remove the transition in the span of $<$s$>$ and $<$/s$>$.

Inserting $<$s$>$ and $<$/s$>$ is required in learn.css file.

% text2wfreq < learn.txt > learn.wfreq
% wfreq2vocab < learn.wfreq > learn.vocab
% text2idngram -n 2 -vocab learn.vocab < learn.txt > learn.id2gram
% text2idngram -vocab learn.vocab < learn.txt > learn.id3gram
% reverseidngram learn.id3gram learn.revid3gram
% idngram2lm -idngram learn.revid3gram -vocab learn.vocab -context learn.ccs
%         -arpa learn.rev3gram.arpa
% idngram2lm -n 2 -idngram learn.id2gram -vocab learn.vocab -context learn.ccs
%         -arpa learn.2gram.arpa

This will create the 2-gram and reversed 3-gram, which will then be merged. By using Julius’ tool called mkbingram, the language model for Julius like the sample below can be created.

% mkbingram learn.2gram.arpa learn.rev3gram.arpa julius.bingram