Problem
To describe a method of constructing an acoustic model for speech recognition. This type of model can improve speech recognition performance after having the introduction of HARK into a robot.
Solution
An acoustic model is a statistical expression of the relationship between a phoneme and acoustic features and can have a substantial impact on speech recognition. A Hidden Markov Model (HMM) is frequently used. When changing the microphone layout on a robot or the algorithm and parameters for separation and speech enhancement, the properties of the acoustic features input into speech recognition may also change. Therefore, speech recognition may be improved by adapting an acoustic model to new conditions or by creating a new acoustic model that meets these conditions. We describe here the methods of construction of three acoustic models:
Multi-condition training
MLLR/MAP adaptation
Additional training
In each model, HMM is described with a Hidden Markov Model ToolKit (HTK), which is used when creating acoustic models of the speech recognition engine Julius used in HARK.
Although acoustic models have various parameters, we describe here the training of triphone HMM for three states and 16 mixtures. For more information about each parameter, consult textbooks such as the “HTK Book” and the “IT Text speech recognition system”. A fundamental flow of the creation of a typical triphone-based acoustic model is shown below.
Extraction of acoustic features
Training of a monophone model
Training of a non-context-dependent triphone model
Status clustering
Training of a context-dependent triphone model
Acoustic features extraction
The mel frequency cepstrum coefficient (MFCC) is often used for acoustic features. Although MFCC can be used, the mel scale logarithmic spectral coefficient (MSLS) is recommended for HARK. MSLS can be created easily from a wav file on a HARK network. MFCC can also be created on a HARK network in the same way. However, since MFCC is extracted in HTK, a similar tool HCopy is provided, making the number of parameters for MFCC extraction higher than in HARK.
% HCopy -T 1 -C config.mfcc -S scriptfile.scp
(Example) nf001001.dt.wav nf001001.mfc nf001002.dt.wav nf001002.mfc ...
Sample of config.mfcc
----- # HTK Configuration Parameters for Generating MFCC_D_E_N from # headerless SPEECH Corpus. # Copyright 1996 Kazuya TAKEDA, takeda@nuee.nagoya-u.ac.jp # IPA Japanese Dictation Software (1997) SOURCEFORMAT=NOHEA/D# ASJ Copus has no header part SOURCEKIND = WAVEFORM SOURCERATE = 625 # surce sampling frequency is 16 [kHz] TARGETKIND = MFCC_E_D_Z TARGETRATE=100000.0 # frame interval is 10 [msec] SAVECOMPRESSED=F # set T, if you like to save disk storage SAVEWITHCRC=F WINDOWSIZE=250000.0 # window length is 25 [msec] USEHAMMING=T # use HAMMING window PREEMCOEF=0.97 # apply highpass filtering NUMCHANS=24 # # of filterbank for MFCC is 24 NUMCEPS=12 # # of parameters for MFCC presentation ZMEANSOURCE=T # Rather local Parameters ENORMALISE=F ESCALE=1.0 TRACE=0 RAWENERGY=F # CAUTION !! Do not use following option for nist encoded data. BYTEORDER=SUN -----
In any event, create an acoustic model with HTK after feature extraction.
Data revision: Generally, even when using a distributed corpus, it is difficult to completely remove fluctuations in description and descriptive errors. Although these are difficult to notice beforehand, they should be revised as soon as they are found since such errors can degrade performance.
Creation of words.mlf: Create words.mlf has file names for (virtual) labels that correspond to features and utterances included in files written per word. The first line of each words.mlf file must be #!MLF!#. After describing the labeled file names with “ ” on the second line, the utterance included in the labeled file name is divided into each word, with the words described on individual lines.
In addition, the half size period “.” is added to the last line of each entry.
-exec /phonem/rom2mlf
; $>$ words.mlf
Creation of word dictionary: A word dictionary is created that relates words with phoneme lines. Generally, a dictionary with registered word classes is often used. For a small dictionary, it would be sufficient to describe phoneme lines and words corresponding to them.
-exec /phonem/rom2dic
; | sort | uniq $>$ dic
-exec /phonem/rom2dic2
; | sort | uniq $>$ dic
Creation of phoneme MLF(phones1.mlf): Phoneme MLFs are created with a dictionary and word MLF. Use LLEd concretely. Rules are described in phones1.led. The rule allowing sp (short pose) is described in HTKBook.
% HLEd -d dic -i phones1.mlf phones1.led words.mlf
The format of phoneme MLF is almost the same as that of word MLF except that the unit of lines is changed to phonemes from words. An example of phones1.mlf is shown below.
------------------------------ #!MLF!# silB a r a y u r u g e N j i ts u o sp ------------------------------
Preparation of the list train.scp for features file
Basically, create a file that lists, with one file name per line, feature file names in a complete path. However, since abnormal values may be included in the contents of features files, it is preferable to check the values with HList and include only normal files.
Preparation of triphone
Although this operation may be performed after training of monophones, it may be necessary to remake phones1.mlf depending on the results of the checking. To save time, this operation can be performed here.
Creation of tri.mlf: First, create phonemes in triplicate.
% HLEd -i tmptri.mlf mktri.led phones1.mlf
Remove the phonemes described in mktri.led from the phoneme context.
mktri.led ------------------ WB sp WB silB WB silE TC ------------------
Parameters are reduced with short vowel contexts by identifying the anteroposterior long vowel contexts. An example of a created tri.mlf is shown here.
------------------------------ #!MLF!# "/hoge/mfcc/can1001/a/a01.lab" silB a+r a-r+a r-a+y a-y+u y-u+r u-r+u r-u+g u-g+e g-e+N e-N+j N-j+i y-i+ts i-ts+u ts-u+o u-o sp ... ------------------------------
Creation of triphones: Triphones corresponds to the list of triplicates of phonemes included in tri.mlf.
grep -v lab tri.mlf | grep -v MLF | grep -v "\."| sort | uniq > triphones
physicalTri: The triphone list that includes the phoneme contexts but do not appear in (tri.mlf) at the time of training.
Check of consistency: Check triphones and physicaiTri. This check is important.
Preparation of monophone
Create a prototype (proto) of HMM: The proto can be created in HTK with the tool MakeProtoHMMSet. % ./MakeProtoHMMSet proto.pcf An example of proto.pcf for MFCC is shown below.
-------- <BEGINproto_config_file> <COMMENT> This PCF produces a 1 stream, single mixture prototype system <BEGINsys_setup> hsKind: P covKind: D nStates: 3 nStreams: 1 sWidths: 25 mixes: 1 parmKind: MFCC_D_E_N_Z vecSize: 25 outDir: ./test hmmList: protolist/protolist <ENDsys_setup> <ENDproto_config_file>
Creation of initial model
% mkdir hmm0 % HCompV -C config.train -f 0.01 -m -S train.scp -M hmm0 proto
These steps result in the creation under hmm0/ of a proto and vFloor (initial model) that learned dispersion and means from all the training data. Note the time required to complete this operation, although it is dependent on the volume of data.
Creation of initial monophones:
hmm0/hmmdefs Allocate the value of hmm0/proto to all phonemes
% cd hmm0 % ../mkmonophone.pl proto ../monophone1.list > hmmdefs
The monophone1.list is a list of phonemes including sp. In the HTKBook, the "monophone1.list" should be used after training with the phoneme list of "monophone0.list" without sp. Here, use the phoneme list that includes sp from the beginning.
hmm0/macros Create a file "macro" by rewriting some contents of vFloor. This is used as flooring when data are insufficient.
% cp vFloor macro
In this example, add the following as a header of macro. Generally, the description of the header should be the same as that of hmmdefs; i.e., dependent on the content of proto.
----- ~o <STREAMINFO> 1 25 <VECSIZE> 25<NULLD><MFCC_E_D_N_Z> -----
% cd ../ % mkdir hmm1 hmm2 hmm3
Perform repeated training a minimum of three times. (hmm1 hmm2 hmm3) * hmm1
% HERest -C config.train -I phones1.mlf -t 250.0 150.0 1000.0 -T 1 \ -S train.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1
* hmm2
% HERest -C config.train -I phones1.mlf -t 250.0 150.0 1000.0 -T 1 \ -S train.scp -H hmm1/macros -H hmm1/hmmdefs -M hmm2
* hmm3
% HERest -C config.train -I phones1.mlf -t 250.0 150.0 1000.0 -T 1 \ -S train.scp -H hmm2/macros -H hmm2/hmmdefs -M hmm3
Although alignment settings should be readjusted at this point, it has been omitted here.
Creation of triphone
Creation of triphone by monophone:
% mkdir tri0 % HHEd -H hmm3/macro -H hmm3/hmmdefs -M tri0 mktri.hed monophones1.list
Initial training of triphone:
% mkdir tri1 % HERest -C config.train -I tri.mlf -t 250.0 150.0 1000.0 -T 1 -s stats \ -S train.scp -H tri0/macro -H tri0/hmmdefs -M tri1 triphones
Perform repeated training around 10 times.
Clustering
Clustering to the 2000 status:
% mkdir s2000 % mkdir s2000/tri-01-00 % HHEd -H tri10/macro -H tri10/hmmdefs -M s2000/tri-01-00 2000.hed \ triphones > log.s2000
Here, 2000.hed can be described as follows. Stats on the first line is an output file obtained in 9.2. Replace these with a value around 1000 temporarily and set it so that the status number becomes 2000 by trial and error, looking at the execution log.
-------------------------------------------------------------- RO 100.0 stats TR 0 QS "L_Nasal" { N-*,n-*,m-* } QS "R_Nasal" { *+N,*+n,*+m } QS "L_Bilabial" { p-*,b-*,f-*,m-*,w-* } QS "R_Bilabial" { *+p,*+b,*+f,*+m,*+w } ... TR 2 TB thres "TC_N2_" {("N","*-N+*","N+*","*-N"). state[2]} TB thres "TC_a2_" {("a","*-a+*","a+*","*-a"). state[2]} ... TR 1 AU "physicalTri" ST "Tree,thres" --------------------------------------------------------------
Question
What is described here is the target of clustering. In this example, only the same central phonemes with the same status are included.
Control the final state number by changing the dividing threshold value properly (e.g. 1000 or 1200) (confirm log)
Training: Perform training after clustering.
% mkdir s2000/tri-01-01 % HERest -C config.train -I tri.mlf -t 250.0 150.0 1000.0 -T 1 \ -S train.scp -H s2000/tri-01-00/macro -H s2000/tri-01-00/hmmdefs \ -M s2000/tri-01-01 physicalTri
Repeat more than three times
Increase of the number of mixtures
Increasing the number of mixtures (example of 1 $\rightarrow $ 2mixtures):
% cd s2000 % mkdir tri-02-00 % HHEd -H tri-01-03/macro -H tri-01-03/hmmdefs -M tri-02-00 \ tiedmix2.hed physicalTri
Training: Perform training after increasing of the number of mixtures.
% mkdir tri-02-01 % HERest -C config.train -I tri.mlf -t 250.0 150.0 1000.0 -T 1 \ -S train.scp -H s2000/tri-02-00/macro -H s2000/tri-02-00/hmmdefs \ -M tri-02-01 physicalTri
Repeat more than three times. Repeat these steps and increase the number of mixtures to around 16 sequentially. Here, we recommend that the number of mixtures be doubled. (2 $\rightarrow $ 4 $\rightarrow $ 8 $\rightarrow $ 16)
See Also
HTK Speech Recognition Toolkit. For acoustic models for Julius, see source information of this document. For acoustic model construction for HTK, see Acoustic model construction for HTK For acoustic model construction for Sphinx, see acoustic model construction tutorial for Sphinx developed by CMU, acoustic model construction for Sphinx.