JuliusMFT is the speech recognition module obtained by remodeling the large glossary speech recognition system Julius for HARK. It had been provided for HARK 0.1.x systems as a patch for the multiband edition Julius1 improved based on the large glossary speech recognition system Julius 3.5. However, for HARK 1.0, we reviewed its implementation and functions with the 4.1 Julius system as a base. Compared with the original Julius, modifications have been made to JuliusMFT of HARK 1.0 for accepting the following four points.
Introduction of the Missing Feature theory
Acceptance of network inputs (mfcnet) of MSLS features
Acceptance of addition of sound source information (SrcInfo)
Acceptance of simultaneous utterance (exclusion)
Implementation was achieved with plug-in feature introduced from Julius 4.0 with the minimum modification to the main body of Julius . This section describes difference from Julius and connection with the HARK modules in FlowDesigner as well as the installation method and usage of it.
Execution of JuliusMFT is performed as follows when assuming the setting file name as julius.conf for example.
> julius_mft -C julius.jconf > julius_mft.exe -C julius.jconf (for Windows OS)
In HARK, after starting JuliusMFT , the socket connection with JuliusMFT is performed by starting a network that contains SpeechRecognitionClient (or SpeechRecognitionSMNClient ) for which an IP address and a port number are correctly set to enable the speech recognition. The abovementioned julius.jconf is a text file that describes setting of JuliusMFT . The content of the setting file consists of from an argument options that begin with "-" basically and therefore the user can designate arguments directly as an option of Julius when starting. Moreover, descriptions that come after # are treated as comments. The options used for Julius are summarized in http://julius.sourceforge.jp/juliusbook/ja/desc_option.html and the users are recommended to refer to the website. The minimum required setting is the following seven items.
-notypecheck
-plugindir /usr/lib/julius_plugin
-input mfcnet
-gprune add_mask_to_safe
-gram grammar
-h hmmdefs
-hlist allTriphones
-notypecheck Setting to skip type checks for feature parameters. It is an option that can be designated arbitrarily in the original Julius though it is an option that must be designated in JuliusMFT . Type check is performed unless this option is designated. However, mask data, as well as features, are calculated in plug-ins of JuliusMFT (1.0 is output even in the case of without masks). Therefore, it is judged that the sizes do not match in the type check and recognition is not performed.
-plugindir Plug in directory name Designate a directory where plug ins (*.jpi) exist. Designate an absolute path from a current directory or a complete path of the plug in as an argument. The default value of this path is /usr/lib/julius_plugin when apt-get is installed and /usr/local/lib/julius_plugin when the source code is compiled and installed without designating the path. Further, it is necessary to designate this path before designating the functions that are realized in plug ins such as -input mfcnet or -gprune add_mask_to_safe. Note that all extensive plug in files in this path are read entirely in execution. For Windows OS, this option must be set even when the input option is not mfcnet. If mfcnet is disabled, the directory name can be arbitrary.
-input mfcnet -input itself is an option implemented in the original Julius and microphones, files and inputs through a network are supported. In JuliusMFT , this option is extended so that acoustic features transmitted by SpeechRecognitionClient (or SpeechRecognitionSMNClient ) and masks can be received through a network and the user can designate mfcnet as an audio input source. This function is validated by designating -input mfcnet. Moreover, the port numbers when designating mfcnet are used to designate port numbers for the acoustic input source adinnet in the original Julius . The user can designate like "-adport port number" with adport.
-gprune Designate a pruning algorithm used when masks are used for existing output probability calculation. They basically are the algorithms that are transplanted from the functions equipped with julius_mft(ver3.5) that was provided in HARK 0.1.x. The user selects an algorithm from ${add\_ mask\_ to\_ safe, add\_ mask\_ to\_ heu, add\_ mask\_ to\_ beam, add\_ mask\_ to\_ none}$ (when it is not designated, the default calculation method is adopted). They correspond to ${safe heuristic beam none}$ in the original Julius , respectively. Further, since the calculation method with eachgconst of julius_mft(ver3.5) is not precise strictly, errors occurred in calculation results (score) different from the original. This time, the calculation method same as that of the original is adopted to solve this error problem.
-gram grammar Designate a language model. Same as the original Julius .
-h hmmdefs Designate acoustic models (HMM). Same as the original Julius .
-hlist allTriphones Designate a HMMList file. Same as the original Julius .
Further, when using the above in the module modes described later, it is necessary to designate the -module option same as the original Julius .
To use mfcnet as an acoustic input source, designate "-input mfcnet" as an argument when starting up JuliusMFT as mentioned above. In such a case, JuliusMFT acts as a TCP/IP communications server and receives features. Moreover, SpeechRecognitionClient and SpeechRecognitionSMNClient , which are modules of HARK, work as a client to transmit acoustic features and Missing Feature Mask to JuliusMFT . The client connects to JuliusMFT for every utterance and cuts off the connection after transmitting completion promptly. The data to be transmitted must be little endian (Note that it is not a network byte order). Concretely, communication is performed as follows for one utterance.
Socket connection The client opens the socket and connects to JuliusMFT .
Communication initialization (data transmitted once at the beginning) The client transmits information on the sound source that is going to be transmitted shown in Table 6.74 only once just after the socket connection. The sound source information is expressed in a SourceInfo structure (Table 6.75) and has a sound source ID, sound source direction and time of starting transmitting. The time is indicated in a timeval structure defined in <sys/time.h> and is elapsed time from the starting time point (January 1, 1970 00:00:00) in the time zone of the system. The time indicates the elapsed time from the time starting point thereafter.
Size [byte] |
type |
Data to be transmitted |
4 |
28 (= sizeof(SourceInfo))) |
|
28 |
SourceInfo |
Sound source information on features that is going to be transmitted |
Member variable name |
Type |
Description |
source_id |
Sound source ID |
|
azimuth |
Horizontal direction [deg] |
|
elevation |
Vertical direction [deg] |
|
time |
timeval |
Time (standardized to 64 bit processor and the size is 16 bytes) |
Data transmission (every frame) Acoustic features and Missing Feature Mask are transmitted. Features of one utterance are transmitted repeatedly till the end of the speech section with the data shown in Table 6.76 as one frame. It is assumed inside the JuliusMFT that the dimension number of feature vectors and mask vectors are same.
Completing process Finishing transmitting features for one sound source, data (table 6.77) that indicate completion are transmitted and the socket is closed.
Size [byte] |
Type |
Data to be transmitted |
4 |
0 |
When designating -module, JuliusMFT operates in the module mode same as the original Julius . In the module mode, JuliusMFT functions as a server of TCP/IP communication and provides clients such as jcontrol with statuses and recognition results of JuliusMFT . Moreover, its operation can be changed by transmitting a command. EUC-JP is usually used as a character code of a Japanese character string and can be changed with the argument. An XML-like format is used for data representation and as a mark to indicate completion of data. "." (period) is transmitted for every one message. Table shows an example of the outputs of the module mode. Meaning of representative tags transmitted in JuliusMFT is as follows.
INPUT tag
This tag indicates information related to inputs and there are STATUS and TIME as attributes. The values of STATUS are LISTEN, STARTREC or ENDREC. LISTEN indicates that Julius is ready to receive speech. STARTREC indicates that reception of features is started. ENDREC indicates that the last feature of the sound source being received is received. TIME indicates the time at that time.
SOURCEINFO tag
This tag indicates information related to tag sound sources and is an original tag of JuliusMFT . There are ID, AZIMUTH, ELEVATION, SEC and USEC as attributes. The SOURCEINFO tag is transmitted when starting the second path. Its ID indicates a sound source ID (not speaker IDs but numbers uniformly given to each sound source) given in HARK. AZIMUTH and ELEVATION indicate horizontal and vertical direction (degrees) seen from a microphone array coordinate system for the first frame of the sound source, respectively. SEC and USEC indicate time of the first frame of the sound source. SEC indicates seconds and USEC indicates microsec digit.
RECOGOUT tag
This tag indicates recognition results, and subelement is a gradual output, the first path output or the second path output. In the case of the gradual output, this tag has the PHYPO tag as a subelement. In the case of the first path output and the second path output, this tag has the SHYPO tag as a subelement. In the case of the first path, the result that becomes the maximum score is the output, and in the case of the second path, candidates for the number specified for the parameter are the output and therefore SHYPO tags for the number of candidates are the output.
PHYPO tag
This tag indicates gradual candidates and columns of the candidate word WHYPO tag are included as a subelement. There are PASS, SCORE, FRAME and TIME as attributes. PASS indicates the order of the path and always is 1. SCORE indicates conventionally accumulated scores of this candidate. FRAME indicates the number of frames that have been processed to output this candidate. TIME indicates time (sec) at that time.
SHYPO tag
This tag indicates sentence assumptions and columns of the candidate word WHYPO tag are included as a subelement. There are PASS, RANK, SCORE, AMSCORE and LMSCORE as attributes. PASS indicates the order of the path and always is 1 when an attribute PASS exists. RANK indicates a rank order of an assumption and exists only in the case of the second path. SCORE indicates a logarithmic likelihood of this assumption, AMSCORE indicates a logarithmic acoustic likelihood and LMSCORE indicates a logarithmic language probability.
WHYPO tag
This tag indicates word assumptions and WORD, CLASSID and PHONE are included as attributes. WORD indicates notations, CLASSID indicates word names that become a key to a statistics language model, PHONE indicates phoneme lines and CM indicates word reliability. Word reliability is included only in the results of the second path.
SYSINFO tag
This tag indicates statuses of the system and there is PROCESS as an attribute. When PROCESS is EXIT, it indicates normal termination. When PROCESS is ERREXIT, it indicates abnormal termination. When PROCESS is ACTIVE, it indicates the status that speech recognition can be performed. When PROCESS is SLEEP, it indicates the status that speech recognition is halted. It is determined whether or not to output these tags and attributes by the argument designated when starting Julius MFT. The SOURCEINFO tag is always output and the others are same as those of the original Julius and therefore users are recommended to refer to Argument Help of the original Julius.
When comparing with the original Julius , the changes made to JuliusMFT are two points as follows.
Addition of items related to the SOURCEINFO tag, which is a tag for the information on source localization above and embedding of sound source ID(SOURCEID) to the following tags related. STARTRECOG, ENDRECOG, INPUTPARAM, GMM, RECOGOUT, REJECTED, RECOGFAIL, GRAPHOUT, SOURCEINFO
To improve the processing delay caused by the exclusion control at the time of simultaneous utterance, changes were made to the format of the module mode. Concretely, exclusion control has been performed for each utterance conventionally. An output is divided into multiple times so that the exclusion control can be performed for each unit and therefore modifications were made to the output of the following tags, which need to be output at a time. <<Tags separated to start-tag / end-tag >>
$<$RECOGOUT$>$... $<$/RECOGOUT$>$
$<$GRAPHOUT$>$... $<$/GRAPHOUT$>$
$<$GRAMINFO$>$... $<$/GRAMINFO$>$
$<$RECOGPROCESS$>$... $<$/RECOGPROCESS$>$
<<One-line tags that are output devided into multiple times inside>>
$<$RECOGFAIL .../$>$
$<$REJECTED .../$>$
$<$SR .../$>$
Example output of standard output mode
Stat: server-client: connect from 127.0.0.1 forked process [6212] handles this request waiting connection... source_id = 0, azimuth = 5.000000, elevation = 16.700001, sec = 1268718777, usec = 474575 ### Recognition: 1st pass (LR beam) .......................................................................................................................................................................................................... read_count < 0, read_count=-1, veclen=54 pass1_best: <s> Please order </s> pass1_best_wordseq: 0 2 1 pass1_best_phonemeseq: silB | ch u: m o N o n e g a i sh i m a s u | silE pass1_best_score: 403.611420 ### Recognition: 2nd pass (RL heuristic best-first) STAT: 00 _default: 19 generated, 19 pushed, 4 nodes popped in 202 transmittedence1: <s> Please order </s> wseq1: 0 2 1 phseq1: silB | ch u: m o N o n e g a i sh i m a s u | silE cmscore1: 1.000 1.000 1.000 score1: 403.611786 connection end ERROR: an error occurred while recognition, terminate stream <- This error log is part of the specification
Output sample of module mode Output to clients (e.g. jcontrol)in following XML-like format. “$>$” on the line head is output by jcontrol when using jcontrol (not included in output information).
> <STARTPROC/> > <STARTRECOG SOURCEID="0"/> > <ENDRECOG SOURCEID="0"/> > <INPUTPARAM SOURCEID="0" FRAMES="202" MSEC="2020"/> > <SOURCEINFO SOURCEID="0" AZIMUTH="5.000000" ELEVATION="16.700001" SEC="1268718638" USEC="10929"/> > <RECOGOUT SOURCEID="0"> > <SHYPO RANK="1" SCORE="403.611786" GRAM="0"> > <WHYPO WORD="<s>" CLASSID="0" PHONE="silB" CM="1.000"/> > <WHYPO WORD="Please order "CLASSID="2" PHONE="ch u: m o N o n e g a i sh i m a s u" CM="1.000"/> > <WHYPO WORD="</s>" CLASSID="1" PHONE="silE" CM="1.000"/> > </SHYPO> > </RECOGOUT>
Restraint of the -outcode option
Since the tag outputs are implemented with plug-in functions, modifications were made so that the -outcode option for which the user can designate an output information type can be realized with plug-in functions. An error occurs when designating the -outcode option with the status that the plug ins are not read.
Error message in utterance completion in the standard output mode The error output in the standard output mode "ERROR: an error occurred while recognition, terminate stream" (see the example output) is output because the error code is returned to the main body of julius forcibly when finishing the child process generated in the feature input plug in created (mfcnet). As a specification, measures are not taken for this error so as to avoid modifications to the main body of Julius as much as possible. Further, in the module mode, this error is not output.
Method using apt-get
If setting of apt-get is ready, installation is completed as follows. Furthermore, since the original Julius is made packaged in Ubuntu, in the case that the original Julius is installed, execute the following after deleting this.
> apt-get install julius-4.1.4-hark julius-4.1.3-hark-plugin
Method to install from source
Download julius-4.1.4-hark and julius_4.1.3_plugin and expand them in an appropriate directory.
Move to the julius-4.1.4-hark directory and execute the following command. Since it is installed in /usr/local/bin with the default, designate - prefix as follows to install in /usr/bin same as package.
./configure --prefix=/usr --enable-mfcnet; make; sudo make install
If the following indication is output after the execution, installation of Julius is completed normally.
> /usr/bin/julius Julius rev.4.1.4 - based on JuliusLib? rev.4.1.4 (fast) built for i686-pc-linux Copyright (c) 1991-2009 Kawahara Lab., Kyoto University Copyright (c) 1997-2000 Information-technology Promotion Agency, Japan Copyright (c) 2000-2005 Shikano Lab., Nara Institute of Science and Technology Copyright (c) 2005-2009 Julius project team, Nagoya Institute of Technology Try '-setting' for built-in engine configuration. Try '-help' for run time options. >
Install plug ins next. Move to the julius_4.1.3_plugin directory and execute the following command.
> export JULIUS_SOURCE_DIR=../julius_4.1.4-hark; make; sudo make install
Designate the path of the source of julius_4.1.4-hark in JULIUS_SOURCE_DIR. Here, the example shows the case of developing the sources of Julius and plug-ins to the same directory. Now the installation is completed.
Confirm if there are plug in files under /usr/lib/julius_plugin properly.
> ls /usr/lib/julius_plugin calcmix_beam.jpi calcmix_none.jpi mfcnet.jpi calcmix_heu.jpi calcmix_safe.jpi >
If five plug in files are indicated as above, installation is completed normally.
For the method in Windows OS, please refer to Section 3.2.
Footnotes