6.8.1 JuliusMFT

6.8.1.1 Outline

JuliusMFT is the speech recognition module obtained by remodeling the large glossary speech recognition system Julius for HARK. It had been provided for HARK 0.1.x systems as a patch for the multiband edition Julius1 improved based on the large glossary speech recognition system Julius 3.5. However, for HARK 1.0, we reviewed its implementation and functions with the 4.1 Julius system as a base. Compared with the original Julius, modifications have been made to JuliusMFT of HARK 1.0 for accepting the following four points.

Implementation was achieved with plug-in feature introduced from Julius 4.0 with the minimum modification to the main body of Julius . This section describes difference from Julius and connection with the HARK modules in FlowDesigner as well as the installation method and usage of it.

6.8.1.2 Start up and setting

Execution of JuliusMFT is performed as follows when assuming the setting file name as julius.conf for example.

> julius_mft -C julius.jconf
> julius_mft.exe -C julius.jconf (for Windows OS)

In HARK, after starting JuliusMFT , the socket connection with JuliusMFT is performed by starting a network that contains SpeechRecognitionClient (or SpeechRecognitionSMNClient ) for which an IP address and a port number are correctly set to enable the speech recognition. The abovementioned julius.jconf is a text file that describes setting of JuliusMFT . The content of the setting file consists of from an argument options that begin with "-" basically and therefore the user can designate arguments directly as an option of Julius when starting. Moreover, descriptions that come after # are treated as comments. The options used for Julius are summarized in http://julius.sourceforge.jp/juliusbook/ja/desc_option.html and the users are recommended to refer to the website. The minimum required setting is the following seven items.

Further, when using the above in the module modes described later, it is necessary to designate the -module option same as the original Julius .

6.8.1.3 Detail description

6.8.1.3.1 mfcnet communication specification

 

To use mfcnet as an acoustic input source, designate "-input mfcnet" as an argument when starting up JuliusMFT as mentioned above. In such a case, JuliusMFT acts as a TCP/IP communications server and receives features. Moreover, SpeechRecognitionClient and SpeechRecognitionSMNClient , which are modules of HARK, work as a client to transmit acoustic features and Missing Feature Mask to JuliusMFT . The client connects to JuliusMFT for every utterance and cuts off the connection after transmitting completion promptly. The data to be transmitted must be little endian (Note that it is not a network byte order). Concretely, communication is performed as follows for one utterance.

  1. Socket connection The client opens the socket and connects to JuliusMFT .

  2. Communication initialization (data transmitted once at the beginning) The client transmits information on the sound source that is going to be transmitted shown in Table  6.74 only once just after the socket connection. The sound source information is expressed in a SourceInfo structure (Table  6.75) and has a sound source ID, sound source direction and time of starting transmitting. The time is indicated in a timeval structure defined in <sys/time.h> and is elapsed time from the starting time point (January 1, 1970 00:00:00) in the time zone of the system. The time indicates the elapsed time from the time starting point thereafter.

    Table 6.74: Data to be transmitted only once at the beginning

    Size [byte]

    type

    Data to be transmitted

    4

    int 

    28 (= sizeof(SourceInfo)))

    28

    SourceInfo

    Sound source information on features that is going to be transmitted

    Table 6.75: SourceInfo structure

    Member variable name

    Type

    Description

    source_id

    int 

    Sound source ID

    azimuth

    float 

    Horizontal direction [deg]

    elevation

    float 

    Vertical direction [deg]

    time

    timeval

    Time (standardized to 64 bit processor and the size is 16 bytes)

  3. Data transmission (every frame) Acoustic features and Missing Feature Mask are transmitted. Features of one utterance are transmitted repeatedly till the end of the speech section with the data shown in Table 6.76 as one frame. It is assumed inside the JuliusMFT that the dimension number of feature vectors and mask vectors are same.

    Table 6.76: Data to be transmitted for every frame

    Size [byte]

    Type

    Data to be transmitted

    4

    int 

    N1=(dimension number of feature vector) $\times $ sizeof(float )

    N1

    float [N1]

    Array of feature vector

    4

    int 

    N2= (dimension number of mask vector) $\times $ sizeof(float )

    N2

    float [N2]

    Array of mask vector

  4. Completing process Finishing transmitting features for one sound source, data (table 6.77) that indicate completion are transmitted and the socket is closed.

    Table 6.77: Data to indicate completion

    Size [byte]

    Type

    Data to be transmitted

    4

    int 

    0

6.8.1.3.2 Module mode communication specification

When designating -module, JuliusMFT operates in the module mode same as the original Julius . In the module mode, JuliusMFT functions as a server of TCP/IP communication and provides clients such as jcontrol with statuses and recognition results of JuliusMFT . Moreover, its operation can be changed by transmitting a command. EUC-JP is usually used as a character code of a Japanese character string and can be changed with the argument. An XML-like format is used for data representation and as a mark to indicate completion of data. "." (period) is transmitted for every one message. Table shows an example of the outputs of the module mode. Meaning of representative tags transmitted in JuliusMFT is as follows.

When comparing with the original Julius , the changes made to JuliusMFT are two points as follows.

6.8.1.3.3 Example output of JuliusMFT 

 

  1. Example output of standard output mode

    Stat:
    server-client:
    connect from 127.0.0.1
    forked process [6212] handles this request
    waiting connection...
    source_id = 0, azimuth = 5.000000, elevation = 16.700001, sec = 1268718777, usec = 474575
    ###
    Recognition:
    1st pass (LR beam)
    ..........................................................................................................................................................................................................
    read_count < 0, read_count=-1, veclen=54
    pass1_best:
    <s> Please order </s> pass1_best_wordseq:
    0 2 1
    pass1_best_phonemeseq:
    silB |
    ch u:
    m o N o n e g a i sh i m a s u |
    silE
    pass1_best_score:
    403.611420
    ###
    Recognition:
    2nd pass (RL heuristic best-first)
    STAT:
    00 _default:
    19 generated, 19 pushed, 4 nodes popped in 202
    transmittedence1:
    <s> Please order </s> wseq1:
    0 2 1
    phseq1:
    silB |
    ch u:
    m o N o n e g a i sh i m a s u |
    silE
    cmscore1:
    1.000 1.000 1.000
    score1:
    403.611786
    connection end
    ERROR:
    an error occurred while recognition, terminate stream
    <- This error log is part of the specification
    
  2. Output sample of module mode Output to clients (e.g. jcontrol)in following XML-like format. “$>$” on the line head is output by jcontrol when using jcontrol (not included in output information).

    > <STARTPROC/>
    > <STARTRECOG SOURCEID="0"/>
    > <ENDRECOG SOURCEID="0"/>
    > <INPUTPARAM SOURCEID="0" FRAMES="202" MSEC="2020"/>
    > <SOURCEINFO SOURCEID="0" AZIMUTH="5.000000" ELEVATION="16.700001" SEC="1268718638" USEC="10929"/>
    > <RECOGOUT SOURCEID="0">
    > <SHYPO RANK="1" SCORE="403.611786" GRAM="0">
    >
    <WHYPO WORD="<s>" CLASSID="0" PHONE="silB" CM="1.000"/>
    >
    <WHYPO WORD="Please order "CLASSID="2" PHONE="ch u:
    m o N o n e g a i sh i m a s u" CM="1.000"/>
    >
    <WHYPO WORD="</s>" CLASSID="1" PHONE="silE" CM="1.000"/>
    > </SHYPO>
    > </RECOGOUT>
    

6.8.1.4 Notice

6.8.1.5 Installation method


Footnotes

  1. http://www.furui.cs.titech.ac.jp/mband_julius/