The design philosophy of the robot audition software HARK is summarized as follows.
Provision of total functions, from the input to sound source localization / source separation / speech recognition Guarantee of the total performance such as inputs from microphones installed in a robot, sound source localization, source separation, noise suppression and automatic speech recognition,
Correspondence to robot shape Corresponds to the microphone layout required by a user and incorporation to signal processing,
Correspondence to multichannel A/D systems Supports various multichannel A/D systems depending on price range / function,
Provision of optimal sound processing module and advice For the signal processing algorithms, each algorithm is based on effective premises, multiple algorithms have been developed for the same function and optimal modules are provided through the usage experience,
Real-time processing Essential for performing interactions and behaviors through sounds.
As shown in Figure 1.1, HARK uses FlowDesigner [2] as middleware excluding the speech recognition part (Julius) and support tools. As understood from Figure 1.1, only the Linux OS is supported. One of the reasons is that the API called ALSA (Advanced Linux Sound Architecture) is used so as to support multiple multichannel A/D systems. HARK for PortAudio is also being developed since PortAudio has recently become available for Windows systems.
In robot audition, sound sources are typically separated based on sound source localization data, and speech recognition is performed for the separated speech. Each processing would become flexible by comprising it of multiple modules so that an algorithm can be partially substituted. Therefore, it is essential to introduce a middleware that allows efficient integration between the modules. However, as the number of modules to be integrated increases, the total overhead of the module connections increases and a real time performance is lost. It is difficult to respond to such a problem with a common frame such as CORBA (Common Object Request Broker Architecture), which requires data serialization at the time of module connection. Indeed, in each module of HARK, processing is performed with the same acoustic data in the same time frame. If each module used the acoustic data by memory-copying each time, the processing would be inefficient in terms of speed and memory efficiency. We have employed FlowDesigner [2] as a middleware that can respond to such a problem, which is a data flow-oriented GUI support environment. The processing in FlowDesigner is faster and lighter than those in the frames that can universally be used for integrations such as the CORBA frame. FlowDesigner is free (LGPL/GPL) middleware equipped with a data flow-oriented GUI development environment that realizes high-speed and lightweight module integration, premised on use in a single computer. In FlowDesigner, each module is realized as a class of C++. Since these classes have inherited common superclasses, the interface between modules is naturally commonized. Since module connections are realized by calling a specific method of each class (function call), the overhead is small. Since data are transferred by pass-by-reference and pointers, processing is performed at high speed with few resources for the above-mentioned acoustic data. In other words, both data transmission rate between modules and module reuse can be maintained by using FlowDesigner. We are publishing FlowDesigner for which bugs such as memory leak have been removed and its operationality (mainly the attribute settings) has been improved based on the past use experience.
1.2 shows a network of FlowDesigner for the typical robot audition with HARK. Multichannel acoustic signals are acquired by input files and sound source localization / source separation are performed. Missing feature masks (MFM) are generated by extracting acoustic features from the separated sound and sent to speech recognition (ASR). Attribute of each module can be set on the attribute setting screen (Figure1.3 shows an example of the attribute setting screen of GHDSS ). Table 1.1 shows HARK modules and external tools that are currently provided for HARK. In the following section, outlines of each module are described with the design strategy.
Multiple microphones (microphone array) are mounted as ears of a robot in HARK for processing. Figure 4 shows an installment example of ears of a robot. Each of these example is equipped with a microphone array with eight channels though microphone arrays with the arbitrary number of channels can be used in HARK. The followings three types are the multichannel A/D conversion devices supported by HARK.
JEOL System Technology Co., Ltd., The RASP series,
Tokyo Electron Device Ltd., TD-BD-16ADUSB (USB interface),
A/D conversion device of ALSA base, (e.g. RME, Hammerfall DSP series, Multiface AE)
These A/D systems have 16 channel inputs. The sixteen channel inputs can be used in HARK by changing internal parameters in HARK. However, the processing speed might fall under such a condition. Moreover, even when sampling is performed with 24 bits for signal expression, no changes are required. Further, the sampling rate assumed for HARK is 16kHz and therefore a downsampling module can be used for 48KHz sampling data. Low-priced pin microphones are enough though it will be better if a preamplifier is used for resolving lack of gain. OctaMic II is available from RME.
Function Category name Module name Description Voice input output AudioIO Acquire sound from microphone Acquire sound from file Save sound in file Save sound in wav-formatted file Socket-based data communication Sound source Localization Output constant localized value Localization / Display localization result tracking Localize sound source Load localization information from file Save source location information in file Extend forward the tracking result Source tracking Load a Correlation Matrix (CM) file Save a CM file Channel selection for CM Create a CM Create a CM Division of each element of CM Multiplication of each element of CM Inverse of CM Multiplication of CM Output identity CM Sound source Separation Estimate background noise separation Subtract noise spectrum subtraction and estimate optimum gain Add power spectrum Estimate inter-channel leak noise Separate sound source by GHDSS Estimate noise spectrum Perform post-filtering after sound source separation Estimate voice spectrum Feature FeatureExtraction Calculate term extract Remove term Perform mel-scale filter bank processing Extract MFCC Extract MSLS Perform pre-emphasis Save features Save features in the HTK form Normalize spectrum mean Missing MFM Calculate mask term Feature Calculate power mask term Mask Generate MFM Communication ASRIF Send feature to ASR with ASR Same as above, with the feature SMN Others MISC Select channel Generate log output of data Convert Matrix to Map Calculate gain of multiple-channel Perform downsampling Perform multichannel FFT Calculate power of Map input Calculate power of matrix input Select audio stream segment by ID Select sound source by direction Select sound source by ID Convert waveform Add white noise A function Category Tool name Description Data generation External tool hark-tool Visualize data / Generate setting file
MUltiple SIgnal Classification (MUSIC) method, which has shown the best performance in past experience, is employed for microphone arrays. The MUSIC method is the method that localizes sound sources based on source positions and impulse responses (transfer function) between each microphone. Impulse responses can be obtained by actual measurements or calculation with geometric positions of microphones. In HARK 0.1.7, the beamformer of ManyEars [3] was available as a microphone array. This module is a 2D polar coordinate space (called “2D” in the semantics that direction information can be recognized in a 3D polar coordinate space). It has been reported that the error due to incorrect orientation is about 1.4 when it is within 5 m, from a microphone array and the sound source interval leaves more than 20. However, the entire module of ManyEars is originally designed for 48 kHz sampling under the assumption that the sampling frequency is not 16 kHz, which is used in HARK, and microphones are arranged in free space when impulse responses are simulated from the microphone layout. For the above reason, impacts of the robot body cannot be considered and sound source localization accuracy of adaptive beamformers such as MUSIC is higher than that of common beamformers and therefore HARK 1.0.0 supports only the MUSIC method. In HARK 1.1.0, we supported GEVD-MUSIC and GSVD-MUSIC which are extended version of MUSIC. By the extension, we can suppress or whiten a known high power noise such as robot ego-noise and localize desired sounds under this noise.
For sound source separation, Geometric-Constrained High-order Source Separation (GHDSS ) [8], which is known to have the highest total performance in various acoustic environments from the past usage experience, PostFilter and the noise estimation method Histogram-based Recursive Level Estimation HRLE are employed for HARK 1.0.0. Presently, the best performance and stability in various acoustic environments are obtained by the combination of GHDSS and HRLE . Until now, various methods such as adoptive beamformer (delayed union type, adoptive type), Independent Component Analysis (ICA) and Geometric Source Separation (GSS ) have been developed and tested for evaluation. Sound source separation methods employed for HARK are summarized as follows
Delayed union type beamformer employed for HARK 0.1.7,
Combination of ManyEars Geometric Source Separation (GSS ) and PostFilter [4], which was supported as an external module with HARK 0.1.7,
Combination of GSS and PostFilter as an original design [5] employed for in 1.0.0 HARK prerelease,
Combination of GHDSS and HRLE employed for HARK 1.0.0 [6, 8].
GSS of ManyEars used for HARK 0.1.7 is the method that uses transfer functions from a sound source to a microphone as a geometric constraint and separates the signal coming from a given sound source direction. A geometrical constraint is supposed to be a transfer function from the sound source to each microphone and transfer functions are obtained from the relation between microphone positions and sound source positions. This way of obtaining transfer functions was a cause of performance degradation under the condition that a transfer function changes as shape of a robot changes though the microphone layout is the same. GSS was redesigned for the HARK 1.0.0 prerelease. It was extended so that transfer functions of actual measurements can be used as a geometrical constraint. Further, modifications such as adaptive change of stepsize were made so as to accelerate convergence of a separation matrix. Furthermore, it has become possible to constitute a delayed union type beamformer by changing attribute setting of GSS . In accordance with the above change, the delayed union type beamformer DSBeamformer , which had been employed for HARK 0.1.7, has been removed. Most of sound source separation methods except ICA require direction information of the sound source to be separated as a parameter, which is common in sound source separation. If localization information is not provided, separation itself cannot be executed. On the other hand, robot’s steady noise has a comparatively strong property as a directional sound source and therefore the steady noise can be removed if sound source is localized. However, in fact, sound sources are not localized successfully for such noise in many cases and there was an actual case that separation performance of steady noise was degraded as a result. A function that continuously specifies noise sources in specific directions is added in GSS and GHDSS of HARK 1.0.0 prerelease, which enables to separate continuously the sound sources that cannot be localized. Generally, there is a limit for separation performance of the sound source separation based on linear processing, such as GSS and GHDSS and therefore it is essential to perform nonlinear processing called post-filter to improve the quality of separated sounds. The post-filter of ManyEars was redesigned and the post-filter for which parameter quantity was considerably reduced is employed for HARK 1.0.0 prerelease and the final version. The post-filter can be a “good knife” if it is used in a proper way though it is difficult to make full use of it and users may suffer its adverse effect if it is used in a wrong way. There are at least some parameters that should be set in PostFilter and it is difficult to set them properly. Furthermore, the post-filter performs nonlinear processing based on a probabilistic model. Therefore, a non-linear distortion spectrum occurs for separated sounds and the performance of speech recognition ratios for separated sounds does not easily improve. The steady noise estimation method called HRLE (Histogram-based Recursive Level Estimation), which is suited for GHDSS , is employed for HARK 1.0.0. The separated sounds with improved quality are obtained when using EstimateLeak , which has been developed by fully examining the GHDSS separation algorithm and estimates inter-channel leak energy, in combination with HRLE .
The spectral distortion caused by various factors such as sound mixture or separation is beyond those that are assumed in the conventional speech recognition community. In order to deal with it, it is necessary to connect more closely the sound source separation and speech recognition. In HARK, it is dealt with the speech recognition based on the missing feature theory (MFT-ASR) [4]. The concept of MFT-ASR is shown in Figure 1.5. The black and red lines in the figure indicate the time variation of acoustic features in a separated sound and that of an acoustic model for corresponding speech, used by the ASR system, respectively. Acoustic features of a separated sound greatly differ at some points from those of the system by distortion (Figure 1.5(a)). In MTF-ASR, influences of the distortion are ignored by masking the distorted points with Missing Feature Mask (MFM) (Figure 1.5(b)). MFM is a time reliability map that corresponds to acoustic features of a separated sound and a binary mask (also called a Hard Mask) is usually used. Masks with continuous values from 0 to 1 are called Soft Masks. In HARK, MFM is provided from the steady noise obtained from the post-filter and inter-channel energy. MTF-ASR, same as common speech recognition, is based on a Hidden Markov Model (HMM). Parts related to acoustic scores calculated from HMM (mainly the output probability calculation) are modified so that MFM can be used. In HARK, the multiband software Julius developed by Tokyo Institute of Technology Furui Laboratory is used, reinterpreted as MFT-ASR [13]. HARK 1.0.0 uses plug-in features of the Julius 4 type and the main part of MFT-ASR serves as a Julius plug-in. Using MFT-ASR serving as a plug-in allows Julius to be updated without having to modify MFT-ASR. Moreover, MFT-ASR works as a server / daemon independent from FlowDesigner and outputs results to the acoustic features transmitted via socket communication by a speech recognition client of HARK and to their MFM.
In order to improve the effectiveness of MFT and trap the spectral distortion only for specific acoustic features, Mel Scale Log Spectrum (MSLS) [4] is used for acoustic features. Mel-Frequency Cepstrum Coefficient (MFCC), which is generally used for speech recognition, is also employed for HARK. However, distortion spreads in all features in MFCC and therefore it does not get along with MFT. When simultaneous speech is infrequent, better performance is achieved by speech recognition with MFCC in some cases. HARK 1.0.0 provides a new module to use the power term with MSLS features [6]. The effectiveness of the power term for MFCC features has already been reported so far. It has already been confirmed that the 13-dimensional MSLS and MSLS, and power, which is the 27-dimensional MSLS feature, have better performance than the 24-dimensional MSLS and MSLS (48 dimensions in total) used for HARK 0.1.7. In HARK, influences of distortion by the aforementioned non-linear separation are reduced by adding a small amount of white noise. An acoustic model is constructed by multi-condition training with clean speech and with white noise added. Then speech recognition is performed with the same amount of white noise added to recognized speech after separation. In this way, highly precise recognition is realized even when S/N is around -3 dB [6] for one speaker’s speech.
Footnotes