HARK Document Version 3.2.0. (Revision: 9448) : SemiBlindICA

6.3.12 SemiBlindICA

6.3.12.1 Overview

This module removes a known signal (e.g., the utterance from a spoken dialogue system) from the multichannel observed mixture. The algorithm presented in Reference $^(1)$ is implemented in this module.

6.3.12.2 Necessary files

N/A.

6.3.12.3 Usage

When to use

Here is an example in the context of a spoken dialogue system. When a spoken dialogue system dispenses with a close-talk microphone, the microphone may capture the mixture of the utterance of the user of the system and the utterance from the system itself because the microphone is located in a distance from the mouth of the user. In this situation, the speech recognition quality is degraded because the input signal captured by the system contains both the target voice of the user and interfering system utterance.

Generally speaking, when the multichannel observed signal recorded by a microphone array contains a known signal component, we can remove the known signal component from the observed mixture signal. In the example above, the system utterance corresponds to the known signal component. Here, a “known” signal means that we know the waveform of the signal when it is played. For example, when we have a wav file to play with a loudspeaker, the signal is known. Usually, the waveform of an observed signal by a microphone is usually different from the waveform that was played from a loudspeaker. This is because the spatial propagation of the sound changes the waveform and adds some reverberation. Besides, there is some time difference of arrival due to the distance between the sound source (e.g., a loudspeaker) and receivers (e.g., microphones). Since SemiBlindICA module explicitly models this propagation of the known sounds, the original waveform being played is sufficient information to remove the known signal component.

Typical connection

Figures 6.79 and 6.80 show the typical usage of SemiBlindICA node. In Figure 6.79, the multichannel observed mixture including the known and unknown signal components is connected to INPUT. The known component is connected to REFERENCE terminal. These signals are converted into the time-frequency domain with MultiFFT node. From OUTPUT terminal, the unknown signal component is produced where the REFERENCE is removed from INPUT. This OUTPUT can be connected to LocalizeMUSIC node for further analysis of the unknown component.

Figure 6.80 illustrates an example that uses a stereo wav file. In the wav file, the first (left) channel contains the known signal component while the second (right) channel contains the mixture of known and unknown components. ChannelSelector node is used to extract each channel from a multichannel wav file. In Figure 6.80, the separated unknown signal component is finally saved as another wav file using SaveWavePCM node.

$\includegraphics[width=.75\textwidth ]{fig/modules/SemiBlindICA1}$

Figure 6.79: Typical connection of SemiBlindICA node

$\includegraphics[width=.75\textwidth ]{fig/modules/SemiBlindICA2}$

Figure 6.80: Extraction of unknown signal component with SemiBlindICA node by using the left and right channels

6.3.12.4 Input-output and property of the node

Table 6.75: Parameters of SemiBlindICA

Parameter	Type	Default value	Unit	description
CHANNEL	`int`	1		Number of channels of INPUT terminal
LENGTH	`int`	512	[pt]	Length of FFT window
INTERVAL	`int`	1		This parameter adjusts the length of the filter to remove the known signal depending on the step size of short-time Fourier transform. If the overlap of the windows (i.e., the step size is small), take a larger value.
TAP_LOWERFREQ	`int`	8	[frame]	The length of the filter at 0 Hz frequency bin
TAP_UPPERFREQ	`int`	4	[frame]	The length of the filter at the Nyquist frequency
DECAY	`float`	0.8		Decaying parameter for the learning rate of each element of the filter to remove the known signal
MU_FILTER	`float`	0.01		Learning rate of the filter to remove the known signal component. Specify a positive value.
MU_REFERENCE	`float`	0.01		Learning rate of the normalization parameter of the known signal. Specify a positive value.
MU_UNKNOWNSIGNAL	`float`	0.01		Learning rate of the normalization parameter of the unknown signal component. Specify a positive value.
IS_ZERO	`float`	0.0001		A threshold to detect an active signal in INPUT. This threshold is applied to the power at each time frequency point in the time-frequency domain.
FILE_FILTER_IN	`string`	`-null`		File name (path) that contains the filter value used to initialize the processing. If “-null” (the default value) is specified, no file is used for the initialization.
FILE_FILTER_OUT	`string`	`-null`		File name (path) to write out the filter values. If “-null” (the default value) is specified, no file is written.
OUTPUT_FREQ	`int`	150	[frame]	The interval to save the filter values in terms of time frames.

Input

INPUT: : Matrix<complex<float> > type. Multichannel observation that contains both known and unkonwn signal components. This is complex-valued spectra obtained by MultiFFT node.
REFERENCE: : Matrix<complex<float> > type. Complex-valued spectra of the known signal component.

Output

OUTPUT: : Matrix<complex<float> > type. The output signal is generated by suppressing the known signal REFERENCE from the observed mixture INPUT. This is in the same multichannel format as INPUT terminal.

Parameter

CHANNEL: Number of channels of the input mixture INPUT.
LENGTH: Windows length of the short-time Fourier transform. HARK uses 512 [pt] by default.
INTERVAL: This parameter adjusts the length of the filter to remove the known signal component depending on the step size of the short-time Fourier transform. This multirate repeating is introduced to improve the convergence of the filter learning $^{(2)}$ . This value is denoted by $K$ in the math expressions below.
TAP_LOWERFREQ: The length of the filter at 0 [Hz]. The length of the filter accounts for the time difference between the REFERENCE signal and the known component in INPUT, and the reverberation in the observed signal. When the environment contains a long reverberation, this value is set larger accordingly. This value is denoted by $M_ L$ .
TAP_UPPERFREQ: The length of the filter at the Nyquist frequency. The filter length at each frequency bin is determined by the linear interpolation of TAP_LOWERFREQ and TAP_UPPERFREQ. This value is denoted by $M_ H$ .
DECAY: The decaying parameter for the learning rate of each element of the filter to remove the known signal component. In a reverberant environment, such as an indoor situation, the filter element corresponding to the past time frame exponentially decays. Since the filter element values follow an exponential curve, the learning of the filter values becomes efficient when the learning rate for each filter value is also set exponentially decaying values. When this value is set $1$ , the learning rate becomes identical for all filter values. We empirically set this value at 0.6–0.8. This value is denoted by \lambda.
MU_FILTER: The filter values are obtained though the stochastic gradient method. This is the learning rate for the learning procedure. Generally speaking, a large learning rate is able to drastically update the parameters while there is a risk of the fluctuation of the parameters around the (local) optimum value. On the other hand, a small learning rate can avoid the risk of the fluctuation, while the number of parameter updates may increase before convergence. This is denoted by $\mu _ w$ .
MU_REFERENCE: Learning rate for the normalization parameter of the known signal component. The normalization of the known signal is carried out to accelerate the learning. This value is denoted by $\mu _{\alpha }$ .
MU_UNKNOWNSIGNAL: Learning rate for the normalization parameter of the unknown signal component. Similarly to MU_REFERENCE, this normalization is introduced for an efficient learning. This is denoted by $\mu _{\beta }$ .
IS_ZERO: To save the computational resource, the filter update procedure is omitted if the INPUT contains no signal. This is the threshold to detect the existence of a signal. When the power of the INPUT is below this value, this time frame is ignored. Note that this threshold is applied to the power of the signal in the time-frequency domain, instead of the waveform in the time domain.
FILE_FILTER_IN: File name (path) that contains the initial filter values. If “-null” is specified, no file is used for the initialization.
FILE_FILTER_OUT: File name (path) to write out the filter values. If “-null” is specified, no file is written out.
OUTPUT_FREQ: The interval in terms of time frame to save the filter values.

6.3.12.5 Detail of the node

SemiBlindICA node suppresses the known signal component from the multichannel observation that contains both the known and unknown signals using independent component analysis (ICA). The ICA algorithm is derived based on the mixing process in the time-frequency domain and the statistical independence between the known and unknown signals.

Mixing model and the separation process: node uses the following mixing process in a reverberant environment. This model is a linear mixing process in the time-frequency domain. Let $\omega$ be the frequency bin index, $f$ be the time frame index, and $X(\omega , f)$ be the observed signal at frequency $\omega$ and time $f$ . The observation is modeled as

$\displaystyle X(\omega , f) = N(\omega , f) + \sum ^ M_{m=0} H(\omega , m)S(\omega , f-m), \nonumber$

where $N(\omega , f)$ , $S(\omega ,f)$ , and $H(\omega ,m)$ denote the unknown signal, the known signal, and the propagation coefficient with a time lag of $m$ frames.

The separation process is derived as follows using ICA.

$\displaystyle \left( \begin{array}{c} \hat{N}(\omega , f) \\ {\boldsymbol S}(\omega , f) \end{array} \right)$	$\displaystyle =$	$\displaystyle \left( \begin{array}{cc} a(\omega ) & -{\boldsymbol w}^ T(\omega ) \\ {\boldsymbol 0} & {\boldsymbol I} \end{array} \right) \left( \begin{array}{c} X(\omega , f) \\ {\boldsymbol S}(\omega , f) \end{array} \right), \label{eq:unmix}$	(129)
$\displaystyle {\boldsymbol S}(\omega , f)$	$\displaystyle =$	$\displaystyle [S(\omega , f), S(\omega , f-K), \cdots , S(\omega , f-M(\omega )K)]^ T, \nonumber$
$\displaystyle {\boldsymbol w}(\omega )$	$\displaystyle =$	$\displaystyle [w_0(\omega ), w_1(\omega ), \cdots , w_{M(\omega )}(\omega )]^ T, \nonumber$
$\displaystyle M(\omega )$	$\displaystyle =$	$\displaystyle {\rm floor}\left(\omega / \omega _{nyq}(M_ U - M_ L)\right) + M_ L. \nonumber$

Here, $\omega _{nyq}$ denotes the maximum value of the frequency bin index (corresponding to the Nyquist frequency). The separation filter ${\boldsymbol w}(\omega )^ T$ is a $M(\omega )+1$ -dimensional filter, where ${\boldsymbol w}^ T$ is the transpose of vector ${\boldsymbol w}$ . $K$ is the factor for multirate repeating $^(2)$ introduced for an efficient convergence of the filter.

Estimation of the separation filter: The separation filter is estimated through ICA processing: the filter is obtained by minimizing the Kullback-Leibler divergence (KLD) between the product of the probability density function of $\hat{N}$ and ${\boldsymbol S}$ , and the joint distribution of these variables. The update procedure is derived by using the nonholonomic constraint $^(3)$ and natural gradient method. Let ${\hat N}_ n$ be the normalized unknown signal. The separation filter to suppress the known signal component is incrementally updated as follows.

	$\displaystyle {\boldsymbol w}(\omega , f+1)$	$\displaystyle =$	$\displaystyle {\boldsymbol w}(\omega , f) + {\boldsymbol \mu }_ w \Phi _{\hat{N}_ n(\omega )}\left( \hat{N}_ n(\omega , f)\right) \bar{\boldsymbol S}_ n(\omega , f), \nonumber$
	$\displaystyle a(\omega )$	$\displaystyle =$	$\displaystyle 1, \nonumber$

where $\hat{x}$ denotes the complex conjugate of $x$ . Here, the higher-order correlation is defined as $\Phi _ x(x) = \tanh (|x|)e^{j\theta (x)}$ . ${\boldsymbol \mu }_ w$ is defined as follows.

$\displaystyle {\boldsymbol \mu }_ w = {\rm diag}\left(\mu _ w, \mu _ w \lambda ^{-1}, \cdots , \mu _ w \lambda ^{-M(\omega )}\right). \label{eq:update-w}$

(130)

This element-wise learning rate has an exponential decay so as to accelerate the learning of the separation filter. From Eq. (129), the unknown signal component $\hat{N}$ is obtained as follows.

$\displaystyle \hat{N}(\omega , f)$

$\displaystyle =$

$\displaystyle X(\omega , f) - {\boldsymbol w}(\omega , f)^ T {\boldsymbol S}_ n(\omega , f). \label{eq:calc-n}$

(131)

To satisfy the nonholonomic constraint, $\hat{N}$ should be normalized. This is because the constraint requires $E[1-\Phi _ x(x\alpha _ x)\bar{x}\bar{\alpha }_ x] =1$ . In the general framework of minimization of KLD based on natural gradient method, variable $x$ has a normalization factor $\nu _ x$ that is incrementally updated as follows:

$\displaystyle \nu _ x(f+1) = \nu _ x(f) + \mu _ x[1 - \Phi _ x(x(f)\nu _ x(f))\bar{x}(f)\bar{\nu }_ x(f)]\nu _ x(f) \nonumber$

This is applied to the normalization of the estimated unknown signal component $\hat{N}$ using the factor $\alpha$ as

	$\displaystyle \hat{N}_ n(f)$	$\displaystyle =$	$\displaystyle \alpha (f)\hat{N}(f), \label{eq:calc-nn}$		(132)
	$\displaystyle \alpha (f+1)$	$\displaystyle =$	$\displaystyle \alpha (f) + \mu _\alpha [1 - \Phi _{\hat{N}_ n}(\hat{N}_ n(f))\bar{\hat{N}}_ n(f)]\alpha (f). \label{eq:update-a}$		(133)

For an efficient convergence of the separation filter, node normalizes the observation signal using normalization factor $\beta$ . Similarly to the case of $\hat{N}$ , these quantities are updated as

$\displaystyle S_ n(f)$	$\displaystyle =$	$\displaystyle \beta (f)S(f), \label{eq:calc-sn}$	(134)
$\displaystyle \beta (f+1)$	$\displaystyle =$	$\displaystyle \beta (f) + \mu _\beta [1 - \Phi _{S_ n}(S_ n(f))\bar{S}_ n(f)]\beta (f), \label{eq:update-b}$	(135)
$\displaystyle {\boldsymbol S}_ n(f)$	$\displaystyle =$	$\displaystyle [S_ n(f), S_ n(f-K), \cdots , S_ n(f-MK)]. \nonumber$

Flow: The main algorithm of node consists of Eqs. (130–135) for each frequency bin $\omega$ and time frame $f$ . Algorithm 6.1 summarizes the core procedures at a certain frequency bin and a channel.

Algorithm 6.1: The core algorithm of SemiBlindICA node





: At frequency $\omega$ and frame $f$ , the following procedures are carried out.
: Calculate $\hat{N}(\omega , f)$ by Eq. (131).
: Normalize $\hat{N}(\omega , f)$ and ${\boldsymbol S}(\omega , f)$ by Eqs. (132, 134)
: Update the separation filter ${\boldsymbol w}(\omega , f)$ using Eq. (130)
: Update the normalization factors $\alpha (\omega , f)$ and $\beta (\omega , f)$ using Eq. (133, 135)
: Output $\hat{N}(\omega , f)$

Algorithm 6.2 presents the overall procedures applied to all the frequency bins and channels. Every time a new time frame is observed, Algorithm 6.1 is applied to each channel and frequency bin.

Algorithm 6.2: Overall procedures

The following procedures are carried out with a new observation of a time frame.

$f \leftarrow f + 1$

for $ch$ in $0, \cdots , C$ do {default}

for $\omega$ in $0, \cdots , \omega _{nqt}$ do {default}





: Run Algorithm 6.1 for $\omega , ch$ .

end for

6.3.12.6 Reference

R. Takeda et al., “Barge-in-able Robot Audition Based on ICA and Missing Feature Theory,” in Proc. of IROS, pp. 1718–1723, 2008.
H. Kiya et al., “Improvement of convergence speed for subband adaptive digital filter using the multirate repeating method,” Electronics and Communications in Japan, Part III, Vol. 78, no. 10, pp. 37–45, 1995.
C. Choi et al., “Natural gradient learning with nonholonomic constraint for blind deconvolution of multiple channels,” in Proc. of Int’l Workshop on ICA and BBS, pp. 371–376