HARK Document Version 2.5.0. (Revision: 9008) : MSNR

6.3.9 MSNR

6.3.9.1 Outline of the node

Perform sound source separation using the method of maximum SNR (Maximum Signal-to-Noise Ratio). In this algorithm, perform sound source separation by updating the separation matrix so that the ratio of the gain in the target sound source direction and the gain in the known noise direction is maximized. Transfer function information from the sound source to the microphones in advance is not required; however, the section information on the sound source (detection result of the utterance section) is necessary.

Node inputs are:

Multi-channel complex spectrum of mixed sound,
Direction of localized sound sources.

Note outputs are a set of complex spectrum of each separated sound.

6.3.9.2 Necessary files

No files are required.

6.3.9.3 Usage

When to use

This node is used to perform sound source separation on the sound source direction originated using a microphone array. The sound source direction can be either a value estimated by sound source localization or a constant value. Since this node uses the ratio of the gain in the target sound source and the gain in the known noise, it requires the speech period information of the known noise. This node treats the time period with no sound source direction input data as the period with noise, the Noise Period.

$\includegraphics[width=.5\textwidth ]{fig/modules/MSNR_NoisePeriod.png}$

Figure 6.65: Noise Period vs Source Period

Typical connection

Figure 6.66 shows a connection example of the MSNR . The node has two inputs as follows:

INPUT_FRAMES takes a multi-channel complex spectrum containing the mixture of sounds produced by for example MultiFFT ,
INPUT_SOURCES takes the results of sound source localization produced by for example LocalizeMUSIC or ConstantLocalization ,

The output is the separated signals.

$\includegraphics[width=.8\textwidth ]{fig/modules/MSNR.png}$

Figure 6.66: Example of connection of the MSNR

6.3.9.4 Input-output and property of the node

Input

INPUT_FRAMES: : Matrix<complex<float> > type. Multi-channel complex spectra. Corresponding to the complex spectrum of input waveform from each microphone, the rows correspond to the channel and the columns correspond to the frequency bins.
INPUT_SOURCES: : Vector<ObjectRef> type. A Vector array of the Source type object in which sound source localization results are stored. Typically, takes the output of SourceIntervalExtender connected to SourceTracker .

Output

OUTPUT: : Map<int, ObjectRef> type. A pair containing the sound source ID of a separated sound and a 1-channel complex spectrum of the separated sound (Vector<complex<float> > type).

Parameter

LENGTH: : int type. Analysis frame length [samples], which must be equal to the values at the preceding node (e.g. AudioStreamFromMic or the MultiFFT node). The default value is 512[samples]. 型.
ADVANCE: : int type. Shift length of a frame [samples], which must be equal to the values at the preceding node (e.g. AudioStreamFromMic or the MultiFFT node). The default value is 160[samples].
SAMPLING_RATE: : int type. Sampling frequency of the input waveform [Hz]. The default value is 16000[Hz].
LOWER_BOUND_FREQUENCY: : int type. The minimum frequency value used for separation processing. For frequencies below this value, no processing is performed and the output spectrum is 0. Specify the value in the range between 0 and up to the half of the sampling frequency value.
UPPER_BOUND_FREQUENCY: : int type. The maximum frequency value used for separation processing. For frequencies above this value, no processing is performed and the output spectrum is 0. The UPPER_BOUND_FREQUENCY must be greater than the LOWER_BOUND_FREQUENCY.
DECOMPOSITION_ALGORITHM: : string type. The decomposition algorithm to perform sound source separation. GEVD represents generalized eigenvalue decomposition. GSVD represents generalized singular value decomposition. GEVD has better noise suppression performance than GSVD whereas GEVD costs longer calculation time than GSVD. Select the appropriate algorithm according to the purpose and the computer environment.
ALPHA: : float type. The stepsize for updating correlation matrices. The default value is 0.99.
ENABLE_DEBUG: : bool type. The default value is false. Setting the value to trueoutputs the separation status to the standard output.

Table 6.54: Parameter list of MSNR

Parameter list	Type	Default value	Unit	Description
LENGTH	`int`	512	[pt]	Analysis frame length.
ADVANCE	`int`	160	[pt]	Shift length of frame.
SAMPLING_RATE	`int`	16000	[Hz]	Sampling frequency.
LOWER_BOUND_FREQUENCY	`int`	0	[Hz]	The minimum frequency value used for separation processing.
UPPER_BOUND_FREQUENCY	`int`	8000	[Hz]	The maximum frequency value used for separation processing.
DECOMPOSITION_ALGORITHM	`string`	GEVD		The decomposition algorithm.
ALPHA	`float`	0.99		The stepsize for updating correlation matrices.
ENABLE_DEBUG	`bool`	`false`		Enable or disable to output the separation status to standard output.

6.3.9.5 Details of the node

Technical details: Please refer to the following reference for the details.

Brief explanation of sound source separation: Table 6.44 shows the notation of variables used in sound source separation problems. Since the source separation is performed frame-by-frame in the frequency domain, all the variable is computed in a complex field. Also, the separation is performed for all $K$ frequency bins ( $1 \leq k \leq K$ ). Here, we omit $k$ from the notation. Let $N$ , $M$ , and $f$ denote the number of sound sources and the number of microphones, and the frame index, respectively.

Table 6.55: Notation of variables

Variables	Description
$\boldsymbol {S}(f) = \left[S_1(f), \dots , S_ N(f)\right]^ T$	Complex spectrum of target sound sources at the $f$ -th frame.
$\boldsymbol {X}(f) = \left[X_1(f), \dots , X_ M(f)\right]^ T$	Complex spectrum of a microphone observation at the $f$ -th frame, which corresponds to INPUT_FRAMES.
$\boldsymbol {N}(f) = \left[N_1(f), \dots , N_ M(f)\right]^ T$	Complex spectrum of added noise.
$\boldsymbol {H} = \left[ \boldsymbol {H}_1, \dots , \boldsymbol {H}_ N \right] \in \mathbb {C}^{M \times N}$	Transfer function matrix from the $n$ -th sound source ( $1 \leq n \leq N$ ) to the $m$ -th microphone ( $1 \leq m \leq M$ )
$\boldsymbol {K}(f) \in \mathbb {C}^{M \times M}$	Correlation matrix of known noise.
$\boldsymbol {W}(f) = \left[ \boldsymbol {W}_1, \dots , \boldsymbol {W}_ M \right] \in \mathbb {C}^{N \times M}$	Separation matrix at the $f$ -th frame.
$\boldsymbol {Y}(f) = \left[Y_1(f), \dots , Y_ N(f)\right]^ T$	Complex spectrum of separated signals.

Use the following linear model for the signal processing:

$\displaystyle \boldsymbol {X}(f)$

$\displaystyle =$

$\displaystyle \boldsymbol {H}\boldsymbol {S}(f) + \boldsymbol {N}(f) \label{eq:beamforming-observation}$

(77)

The purpose of the separation is to estimate $\boldsymbol {W}(f)$ based on the following equation:

$\displaystyle \boldsymbol {Y}(f)$

$\displaystyle =$

$\displaystyle \boldsymbol {W}(f)\boldsymbol {X}(f) \label{eq:Beamforming-separation}$

(78)

so that $\boldsymbol {Y}(f)$ is getting close to $\boldsymbol {S}(f)$ .

The evaluation function $J_{\textrm{MSNR}}(\boldsymbol {W}(f))$ for updating the separation matrix is defined by the information of the directions of the target source and the noise received at the input terminal of INPUT_SOURCES and INPUT_NOISE_SOURCES.

Assuming that the correlation matrix of the target sound signal is $\boldsymbol {R}_{ss}(f)$ and the correlation matrix of the noise signal is $\boldsymbol {R}_{nn}(f)$ , the evaluation function $J_{\textrm{MSNR}}(\boldsymbol {W}(f))$ for updating the separation matrix is expressed as follows.

$\displaystyle J_{\textrm{MSNR}}(\boldsymbol {W}(f))$

$\displaystyle =$

$\displaystyle \frac{\boldsymbol {W}(f))\boldsymbol {R}_{ss}(f)\boldsymbol {W}(f))^ H}{\boldsymbol {W}(f))\boldsymbol {R}_{nn}(f)\boldsymbol {W}(f))^ H} \label{eq:MSNR}$

(79)

In the , obtain $\boldsymbol {W}(f)$ that maximizes $J_{\textrm{MSNR}}(\boldsymbol {W}(f))$ using generalized eigenvalue decomposition or generalized singular value decomposition. Here, the correlation matrix $\boldsymbol {R}_{ss}(f)$ of the signal is updated from the correlation matrix $\boldsymbol {R}_{xx}(f)$ obtained from the signal of the signal period (the period in which the target sound exists) where the sound source exists at the INPUT_SOURCES input terminal as follows.

$\displaystyle \boldsymbol {R}_{ss}(f+1)$

$\displaystyle =$

$\displaystyle \alpha \boldsymbol {R}_{ss}(f) + (1-\alpha )\boldsymbol {R}_{xx}(f) \label{eq:MSNR-Rss}$

(80)

On the other hand, the correlation matrix of noise $\boldsymbol {R}_{nn}(f)$ is updated by a correlation matrix $\boldsymbol {R}_{xx}(f)$ obtained from the signal of the signal period (the period in which the noise exist) where sound source exists in the input terminal of INPUT_NOISE_SOURCES as follows.

$\displaystyle \boldsymbol {R}_{nn}(f+1)$

$\displaystyle =$

$\displaystyle \alpha \boldsymbol {R}_{nn}(f) + (1-\alpha )\boldsymbol {R}_{xx}(f) \label{eq:MSNR-Rnn}$

(81)

The $\alpha$ in the equation (80) and the equation (81) can be specified in the ALPHA property. $\boldsymbol {W}(f)$ is updated by $\boldsymbol {R}_{ss}(f)$ and $\boldsymbol {R}_{nn}(f)$ and so separated.

Trouble shooting: Basically, same as GHDSS node troubleshooting.

6.3.9.6 Reference

P. W. Howells, ’Intermediate Frequency Sidelobe Canceller’, U.S. Patent No.3202990, 1965.