HARK version 1.1.0 Document : PostFilter

6.3.7 PostFilter

Outline of the node

This node performs postprocessing to improve the accuracy of speech recognition with the sound source separation node GHDSS for a separated complex spectrum. At the same time, it generates noise power spectra to generate Missing Feature Masks.

Necessary files

No files are required.

When to use

This node is used to form the spectrum that are separated by the GHDSS node and generate the noise spectra required to generate Missing Feature Masks.

Typical connection

Figure 382 shows an example of a connection for the PostFilter node. The output of the GHDSS node is connected to the INPUT_SPEC input and the output of the BGNEstimator node is connected to the INIT_NOISE_POWER input. Figure 382 shows examples for typical output connections

Speech feature extraction from separated sound (OUTPUT_SPEC) (MSLSExtraction node)
Generation of Missing Feature Masks from separated sound and power (EST_NOISE_POWER) of noise contained in it at the time of speech recognition (MFMGeneration node)

$\includegraphics[width=.9\textwidth ]{fig/modules/PostFilter}$

Input-output and property of the node

Input

INPUT_SPEC: Map<int, ObjectRef> type. The same type as the output from the GHDSS node. A pair of a sound source ID and a complex spectrum of the separated sound as Vector<complex<float> > type data.
INPUT_NOISE_POWER: Matrix<float> type. The power spectrum of the stationary noise estimated by the BGNEstimator node.

Output

OUTPUT_SPEC: Map<int, ObjectRef> type. The Object is the complex spectrum from the input INPUT_SPEC, with noise removed.
EST_NOISE_POWER: Map<int, ObjectRef> type. Power of the estimated noise to be contained is paired with IDs as Vector<float> type data for each separated sound of OUTPUT_SPEC.

Parameter

Table 6.38: Parameter list of PostFilter (first half)

Parameter name	Type	Default value	Description
MCRA_SETTING	`bool`	`false`	When the user set parameters for the MCRA estimation, which is a noise removal method, select `true`.
MCRA_SETTING			The following are valid when MCRA_SETTING is set to `true`
STATIONARY_NOISE_FACTOR	`float`	1.2	Coefficient at the time of stationary noise estimation.
SPEC_SMOOTH_FACTOR	`float`	0.5	Smoothing coefficient of an input power spectrum.
AMP_LEAK_FACTOR	`float`	1.5	Leakage coefficient.
STATIONARY_NOISE_MIXTURE_FACTOR	`float`	0.98	Mixing ratio of stationary noise.
LEAK_FLOOR	`float`	0.1	Minimum value of leakage noise.
BLOCK_LENGTH	`int`	80	Detection time width.
VOICEP_THRESHOLD	`int`	3	Threshold value of speech presence judgment.
EST_LEAK_SETTING	`bool`	`false`	When the user sets parameters related to the leakage rate estimation, select `true`.
EST_LEAK_SETTING			The followings are valid when EST_LEAK_SETTING is set to `true`.
LEAK_FACTOR	`float`	0.25	Leakage rate.
OVER_CANCEL_FACTOR	`float`	1	Leakage rate weighting factor.
EST_REV_SETTING	`bool`	`false`	When the user sets parameters related to the component estimation, select `true`.
EST_REV_SETTING			The followings are valid when EST_REV_SETTING is set to `true`.
REVERB_DECAY_FACTOR	`float`	0.5	Damping coefficient of reverberant power.
DIRECT_DECAY_FACTOR	`float`	0.2	Damping coefficient of a separated spectrum.
EST_SN_SETTING	`bool`	`false`	When the user sets parameters related to the SN ratio estimation, select `true`.
EST_SN_SETTING			The followings are valid when EST_SN_SETTING is set to `true`.
PRIOR_SNR_FACTOR	`float`	0.8	Ratio of priori and posteriori SNRs.
VOICEP_PROB_FACTOR	`float`	0.9	Amplitude coefficient of the probability of speech presence.
MIN_VOICEP_PROB	`float`	0.05	Probability of the minimum speech presence.
MAX_PRIOR_SNR	`float`	100	Maximum value of preliminary SNR.
MAX_OPT_GAIN	`float`	20	Maximum value of the optimal gain intermediate variable v.
MIN_OPT_GAIN	`float`	6	Minimum value of the optimal gain intermediate variable v.

Table 6.39: Parameter list of PostFilter (latter half)

Parameter name	Type	Default value	Description
EST_VOICEP_SETTING	`bool`	`false`	When the user sets parameters related to the speech probability estimation, select `true`.
EST_VOICEP_SETTING			The following are valid when EST_VOICEP_SETTING is set to `true`.
PRIOR_SNR_SMOOTH_FACTOR	`float`	0.7	Time smoothing coefficient.
MIN_FRAME_SMOOTH_SNR	`float`	0.1	Minimum value of the frequency smoothing SNR (frame).
MAX_FRAME_SMOOTH_SNR	`float`	0.316	Maximum value of the frequency smoothing SNR (frame).
MIN_GLOBAL_SMOOTH_SNR	`float`	0.1	Minimum value of the frequency smoothing SNR (global).
MAX_GLOBAL_SMOOTH_SNR	`float`	0.316	Maximum value of the frequency smoothing SNR (global).
MIN_LOCAL_SMOOTH_SNR	`float`	0.1	Minimum value of the frequency smoothing SNR (local).
MAX_LOCAL_SMOOTH_SNR	`float`	0.316	Maximum value of the frequency smoothing SNR (local).
UPPER_SMOOTH_FREQ_INDEX	`int`	99	Frequency smoothing upper limit bin index.
LOWER_SMOOTH_FREQ_INDEX	`int`	8	The frequency smoothing lower limit bin index.
GLOBAL_SMOOTH_BANDWIDTH	`int`	29	Frequency smoothing band width (global).
LOCAL_SMOOTH_BANDWIDTH	`int`	5	The frequency smoothing band width (local).
FRAME_SMOOTH_SNR_THRESH	`float`	1.5	Threshold value of frequency smoothing SNR.
MIN_SMOOTH_PEAK_SNR	`float`	1.0	Minimum value of the frequency smoothing SNR peak.
MAX_SMOOTH_PEAK_SNR	`float`	10.0	Maximum value of the frequency smoothing SNR peak.
FRAME_VOICEP_PROB_FACTOR	`float`	0.7	Speech probability smoothing coefficient (frame).
GLOBAL_VOICEP_PROB_FACTOR	`float`	0.9	Speech probability smoothing coefficient (global).
LOCAL_VOICEP_PROB_FACTOR	`float`	0.9	Speech probability smoothing coefficient (local).
MIN_VOICE_PAUSE_PROB	`float`	0.02	Minimum value of speech quiescent probability.
MAX_VOICE_PAUSE_PROB	`float`	0.98	Maximum value of speech quiescent probability.

Details of the node

$\includegraphics[width=0.7\textwidth ]{fig/modules/PF-fc-overview.eps}$

Figure 6.49: Flowchart of PostFilter

The subscripts used in the equations are based on the definitions in Table 6.1. Moreover, the time frame index $f$ is abbreviated in the following equations unless especially needed. Figure 6.49 shows a flowchart of the PostFilter node. A separated sound spectrum from the GHDSS node and a stationary noise power spectrum of the BGNEstimator node are obtained as inputs. Outputs are the separated sound spectrum for which the speech is emphasized, and a power spectrum of noise mixed with the separated sound. The processing flow is as follows.

Noise estimation
SNR estimation
Speech presence probability estimation
Noise removal

1) Noise estimation

$\includegraphics[width=0.7\textwidth ]{fig/modules/PF-fc-noise.eps}$

Figure 6.50: Procedure of noise estimation

Figure 6.50 shows the processing flow of noise estimation . The three kinds of noise that the PostFilter node processes are
a) The stationary noise for which contact points of microphones are a factor,
b) The sound of other sound sources that cannot be completely removed (leakage noise),
c) Reverberations from the previous frame.

The noise contained in the final separated sound ${\mbox{\boldmath {$\lambda $}}}(f, k_ i)$ is obtained by the following equation.

$\displaystyle {\mbox{\boldmath {$\lambda $}}}(f,k_ i)$

$\displaystyle =$

$\displaystyle {\mbox{\boldmath {$\lambda $}}}^{sta}(f,k_ i) + {\mbox{\boldmath {$\lambda $}}}^{leak}(f,k_ i) + {\mbox{\boldmath {$\lambda $}}}^{rev}(f-1,k_ i)$

(56)

Here, ${\mbox{\boldmath {$\lambda $}}}^{sta}(f,k_ i), {\mbox{\boldmath {$\lambda $}}}^{leak}(f,k_ i)$ and ${\mbox{\boldmath {$\lambda $}}}^{rev}(f-1,k_ i)$ indicate stationary noise, leakage noise and reverberation from the previous frame, respectively.

1-a) Stationary noise estimation by MCRA method

The parameters used in 1-a) are based on Table 6.40.

Table 6.40: Definition of variable

Parameter	Description, Corresponding parameter
${\mbox{\boldmath {$Y$}}}(k_ i) = \left[Y_1(k_ i),\dots , Y_ N(k_ i) \right]^ T$	Complex spectrum of separated sound corresponding to the frequency bin $k_ i$
${\mbox{\boldmath {$\lambda $}}}^{init}(k_ i) = \left[\lambda ^{init}_{1}(k_ i),\dots , \lambda ^{init}_ N(k_ i)\right]^ T$	Initial value power spectrum used for the stationary noise estimation
${\mbox{\boldmath {$\lambda $}}}^{sta}(k_ i) = \left[\lambda ^{sta}_{1}(k_ i),\dots , \lambda ^{sta}_ N(k_ i) \right]^ T$	Estimated stationary noise power spectrum.
$\alpha _ s$	Smoothing coefficient of the input power spectrum. Parameter SPEC_SMOOTH_FACTOR. The default value is 0.5
${\mbox{\boldmath {$S$}}}^{tmp}(k_ i)= \left[S^{tmp}_1(k_ i),\dots , S^{tmp}_ N(k_ i) \right]$	Temporary parameter for minimum power calculation.
${\mbox{\boldmath {$S$}}}^{min}(k_ i)= \left[S^{min}_1(k_ i),\dots , S^{min}_ N(k_ i) \right]$	The parameter that maintains the minimum power.
$L$	Maintained frame numbers of ${\mbox{\boldmath {$S$}}}_{tmp}$ . Parameter BLOCK_LENGTH. The default value is 80
$\delta$	Threshold value of speech presence judgment. Parameter VOICEP_THRESHOLD. The default value is 3.0
$\alpha _ d$	Mixing ratio of estimated stationary noise. Parameter STATIONARY_NOISE_MIXTURE_FACTOR. The default value is 0.98
${\mbox{\boldmath {$Y$}}}^{leak}(k_ i)$	Power spectrum of leakage noise estimated, to be contained in separated sound
$q$	Coefficient for when leakage noise is removed from the input separated sound power. Parameter AMP_LEAK_FACTOR. The default value is 1.5.
$S_{floor}$	Minimum value of leakage noise. Parameter LEAK_FLOOR. The default value is 0.1.
$r$	Coefficient at the time of stationary noise estimation. Parameter STATIONARY_NOISE_FACTOR. The default value is 1.2

First, calculate the power spectrum for which the input spectrum is smoothed with the power from one frame before. ${\mbox{\boldmath {$S$}}}(f,k_ i) = \left[S_1(f,k_ i),\dots , S_ N(f,k_ i)\right]$ .

$\displaystyle S_ n(f,k_ i)$

$\displaystyle =$

$\displaystyle \alpha _ s S_ n(f-1,k_ i)+ (1 - \alpha _ s)|Y_ n(k_ i)|^2 \label{eqMCRA-smooth}$

(57)

Next, update ${\mbox{\boldmath {$S$}}}^{tmp}$ , ${\mbox{\boldmath {$S$}}}^{min}$ .

	$\displaystyle S^{min}_ n(f,k_ i)$	$\displaystyle =$	$\displaystyle \left\{ \begin{array}{cr} \min \{ S^{min}_ n(f-1,k_ i),S_ n(f,k_ i) & \mathrm{if}\ \ f \undefined nL\\ \min \{ S^{tmp}_ n(f-1,k_ i),S_ n(f,k_ i) & \mathrm{if}\ \ f = nL \end{array}\right.,$		(58)
	$\displaystyle S^{min}_ n(f,k_ i)$	$\displaystyle =$	$\displaystyle \left\{ \begin{array}{cr} \min \{ S^{tmp}_ n(f-1,k_ i),S_ n(f,k_ i) & \mathrm{if}\ \ f \undefined nL\\ S_ n(f,k_ i) & \mathrm{if}\ \ f = nL \end{array}\right.,$		(59)

Here, $n$ indicates an arbitrary integer. ${\mbox{\boldmath {$S$}}}^{min}$ maintains the minimum power after the noise estimation begins ${\mbox{\boldmath {$S$}}}^{tmp}$ maintains an extremely small power of a recent frame. ${\mbox{\boldmath {$S$}}}^{tmp}$ is updated every $L$ frames. Next, judge if the frame contains speech based on the power ratio of the minimum power and the input separated sound.

	$\displaystyle S_ n^{r}(k_ i)$	$\displaystyle =$	$\displaystyle \frac{S_ n(k_ i)}{S^{min}(k_ i)},$		(60)
	$\displaystyle I_ n(k_ i)$	$\displaystyle =$	$\displaystyle \left\{ \begin{array}{cr} 1 & \mathrm{if}\ \ S_ n^ r(k_ i) > \delta \\ 0 & \mathrm{if}\ \ S_ n^ r(k_ i) \leq \delta \end{array} \right.$		(61)

When speech is included, $I_ n(k_ i)$ is 1 and when it is not included, it is 0. Based on this result, we determine the mixing ratio $\alpha _{d,n}^ C(k_ i)$ of the frame’s estimated stationary noise.

$\displaystyle \alpha _{d,n}^ C(k_ i)$

$\displaystyle =$

$\displaystyle (\alpha _ d - 1)I_ n(k_ i)+ 1.$

(62)

Next, subtract leakage noise contained in the power spectrum of the separated sound.

	$\displaystyle S^{leak}_ n(k_ i)$	$\displaystyle =$	$\displaystyle \sum _{p=1}^{N}\|Y_ p(k_ i)\|^2 - \|Y_ n(k_ i)\|^2,\label{eqMCRA-leak}$		(63)
	$\displaystyle S_ n^0(k_ i)$	$\displaystyle =$	$\displaystyle \|Y_ n(k_ i)\|^2 - q S^{leak}_ n(k_ i),$		(64)

Here, when $S_ n^0(k_ i) < S_{floor}$ , the valued is changed to below.

$\displaystyle S_ n^0(k_ i)$

$\displaystyle =$

$\displaystyle S_{floor}$

(65)

Obtain stationary noise of the current frame by mixing the power spectrum with leakage noise removed $S_ n^0(f,k_ i)$ and the estimated stationary noise of the former frame ${\mbox{\boldmath {$\lambda $}}}^{sta}(f-1,k_ i)$ or ${bf \lambda }^{init}(f,k_ i)$ , which is the output from BGNEstimator .

$\displaystyle \lambda ^{sta}_ n(f,k_ i)$

$\displaystyle =$

$\displaystyle \left\{ \begin{array}{cr} \alpha _{d,n}^ C(k_ i) \lambda ^{sta}_ n(f-1,k_ i)+ (1-\alpha _{d,n}^ C(k_ i) r S_ n^0(f,k_ i) & no change in source position\\ \alpha _{d,n}^ C(k_ i) \lambda ^{init}_ n(f,k_ i) + (1-\alpha _{d,n}^ C(k_ i) r S_ n^0(f,k_ i) & \mathrm{if }\mbox{Change in source position} \end{array} \right.$

(66)

1-b)Leakage noise estimation

The variables used in 1-b) are based on Table 6.41.

Table 6.41: Definition of variable

Variable	Description, Corresponding parameter
${\mbox{\boldmath {$\lambda $}}}^{leak}(k_ i)$	Power spectrum of leakage noise. Vector comprising elements of each separated sound.
$\alpha ^{leak}$	Leakage rate for the total of separated sound power. LEAK_FACTOR $\times$ OVER_CANCEL_FACTOR
$S_ n(f,k_ i)$	Smoothing power spectrum obtained by Equation (57)

Some parameters are calculated as follows.

	$\displaystyle \beta$	$\displaystyle =$	$\displaystyle -\frac{\alpha ^{leak}}{1-(\alpha ^{leak})^2+\alpha ^{leak}(1-\alpha ^{leak})(N-2)}$		(67)
	$\displaystyle \alpha$	$\displaystyle =$	$\displaystyle 1 - (N-1)\alpha ^{leak}\beta$		(68)

With this parameter, mix the smoothed spectrum $\mbox{\boldmath {$S$ }}(k_ i)$ , the power spectrum for which the power of the own separated sound is removed from the power of other separated sound $S^{leak}_ n(k_ i)$ obtained by Equation (63).

$\displaystyle Z_ n(k_ i)$

$\displaystyle =$

$\displaystyle \alpha S_ n(k_ i)+ \beta S^{leak}_ n(k_ i),$

(69)

Here, when $Z_ n(k_ i) < 1$ , assume $Z_ n(k_ i) = 1$ . The power spectrum of final leakage noise ${\mbox{\boldmath {$\lambda $}}}^{leak}(k_ i)$ is obtained as follows.

$\displaystyle \lambda ^{leak}_ n$

$\displaystyle =$

$\displaystyle \alpha ^{leak} \left(\sum _{n' \undefined n}Z_{n'}(k_ i) \right)$

(70)

1-c) Reverberant estimation

The variables used in 1-c) are based on Table 6.42.

Table 6.42: Definition of variable

Variable	Description, Corresponding parameter
${\mbox{\boldmath {$\lambda $}}}^{rev}(f,k_ i)$	Power spectrum of reverberant in the time frame $f$
${\hat{\mbox{\boldmath {$S$}}}}(f-1,k_ i)$

$\displaystyle \lambda ^{leak}_ n$

$\displaystyle =$

$\displaystyle \alpha ^{leak} \left(\sum _{n' \undefined n}Z_{n'}(k_ i) \right)$

(71)

$\displaystyle \lambda _ n^{rev}(f,k_ i)$

$\displaystyle =$

$\displaystyle \gamma \left(\lambda _ n^{rev}(f-1,k_ i)+ \Delta |{\hat S}_ n(f-1,k_ i)|^2 \right)$

(72)

2)SNR estimation

$\includegraphics[width=0.7\textwidth ]{fig/modules/PF-fc-SNR.eps}$

Figure 6.51: Procedure of SNR estimation

Figure 6.51 shows the flow of the SNR estimation. The SNR estimation consists of the followings
a) Calculation of SNR
b) Preliminary SNR estimation before noise mixture
c) Estimation of a speech content rate
d) Estimation of an optimal gain

Table 6.43: Definition of major variable

Variable	Description, corresponding parameter
${\mbox{\boldmath {$Y$}}}(k_ i)$	Complex spectra of the separated sound, which is an input of the PostFilter node
${\hat{\mbox{\boldmath {$S$}}}}(k_ i)$	Complex spectra of the formed separated sound, which is an output of the PostFilter node
${\mbox{\boldmath {$\lambda $}}}(k_ i)$	Power spectrum of noise estimated above
$\gamma _ n(k_ i)$	SNR of the separated sound $n$
$\alpha _ n^ p(k_ i)$	Speech content rate
$\xi _ n(k_ i)$	Preliminary SNR
${\mbox{\boldmath {$G$}}}^{H1}(k_ i)$	Optimal gain to improve SNR of the separated sound

The vector elements in Table 6.43 indicate value of each separated sound.

2-a) Calculation of SNR

The variables used in 2-a) are based on Table 6.43. Here, SNR $\gamma _ n(k_ i)$ is calculated based on the complex spectra ${\mbox{\boldmath {$Y$}}}(k_ i)$ of the input and the power spectrum of the noise estimated above.

	$\displaystyle \gamma _ n(k_ i)$	$\displaystyle =$	$\displaystyle \frac{\|Y_ n(k_ i)\|^2}{\lambda _ n(k_ i)}$		(73)
	$\displaystyle \gamma _ n^ C(k_ i)$	$\displaystyle =$	$\displaystyle \left\{ \begin{array}{cr} \gamma _ n(k_ i) & \mathrm{if}\ \ \gamma _ n(k_ i)> 0\\ 0 & \mathrm{otherwise} \end{array} \right.$		(74)

Here, when $\gamma _ n(k_ i) < 0$ is satisfied, $\gamma _ n(k_ i) = 0$ .

2-b)Estimation of speech content rate

The variables used in 2-b) are based on Table 6.44.

Table 6.44: Definition of variable

Variable	Description, corresponding parameter
$\alpha ^ p_{mag}$	Preliminary SNR coefficient. Parameter VOICEP_PROB_FACTOR. The default value is 0.9.
$\alpha ^ p_{min}$	Minimum speech content rate. Parameter MIN_VOICEP_PROB. The default value is 0.05.

The speech content rate $\alpha _ n^ p(f,k_ i)$ is calculated as follows, with the preliminary SNR $\xi _ n(f-1,k_ i)$ of the former frame.

$\displaystyle \alpha _ n^ p(f,k_ i)$

$\displaystyle =$

$\displaystyle \alpha ^ p_{mag} \left(\frac{\xi _ n(f-1,k_ i)}{\xi _ n(f-1,k_ i)+1}\right)^2 + \alpha ^ p_{min}$

(75)

2-c) Preliminary SNR estimation before noise mixture

The variables used in 2-c) are based on Table 6.45.

Table 6.45: Definition of variable

Variable	Description, corresponding parameter
$a$	Internal ratio of the former frame SNR. Parameter PRIOR_SNR_FACTOR. The default value is 0.8.
$\xi ^{max}$	Upper limit of the preliminary SNR. Parameter MAX_PRIOR_SNR. The default value is 100.

The preliminary SNR $\xi _ n(k_ i)$ is calculated as follows.

	$\displaystyle \xi _ n(k_ i)$	$\displaystyle =$	$\displaystyle \left(1-\alpha _ n^ p(k_ i)\right) \xi _{tmp} + \alpha _ n^ p(k_ i) \gamma _ n^ C(k_ i) \label{eqprior-SNR}$		(76)
	$\displaystyle \xi _{tmp}$	$\displaystyle =$	$\displaystyle a \frac{\|{\hat S}_ n(f-1,k_ i)\|^2}{\lambda _ n(f-1,k_ i)} + (1-a) \xi _ n(f-1,k_ i)$		(77)

Here, $\xi _{tmp}$ is a temporary variable in the calculation, which is an interior division value of the estimated SNR $\gamma _ n(k_ i)$ and preliminary SNR $\xi _ n(k_ i)$ of the former frame. Moreover, when $\xi _ n(k_ i) > \xi ^{max}$ is satisfied, change the value as $\xi _ n(k_ i) = \xi ^{max}$ .

2-d)Estimation of optimal gain

The variables used in 2-d) are based on Table 6.46.

Table 6.46: Definition of variable

Variable	Description, corresponding parameter
$\theta ^{max}$	Intermediate variable $v_ n(k_ i)$ maximum value. Parameter MAX_OPT_GAIN. The default value is 20.
$\theta ^{min}$	The intermediate variable $v_ n(k_ i)$ minimum value. Parameter MIN_OPT_GAIN. The default value is 6

Prior to calculating an optimal gain, the following intermediate variable $v_ n(k_ i)$ is calculated with the preliminary SNR $\xi _ n(k_ i)$ obtained above and the estimated SNR $\gamma _ n(k_ i)$ .

$\displaystyle v_ n(k_ i)$

$\displaystyle =$

$\displaystyle \frac{\xi _ n(k_ i)}{1+\xi _ n(k_ i)} \gamma _ n(k_ i) \label{eqprior-SNR-temp-v}$

(78)

When $v_ n(k_ i) > \theta ^{max}$ is satisfied, $v_ n(k_ i) = \theta ^{max}$ . The optimal gain ${\mbox{\boldmath {$G$}}}^{H1}(k_ i) = [G^{H1}_1(k_ i),\dots , G^{H1}_ N(k_ i)]$ when speech exists is obtained as sollows.

$\displaystyle G^{H1}_ n(k_ i)$

$\displaystyle =$

$\displaystyle \frac{\xi _ n(k_ i)}{1+\xi _ n(k_ i)}\exp \left\{ \frac{1}{2}\texttt{\hyperref{../hark-document-en/subsecPrimitives.html}{}{}{int}}~ _{v_ n(k_ i)}^{\inf }\frac{e^{-t}}{t}\mathrm{d}t \right\}$

(79)

Here,

$\displaystyle \begin{array}{cr} G^{H1}_ n(k_ i) = 1 & \mathrm{if} v_ n(k_ i) < \theta ^{min} \\ G^{H1}_ n(k_ i) = 1 & \mathrm{if} G^{H1}_ n(k_ i) > 1. \end{array}$

(80)

3) Estimation of probability of speech presence

$\includegraphics[width=0.7\textwidth ]{fig/modules/PF-fc-VP.eps}$

Figure 6.52: Procedure for estimation of probability of speech presence

Figure 6.52 shows the flow of estimation of probability of speech presence. Estimation of the probability of speech presence consists of
a) Smoothing of the preliminary SNR for each of the 3 types of bands
b) Estimation with the temporal probability of speech presence based on the smoothed SNR in each band
c) Speech quiescent probability is estimated based on three provisional probability.
d) Estimation of the final probability of speech presence.

3-a) Smoothing of preliminary SNR

The variables used in 3-a) are summarized in Table 6.47.

Table 6.47: Definition of variable

Variable	Description, corresponding parameter
$\zeta _ n(k_ i)$	Time preliminary SNR temporally-smoothed
$\xi _ n(k_ i)$	Preliminary SNR
$\zeta ^{f}_ n(k_ i)$	Frequency-smoothed SNR (frame)
$\zeta ^{g}_ n(k_ i)$	Frequency-smoothed SNR (global)
$\zeta ^{l}_ n(k_ i)$	Frequency smoothing SNR (local)
$b$	Parameter PRIOR_SNR_SMOOTH_FACTOR. The default value is 0.7
$F_{st}$	Parameter LOWER_SMOOTH_FREQ_INDEX. The default value is 8
$F_{en}$	Parameter UPPER_SMOOTH_FREQ_INDEX. The default value is 99
$G$	Parameter GLOBAL_SMOOTH_BANDWIDTH. The default value is 29
$L$	Parameter LOCAL_SMOOTH_BANDWIDTH. The default value is 5

First, temporally-smoothing is performed with the preliminary SNR $\xi _ n(f,k_ i)$ calculated by Equation (76) and the temporally-smoothed preliminary SNR $\zeta _ n(f-1,k_ i)$ of the former frame.

$\displaystyle \zeta _ n(f,k_ i)$

$\displaystyle =$

$\displaystyle b \zeta _ n(f-1,k_ i)+ (1-b) \xi _ n(f,k_ i)$

(81)

Smoothing of the frequency direction is reduced in the order of frame, global, local depending on the size of the frame.

Frequency smoothing in frame
Smoothing by averaging is performed in the frequency bin range $F_{st} \sim F_{en}$ .

$\displaystyle \zeta ^{f}_ n(k_ i)$ $\displaystyle =$ $\displaystyle \frac{1}{F_{en}-F_{st}+1}\sum _{k_ j=F_{st}}^{F_{en}}\zeta _ n(k_ j) \label{eqSNR-smooth-frame}$ (82)

Global frequency smoothing in global
Frequency smoothing with a hamming window in width $G$ is performed globally.

	$\displaystyle \zeta ^{g}_ n(k_ i)$	$\displaystyle =$	$\displaystyle \sum _{j=-(G-1)/2}^{(G-1)/2}w_{han}(j+(G-1)/2)\zeta _ n(k_{i+j}),$		(83)
	$\displaystyle w_{han}(j)$	$\displaystyle =$	$\displaystyle \frac{1}{C}\left(0.5 - 0.5 \cos \left( \frac{2 \pi j}{G}\right)\right),$		(84)

Here, $C$ is a normalization coefficient so that $\sum _{j=0}^{G-1} w_{han}(j) = 1$ can be satisfied.

Local frequency smoothing
Frequency smoothing with a triangle window in width $F$ is performed locally.

$\displaystyle \zeta ^{l}_ n(k_ i)$ $\displaystyle =$ $\displaystyle 0.25 \zeta _ n(k_ i-1)+ 0.5 \zeta _ n(k_ i)+ 0.25(k_ i+1)$ (85)

3-b Estimation of the probability of provisional speech

The variables used in 3-b) are shown in Table 6.48.

Table 6.48: Definition of variable

Variable	Description, corresponding parameter
$\zeta ^{f,g,l}_ n(k_ i)$	SNR smoothed in each band
$P^{f,g,l}_ n(k_ i)$	Probability of provisional speech in each band
$\zeta ^{peak}_ n(k_ i)$	Peak of smoothed SNR
$Z^{peak}_{min}$	Parameter MIN_SMOOTH_PEAK_SNR. The default value is 1.
$Z^{peak}_{max}$	Parameter MAX_SMOOTH_PEAK_SNR. The default value is 10.
$Z_{thres}$	FRAME_SMOOTH_SNR_THRESH. The default value is 1.5.
$Z_{min}^{f,g,l}$	Parameter MIN_FRAME_SMOOTH_SNR,
	MIN_GLOBAL_SMOOTH_SNR,
	MIN_LOCAL_SMOOTH_SNR. The default value is 0.1.
$Z_{max}^{f,g,l}$	Parameter MAX_FRAME_SMOOTH_SNRF,
	MAX_GLOBAL_SMOOTH_SNR,
	MAX_LOCAL_SMOOTH_SNR. The default value is 0.316.

Calculation of $P^{f}_ n(k_ i)$ and $\zeta ^{peak}_ n(k_ i)$
First, calculate $\zeta ^{peak}_ n(f,k_ i)$ as follows.

$\displaystyle \zeta ^{peak}_ n(f,k_ i)$ $\displaystyle =$ $\displaystyle \left\{ \begin{array}{cr} \zeta ^ f_ n(f,k_ i),& \mathrm{if\ } \zeta ^{f}_ n(f,k_ i)> Z_{thres} \zeta ^ f_ n(f-1,k_ i)\\ \zeta ^{peak}_ n(f-1,k_ i),& \mathrm{if\ otherwise}. \end{array} \right.$ (86)

Here, the value of $\zeta ^{peak}_ n(k_ i)$ must be within the range of the parameter $Z^{peak}_{min},Z^{peak}_{max}$ . That is,

$\displaystyle \zeta ^{peak}_ n(k_ i)$ $\displaystyle =$ $\displaystyle \left\{ \begin{array}{cr} Z^{peak}_{min},& \mathrm{if} \ \zeta ^{peak}_ n(k_ i) <Z^{peak}_{min}\\ Z^{peak}_{max},& \mathrm{if} \ \zeta ^{peak}_ n(k_ i)>Z^{peak}_{max} \end{array} \right.$ (87)

Next, $P_ n^ f(k_ i)$ is obtained as follows.

$\displaystyle P^{f}_ n(k_ i)$ $\displaystyle =$ $\displaystyle \left\{ \begin{array}{cr} 0, & \mathrm{if}\ \zeta ^ f_ n(k_ i)< \zeta ^{peak}_ n(k_ i) Z_{min}^ f \\ 1, & \mathrm{if}\ \zeta ^ f_ n(k_ i)> \zeta ^{peak}_ n(k_ i) Z_{max}^ f\\ \frac{\log \left(\zeta ^ f_ n(k_ i)\slash \zeta ^{peak}_ n(k_ i)Z^ f_{min}\right)}{\log \left( Z^ f_{max} \slash Z^ f_{min} \right)},& \mathrm{otherwise} \end{array} \right.$ (88)
Calculation of $P^{g}_ n(k_ i)$
Calculate as follows.

$\displaystyle P^ g_ n(k_ i)$ $\displaystyle =$ $\displaystyle \left\{ \begin{array}{cr} 0,& \mathrm{if}\ \ \zeta ^ g_ n(k_ i) < Z_{min}^ g\\ 1,& \mathrm{if}\ \ \zeta ^ g_ n(k_ i) > Z_{max}^ g\\ \frac{\ \log \left(\zeta ^ g_ n(k_ i)\slash Z_{min}^ g\right)\ }{\ \log \left(Z_{max}^ g\slash Z_{min}^ g\right)\ },& \mathrm{otherwise} \end{array} \right.$ (89)

Calculation of $P^{l}_ n(k_ i)$
Calculate as follows.

$\displaystyle P^ l_ n(k_ i)$ $\displaystyle =$ $\displaystyle \left\{ \begin{array}{cr} 0,& \mathrm{if}\ \ \zeta ^ l_ n(k_ i) < Z_{min}^ l\\ 1,& \mathrm{if}\ \ \zeta ^ l_ n(k_ i) > Z_{max}^ l\\ \frac{\ \log \left(\zeta ^ l_ n(k_ i)\slash Z_{min}^ l\right)\ }{\ \log \left(Z_{max}^ l\slash Z_{min}^ l\right)\ },& \mathrm{otherwise} \end{array} \right.$ (90)

3-c) Estimation of the probability of speech pause

The variables used in 3-c) are shown in Table 6.49.

Table 6.49: Definition of variable

Variable	description, a corresponding parameter
$q_ n(k_ i)$	Probability of speech pause.
$a^{f}$	FRAME_VOICEP_PROB_FACTOR. The default value is 0.7.
$a^{g}$	GLOBAL_VOICEP_PROB_FACTOR. The default value is 0.9.
$a^{l}$	LOCAL_VOICEP_PROB_FACTOR. The default value is 0.9.
$q_{min}$	MIN_VOICE_PAUSE_PROB. The default value is 0.02.
$q_{max}$	MAX_VOICE_PAUSE_PROB. The default value is 0.98.

As shown below, the probability of speech pause $q_ n(k_ i)$ is obtained by integrating the provisional probability of speech calculated from a smoothing result of the three frequency bands $P^{f,g,l}_ n(k_ i)$ .

$\displaystyle q_ n(k_ i)$

$\displaystyle =$

$\displaystyle 1 - \left( 1-a^ l+a^ l P^ l_ n(k_ i) \right) \left( 1-a^ g +a^ g P^ g_ n(k_ i) \right) \left( 1-a^ f+ a^ f P^ f_ n(k_ i) \right),$

(91)

Here, when $q_ n(k_ i) < q_{min}$ , $q_ n(k_ i) = q_{min}$ , and when $q_ n(k_ i) > q_{max}$ , $q_ n(k_ i) = q_{max}$ .

3-d) Estimation of the probability of speech presence

The probability of speech presence $p_ n(k_ i)$ is obtained by the probability of speech suspension pause $q_ n(k_ i)$ , the preliminary SNR $\zeta _ n(k_ i)$ and the intermediate variable $v_ n(k_ i)$ derived by Equation (78).

$\displaystyle p_ n(k_ i)$

$\displaystyle =$

$\displaystyle \left\{ 1 + \frac{q_ n(k_ i)}{1-q_ n(k_ i)} \left( 1+\zeta _ n(k_ i)\right) \exp \left(-v_ n(k_ i)\right)\right)^{-1}$

(92)

4 Noise removal

The enhanced separated sound as an output ${\hat S}_ n(k_ i)$ is derived by activating the optimal gain $G^{H1}_ n(k_ i)$ and the probability of speech presence $p_ n(k_ i)$ for the separated sound spectrum as the input $Y_ n(k_ i)$ .

$\displaystyle {\hat S}_ n(k_ i)$

$\displaystyle =$

$\displaystyle Y_ n(k_ i) G^{H1}_ n(k_ i) p_ n(k_ i)$

(93)