Processing math: 100%

6.3.8 PostFilter

6.3.8.1 Outline of the node

This node performs postprocessing to improve the accuracy of speech recognition with the sound source separation node GHDSS for a separated complex spectrum. At the same time, it generates noise power spectra to generate Missing Feature Masks.

6.3.8.2 Necessary files

No files are required.

When to use

This node is used to form the spectrum that are separated by the GHDSS node and generate the noise spectra required to generate Missing Feature Masks.

Typical connection

Figure 469 shows an example of a connection for the PostFilter node. The output of the GHDSS node is connected to the INPUT_SPEC input and the output of the BGNEstimator node is connected to the INIT_NOISE_POWER input. Figure 469 shows examples for typical output connections:

  1. Speech feature extraction from separated sound (OUTPUT_SPEC) (MSLSExtraction node)

  2. Generation of Missing Feature Masks from separated sound and power (EST_NOISE_POWER) of noise contained in it at the time of speech recognition (MFMGeneration node)

\includegraphics[width=.9\textwidth ]{fig/modules/PostFilter}

6.3.8.3 Input-output and property of the node

Input

INPUT_SPEC

: Map<int, ObjectRef> type. The same type as the output from the GHDSS node. A pair of a sound source ID and a complex spectrum of the separated sound as Vector<complex<float> > type data.

INPUT_NOISE_POWER

: Matrix<float> type. The power spectrum of the stationary noise estimated by the BGNEstimator node.

Output

OUTPUT_SPEC

: Map<int, ObjectRef> type. The Object is the complex spectrum from the input INPUT_SPEC, with noise removed.

EST_NOISE_POWER

: Map<int, ObjectRef> type. Power of the estimated noise to be contained is paired with IDs as Vector<float> type data for each separated sound of OUTPUT_SPEC.

Parameter

Table 6.46: Parameter list of PostFilter (first half)

Parameter name

Type

Default value

Description

MCRA_SETTING

bool 

false

When the user set parameters for the MCRA estimation, which is a noise removal method, select true.

MCRA_SETTING

   

The following are valid when MCRA_SETTING is set to true

STATIONARY_NOISE_FACTOR

float 

1.2

Coefficient at the time of stationary noise estimation.

SPEC_SMOOTH_FACTOR

float 

0.5

Smoothing coefficient of an input power spectrum.

AMP_LEAK_FACTOR

float 

1.5

Leakage coefficient.

STATIONARY_NOISE_MIXTURE_FACTOR

float 

0.98

Mixing ratio of stationary noise.

LEAK_FLOOR

float 

0.1

Minimum value of leakage noise.

BLOCK_LENGTH

int 

80

Detection time width.

VOICEP_THRESHOLD

int 

3

Threshold value of speech presence judgment.

EST_LEAK_SETTING

bool 

false

When the user sets parameters related to the leakage rate estimation, select true.

EST_LEAK_SETTING

   

The followings are valid when EST_LEAK_SETTING is set to true.

LEAK_FACTOR

float 

0.25

Leakage rate.

OVER_CANCEL_FACTOR

float 

1

Leakage rate weighting factor.

EST_REV_SETTING

bool 

false

When the user sets parameters related to the component estimation, select true.

EST_REV_SETTING

   

The followings are valid when EST_REV_SETTING is set to true.

REVERB_DECAY_FACTOR

float 

0.5

Damping coefficient of reverberant power.

DIRECT_DECAY_FACTOR

float 

0.2

Damping coefficient of a separated spectrum.

EST_SN_SETTING

bool 

false

When the user sets parameters related to the SN ratio estimation, select true.

EST_SN_SETTING

   

The followings are valid when EST_SN_SETTING is set to true.

PRIOR_SNR_FACTOR

float 

0.8

Ratio of priori and posteriori SNRs.

VOICEP_PROB_FACTOR

float 

0.9

Amplitude coefficient of the probability of speech presence.

MIN_VOICEP_PROB

float 

0.05

Probability of the minimum speech presence.

MAX_PRIOR_SNR

float 

100

Maximum value of preliminary SNR.

MAX_OPT_GAIN

float 

20

Maximum value of the optimal gain intermediate variable v.

MIN_OPT_GAIN

float 

6

Minimum value of the optimal gain intermediate variable v.

Table 6.47: Parameter list of PostFilter (latter half)

Parameter name

Type

Default value

Description

EST_VOICEP_SETTING

bool 

false

When the user sets parameters related to the speech probability estimation, select true.

EST_VOICEP_SETTING

   

The following are valid when EST_VOICEP_SETTING is set to true.

PRIOR_SNR_SMOOTH_FACTOR

float 

0.7

Time smoothing coefficient.

MIN_FRAME_SMOOTH_SNR

float 

0.1

Minimum value of the frequency smoothing SNR (frame).

MAX_FRAME_SMOOTH_SNR

float 

0.316

Maximum value of the frequency smoothing SNR (frame).

MIN_GLOBAL_SMOOTH_SNR

float 

0.1

Minimum value of the frequency smoothing SNR (global).

MAX_GLOBAL_SMOOTH_SNR

float 

0.316

Maximum value of the frequency smoothing SNR (global).

MIN_LOCAL_SMOOTH_SNR

float 

0.1

Minimum value of the frequency smoothing SNR (local).

MAX_LOCAL_SMOOTH_SNR

float 

0.316

Maximum value of the frequency smoothing SNR (local).

UPPER_SMOOTH_FREQ_INDEX

int 

99

Frequency smoothing upper limit bin index.

LOWER_SMOOTH_FREQ_INDEX

int 

8

The frequency smoothing lower limit bin index.

GLOBAL_SMOOTH_BANDWIDTH

int 

29

Frequency smoothing band width (global).

LOCAL_SMOOTH_BANDWIDTH

int 

5

The frequency smoothing band width (local).

FRAME_SMOOTH_SNR_THRESH

float 

1.5

Threshold value of frequency smoothing SNR.

MIN_SMOOTH_PEAK_SNR

float 

1.0

Minimum value of the frequency smoothing SNR peak.

MAX_SMOOTH_PEAK_SNR

float 

10.0

Maximum value of the frequency smoothing SNR peak.

FRAME_VOICEP_PROB_FACTOR

float 

0.7

Speech probability smoothing coefficient (frame).

GLOBAL_VOICEP_PROB_FACTOR

float 

0.9

Speech probability smoothing coefficient (global).

LOCAL_VOICEP_PROB_FACTOR

float 

0.9

Speech probability smoothing coefficient (local).

MIN_VOICE_PAUSE_PROB

float 

0.02

Minimum value of speech quiescent probability.

MAX_VOICE_PAUSE_PROB

float 

0.98

Maximum value of speech quiescent probability.

6.3.8.4 Details of the node

\includegraphics[width=0.7\textwidth ]{fig/modules/PF-fc-overview.eps}
Figure 6.55: Flowchart of PostFilter 

The subscripts used in the equations are based on the definitions in Table 6.1. Moreover, the time frame index f is abbreviated in the following equations unless especially needed. Figure 6.55 shows a flowchart of the PostFilter node. A separated sound spectrum from the GHDSS node and a stationary noise power spectrum of the BGNEstimator node are obtained as inputs. Outputs are the separated sound spectrum for which the speech is emphasized, and a power spectrum of noise mixed with the separated sound. The processing flow is as follows.

  1. Noise estimation

  2. SNR estimation

  3. Speech presence probability estimation

  4. Noise removal

1) Noise estimation:

\includegraphics[width=0.7\textwidth ]{fig/modules/PF-fc-noise.eps}
Figure 6.56: Procedure of noise estimation

Figure 6.56 shows the processing flow of noise estimation . The three kinds of noise that the PostFilter node processes are:
a) The stationary noise for which contact points of microphones are a factor,
b) The sound of other sound sources that cannot be completely removed (leakage noise),
c) Reverberations from the previous frame.

The noise contained in the final separated sound λ(f,ki) is obtained by the following equation.

  λ(f,ki) = λsta(f,ki)+λleak(f,ki)+λrev(f1,ki)   (63)

Here, λsta(f,ki),λleak(f,ki) and λrev(f1,ki) indicate stationary noise, leakage noise and reverberation from the previous frame, respectively.

6.3.8.4.1 1-a) Stationary noise estimation by MCRA method

The parameters used in 1-a) are based on Table 6.48.

Table 6.48: Definition of variable

Parameter

Description, Corresponding parameter

Y(ki)=[Y1(ki),,YN(ki)]T

Complex spectrum of separated sound corresponding to the frequency bin ki

λinit(ki)=[λinit1(ki),,λinitN(ki)]T

Initial value power spectrum used for the stationary noise estimation

λsta(ki)=[λsta1(ki),,λstaN(ki)]T

Estimated stationary noise power spectrum.

αs

Smoothing coefficient of the input power spectrum. Parameter SPEC_SMOOTH_FACTOR. The default value is 0.5

Stmp(ki)=[Stmp1(ki),,StmpN(ki)]

Temporary parameter for minimum power calculation.

Smin(ki)=[Smin1(ki),,SminN(ki)]

The parameter that maintains the minimum power.

L

Maintained frame numbers of Stmp. Parameter BLOCK_LENGTH. The default value is 80

δ

Threshold value of speech presence judgment. Parameter VOICEP_THRESHOLD. The default value is 3.0

αd

Mixing ratio of estimated stationary noise. Parameter STATIONARY_NOISE_MIXTURE_FACTOR. The default value is 0.98

Yleak(ki)

Power spectrum of leakage noise estimated, to be contained in separated sound

q

Coefficient for when leakage noise is removed from the input separated sound power. Parameter AMP_LEAK_FACTOR. The default value is 1.5.

Sfloor

Minimum value of leakage noise. Parameter LEAK_FLOOR. The default value is 0.1.

r

Coefficient at the time of stationary noise estimation. Parameter STATIONARY_NOISE_FACTOR. The default value is 1.2

First, calculate the power spectrum for which the input spectrum is smoothed with the power from one frame before. S(f,ki)=[S1(f,ki),,SN(f,ki)].

  Sn(f,ki) = αsSn(f1,ki)+(1αs)|Yn(ki)|2   (64)

Next, update Stmp, Smin.

  Sminn(f,ki) = {min{Sminn(f1,ki),Sn(f,ki)if  f\undefinednLmin{Stmpn(f1,ki),Sn(f,ki)if  f=nL,   (65)
  Sminn(f,ki) = {min{Stmpn(f1,ki),Sn(f,ki)if  f\undefinednLSn(f,ki)if  f=nL,   (66)

Here, n indicates an arbitrary integer. Smin maintains the minimum power after the noise estimation begins Stmp maintains an extremely small power of a recent frame. Stmp is updated every L frames. Next, judge if the frame contains speech based on the power ratio of the minimum power and the input separated sound.

  Srn(ki) = Sn(ki)Smin(ki),   (67)
  In(ki) = {1if  Srn(ki)>δ0if  Srn(ki)δ   (68)

When speech is included, In(ki) is 1 and when it is not included, it is 0. Based on this result, we determine the mixing ratio αCd,n(ki) of the frame’s estimated stationary noise.

  αCd,n(ki) = (αd1)In(ki)+1.   (69)

Next, subtract leakage noise contained in the power spectrum of the separated sound.

  Sleakn(ki) = Np=1|Yp(ki)|2|Yn(ki)|2,   (70)
  S0n(ki) = |Yn(ki)|2qSleakn(ki),   (71)

Here, when S0n(ki)<Sfloor, the valued is changed to below.

  S0n(ki) = Sfloor   (72)

Obtain stationary noise of the current frame by mixing the power spectrum with leakage noise removed S0n(f,ki) and the estimated stationary noise of the former frame λsta(f1,ki) or bfλinit(f,ki), which is the output from BGNEstimator .

  λstan(f,ki) = {αCd,n(ki)λstan(f1,ki)+(1αCd,n(ki)rS0n(f,ki)nochangeinsourcepositionαCd,n(ki)λinitn(f,ki)+(1αCd,n(ki)rS0n(f,ki)ifChangeinsourceposition   (73)

6.3.8.4.2 1-b)Leakage noise estimation

The variables used in 1-b) are based on Table 6.49.

Table 6.49: Definition of variable

Variable

Description, Corresponding parameter

λleak(ki)

Power spectrum of leakage noise. Vector comprising elements of each separated sound.

αleak

Leakage rate for the total of separated sound power. LEAK_FACTOR × OVER_CANCEL_FACTOR

Sn(f,ki)

Smoothing power spectrum obtained by Equation (64)

Some parameters are calculated as follows.

  β = αleak1(αleak)2+αleak(1αleak)(N2)   (74)
  α = 1(N1)αleakβ   (75)

With this parameter, mix the smoothed spectrum $S$(ki), the power spectrum for which the power of the own separated sound is removed from the power of other separated sound Sleakn(ki) obtained by Equation (70).

  Zn(ki) = αSn(ki)+βSleakn(ki),   (76)

Here, when Zn(ki)<1, assume Zn(ki)=1. The power spectrum of final leakage noise λleak(ki) is obtained as follows.

  λleakn = αleak(n\undefinednZn(ki))   (77)

6.3.8.4.3 1-c) Reverberant estimation

The variables used in 1-c) are based on Table 6.50.

Table 6.50: Definition of variable

Variable

Description, Corresponding parameter

λrev(f,ki)

Power spectrum of reverberant in the time frame f

ˆS(f1,ki)

 
  λleakn = αleak(n\undefinednZn(ki))   (78)
  λrevn(f,ki) = γ(λrevn(f1,ki)+Δ|ˆSn(f1,ki)|2)   (79)

2)SNR estimation:

\includegraphics[width=0.7\textwidth ]{fig/modules/PF-fc-SNR.eps}
Figure 6.57: Procedure of SNR estimation

Figure 6.57 shows the flow of the SNR estimation. The SNR estimation consists of the followings
a) Calculation of SNR
b) Preliminary SNR estimation before noise mixture
c) Estimation of a speech content rate
d) Estimation of an optimal gain

Table 6.51: Definition of major variable

Variable

Description, corresponding parameter

Y(ki)

Complex spectra of the separated sound, which is an input of the PostFilter node

ˆS(ki)

Complex spectra of the formed separated sound, which is an output of the PostFilter node

λ(ki)

Power spectrum of noise estimated above

γn(ki)

SNR of the separated sound n

αpn(ki)

Speech content rate

ξn(ki)

Preliminary SNR

GH1(ki)

Optimal gain to improve SNR of the separated sound

The vector elements in Table 6.51 indicate value of each separated sound.

6.3.8.4.4 2-a) Calculation of SNR

The variables used in 2-a) are based on Table 6.51. Here, SNR γn(ki) is calculated based on the complex spectra Y(ki) of the input and the power spectrum of the noise estimated above.

  γn(ki) = |Yn(ki)|2λn(ki)   (80)
  γCn(ki) = {γn(ki)if  γn(ki)>00otherwise   (81)

Here, when γn(ki)<0 is satisfied, γn(ki)=0.

6.3.8.4.5 2-b)Estimation of speech content rate

The variables used in 2-b) are based on Table 6.52.

Table 6.52: Definition of variable

Variable

Description, corresponding parameter

αpmag

Preliminary SNR coefficient. Parameter VOICEP_PROB_FACTOR. The default value is 0.9.

αpmin

Minimum speech content rate. Parameter MIN_VOICEP_PROB. The default value is 0.05.

The speech content rate αpn(f,ki) is calculated as follows, with the preliminary SNR ξn(f1,ki) of the former frame.

  αpn(f,ki) = αpmag(ξn(f1,ki)ξn(f1,ki)+1)2+αpmin   (82)

6.3.8.4.6 2-c) Preliminary SNR estimation before noise mixture

The variables used in 2-c) are based on Table 6.53.

Table 6.53: Definition of variable

Variable

Description, corresponding parameter

a

Internal ratio of the former frame SNR. Parameter PRIOR_SNR_FACTOR. The default value is 0.8.

ξmax

Upper limit of the preliminary SNR. Parameter MAX_PRIOR_SNR. The default value is 100.

The preliminary SNR ξn(ki) is calculated as follows.

  ξn(ki) = (1αpn(ki))ξtmp+αpn(ki)γCn(ki)   (83)
  ξtmp = a|ˆSn(f1,ki)|2λn(f1,ki)+(1a)ξn(f1,ki)   (84)

Here, ξtmp is a temporary variable in the calculation, which is an interior division value of the estimated SNRγn(ki) and preliminary SNR ξn(ki) of the former frame. Moreover, when ξn(ki)>ξmax is satisfied, change the value as ξn(ki)=ξmax.

6.3.8.4.7 2-d)Estimation of optimal gain

The variables used in 2-d) are based on Table 6.54.

Table 6.54: Definition of variable

Variable

Description, corresponding parameter

θmax

Intermediate variable vn(ki) maximum value. Parameter MAX_OPT_GAIN. The default value is 20.

θmin

The intermediate variable vn(ki) minimum value. Parameter MIN_OPT_GAIN. The default value is 6

Prior to calculating an optimal gain, the following intermediate variable vn(ki) is calculated with the preliminary SNRξn(ki) obtained above and the estimated SNRγn(ki).

  vn(ki) = ξn(ki)1+ξn(ki)γn(ki)   (85)

When vn(ki)>θmax is satisfied, vn(ki)=θmax. The optimal gain GH1(ki)=[GH11(ki),,GH1N(ki)] when speech exists is obtained as sollows.

  GH1n(ki) = ξn(ki)1+ξn(ki)exp{12infvn(ki)ettdt}   (86)

Here,

  GH1n(ki)=1ifvn(ki)<θminGH1n(ki)=1ifGH1n(ki)>1.   (87)

3) Estimation of probability of speech presence

\includegraphics[width=0.7\textwidth ]{fig/modules/PF-fc-VP.eps}
Figure 6.58: Procedure for estimation of probability of speech presence:

Figure 6.58 shows the flow of estimation of probability of speech presence. Estimation of the probability of speech presence consists of:
a) Smoothing of the preliminary SNR for each of the 3 types of bands
b) Estimation with the temporal probability of speech presence based on the smoothed SNR in each band
c) Speech quiescent probability is estimated based on three provisional probability.
d) Estimation of the final probability of speech presence.

6.3.8.4.8 3-a) Smoothing of preliminary SNR

The variables used in 3-a) are summarized in Table 6.55.

Table 6.55: Definition of variable

Variable

Description, corresponding parameter

ζn(ki)

Time preliminary SNR temporally-smoothed

ξn(ki)

Preliminary SNR

ζfn(ki)

Frequency-smoothed SNR (frame)

ζgn(ki)

Frequency-smoothed SNR (global)

ζln(ki)

Frequency smoothing SNR (local)

b

Parameter PRIOR_SNR_SMOOTH_FACTOR. The default value is 0.7

Fst

Parameter LOWER_SMOOTH_FREQ_INDEX. The default value is 8

Fen

Parameter UPPER_SMOOTH_FREQ_INDEX. The default value is 99

G

Parameter GLOBAL_SMOOTH_BANDWIDTH. The default value is 29

L

Parameter LOCAL_SMOOTH_BANDWIDTH. The default value is 5

First, temporally-smoothing is performed with the preliminary SNR ξn(f,ki) calculated by Equation (83) and the temporally-smoothed preliminary SNR ζn(f1,ki) of the former frame.

  ζn(f,ki) = bζn(f1,ki)+(1b)ξn(f,ki)   (88)

Smoothing of the frequency direction is reduced in the order of frame, global, local depending on the size of the frame.

6.3.8.4.9 3-b Estimation of the probability of provisional speech

The variables used in 3-b) are shown in Table 6.56.

Table 6.56: Definition of variable

Variable

Description, corresponding parameter

ζf,g,ln(ki)

SNR smoothed in each band

Pf,g,ln(ki)

Probability of provisional speech in each band

ζpeakn(ki)

Peak of smoothed SNR

Zpeakmin

Parameter MIN_SMOOTH_PEAK_SNR. The default value is 1.

Zpeakmax

Parameter MAX_SMOOTH_PEAK_SNR. The default value is 10.

Zthres

FRAME_SMOOTH_SNR_THRESH. The default value is 1.5.

Zf,g,lmin

Parameter MIN_FRAME_SMOOTH_SNR,

 

MIN_GLOBAL_SMOOTH_SNR,

 

MIN_LOCAL_SMOOTH_SNR. The default value is 0.1.

Zf,g,lmax

Parameter MAX_FRAME_SMOOTH_SNRF,

 

MAX_GLOBAL_SMOOTH_SNR,

 

MAX_LOCAL_SMOOTH_SNR. The default value is 0.316.

6.3.8.4.10 3-c) Estimation of the probability of speech pause

The variables used in 3-c) are shown in Table 6.57.

Table 6.57: Definition of variable

Variable

description, a corresponding parameter

qn(ki)

Probability of speech pause.

af

FRAME_VOICEP_PROB_FACTOR. The default value is 0.7.

ag

GLOBAL_VOICEP_PROB_FACTOR. The default value is 0.9.

al

LOCAL_VOICEP_PROB_FACTOR. The default value is 0.9.

qmin

MIN_VOICE_PAUSE_PROB. The default value is 0.02.

qmax

MAX_VOICE_PAUSE_PROB. The default value is 0.98.

As shown below, the probability of speech pause qn(ki) is obtained by integrating the provisional probability of speech calculated from a smoothing result of the three frequency bands Pf,g,ln(ki).

  qn(ki) = 1(1al+alPln(ki))(1ag+agPgn(ki))(1af+afPfn(ki)),   (98)

Here, when qn(ki)<qmin, qn(ki)=qmin, and when qn(ki)>qmax, qn(ki)=qmax.

6.3.8.4.11 3-d) Estimation of the probability of speech presence

The probability of speech presence pn(ki) is obtained by the probability of speech suspension pause qn(ki), the preliminary SNR ζn(ki) and the intermediate variable vn(ki) derived by Equation (85).

  pn(ki) = {1+qn(ki)1qn(ki)(1+ζn(ki))exp(vn(ki)))1   (99)

4 Noise removal:

The enhanced separated sound as an output ˆSn(ki) is derived by activating the optimal gain GH1n(ki) and the probability of speech presence pn(ki) for the separated sound spectrum as the input Yn(ki).

  ˆSn(ki) = Yn(ki)GH1n(ki)pn(ki)   (100)