This node performs postprocessing to improve the accuracy of speech recognition with the sound source separation node GHDSS for a separated complex spectrum. At the same time, it generates noise power spectra to generate Missing Feature Masks.
No files are required.
When to use
This node is used to form the spectrum that are separated by the GHDSS node and generate the noise spectra required to generate Missing Feature Masks.
Typical connection
Figure 469 shows an example of a connection for the PostFilter node. The output of the GHDSS node is connected to the INPUT_SPEC input and the output of the BGNEstimator node is connected to the INIT_NOISE_POWER input. Figure 469 shows examples for typical output connections:
Speech feature extraction from separated sound (OUTPUT_SPEC) (MSLSExtraction node)
Generation of Missing Feature Masks from separated sound and power (EST_NOISE_POWER) of noise contained in it at the time of speech recognition (MFMGeneration node)
Input
: Map<int, ObjectRef> type. The same type as the output from the GHDSS node. A pair of a sound source ID and a complex spectrum of the separated sound as Vector<complex<float> > type data.
: Matrix<float> type. The power spectrum of the stationary noise estimated by the BGNEstimator node.
Output
: Map<int, ObjectRef> type. The Object is the complex spectrum from the input INPUT_SPEC, with noise removed.
: Map<int, ObjectRef> type. Power of the estimated noise to be contained is paired with IDs as Vector<float> type data for each separated sound of OUTPUT_SPEC.
Parameter
Parameter name |
Type |
Default value |
Description |
MCRA_SETTING |
false |
When the user set parameters for the MCRA estimation, which is a noise removal method, select true. |
|
MCRA_SETTING |
The following are valid when MCRA_SETTING is set to true |
||
STATIONARY_NOISE_FACTOR |
1.2 |
Coefficient at the time of stationary noise estimation. |
|
SPEC_SMOOTH_FACTOR |
0.5 |
Smoothing coefficient of an input power spectrum. |
|
AMP_LEAK_FACTOR |
1.5 |
Leakage coefficient. |
|
STATIONARY_NOISE_MIXTURE_FACTOR |
0.98 |
Mixing ratio of stationary noise. |
|
LEAK_FLOOR |
0.1 |
Minimum value of leakage noise. |
|
BLOCK_LENGTH |
80 |
Detection time width. |
|
VOICEP_THRESHOLD |
3 |
Threshold value of speech presence judgment. |
|
EST_LEAK_SETTING |
false |
When the user sets parameters related to the leakage rate estimation, select true. |
|
EST_LEAK_SETTING |
The followings are valid when EST_LEAK_SETTING is set to true. |
||
LEAK_FACTOR |
0.25 |
Leakage rate. |
|
OVER_CANCEL_FACTOR |
1 |
Leakage rate weighting factor. |
|
EST_REV_SETTING |
false |
When the user sets parameters related to the component estimation, select true. |
|
EST_REV_SETTING |
The followings are valid when EST_REV_SETTING is set to true. |
||
REVERB_DECAY_FACTOR |
0.5 |
Damping coefficient of reverberant power. |
|
DIRECT_DECAY_FACTOR |
0.2 |
Damping coefficient of a separated spectrum. |
|
EST_SN_SETTING |
false |
When the user sets parameters related to the SN ratio estimation, select true. |
|
EST_SN_SETTING |
The followings are valid when EST_SN_SETTING is set to true. |
||
PRIOR_SNR_FACTOR |
0.8 |
Ratio of priori and posteriori SNRs. |
|
VOICEP_PROB_FACTOR |
0.9 |
Amplitude coefficient of the probability of speech presence. |
|
MIN_VOICEP_PROB |
0.05 |
Probability of the minimum speech presence. |
|
MAX_PRIOR_SNR |
100 |
Maximum value of preliminary SNR. |
|
MAX_OPT_GAIN |
20 |
Maximum value of the optimal gain intermediate variable v. |
|
MIN_OPT_GAIN |
6 |
Minimum value of the optimal gain intermediate variable v. |
Parameter name |
Type |
Default value |
Description |
EST_VOICEP_SETTING |
false |
When the user sets parameters related to the speech probability estimation, select true. |
|
EST_VOICEP_SETTING |
The following are valid when EST_VOICEP_SETTING is set to true. |
||
PRIOR_SNR_SMOOTH_FACTOR |
0.7 |
Time smoothing coefficient. |
|
MIN_FRAME_SMOOTH_SNR |
0.1 |
Minimum value of the frequency smoothing SNR (frame). |
|
MAX_FRAME_SMOOTH_SNR |
0.316 |
Maximum value of the frequency smoothing SNR (frame). |
|
MIN_GLOBAL_SMOOTH_SNR |
0.1 |
Minimum value of the frequency smoothing SNR (global). |
|
MAX_GLOBAL_SMOOTH_SNR |
0.316 |
Maximum value of the frequency smoothing SNR (global). |
|
MIN_LOCAL_SMOOTH_SNR |
0.1 |
Minimum value of the frequency smoothing SNR (local). |
|
MAX_LOCAL_SMOOTH_SNR |
0.316 |
Maximum value of the frequency smoothing SNR (local). |
|
UPPER_SMOOTH_FREQ_INDEX |
99 |
Frequency smoothing upper limit bin index. |
|
LOWER_SMOOTH_FREQ_INDEX |
8 |
The frequency smoothing lower limit bin index. |
|
GLOBAL_SMOOTH_BANDWIDTH |
29 |
Frequency smoothing band width (global). |
|
LOCAL_SMOOTH_BANDWIDTH |
5 |
The frequency smoothing band width (local). |
|
FRAME_SMOOTH_SNR_THRESH |
1.5 |
Threshold value of frequency smoothing SNR. |
|
MIN_SMOOTH_PEAK_SNR |
1.0 |
Minimum value of the frequency smoothing SNR peak. |
|
MAX_SMOOTH_PEAK_SNR |
10.0 |
Maximum value of the frequency smoothing SNR peak. |
|
FRAME_VOICEP_PROB_FACTOR |
0.7 |
Speech probability smoothing coefficient (frame). |
|
GLOBAL_VOICEP_PROB_FACTOR |
0.9 |
Speech probability smoothing coefficient (global). |
|
LOCAL_VOICEP_PROB_FACTOR |
0.9 |
Speech probability smoothing coefficient (local). |
|
MIN_VOICE_PAUSE_PROB |
0.02 |
Minimum value of speech quiescent probability. |
|
MAX_VOICE_PAUSE_PROB |
0.98 |
Maximum value of speech quiescent probability. |
The subscripts used in the equations are based on the definitions in Table 6.1. Moreover, the time frame index f is abbreviated in the following equations unless especially needed. Figure 6.55 shows a flowchart of the PostFilter node. A separated sound spectrum from the GHDSS node and a stationary noise power spectrum of the BGNEstimator node are obtained as inputs. Outputs are the separated sound spectrum for which the speech is emphasized, and a power spectrum of noise mixed with the separated sound. The processing flow is as follows.
Noise estimation
SNR estimation
Speech presence probability estimation
Noise removal
1) Noise estimation:
Figure 6.56 shows the processing flow of noise estimation . The three kinds of noise that the PostFilter node processes are:
a) The stationary noise for which contact points of microphones are a factor,
b) The sound of other sound sources that cannot be completely removed (leakage noise),
c) Reverberations from the previous frame.
The noise contained in the final separated sound λ(f,ki) is obtained by the following equation.
λ(f,ki) | = | λsta(f,ki)+λleak(f,ki)+λrev(f−1,ki) | (63) |
Here, λsta(f,ki),λleak(f,ki) and λrev(f−1,ki) indicate stationary noise, leakage noise and reverberation from the previous frame, respectively.
The parameters used in 1-a) are based on Table 6.48.
Parameter |
Description, Corresponding parameter |
Y(ki)=[Y1(ki),…,YN(ki)]T |
Complex spectrum of separated sound corresponding to the frequency bin ki |
λinit(ki)=[λinit1(ki),…,λinitN(ki)]T |
Initial value power spectrum used for the stationary noise estimation |
λsta(ki)=[λsta1(ki),…,λstaN(ki)]T |
Estimated stationary noise power spectrum. |
αs |
Smoothing coefficient of the input power spectrum. Parameter SPEC_SMOOTH_FACTOR. The default value is 0.5 |
Stmp(ki)=[Stmp1(ki),…,StmpN(ki)] |
Temporary parameter for minimum power calculation. |
Smin(ki)=[Smin1(ki),…,SminN(ki)] |
The parameter that maintains the minimum power. |
L |
Maintained frame numbers of Stmp. Parameter BLOCK_LENGTH. The default value is 80 |
δ |
Threshold value of speech presence judgment. Parameter VOICEP_THRESHOLD. The default value is 3.0 |
αd |
Mixing ratio of estimated stationary noise. Parameter STATIONARY_NOISE_MIXTURE_FACTOR. The default value is 0.98 |
Yleak(ki) |
Power spectrum of leakage noise estimated, to be contained in separated sound |
q |
Coefficient for when leakage noise is removed from the input separated sound power. Parameter AMP_LEAK_FACTOR. The default value is 1.5. |
Sfloor |
Minimum value of leakage noise. Parameter LEAK_FLOOR. The default value is 0.1. |
r |
Coefficient at the time of stationary noise estimation. Parameter STATIONARY_NOISE_FACTOR. The default value is 1.2 |
First, calculate the power spectrum for which the input spectrum is smoothed with the power from one frame before. S(f,ki)=[S1(f,ki),…,SN(f,ki)].
Sn(f,ki) | = | αsSn(f−1,ki)+(1−αs)|Yn(ki)|2 | (64) |
Next, update Stmp, Smin.
Sminn(f,ki) | = | {min{Sminn(f−1,ki),Sn(f,ki)if f\undefinednLmin{Stmpn(f−1,ki),Sn(f,ki)if f=nL, | (65) | ||
Sminn(f,ki) | = | {min{Stmpn(f−1,ki),Sn(f,ki)if f\undefinednLSn(f,ki)if f=nL, | (66) |
Here, n indicates an arbitrary integer. Smin maintains the minimum power after the noise estimation begins Stmp maintains an extremely small power of a recent frame. Stmp is updated every L frames. Next, judge if the frame contains speech based on the power ratio of the minimum power and the input separated sound.
Srn(ki) | = | Sn(ki)Smin(ki), | (67) | ||
In(ki) | = | {1if Srn(ki)>δ0if Srn(ki)≤δ | (68) |
When speech is included, In(ki) is 1 and when it is not included, it is 0. Based on this result, we determine the mixing ratio αCd,n(ki) of the frame’s estimated stationary noise.
αCd,n(ki) | = | (αd−1)In(ki)+1. | (69) |
Next, subtract leakage noise contained in the power spectrum of the separated sound.
Sleakn(ki) | = | N∑p=1|Yp(ki)|2−|Yn(ki)|2, | (70) | ||
S0n(ki) | = | |Yn(ki)|2−qSleakn(ki), | (71) |
Here, when S0n(ki)<Sfloor, the valued is changed to below.
S0n(ki) | = | Sfloor | (72) |
Obtain stationary noise of the current frame by mixing the power spectrum with leakage noise removed S0n(f,ki) and the estimated stationary noise of the former frame λsta(f−1,ki) or bfλinit(f,ki), which is the output from BGNEstimator .
λstan(f,ki) | = | {αCd,n(ki)λstan(f−1,ki)+(1−αCd,n(ki)rS0n(f,ki)nochangeinsourcepositionαCd,n(ki)λinitn(f,ki)+(1−αCd,n(ki)rS0n(f,ki)ifChangeinsourceposition | (73) |
The variables used in 1-b) are based on Table 6.49.
Variable |
Description, Corresponding parameter |
λleak(ki) |
Power spectrum of leakage noise. Vector comprising elements of each separated sound. |
αleak |
Leakage rate for the total of separated sound power. LEAK_FACTOR × OVER_CANCEL_FACTOR |
Sn(f,ki) |
Smoothing power spectrum obtained by Equation (64) |
Some parameters are calculated as follows.
β | = | −αleak1−(αleak)2+αleak(1−αleak)(N−2) | (74) | ||
α | = | 1−(N−1)αleakβ | (75) |
With this parameter, mix the smoothed spectrum $S$(ki), the power spectrum for which the power of the own separated sound is removed from the power of other separated sound Sleakn(ki) obtained by Equation (70).
Zn(ki) | = | αSn(ki)+βSleakn(ki), | (76) |
Here, when Zn(ki)<1, assume Zn(ki)=1. The power spectrum of final leakage noise λleak(ki) is obtained as follows.
λleakn | = | αleak(∑n′\undefinednZn′(ki)) | (77) |
The variables used in 1-c) are based on Table 6.50.
Variable |
Description, Corresponding parameter |
λrev(f,ki) |
Power spectrum of reverberant in the time frame f |
ˆS(f−1,ki) |
λleakn | = | αleak(∑n′\undefinednZn′(ki)) | (78) |
λrevn(f,ki) | = | γ(λrevn(f−1,ki)+Δ|ˆSn(f−1,ki)|2) | (79) |
2)SNR estimation:
Figure 6.57 shows the flow of the SNR estimation. The SNR estimation consists of the followings
a) Calculation of SNR
b) Preliminary SNR estimation before noise mixture
c) Estimation of a speech content rate
d) Estimation of an optimal gain
Variable |
Description, corresponding parameter |
Y(ki) |
Complex spectra of the separated sound, which is an input of the PostFilter node |
ˆS(ki) |
Complex spectra of the formed separated sound, which is an output of the PostFilter node |
λ(ki) |
Power spectrum of noise estimated above |
γn(ki) |
SNR of the separated sound n |
αpn(ki) |
Speech content rate |
ξn(ki) |
Preliminary SNR |
GH1(ki) |
Optimal gain to improve SNR of the separated sound |
The vector elements in Table 6.51 indicate value of each separated sound.
The variables used in 2-a) are based on Table 6.51. Here, SNR γn(ki) is calculated based on the complex spectra Y(ki) of the input and the power spectrum of the noise estimated above.
γn(ki) | = | |Yn(ki)|2λn(ki) | (80) | ||
γCn(ki) | = | {γn(ki)if γn(ki)>00otherwise | (81) |
Here, when γn(ki)<0 is satisfied, γn(ki)=0.
The variables used in 2-b) are based on Table 6.52.
Variable |
Description, corresponding parameter |
αpmag |
Preliminary SNR coefficient. Parameter VOICEP_PROB_FACTOR. The default value is 0.9. |
αpmin |
Minimum speech content rate. Parameter MIN_VOICEP_PROB. The default value is 0.05. |
The speech content rate αpn(f,ki) is calculated as follows, with the preliminary SNR ξn(f−1,ki) of the former frame.
αpn(f,ki) | = | αpmag(ξn(f−1,ki)ξn(f−1,ki)+1)2+αpmin | (82) |
The variables used in 2-c) are based on Table 6.53.
Variable |
Description, corresponding parameter |
a |
Internal ratio of the former frame SNR. Parameter PRIOR_SNR_FACTOR. The default value is 0.8. |
ξmax |
Upper limit of the preliminary SNR. Parameter MAX_PRIOR_SNR. The default value is 100. |
The preliminary SNR ξn(ki) is calculated as follows.
ξn(ki) | = | (1−αpn(ki))ξtmp+αpn(ki)γCn(ki) | (83) | ||
ξtmp | = | a|ˆSn(f−1,ki)|2λn(f−1,ki)+(1−a)ξn(f−1,ki) | (84) |
Here, ξtmp is a temporary variable in the calculation, which is an interior division value of the estimated SNRγn(ki) and preliminary SNR ξn(ki) of the former frame. Moreover, when ξn(ki)>ξmax is satisfied, change the value as ξn(ki)=ξmax.
The variables used in 2-d) are based on Table 6.54.
Variable |
Description, corresponding parameter |
θmax |
Intermediate variable vn(ki) maximum value. Parameter MAX_OPT_GAIN. The default value is 20. |
θmin |
The intermediate variable vn(ki) minimum value. Parameter MIN_OPT_GAIN. The default value is 6 |
Prior to calculating an optimal gain, the following intermediate variable vn(ki) is calculated with the preliminary SNRξn(ki) obtained above and the estimated SNRγn(ki).
vn(ki) | = | ξn(ki)1+ξn(ki)γn(ki) | (85) |
When vn(ki)>θmax is satisfied, vn(ki)=θmax. The optimal gain GH1(ki)=[GH11(ki),…,GH1N(ki)] when speech exists is obtained as sollows.
GH1n(ki) | = | ξn(ki)1+ξn(ki)exp{12∫infvn(ki)e−ttdt} | (86) |
Here,
GH1n(ki)=1ifvn(ki)<θminGH1n(ki)=1ifGH1n(ki)>1. | (87) |
3) Estimation of probability of speech presence
Figure 6.58 shows the flow of estimation of probability of speech presence. Estimation of the probability of speech presence consists of:
a) Smoothing of the preliminary SNR for each of the 3 types of bands
b) Estimation with the temporal probability of speech presence based on the smoothed SNR in each band
c) Speech quiescent probability is estimated based on three provisional probability.
d) Estimation of the final probability of speech presence.
The variables used in 3-a) are summarized in Table 6.55.
Variable |
Description, corresponding parameter |
ζn(ki) |
Time preliminary SNR temporally-smoothed |
ξn(ki) |
Preliminary SNR |
ζfn(ki) |
Frequency-smoothed SNR (frame) |
ζgn(ki) |
Frequency-smoothed SNR (global) |
ζln(ki) |
Frequency smoothing SNR (local) |
b |
Parameter PRIOR_SNR_SMOOTH_FACTOR. The default value is 0.7 |
Fst |
Parameter LOWER_SMOOTH_FREQ_INDEX. The default value is 8 |
Fen |
Parameter UPPER_SMOOTH_FREQ_INDEX. The default value is 99 |
G |
Parameter GLOBAL_SMOOTH_BANDWIDTH. The default value is 29 |
L |
Parameter LOCAL_SMOOTH_BANDWIDTH. The default value is 5 |
First, temporally-smoothing is performed with the preliminary SNR ξn(f,ki) calculated by Equation (83) and the temporally-smoothed preliminary SNR ζn(f−1,ki) of the former frame.
ζn(f,ki) | = | bζn(f−1,ki)+(1−b)ξn(f,ki) | (88) |
Smoothing of the frequency direction is reduced in the order of frame, global, local depending on the size of the frame.
Frequency smoothing in frame
Smoothing by averaging is performed in the frequency bin range Fst∼Fen.
ζfn(ki) | = | 1Fen−Fst+1Fen∑kj=Fstζn(kj) | (89) |
Global frequency smoothing in global
Frequency smoothing with a hamming window in width G is performed globally.
ζgn(ki) | = | (G−1)/2∑j=−(G−1)/2whan(j+(G−1)/2)ζn(ki+j), | (90) | ||
whan(j) | = | 1C(0.5−0.5cos(2πjG)), | (91) |
Here, C is a normalization coefficient so that ∑G−1j=0whan(j)=1 can be satisfied.
Local frequency smoothing
Frequency smoothing with a triangle window in width F is performed locally.
ζln(ki) | = | 0.25ζn(ki−1)+0.5ζn(ki)+0.25(ki+1) | (92) |
The variables used in 3-b) are shown in Table 6.56.
Variable |
Description, corresponding parameter |
ζf,g,ln(ki) |
SNR smoothed in each band |
Pf,g,ln(ki) |
Probability of provisional speech in each band |
ζpeakn(ki) |
Peak of smoothed SNR |
Zpeakmin |
Parameter MIN_SMOOTH_PEAK_SNR. The default value is 1. |
Zpeakmax |
Parameter MAX_SMOOTH_PEAK_SNR. The default value is 10. |
Zthres |
FRAME_SMOOTH_SNR_THRESH. The default value is 1.5. |
Zf,g,lmin |
Parameter MIN_FRAME_SMOOTH_SNR, |
MIN_GLOBAL_SMOOTH_SNR, |
|
MIN_LOCAL_SMOOTH_SNR. The default value is 0.1. |
|
Zf,g,lmax |
Parameter MAX_FRAME_SMOOTH_SNRF, |
MAX_GLOBAL_SMOOTH_SNR, |
|
MAX_LOCAL_SMOOTH_SNR. The default value is 0.316. |
Calculation of Pfn(ki) and ζpeakn(ki)
First, calculate ζpeakn(f,ki) as follows.
ζpeakn(f,ki) | = | {ζfn(f,ki),if ζfn(f,ki)>Zthresζfn(f−1,ki)ζpeakn(f−1,ki),if otherwise. | (93) |
Here, the value of ζpeakn(ki) must be within the range of the parameter Zpeakmin,Zpeakmax. That is,
ζpeakn(ki) | = | {Zpeakmin,if ζpeakn(ki)<ZpeakminZpeakmax,if ζpeakn(ki)>Zpeakmax | (94) |
Next, Pfn(ki) is obtained as follows.
Pfn(ki) | = | {0,if ζfn(ki)<ζpeakn(ki)Zfmin1,if ζfn(ki)>ζpeakn(ki)Zfmaxlog(ζfn(ki)/ζpeakn(ki)Zfmin)log(Zfmax/Zfmin),otherwise | (95) |
Calculation of Pgn(ki)
Calculate as follows.
Pgn(ki) | = | {0,if ζgn(ki)<Zgmin1,if ζgn(ki)>Zgmax log(ζgn(ki)/Zgmin) log(Zgmax/Zgmin) ,otherwise | (96) |
Calculation of Pln(ki)
Calculate as follows.
Pln(ki) | = | {0,if ζln(ki)<Zlmin1,if ζln(ki)>Zlmax log(ζln(ki)/Zlmin) log(Zlmax/Zlmin) ,otherwise | (97) |
The variables used in 3-c) are shown in Table 6.57.
Variable |
description, a corresponding parameter |
qn(ki) |
Probability of speech pause. |
af |
FRAME_VOICEP_PROB_FACTOR. The default value is 0.7. |
ag |
GLOBAL_VOICEP_PROB_FACTOR. The default value is 0.9. |
al |
LOCAL_VOICEP_PROB_FACTOR. The default value is 0.9. |
qmin |
MIN_VOICE_PAUSE_PROB. The default value is 0.02. |
qmax |
MAX_VOICE_PAUSE_PROB. The default value is 0.98. |
As shown below, the probability of speech pause qn(ki) is obtained by integrating the provisional probability of speech calculated from a smoothing result of the three frequency bands Pf,g,ln(ki).
qn(ki) | = | 1−(1−al+alPln(ki))(1−ag+agPgn(ki))(1−af+afPfn(ki)), | (98) |
Here, when qn(ki)<qmin, qn(ki)=qmin, and when qn(ki)>qmax, qn(ki)=qmax.
The probability of speech presence pn(ki) is obtained by the probability of speech suspension pause qn(ki), the preliminary SNR ζn(ki) and the intermediate variable vn(ki) derived by Equation (85).
pn(ki) | = | {1+qn(ki)1−qn(ki)(1+ζn(ki))exp(−vn(ki)))−1 | (99) |
4 Noise removal:
The enhanced separated sound as an output ˆSn(ki) is derived by activating the optimal gain GH1n(ki) and the probability of speech presence pn(ki) for the separated sound spectrum as the input Yn(ki).
ˆSn(ki) | = | Yn(ki)GH1n(ki)pn(ki) | (100) |