9.6 Reducing noise leakage by post processing

Problem

Use this section when distortion is included in the separated sound and when wishing to improve automatic speech recognition by speech enhancement.

Solution

This section describes the settings of nodes related to speech enhancement PostFilter , HRLE , WhiteNoiseAdder and MFMGeneration .

Depending on the situation, better recognition performance is obtained without PostFilter . It is necessary to set adequately the parameters of PostFilter for the given environment. Since the default parameters are determined based on the environment used by the HARK development team, there is no guarantee that they will be suited to the user’s environment.

PostFilter contains many parameters, with many being interdependent. Therefore, it is extremely difficult to tune by hand operations. One solution is to use a combination optimization method. If a data set is available, apply an optimization method such as Generic Algorithm or Evolutional Strategy by using recognition rates and SNR for evaluations. Note that the system may learn parameters too specialized for the given environment.

In PostFilter , stationary noise, reverberation and noise leakage are dynamically estimated by the magnitude relationships of input signal power, with more precisely separated sounds obtained by subtraction. Under some conditions, performance may be degraded because the speech is distorted by such subtraction. Therefore, PostFilter is affected by estimations of stationary noise, reverberations and noise leakage. The influence of PostFilter can be minimized by setting the following parameters to 0.

To increase the influence of PostFilter , bring these values closer to 1.

The number of parameters is much smaller in HRLE than in PostFilter . HRLE can enhance speech by calculating the spectral histograms of separated speech signals and detecting differences between noise and speech. Therefore, the design of the histogram has marked effects on speech enhancement performance. HRLE includes 5 parameters LX, TIME_CONSTANT, NUM_BIN, MIN_LEVEL, and STEP_LEVEL. All these parameters, except for LX, are appropriate in the default setting. However, since LX defines the level of the surface between noise and speech, the best value depends on each acoustic environment. A higher LX can suppress high power noise but increase acoustic distortion. In contrast, a lower LX will reduce the distortion, but not suppress high power noise. Thus, set an appropriate LX depending on your environment.

Adjust the value of WN_LEVEL. If it is too small, the distortion generated in the separated sound cannot be subtracted sufficiently. If it is too large, not only the distorted part but also the separated sound itself will be affected due to too much subtraction.

Threshold values to mask features can be changed by changing the THRESHOLD value in the range from 0 to 1. All features are not masked when THRESHOLD is 0, indicating that some unreliable features are used for speech recognition. As THRESHOLD gets closer to 1, all features are masked, indicating that all features are not used. Both too high and too low values of THRESHOLD would degrade speech recognition.

Discussion

See Also

The separation cannot be performed properly. What should I do?