Loading [MathJax]/jax/output/HTML-CSS/jax.js

VoiceActivityDetection Node

Outline of the node

This node delimits the a speech-present period.

Typical connection

This node is connected with the VoiceActivityDetection node. Typical connection of this node is depicted as follows:

_images/vv_connection.png

Input-output and property of the node

Input

AUDIO_SPECTRUM Matrixd<complex<float> >
Windowed spectrum data. A row index is channel, and a column index is frequency.

Output

VAD_DECISION Vector<ObjectRef>
Decision of speech-present frame

Parameters

Parameters of this node are listed as follows:

Parameter name Type Default value Unit Description
VAD_NOISE_DURATION float 3.0 second Time duration to be regarded as “noise” from the first frame
VAD_THRESHOLD float 50.0   Threshold for voice activity decision.
ADVANCE int 160 sample The length in sample between a frame and a previous frame.
SAMPLING_RATE int 16000 Hz Sampling rate.

Detail of the node

This node estimates the voice activity by using log likelifood ratio of speech and noise variances of the zero-mean Gaussian statistical model [1]. Let Xl/r[f,n] be an input audio signal at frequency bin f and time frame n, this method regards speech-present when following equation is satisfied:

1FFf=1γ[f,n]logγ[f,n]1>ηVAD,

λN[f]=E|Nl[f]Nr[f]|,

γ[f,n]=|Xl[f,n]Xr[f,n]|/λN[f],

where N[f] and ηVAD represent the variance of a estimated noise and threshold parameter, respectively.

References

[1]
  1. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,” Signal Processing Letters, vol. 6, no. 1, pp. 1-3, January 1999.