VoiceActivityDetection Node¶
Outline of the node¶
This node delimits the a speech-present period.
Typical connection¶
This node is connected with the VoiceActivityDetection node. Typical connection of this node is depicted as follows:

Input-output and property of the node¶
Input¶
- AUDIO_SPECTRUM Matrixd<complex<float> >
- Windowed spectrum data. A row index is channel, and a column index is frequency.
Output¶
- VAD_DECISION Vector<ObjectRef>
- Decision of speech-present frame
Parameters¶
Parameters of this node are listed as follows:
Parameter name | Type | Default value | Unit | Description |
---|---|---|---|---|
VAD_NOISE_DURATION | float | 3.0 | second | Time duration to be regarded as “noise” from the first frame |
VAD_THRESHOLD | float | 50.0 | Threshold for voice activity decision. | |
ADVANCE | int | 160 | sample | The length in sample between a frame and a previous frame. |
SAMPLING_RATE | int | 16000 | Hz | Sampling rate. |
Detail of the node¶
This node estimates the voice activity by using log likelifood ratio of speech and noise variances of the zero-mean Gaussian statistical model [1]. Let Xl/r[f,n] be an input audio signal at frequency bin f and time frame n, this method regards speech-present when following equation is satisfied:
1F∑Ff=1γ[f,n]−logγ[f,n]−1>ηVAD,
λN[f]=E|Nl[f]⋅Nr[f]∗|,
γ[f,n]=|Xl[f,n]⋅Xr[f,n]∗|/λN[f],
where N[f] and ηVAD represent the variance of a estimated noise and threshold parameter, respectively.