This node reads speech waveform data from a WAVE file. The waveform data is read into a Matrix<float> type: indexed, multichannel audio waveform data with rows as channels and columns as samples.
Audio files in RIFF WAVE format. There are no limits for the number of channels and sampling frequency. For quantization bit rates, 16-bit or 24-bit signed integers linear PCM format are assumed.
When to use
This node is used when wishing to use WAVE files as input to the HARK system
Typical connection
Figures 6.7 and 6.8 show an usage example of the AudioStreamFromWave node. Figure 6.7 shows an example of AudioStreamFromWave converting Matrix<float> type multichannel waveforms read from a file into frequency domain with the MultiFFT node. To read a file with AudioStreamFromWave , designate a filename in Constant node (Normal node FlowDesigner) and generate a file descriptor in the InputStream node as shown in Figure 6.8. Further, connect the output of the InputStream node to the iterator subnetwork(LOAD_WAVE in Figure 6.8), which contains networks of various nodes of HARK such as AudioStreamFromWave .
Parameter name |
Type |
Default value |
Unit |
Description |
LENGTH |
512 |
[pt] |
Frame length as a fundamental unit for processing. |
|
ADVANCE |
160 |
[pt] |
Frame shift length. |
|
USE_WAIT |
false |
Designate if processing is performed in real time |
Input
: Stream type. Receive inputs from the InputStream node in IO category of FlowDesigner standard node.
Output
: Matrix<float> type. Indexed, multichannel audio waveform data with rows as channels and columns as samples. The number of columns is equal to the parameter LENGTH.
: bool type. Indicate if the file can still be read. Used as an ending flag for loop processing of files. When reaching the end of file, its outputs false and outputs true in other cases.
Parameter
: int type. The default value is 512. Designates the frame length, which is a base unit of processing, in terms of number of samples. The higher the value, the higher the frequency resolution, but the lower the temporal resolution. It is known that length corresponding to $20 \sim 40$ [ms] is appropriate for the analysis of audio waveforms. The default value of 32 [ms] corresponds to the sampling frequency 16,000 [Hz].
: int type. The default value is 160. Designates the frame shift length in samples. The default value of 10 [ms] corresponds to the sampling frequency 16,000 [Hz].
: bool type. The default value is false. Usually, acoustic processing of the HARK system proceeds faster than real time. This option can be used to add "wait time" to the processing. When wishing to process for input files in real time, set to true. However, it is not effective when the processing speed is lower than that of real time.
Applicable file format: RIFF WAVE files can be read. The number of channels and quantization bit rate are read from headers of files. The format IDs that indicate sampling frequency and quantization method are ignored. The number of channels and sampling frequency correspond to arbitrary formats. When sampling frequency is required for processing, they should be set as parameters required by nodes (e.g. GHDSS , MelFilterBank ). The linear PCM by 16- or 24-bit signed integers are assumed for the quantization method and bit counts.
Rough indication of parameters: When the goal of processing is speech analysis (speech recognition), about $20 \sim 40$ [ms] would be appropriate for LENGTH and $1/3 \sim 1/2$ of LENGTH would be appropriate for ADVANCE. In the case that sampling frequency is 16000 [Hz], the default values of LENGTH and ADVANCE are 32 and 10 [ms], respectively.