The GHDSS node performs sound source separation based on the GHDSS (Geometric High-order Dicorrelation-based Source Separation) algorithm. The GHDSS algorithm utilizes microphone arrays and performs the following two processes.
Higher-order decorrelation between sound source signals,
Direcitivity formation towards the sound source direction.
For directivity formulation, the positional relation of the microphones given beforehand is used as a geometric constraint. The GHDSS algorithm implemented in the current version of HARK utilizes the transfer function of the microphone arrays as the positional relation of the microphones. Node inputs are the multi-channel complex spectrum of the sound mixture and data concerning sound source directions. Note outputs are a set of complex spectrum of each separated sound.
Corresponding parameter name |
Description |
TF_CONJ_FILENAME |
Transfer function of microphone array |
MIC_FILENAME |
Coordinate of microphone position |
FIXED_NOISE_FILENAME |
Coordinate of noise source position |
INITW_FILENAME |
Initial value of separation matrix |
When to use
Given a sound source direction, the node separates a sound source originating from the direction with a microphone array. As a sound source direction, either a value estimated by sound source localization or a constant value may be used.
Typical connection
Figure 6.41 shows a connection example of the GHDSS . The node has two inputs as follows
INPUT_FRAMES takes a multi-channel complex spectrum containing the mixture of sounds,
INPUT_SOURCES takes the results of sound source localization.
To recognize the output, that is a separate sound, it may be given to MelFilterBank to convert it to speech features for speech recognition. As a way to improve the performance of automatic speech recognition, it may be given to one of the following nodes
The PostFilter node to suppress the inter-channel leak and diffusive noise caused by the source separation processing (shown in the upper right part in Fig.6.41).
PowerCalcForMap , HRLE , and SpectralGainFilter in cascade to suppress the inter-channel leakage and diffusive noise caused by source separation processing (this tuning would be easier than with PostFilter ), or
PowerCalcForMap , MelFilterBank and MFMGeneration in cascade to generate missing feature masks so that the separated speech is recognized by a missing-feature-theory based automatic speech recognition system (shown in the lower right part of Fig.6.41).
Parameter name |
Type |
Default value |
Unit |
Description |
LENGTH |
512 |
[pt] |
Analysis frame length. |
|
ADVANCE |
160 |
[pt] |
Shift length of frame. |
|
SAMPLING_RATE |
16000 |
[Hz] |
Sampling frequency. |
|
LOWER_BOUND_FREQUENCY |
0 |
[Hz] |
The minimum value of the frequency used for separation processing |
|
UPPER_BOUND_FREQUENCY |
8000 |
[Hz] |
The maximum value of the frequency used for separation processing |
|
TF_CONJ |
CALC |
Select CALC or DATABASE |
||
TF_CONJ==DATABASE |
The following is valid when is DATABASE is chosen for TF_CONJ. |
|||
TF_CONJ_FILENAME |
A file name in which a transfer function of the microphone array is described. |
|||
TF_CONJ==CALC |
The following is valid when is CALC is chosen for TF_CONJ. |
|||
MIC_FILENAME |
A file name in which a coordinate of the microphone position is described. |
|||
MIC_POS_SHIFT |
FIX |
Determine whether or not the origin of the microphone coordinate shifts. Select FIX or SHIFT. When FIX, the origin of the microphone does not vary. When SHIFT, the gravity center of the microphone coordinate is assumed to be the origin. |
||
SPEED_OF_SOUND |
343 |
[m/s] |
Speed of sound. |
|
FIXED_NOISE |
false |
Designate whether or not the noise sound source is fixed. |
||
FIXED_NOISE==true |
The following is valid when trueis chosen for FIXED_NOISE. |
|||
FIXED_NOISE_FILENAME |
A file containing the coordinates of the fixed noise source position. |
|||
INITW_FILENAME |
A file name in which the initial value of the separation matrix is described. |
|||
SS_METHOD |
ADAPTIVE |
A stepsize calculation method based on higher-order decorrelation. Select FIX, LC_MYU or ADAPTIVE. FIX indicates fixed values, LC_MYU indicates the value that links with the stepsize based on geometric constraints and ADAPTIVE indicates automatic regulation. |
||
SS_METHOD==FIX |
The following is valid when FIX is chosen for SS_METHOD. |
|||
SS_MYU |
0.001 |
A stepsize based on higher-order decorrelation at the time of updating a separation matrix |
||
SS_SCAL |
1.0 |
The scale factor in a higher-order correlation matrix computation |
||
NOISE_FLOOR |
0.0 |
The threshold value of the amplitude for judging the input signal as noise (upper limit) |
||
LC_CONST |
DIAG |
Determine geometric constraints. Select DIAG or FULL. In DIAG, the focus is in the target sound source direction. In FULL, blind spots are formed in non-target sound source directions in addition to the focus in the target direction. |
||
LC_METHOD |
ADAPTIVE |
The stepsize calculation method based on geometric constraints. Select FIX or ADAPTIVE. FIX indicates fixed values and ADAPTIVE indicates automatic regulation. |
||
LC_METHOD==FIX |
||||
LC_MYU |
0.001 |
The stepsize when updating a separation matrix based on higher-order decorrelation. |
||
UPDATE_METHOD_TF_CONJ |
POS |
Designate a method to update transfer functions. Select POS or ID. |
||
UPDATE_METHOD_W |
ID |
Designate a method to update separation matrixes. Select ID, POS or ID_POS. |
||
UPDATE_ACCEPT_ANGLE |
5.0 |
[deg] |
The threshold value of angular difference for judging a sound source as identical to another in separation processing. |
|
EXPORT_W |
false |
Designate whether separation matrixes are to be written to files. |
||
EXPORT_W==true |
The following is valid when truefor EXPORT_W. |
|||
EXPORT_W_FILENAME |
The name of the file to which the separation matrix is written. |
|||
UPDATE |
STEP |
The method to update separation matrixes. Select STEP or TOTAL. In STEP, separation matrixes are updated based on the geometric constraints after an update based on higher-order decorrelation. In TOTAL, separation matrixes are updated based on the geometric constraints and higher-order decorrelation at the same time. |
Input
Matrix<complex<float> > type. Multi-channel complex spectra. Rows correspond to channels, i.e., complex spectra of waveforms input from microphones, and columns correspond to frequency bins.
Vector<ObjectRef> type. A Vector array of the Source type object in which Source localization results are stored. It is typically connected to the SourceTracker node and SourceIntervalExtender node and its outputs are used.
Output
Map<int, ObjectRef> type. A pair containing the sound source ID of a separated sound and a 1-channel complex spectrum of the separated sound
(Vector<complex<float> > type).
Parameter
int type. Analysis frame length, which must be equal to the values at a preceding stage value (e.g. AudioStreamFromMic or the MultiFFT node).
int type. Shift length of a frame, which must be equal to the values at a preceding stage value (e.g. AudioStreamFromMic or the MultiFFT node).
int type. Sampling frequency of the input waveform.
This parameter is the minimum frequency used when GHDSS processing is performed. Processing is not performed for frequencies below this value and the value of the output spectrum is zero then. The user designates a value in the range from 0 to half of the sampling frequency.
This parameter is the maximum frequency used when GHDSS processing is performed. Processing is not performed for frequencies above this value and the value of the output spectrum is zero then. LOWER_BOUND_FREQUENCY UPPER_BOUND_FREQUENCY must be maintained.
string type. The user selects if a transfer function is to be obtained from a simulation or from measurement values. Select CALC to use simulations and DATABASE to use measurement values. When calculating with a simulation, a conventional transfer function is calculated. However, measurements values need to be complex conjugates of the transfer function. When transfer function data is generated with hark-tools, they are converted into complex conjugate values automatically.
DATABASE Set TF_CONJ_FILENAME.
TF_CONJ_FILENAME string type. This parameter indicates a binary file name in which a transfer function is described. For the file format, see Section 5.1.2.
CALC Set MIC_FILENAME, MIC_POS_SHIFT and SPEED_OF_SOUND.
MIC_FILENAME string type. Designate the name of the text file in which the coordinates of microphone positions are described. For the file format, see Section 5.2.
MIC_POS_SHIFT string type. Select FIX or SHIFT. The default value is FIX. When SHIFT is chosen, the origin of the microphone coordinate system is moved to the gravity center of the above-mentioned file. When FIX is selected, nothing is performed.
SPEED_OF_SOUND float type. The default value is 343. Designate the acoustic velocity [m/s]
bool type. Fixed noise. For example, the user determines if fan noise at the back of the robot is to be included in the GHDSS processing. When true, the user designates FIXED_NOISE_FILENAME.
string type. This parameter is valid when FIXED_NOISE is set to true. The user designates a name of the text file in which the position coordinate of the fixed noise is described. For its formats, see 5.2.
string type. The file name in which the initial value of a separation matrix is described. Initializing with a converged separation matrix through preliminary computation allows for separation with good precision from the beginning. The file given here must be ready beforehand by setting to truefor EXPORT_W. For its format, see 5.1.3 .
string type. Select a stepsize calculation method based on higher-order decorrelation. When wishing to fix it at a designated value, select FIX. When wishing to set a stepsize based on geometric constraints, select LC_MYU. When wishing to perform automatic regulation, select ADAPTIVE.
When FIX is chosen set SS_MYU.
SS_MYU float type. The default value is 0.01. Designate the stepsize to be used when updating a separation matrix based on higher-order decorrelation. By setting this value and LC_MYU to zero and passing a separation matrix of delay-and-sum beamformer type as INITW_FILENAME, processing equivalent to delay-and-sum beamforming is performed.
float type. The default value is 1.0. Designate the scale factor of a hyperbolic tangent function (tanh) in calculation of the higher-order correlation matrix. A positive real number greater than zero must be designated. The smaller the value is, the less non-linearity, which makes the calculation close to a normal correlation matrix calculation.
float type. The default value is 0. The user designates the threshold value (upper limit) of the amplitude for judging the input signal as noise. When the amplitude of the input signal is equal to or less than this value, it is judged as a noise section and the separation matrix is not updated. When noise is large, and a separation matrix becomes stable and does not converge, a positive real number is to be designated.
string type. Select a method for geometric constraints. To establish geometric constraints to focus on the target sound source direction only, add DIAG to the focus of the target direction. When forming a blind spot in a non-target sound source direction, select FULL. Since a blind spot is formed automatically by the higher-order decorrelation, a highly precise separation is achieved in DIAG. The default is DIAG.
string type. Select a stepsize calculation method based on the geometric constraints. When wishing to fix at the designated value, select FIX. When wishing to perform automatic regulation, select ADAPTIVE.
When FIX is chosen Set LC_MYU.
LC_MYU float type. The default value is 0.001. Designate the stepsize at the time of updating a separation matrix based on the geometric constraints. Setting this value and LC_MYU to zero and passing the separation matrix of the beamformer of Delay and Sum type as INITW_FILENAME enables the processing equivalent to the beamformer of the Delay and Sum type.
string type. Select ID or POS. The default value is POS. The user designates if updates of the complex conjugate TF_CONJ of a transfer function will be performed based on IDs given to each sound source (in the case of ID) or on a source position (in the case of POS)
string type. Select ID, POS or ID_POS. The default value is ID. When source position information is changed, recalculation of the separation matrix is required. The user designates a method to judge that the source location information has changed. A separation matrix is saved along with its corresponding sound source ID and sound source direction angle for a given period of time. Even if the sound stops once, when a detected sound is judged to be from the same direction, separation processing is performed with the values of the saved separation matrix again. The user sets criteria to judge if such a separation matrix will be updated in the above case. When ID is selected, it is judged if the sound source is in the same direction by the sound source ID. When POS is selected, it is judged by comparing the sound source directions. When ID_POS is selected, if the sound source is judged not to be the same sound source using a sound source ID comparison, further judgment is performed by comparing the angles of the sound source direction.
float type. The default value is 5. The unit is [deg]. The user sets an allowable error of angles for judging if the sound is from the same direction when POS or ID_POS are selected for UPDATE_METHOD_TF_CONJ and UPDATE_METHOD_W.
bool type. The default value is false. The user determines if the results of the separation matrix updated by GHDSS will be output. When true, select EXPORT_W_FILENAME.
string type. This parameter is valid when EXPORT_W is set to true. Designate the name of the file into which a separation matrix will be output. For its format, see Section 5.1.3.
Formulation of sound source separation Table 6.36 shows symbols used for the formulation of the sound source separation problem. The meaning of the indices is in Table 6.1. Since the calculation is performed in the frequency domain, the symbols generally indicate complex numbers in the frequency domain. Parameters, except transfer functions, generally vary with time but in the case of calculation in the same time frame, they are indicated with the time index . Moreover, the following calculation describes the frequency bin . In a practical sense, the calculation is performed for each frequency bin of frequencies.
Parameter |
Description |
|
The sound source complex spectrum corresponding to the frequency bin . |
|
The vector of a microphone observation complex spectrum, which corresponds to INPUT_FRAMES. |
|
The additive noise that acts on each microphone. |
|
The transfer function matrix including reflection and diffraction (). |
|
The transfer function matrix of direct sound (). |
|
The separation matrix (). |
|
The separation sound complex spectrum. |
|
The stepsize at the time of updating a separation matrix based on the higher-order decorrelation, which corresponds to SS_MYU. |
|
The stepsize at the time of updating a separation matrix based on geometric constraints, which corresponds to LC_MYU. |
The sound that is emitted from sound sources is affected by the transfer function in space and observed through microphones as expressed by Equation (31).
(31) |
The transfer function generally varies depending on shape of the room and positional relations between microphones and sound sources and therefore it is difficult to estimate it. However, ignoring acoustic reflection and diffraction, in the case that a relative position of microphones and sound source is known, the transfer function limited only to the direct sound is calculated as expressed in Equation (32).
(32) | |||||
(33) |
Here, indicates the speed of sound and is the wave number corresponding to the frequency in the frequency bin . Moreover, indicates difference between the distance from the microphone to the sound source and the difference between the reference point of the coordinate system (e.g. origin) to the sound source . In other words, is defined as the phase difference incurred by the difference in arrival time from the sound source to each microphone.
The matrix of a complex spectrum of separated sound is obtained from the following equation.
(34) |
The GHDSS algorithm estimates the separation matrix so that closes to .
Information assumed to be already-known by this algorithm is as follows.
The number of sound sources
Source position (The LocalizeMUSIC node estimates source location in HARK)
Microphone position
Transfer function of the direct sound component (measurement or approximation by Equation (32))
As unknown information,
Actual transfer function at the time of an observation
Observation noise
GHDSS estimates so that the following conditions are satisfied.
Higher-order decorrelation of the separated signals
In other words, the diagonal component of the higher-order matrix of the separated sound is made 0. Here, the operators , and indicate a hermite transpose, time average operator and nonlinear function, respectively and a hyperbolic tangent function defined by the followings is used in this node.
(35) | |||||
(36) |
Here, indicates a scaling factor (corresponds to SS_SCAL).
The direct sound component is separated without distortions (geometric constraints)
The product of the separation matrix and the transfer function of the direct sound is made a unit matrix ().
The evaluation function that an upper binary element is matched with is as follows. In order to simplify, the frequency bin is abbreviated.
(37) | |||||
(38) | |||||
(39) |
Here, and are weighting factors. Moreover, the norm of a matrix is defined as below.
An update equation of the separation matrix to minimize Equation (37) is obtained by the gradient method that uses the complex gradient calculation .
(40) |
Here, indicates a stepsize regulating the quantity of update of a separation matrix. Usually, when obtaining a complex gradient of the right-hand side of Equation (40), multiple frame values are required for expectation value calculation such as and . An autocorrelation matrix is not obtained in calculation of the GHDSS node. However, Equation (41), which uses only one frame, is used.
(41) | |||||
(42) | |||||
(43) |
Here, is a partial differential of and is defined as follows.
(44) | |||||
(45) |
Moreover, and , which are the stepsizes based on the higher-order decorrelation and geometric constraints. The stepsizes, which are automatically regulated, are calculated by the equations
(46) | |||||
(47) |
The indices of each parameter in Equations (42, 43) are , which are abbreviated above. The initial values of the separation matrix are obtained as follows.
(48) |
Here, indicates the number of microphones.
Processing flow
The main processing for time frame in the GHDSS node is shown in Figure 6.42. The detailed processing related to fixed noise is as follows
Checking the presence of fixed noise
Acquiring a transfer function (direct sound)
Estimating the separation matrix
Performing sound source separation in accordance with Equation (34)
Writing of a separation matrix (When EXPORT_W is set to true)
Checking the presence of fixed noise FIXED_NOISE is set to true, If there is a sound source from a fixed noise direction in a source localization result, sound source separation is performed under the condition that -1 is given as an ID. Acquiring a transfer function The initial value of a transfer function is different for each parameter TF_CONJ.
When the parameter TF_CONJ is set to CALC, the transfer function is calculated based on Equation (32) with the input source location result and the microphone position coordinate designated for the parameter MIC_FILENAME.
When the parameter TF_CONJ is set to DATABASE, the data most close to the direction of the input source localization result are searched for from the transfer function designated in the parameter TF_CONJ_FILENAME.
Processing after the second frame is as follows.
The flow till obtaining a transfer function is shown in Figure 6.43.
When the parameter TF_CONJ is set to CALC, the transfer function is calculated based on Equation (32) with the input source localization result.
When the parameter TF_CONJ is set to DATABASE, it is determined if the transfer function of the former frame is succeeded or files are read with a value of UPDATE_METHOD_TF_CONJ as follows.
UPDATE_METHOD_TF_CONJ is ID
The acquired ID is compared with the ID one frame before.
Same Succeed
Different Read
UPDATE_METHOD_TF_CONJ is POS
The acquired direction is compared with the sound source direction one frame before
Error is less than UPDATE_ACCEPT_ANGLE Succeed
Error is more than UPDATE_ACCEPT_ANGLE Read
Estimating the separation matrix The initial value of the separation matrix is different depending on if the user designates a value for the parameter INITW_FILENAME.
When the parameter INITW_FILENAME is not designated, the separation matrix is calculated from the transfer function .
When the parameter INITW_FILENAME is designated, the data most close to the direction of the input source localization result is searched for from the designated separation matrix.
Processing after the second frame is as follows.
The flow for estimating the separation matrix is shown in Figure 6.44. Here, the previous frame is updated based on Equation (41) or an initial value of the separation matrix is derived by the transfer function based on Equation (48).
When it is found that a sound source has disappeared by referring to the source localization information of the previous frame, the separation matrix is reinitialized.
When the number of sound sources does not change, the separation matrix diverges by the value of UPDATE_METHOD_W. The sound source ID and localization direction from the previous frame are compared with those of the current to determine if the separation matrix will be used continuously or initialized.
[c]UPDATE_METHOD_W is ID
Compare with the previous frame ID
Same Update
Different Initialize
[c]UPDATE_METHOD_W is POS
Compare with the former frame localization direction
Error is less than UPDATE_ACCEPT_ANGLE Update
Error is more than UPDATE_ACCEPT_ANGLE Initialize
[c] UPDATE_METHOD_W is ID_POS
Compare with the former frame ID
Same Update
Localization directions are compared when IDs are different
Error is less than UPDATE_ACCEPT_ANGLE Update
Error more than UPDATE_ACCEPT_ANGLE Initialize
Writing of a separation matrix (When EXPORT_W is set to true) When EXPORT_W is set to true, a converged separation matrix is output to a file designated for EXPORT_W_FILENAME.
When multiple sound sources are detected, all those separation matrices are output to one file. When a sound source disappears, its separation matrix is written to a file.
When written to a file, it is determined to overwrite the existing sound source or add the sound source as a new sound source by comparing localization directions of the sound sources already saved. [c]Disappearance of sound source
Compare with localization direction of the sound source already saved
Error is less than UPDATE_ACCEPT_ANGLE Overwrite and save
Error more than UPDATE_ACCEPT_ANGLE Save additionally