6.3.5 GHDSS

6.3.5.1 Outline of the node

The GHDSS  node performs sound source separation based on the GHDSS  (Geometric High-order Dicorrelation-based Source Separation) algorithm. The GHDSS  algorithm utilizes microphone arrays and performs the following two processes.

  1. Higher-order decorrelation between sound source signals,

  2. Direcitivity formation towards the sound source direction.

For directivity formulation, the positional relation of the microphones given beforehand is used as a geometric constraint. The GHDSS  algorithm implemented in the current version of HARK utilizes the transfer function of the microphone arrays as the positional relation of the microphones. Node inputs are the multi-channel complex spectrum of the sound mixture and data concerning sound source directions. Note outputs are a set of complex spectrum of each separated sound.

6.3.5.2 Necessary files

Table 6.35: Necessary files

Corresponding parameter name

Description

TF_CONJ_FILENAME

Transfer function of microphone array

INITW_FILENAME

Initial value of separation matrix

6.3.5.3 Usage

When to use

Given a sound source direction, the node separates a sound source originating from the direction with a microphone array. As a sound source direction, either a value estimated by sound source localization or a constant value may be used.

Typical connection

Figure 6.43 shows a connection example of the GHDSS . The node has two inputs as follows:

  1. INPUT_FRAMES takes a multi-channel complex spectrum containing the mixture of sounds,

  2. INPUT_SOURCES takes the results of sound source localization.

To recognize the output, that is a separate sound, it may be given to MelFilterBank  to convert it to speech features for speech recognition. As a way to improve the performance of automatic speech recognition, it may be given to one of the following nodes:

  1. The PostFilter  node to suppress the inter-channel leak and diffusive noise caused by the source separation processing (shown in the upper right part in Fig.6.43).

  2. PowerCalcForMap , HRLE , and SpectralGainFilter  in cascade to suppress the inter-channel leakage and diffusive noise caused by source separation processing (this tuning would be easier than with PostFilter ), or

  3. PowerCalcForMap , MelFilterBank  and MFMGeneration  in cascade to generate missing feature masks so that the separated speech is recognized by a missing-feature-theory based automatic speech recognition system (shown in the lower right part of Fig.6.43).

\includegraphics[width=.8\textwidth ]{fig/modules/GHDSS}
Figure 6.43: Example of Connections of GHDSS 

6.3.5.4 Input-output and property of the node

Table 6.36: Parameter list of GHDSS 

Parameter name

Type

Default value

Unit

Description

LENGTH

int 

512

[pt]

Analysis frame length.

ADVANCE

int 

160

[pt]

Shift length of frame.

SAMPLING_RATE

int 

16000

[Hz]

Sampling frequency.

LOWER_BOUND_FREQUENCY

int 

0

[Hz]

The minimum value of the frequency used for separation processing

UPPER_BOUND_FREQUENCY

int 

8000

[Hz]

The maximum value of the frequency used for separation processing

TF_CONJ_FILENAME

string 

   

File name of transfer function database of your microphone array

INITW_FILENAME

string 

   

A file name in which the initial value of the separation matrix is described.

SS_METHOD

string 

ADAPTIVE

 

A stepsize calculation method based on higher-order decorrelation. Select FIX, LC_MYU or ADAPTIVE. FIX indicates fixed values, LC_MYU indicates the value that links with the stepsize based on geometric constraints and ADAPTIVE indicates automatic regulation.

SS_METHOD==FIX

     

The following is valid when FIX is chosen for SS_METHOD.

SS_MYU

float 

0.001

 

A stepsize based on higher-order decorrelation at the time of updating a separation matrix

SS_SCAL

float 

1.0

 

The scale factor in a higher-order correlation matrix computation

NOISE_FLOOR

float 

0.0

 

The threshold value of the amplitude for judging the input signal as noise (upper limit)

LC_CONST

string 

DIAG

 

Determine geometric constraints. Select DIAG or FULL. In DIAG, the focus is in the target sound source direction. In FULL, blind spots are formed in non-target sound source directions in addition to the focus in the target direction.

LC_METHOD

string 

ADAPTIVE

 

The stepsize calculation method based on geometric constraints. Select FIX or ADAPTIVE. FIX indicates fixed values and ADAPTIVE indicates automatic regulation.

LC_METHOD==FIX

     

LC_MYU

float 

0.001

 

The stepsize when updating a separation matrix based on higher-order decorrelation.

UPDATE_METHOD_TF_CONJ

string 

POS

 

Designate a method to update transfer functions. Select POS or ID.

UPDATE_METHOD_W

string 

ID

 

Designate a method to update separation matrixes. Select ID, POS or ID_POS.

UPDATE_SEARCH_AZIMUTH

float 

 

[deg]

The range of azimuth for searching a sound source as identical to another in separation processing.

UPDATE_SEARCH_ELEVATION

float 

 

[deg]

The range of elevation for searching a sound source as identical to another in separation processing.

UPDATE_ACCEPT_ANGLE

float 

5.0

[deg]

The threshold value of angle difference for judging a sound source as identical to another in separation processing.

EXPORT_W

bool 

false

 

Designate whether separation matrixes are to be written to files.

EXPORT_W==true

     

The following is valid when truefor EXPORT_W.

EXPORT_W_FILENAME

string 

   

The name of the file to which the separation matrix is written.

UPDATE

string 

STEP

 

The method to update separation matrixes. Select STEP or TOTAL. In STEP, separation matrixes are updated based on the geometric constraints after an update based on higher-order decorrelation. In TOTAL, separation matrixes are updated based on the geometric constraints and higher-order decorrelation at the same time.

Input

INPUT_FRAMES

: Matrix<complex<float> > type. Multi-channel complex spectra. Rows correspond to channels, i.e., complex spectra of waveforms input from microphones, and columns correspond to frequency bins.

INPUT_SOURCES

: Vector<ObjectRef> type. A Vector array of the Source type object in which Source localization results are stored. It is typically connected to the SourceTracker node and SourceIntervalExtender node and its outputs are used.

Output

OUTPUT

: Map<int, ObjectRef> type. A pair containing the sound source ID of a separated sound and a 1-channel complex spectrum of the separated sound
(Vector<complex<float> > type).

Parameter

LENGTH

: int type. Analysis frame length, which must be equal to the values at a preceding stage value (e.g. AudioStreamFromMic or the MultiFFT node).

ADVANCE

: int type. Shift length of a frame, which must be equal to the values at a preceding stage value (e.g. AudioStreamFromMic or the MultiFFT node).

SAMPLING_RATE

: int type. Sampling frequency of the input waveform.

LOWER_BOUND_FREQUENCY

This parameter is the minimum frequency used when GHDSS processing is performed. Processing is not performed for frequencies below this value and the value of the output spectrum is zero then. The user designates a value in the range from 0 to half of the sampling frequency.

UPPER_BOUND_FREQUENCY

This parameter is the maximum frequency used when GHDSS processing is performed. Processing is not performed for frequencies above this value and the value of the output spectrum is zero then. LOWER_BOUND_FREQUENCY $<$ UPPER_BOUND_FREQUENCY must be maintained.

TF_CONJ_FILENAME: string type. The file name in which the transfer function database of your microphone array is saved. Refer to Section 5.1.2 for the detail of the file format.

INITW_FILENAME

: string type. The file name in which the initial value of a separation matrix is described. Initializing with a converged separation matrix through preliminary computation allows for separation with good precision from the beginning. The file given here must be ready beforehand by setting to truefor EXPORT_W. For its format, see 5.1.3 .

SS_METHOD

: string type. Select a stepsize calculation method based on higher-order decorrelation. When wishing to fix it at a designated value, select FIX. When wishing to set a stepsize based on geometric constraints, select LC_MYU. When wishing to perform automatic regulation, select ADAPTIVE.

  1. When FIX is chosen: set SS_MYU.
    SS_MYU: float type. The default value is 0.01. Designate the stepsize to be used when updating a separation matrix based on higher-order decorrelation. By setting this value and LC_MYU to zero and passing a separation matrix of delay-and-sum beamformer type as INITW_FILENAME, processing equivalent to delay-and-sum beamforming is performed.

SS_SCAL

: float type. The default value is 1.0. Designate the scale factor of a hyperbolic tangent function (tanh) in calculation of the higher-order correlation matrix. A positive real number greater than zero must be designated. The smaller the value is, the less non-linearity, which makes the calculation close to a normal correlation matrix calculation.

NOISE_FLOOR

: float type. The default value is 0. The user designates the threshold value (upper limit) of the amplitude for judging the input signal as noise. When the amplitude of the input signal is equal to or less than this value, it is judged as a noise section and the separation matrix is not updated. When noise is large, and a separation matrix becomes stable and does not converge, a positive real number is to be designated.

LC_CONST

: string type. Select a method for geometric constraints. To establish geometric constraints to focus on the target sound source direction only, add DIAG to the focus of the target direction. When forming a blind spot in a non-target sound source direction, select FULL. Since a blind spot is formed automatically by the higher-order decorrelation, a highly precise separation is achieved in DIAG. The default is DIAG.

LC_METHOD

: string type. Select a stepsize calculation method based on the geometric constraints. When wishing to fix at the designated value, select FIX. When wishing to perform automatic regulation, select ADAPTIVE.

  1. When FIX is chosen: Set LC_MYU.
    LC_MYU: float type. The default value is 0.001. Designate the stepsize at the time of updating a separation matrix based on the geometric constraints. Setting this value and LC_MYU to zero and passing the separation matrix of the beamformer of Delay and Sum type as INITW_FILENAME enables the processing equivalent to the beamformer of the Delay and Sum type.

UPDATE_METHOD_TF_CONJ

: string type. Select ID or POS. The default value is POS. The user designates if updates of the complex conjugate TF_CONJ of a transfer function will be performed based on IDs given to each sound source (in the case of ID) or on a source position (in the case of POS)

UPDATE_METHODW

: string type. Select ID, POS or ID_POS. The default value is ID. When source position information is changed, recalculation of the separation matrix is required. The user designates a method to judge that the source location information has changed. A separation matrix is saved along with its corresponding sound source ID and sound source direction angle for a given period of time. Even if the sound stops once, when a detected sound is judged to be from the same direction, separation processing is performed with the values of the saved separation matrix again. The user sets criteria to judge if such a separation matrix will be updated in the above case. When ID is selected, it is judged if the sound source is in the same direction by the sound source ID. When POS is selected, it is judged by comparing the sound source directions. When ID_POS is selected, if the sound source is judged not to be the same sound source using a sound source ID comparison, further judgment is performed by comparing the angles of the sound source direction.

UPDATE_SEARCH_AZIMUTH

: float type. The default value is brank. The unit is [deg]. This specifies the range of searching a sound source as identical to another in separation processing. This parameter is valid when UPDATE_METHOD_TF_CONJ=POS/ID_POS, or UPDATE_METHOD_W=POS/ID_POS. If the value is $x$, sound sources of the next frame are searched within $x$ degree from the sound sources of the previous frame. If this box is brank, the sound sources are searched in the whole range. The appropriate value of this would reduce the cost of sound source search.

UPDATE_SEARCH_ELEVATION

: float type. This is almost the same parameter as UPDATE_SEARCH_AZIMUTH but works with the elevation of sound sources.

UPDATE_ACCEPT_ANGLE

: float type. The default value is 5. The unit is [deg]. The user sets an allowable error of angles for judging if the sound is from the same direction when POS or ID_POS are selected for UPDATE_METHOD_TF_CONJ and UPDATE_METHOD_W.

EXPORT_W

: bool type. The default value is false. The user determines if the results of the separation matrix updated by GHDSS will be output. When true, select EXPORT_W_FILENAME.

EXPORT_W_FILENAME

: string type. This parameter is valid when EXPORT_W is set to true. Designate the name of the file into which a separation matrix will be output. For its format, see Section 5.1.3.

6.3.5.5 Details of the node

Formulation of sound source separation: Table 6.37 shows symbols used for the formulation of the sound source separation problem. The meaning of the indices is in Table 6.1. Since the calculation is performed in the frequency domain, the symbols generally indicate complex numbers in the frequency domain. Parameters, except transfer functions, generally vary with time but in the case of calculation in the same time frame, they are indicated with the time index $f$. Moreover, the following calculation describes the frequency bin $k_ i$. In a practical sense, the calculation is performed for each frequency bin $k_0, \dots ,k_{K-1}$ of $K$ frequencies.

Table 6.37: Definitions of the parameters

Parameter

Description

$\boldsymbol {S}(k_ i)= \left[S_1(k_ i), \dots ,S_ N(k_ i)\right]^ T$

The sound source complex spectrum corresponding to the frequency bin $k_ i$.

$\boldsymbol {X}(k_ i)= \left[X_1(k_ i), \dots ,X_ M(k_ i)\right]^ T$

The vector of a microphone observation complex spectrum, which corresponds to INPUT_FRAMES.

$\boldsymbol {N}(k_ i)= \left[N_1(k_ i), \dots ,N_ M(k_ i)\right]^ T$

The additive noise that acts on each microphone.

$\boldsymbol {H}(k_ i)= \left[H_{m, n}(k_ i)\right]$

The transfer function matrix including reflection and diffraction ($M \times N$).

$\boldsymbol {H}_ D(k_ i)= \left[H_{Dm, n}(k_ i)\right]$

The transfer function matrix of direct sound ($M \times N$).

$\boldsymbol {W}(k_ i)= \left[W_{n, m}(k_ i)\right]$

The separation matrix ($N \times M$).

$\boldsymbol {Y}(k_ i)= \left[Y_1(k_ i), \dots ,Y_ N(k_ i)\right]^ T$

The separation sound complex spectrum.

$\mu _{SS}$

The stepsize at the time of updating a separation matrix based on the higher-order decorrelation, which corresponds to SS_MYU.

$\mu _{LC}$

The stepsize at the time of updating a separation matrix based on geometric constraints, which corresponds to LC_MYU.

6.3.5.5.1 Mixture model

The sound that is emitted from $N$ sound sources is affected by the transfer function $\boldsymbol {H}(k_ i)$ in space and observed through $M$ microphones as expressed by Equation (30).

  $\displaystyle \boldsymbol {X}(k_ i) $ $\displaystyle = $ $\displaystyle \boldsymbol {H}(k_ i)\boldsymbol {S}(k_ i) + \boldsymbol {N}(k_ i). \label{eq:observation} $   (30)

The transfer function $\boldsymbol {H}(k_ i)$ generally varies depending on shape of the room and positional relations between microphones and sound sources and therefore it is difficult to estimate it. However, ignoring acoustic reflection and diffraction, in the case that a relative position of microphones and sound source is known, the transfer function limited only to the direct sound $\boldsymbol {H}_ D(k_ i)$ is calculated as expressed in Equation (31).

  $\displaystyle H_{Dm, n}(k_ i) $ $\displaystyle = $ $\displaystyle \exp \left(-j2\pi l_ ir_{m, n}\right) \label{eq:tfd} $   (31)
  $\displaystyle l_ i $ $\displaystyle = $ $\displaystyle \frac{2\pi \omega _ i}{c}, \label{eq:wavenumber} $   (32)

Here, $c$ indicates the speed of sound and $l_ i$ is the wave number corresponding to the frequency $\omega _ i$ in the frequency bin $k_ i$. Moreover, $r_{m, n}$ indicates difference between the distance from the microphone $m$ to the sound source $n$ and the difference between the reference point of the coordinate system (e.g. origin) to the sound source $n$. In other words, $\boldsymbol {H}_ D(k_ i)$ is defined as the phase difference incurred by the difference in arrival time from the sound source to each microphone.

6.3.5.5.2 Separation model

The matrix of a complex spectrum of separated sound $\boldsymbol {Y}(k_ i)$ is obtained from the following equation.

  $\displaystyle \boldsymbol {Y}(k_ i) $ $\displaystyle = $ $\displaystyle \boldsymbol {W}(k_ i)\boldsymbol {X}(k_ i) \label{eq:GHDSS-separation} $   (33)

The GHDSS algorithm estimates the separation matrix $\boldsymbol {W}(k_ i)$ so that $\boldsymbol {Y}(k_ i)$ closes to $\boldsymbol {S}(k_ i)$.

6.3.5.5.3 Assumption in the model

Information assumed to be already-known by this algorithm is as follows.

  1. The number of sound sources $N$

  2. Source position (The LocalizeMUSIC node estimates source location in HARK)

  3. Microphone position

  4. Transfer function of the direct sound component $\boldsymbol {H}_ D(k_ i)$ (measurement or approximation by Equation (31))

As unknown information,

  1. Actual transfer function at the time of an observation $\boldsymbol {H}(k_ i)$

  2. Observation noise $\boldsymbol {N}(k_ i)$

6.3.5.5.4 Update equation of separation matrix

GHDSS estimates $\boldsymbol {W}(k_ i)$ so that the following conditions are satisfied.

  1. Higher-order decorrelation of the separated signals
    In other words, the diagonal component of the higher-order matrix $\boldsymbol {R}^{\phi (y)y}(k_ i) = E[\phi (\boldsymbol {Y}(k_ i)) \boldsymbol { Y}^ H(k_ i)] $ of the separated sound $\boldsymbol {Y}(k_ i)$ is made 0. Here, the operators $^ H$, $E[]$ and $\phi ()$ indicate a hermite transpose, time average operator and nonlinear function, respectively and a hyperbolic tangent function defined by the followings is used in this node.

      $\displaystyle \phi (\boldsymbol {Y}) $ $\displaystyle = $ $\displaystyle [\phi (Y_1), \phi (Y_2), \dots , \phi (Y_ N)] ^ T $   (34)
      $\displaystyle \phi (Y_ k) $ $\displaystyle = $ $\displaystyle \tanh (\sigma |Y_ k|) \exp (j\angle (Y_ k)) $   (35)

    Here, $\sigma $ indicates a scaling factor (corresponds to SS_SCAL).

  2. The direct sound component is separated without distortions (geometric constraints)
    The product of the separation matrix $\boldsymbol {W}(k_ i)$ and the transfer function of the direct sound $\boldsymbol {H}_ D(k_ i)$ is made a unit matrix ($\boldsymbol {W}(k_ i)\boldsymbol {H}_ D(k_ i)= \boldsymbol {I}$).

The evaluation function that an upper binary element is matched with is as follows. In order to simplify, the frequency bin $k_ i$ is abbreviated.

  $\displaystyle J(\boldsymbol {W}) $ $\displaystyle = $ $\displaystyle \alpha J_1(\boldsymbol {W}) + \beta J_2(\boldsymbol {W}), \label{eq:evalFuncTotal} $   (36)
  $\displaystyle J_1(\boldsymbol {W}) $ $\displaystyle = $ $\displaystyle \sum _{i \ne j}| R^{\phi (y)y}_{i, j}|^2, \label{eq:evalFunc1} $   (37)
  $\displaystyle J_2(\boldsymbol {W}) $ $\displaystyle = $ $\displaystyle \boldsymbol {WH}_ D-\boldsymbol {I}^2, \label{eq:evalFunc2} $   (38)

Here, $\alpha $ and $\beta $ are weighting factors. Moreover, the norm of a matrix is defined as below. $\boldsymbol {M} ^2 = tr(\boldsymbol {MM}^ H)= \sum _{i, j}|m_{i, j}|^2$

An update equation of the separation matrix to minimize Equation (36) is obtained by the gradient method that uses the complex gradient calculation $\frac{\partial }{\partial \boldsymbol {W}^*}$.

  $\displaystyle \boldsymbol {W}(k_ i, f+1) $ $\displaystyle = $ $\displaystyle \boldsymbol {W}(k_ i, f) - \mu \frac{\partial J}{\partial \boldsymbol {W}^*}(\boldsymbol {W}(k_ i, f)) \label{eq:updateSepMatStat} $   (39)

Here, $\mu $ indicates a stepsize regulating the quantity of update of a separation matrix. Usually, when obtaining a complex gradient of the right-hand side of Equation (39), multiple frame values are required for expectation value calculation such as $\boldsymbol {R}^{xx} = E[\boldsymbol {XX}^ H]$ and $\boldsymbol {R}^{yy} = E[\boldsymbol {YY}^ H]$. An autocorrelation matrix is not obtained in calculation of the GHDSS node. However, Equation (40), which uses only one frame, is used.

  $\displaystyle \boldsymbol {W}(k_ i, f+1) $ $\displaystyle = $ $\displaystyle \boldsymbol {W}(k_ i, f) - \left[ \mu _{SS} \frac{\partial J_1}{\partial \boldsymbol {W}^*}(\boldsymbol {W}(k_ i, f)) + \mu _{LC} \frac{\partial J_2}{\partial \boldsymbol {W}^*}(\boldsymbol {W}(k_ i, f)) \right], \label{eq:updateSepMatInst} $   (40)
  $\displaystyle \frac{\partial J_1}{\partial \boldsymbol {W}^*}(\boldsymbol {W}) $ $\displaystyle = $ $\displaystyle \left(\phi (\boldsymbol {Y})\boldsymbol {Y}^ H - \mathrm{diag}[\phi (\boldsymbol {Y})\boldsymbol {Y}^ H] \right)\tilde{\phi }({{boldmath{$W$}}}\boldsymbol {X})\boldsymbol {X}^ H, \label{eq:J1} $   (41)
  $\displaystyle \frac{\partial J_2}{\partial \boldsymbol {W}^*}(\boldsymbol {W}) $ $\displaystyle = $ $\displaystyle 2\left(\boldsymbol {W}\boldsymbol {H}_ D - \boldsymbol {I} \right)\boldsymbol {H}_ D^ H, \label{eq:J2} $   (42)

Here, $\tilde{\phi }$ is a partial differential of $\phi $ and is defined as follows.

  $\displaystyle \tilde{\phi }(\boldsymbol {Y}) $ $\displaystyle = $ $\displaystyle [\tilde{\phi (Y_1)}, \tilde{\phi (Y_2)},\dots ,\tilde{\phi (Y_ N)}]^ T $   (43)
  $\displaystyle \tilde{\phi }(Y_ k) $ $\displaystyle = $ $\displaystyle \phi (Y_ k)+ Y_ k \frac{\partial \phi (Y_ k)}{\partial Y_ k} $   (44)

Moreover, $\mu _{SS} = \mu \alpha $ and $\mu _{LC} = \mu \beta $, which are the stepsizes based on the higher-order decorrelation and geometric constraints. The stepsizes, which are automatically regulated, are calculated by the equations

  $\displaystyle \mu _{SS} $ $\displaystyle = $ $\displaystyle \frac{J_1(\boldsymbol {W})}{2 \frac{\partial J_1}{\partial \boldsymbol {W}}(\boldsymbol {W})^2} $   (45)
  $\displaystyle \mu _{LC} $ $\displaystyle = $ $\displaystyle \frac{J_2(\boldsymbol {W})}{2\frac{\partial J_2}{\partial \boldsymbol {W}}(\boldsymbol {W})^2} $   (46)

The indices of each parameter in Equations (41, 42) are $(k_ i, f)$, which are abbreviated above. The initial values of the separation matrix are obtained as follows.

  $\displaystyle \boldsymbol {W}(k_ i) $ $\displaystyle = $ $\displaystyle \boldsymbol {H}_ D^ H(k_ i) / M, \label{eq:initSepMat} $   (47)

Here, $M$ indicates the number of microphones.
Processing flow

\includegraphics[width=.8\textwidth ]{fig/modules/GHDSS-fc-overview.eps}
Figure 6.44: Flowchart of GHDSS 

The main processing for time frame $f$ in the GHDSS node is shown in Figure 6.44. The detailed processing related to fixed noise is as follows:

  1. Acquiring a transfer function (direct sound)

  2. Estimating the separation matrix $\boldsymbol {W}$

  3. Performing sound source separation in accordance with Equation (33)

  4. Writing of a separation matrix (When EXPORT_W is set to true)

Acquiring a transfer function: At the first frame, the transfer function, specified by the file name TF_CONJ_FILENAME, that is closest to the localization result is searched.

Processing after the second frame is as follows.

UPDATE_METHOD_TF_CONJ is ID

  1. The acquired ID is compared with the ID one frame before.

    • Same: Succeed

    • Different: Read

UPDATE_METHOD_TF_CONJ is POS

  1. The acquired direction is compared with the sound source direction one frame before

    • Error is less than UPDATE_ACCEPT_ANGLE: Succeed

    • Error is more than UPDATE_ACCEPT_ANGLE: Read

Estimating the separation matrix: The initial value of the separation matrix is different depending on if the user designates a value for the parameter INITW_FILENAME.
When the parameter INITW_FILENAME is not designated, the separation matrix $\boldsymbol {W}$ is calculated from the transfer function $\boldsymbol {H}_ D$.
When the parameter INITW_FILENAME is designated, the data most close to the direction of the input source localization result is searched for from the designated separation matrix.
Processing after the second frame is as follows.

\includegraphics[width=.8\textwidth ]{fig/modules/GHDSS-fc-sep.eps}
Figure 6.45: Flowchart of separation matrix estimation

The flow for estimating the separation matrix is shown in Figure 6.45. Here, the previous frame is updated based on Equation (40) or an initial value of the separation matrix is derived by the transfer function based on Equation (47).

[c]UPDATE_METHOD_W is ID

  1. Compare with the previous frame ID

    • Same: Update $\boldsymbol {W}$

    • Different: Initialize $\boldsymbol {W}$

[c]UPDATE_METHOD_W is POS

  1. Compare with the former frame localization direction

    • Error is less than UPDATE_ACCEPT_ANGLE: Update $\boldsymbol {W}$

    • Error is more than UPDATE_ACCEPT_ANGLE: Initialize $\boldsymbol {W}$

[c] UPDATE_METHOD_W is ID_POS

  1. Compare with the former frame ID

    • Same: Update $\boldsymbol {W}$

  2. Localization directions are compared when IDs are different

    • Error is less than UPDATE_ACCEPT_ANGLE: Update $\boldsymbol {W}$

    • Error more than UPDATE_ACCEPT_ANGLE: Initialize $\boldsymbol {W}$

Writing of a separation matrix (When EXPORT_W is set to true) When EXPORT_W is set to true, a converged separation matrix is output to a file designated for EXPORT_W_FILENAME.
When multiple sound sources are detected, all those separation matrices are output to one file. When a sound source disappears, its separation matrix is written to a file.
When written to a file, it is determined to overwrite the existing sound source or add the sound source as a new sound source by comparing localization directions of the sound sources already saved. [c]Disappearance of sound source

  1. Compare with localization direction of the sound source already saved

    • Error is less than UPDATE_ACCEPT_ANGLE: Overwrite and save $\boldsymbol {W}$

    • Error more than UPDATE_ACCEPT_ANGLE: Save additionally $\boldsymbol {W}$