From multichannel speech waveform data, direction-of-arrival (DOA) in the horizontal plane is estimated using the MUltiple SIgnal Classification (MUSIC) method. It is the main node for Sound Source Localization in HARK .
The transfer function file which consists of a steering vector is required. It is generated based on the positional relationship between the microphone and sound, or the transfer function for which measurement was performed.
This node estimates a sound’s direction and amount of power using the MUSIC method. Detection of a direction with high power in each frame allows the system to know the direction of sound, the number of sound sources, the speech periods, etc. to some extent. The orientation result outputted from this node is used for post-processing such as tracking and source separation.
Typical connection
Figure 6.22 shows a typical connection example.
Input
Matrix<complex<float> > , Complex frequency representation of input signals with size .
Matrix<complex<float> > type. The correlation matrix for each frequency bin. The correlation matrices are inputted, corresponding to the -th complex square matrix. The rows of Matrix<complex<float> > express frequency ( rows) and the columns express the complex correlation matrix ( columns). This input terminal can also be left disconnected; then an identity matrix is used for the correlation matrix instead.
Output
Source position (direction) is expressed as Vector<ObjectRef> type. ObjectRef is a Source and is a structure which consists of the power of the MUSIC spectrum of the source and its direction. The element number of Vector is a sound number (). Please refer to node details for the details of the MUSIC spectrum.
Vector<float> type. Power of the MUSIC spectrum for every direction. This output terminal is not displayed by default.
Matrix<float> type. The Power of the MUSIC spectrum for every frequency bin and for every direction is outputted as a 2 dimensional array. The SPECTRUM output terminal outputs those which are added to the frequency axis. This output terminal is not displayed by default.
Matrix<float> type. The eigenvalue (or singular value) computed from the eigenvalue decomposition (or singular value decomposition) of the correlation matrix for every frequency bins is outputted. The row expresses the frequency bin (), and the column expresses the degree () of the correlation matrix. This output terminal is not displayed by default.
Refer to Figure 6.23 for the addition method of hidden output.
Parameter name |
Type |
Default value |
Unit |
description |
MUSIC_ALGORITHM |
SEVD |
Algorithm of MUSIC |
||
TF_CHANNEL_SELECTION |
See below. |
Channel number used |
||
LENGTH |
512 |
[pt] |
FFT points () |
|
SAMPLING_RATE |
16000 |
[Hz] |
Sampling rate |
|
A_MATRIX |
Transfer function file name |
|||
ELEVATION |
16.7 |
[deg] |
Elevation angle of sound |
|
WINDOW |
50 |
[frame] |
Frames to normalize CM |
|
PERIOD |
50 |
[frame] |
The cycle to compute SSL |
|
NUM_SOURCE |
2 |
Number of sounds |
||
MIN_DEG |
-180 |
[deg] |
Minimum azimuth |
|
MAX_DEG |
180 |
[deg] |
Maximum azimuth |
|
LOWER_BOUND_FREQUENCY |
500 |
[Hz] |
Lower bound frequency |
|
UPPER_BOUND_FREQUENCY |
2800 |
[Hz] |
Upper bound frequency |
|
SPECTRUM_WEIGHT_TYPE |
Uniform |
Type of frequency weight |
||
A_CHAR_SCALING |
1.0 |
Coefficient of weight |
||
MANUAL_WEIGHT_SPLINE |
See below. |
Coefficient of spline weight |
||
MANUAL_WEIGHT_SQUARE |
See below. |
Key point of rectangular weight |
||
ENABLE_EIGENVALUE_WEIGHT |
true |
Enable eigenvalue weight |
||
DEBUG |
false |
ON/OFF of debug output |
Parameter
string type. Selection of algorithm used in order to calculate the signal subspace in the MUSIC method. SEVD represents standard eigenvalue decomposition, GEVD represents generalized eigenvalue decomposition, and GSVD represents generalized singular value decomposition. LocalizeMUSIC enters a correlation matrix with sound information from the NOISECM terminal, and possesses a function which can do SSL whitening of the noise (suppression). SEVD realizes SSL without the function. When SEVD is chosen, the input from NOISECM terminal is disregarded. Although both GEVD and GSVD have a function to whiten the noise inputted from the NOISECM terminal, GEVD has better noise suppression performance compared with GSVD. It has the problem that the calculation time takes approximately 4 times longer. Depending on the scene and computing environment, you can select one of the three algorithms. Please refer to node details for the details of algorithm.
Vector<int> type. Of steering vectors of multichannel stored in the transfer function file, it is parameters which chooses the steering vector of specified channel to use. The channel number begins from 0 like ChannelSelector . Signal processing of 8 channel is assumed by default and it is set as <Vector<int> 0 1 2 3 4 5 6 7> . It is necessary to align the number () of elements of the parameters with the channel number of incoming signals. Moreover, it is necessary to align the order of channel and the channel order of TF_CHANNEL_SELECTION to be inputted into INPUT terminal.
int type. 512 is the default value. FFT point in the case of fourier transform. It is necessary to align it with the FFT points to the preceding paragraph.
int type. 16000 is the default value. Sampling frequency of input acoustic signal. It is necessary to align with other nodes like LENGTH.
string type. There is no default value. The file name of the transfer function file is designated. Both absolute path and relative path are supported. Refer to the harktool3 for the creation method of the transfer function file.
float type. Elevation angle of sound is specified. At the present, HARK does not support estimation of elevation angle of sound, thus, it is taken as the fixed value. For the default value, supposing the position relationships of the robot with human, as one example, when supposing that the height of microphone array of the robot is 1.2 [m] and the position of a person is 1.5 [m], and the distance between a robot and person is about 1.0 [m], it was set up as [rad] [deg].
int type. 50 is the default value. The number of smoothing frames for correlation matrix calculation is designated. Within the node, the correlation matrix is generated for every frame from the complex spectrum of the input signal, and the addition mean is taken by the number of frames specified in WINDOW. Although the correlation matrix will be stabilized if this value is enlarged, time delays become long due to the long interval.
int type. 50 is the default value. The cycle of SSL calculation is specified in frames number. If this value is large, the time interval for obtaining the orientation result becomes large, which will result in improper acquisition of the speech interval or bad tracking of the mobile sound. However, since the computational cost increases if it is small, tuning according to the computing environment is needed.
int type. 2 is the default value. It is the number of dimensions of the signal subspace in the MUSIC method, and can be practically interpreted as the number of desired sound sources to be emphasized in the peak detection. It is expressed as in the following nodes details. It should be . It is desirable to match the sound number of the desired sound, but, for example, in the case of the number of desired sound sources being 3, the interval that each sound pronounces is different, thus, it is sufficient to select a smaller value than it is practically.
int type. -180 is the default value. It is the minimum angle for peak search, and is expressed as in the node details. 0 degree is the robot front direction, negative values are the robot right hand direction, and positive values are the robot left-hand direction. Although the specified range is considered as degrees for convenience, since the surrounding range of 360 degrees or more is also supported, there is no particular limitation.
int type. 180 is the default value. It is the maximum angle for peak search, and is expressed as in the node details. Others are the same as that of MIN_DEG.
int type. 500 is the default value. It is the minimum of frequency bands which is taken into consideration for peak detection, and is expressed as in the node details. It should be .
int type. 2800 is the default value. It is the maximum of frequency bands Which is taken into consideration for peak detections, and, is expressed as below. It should be .
string type. ‘Uniform’ is the default value. The distribution of weights against the frequency axial direction of the MUSIC spectrum used for peak detections is designated. ‘Uniform’ sets weights to OFF. ‘A_Characteristic’ gives the MUSIC spectrum weights imitating the sound pressure sensitivity of human hearing. ‘Manual_Spline’ gives the MUSIC spectrum weights suited to the Cubic spline curve for which the point specified in MANUAL_WEIGHT_SPLINE is considered as the interpolating point. ‘Manual_Square’ generates the rectangular weights suited to the frequency specified in MANUAL_WEIGHT_SQUARE, and gives it to MUSIC spectrum.
float type. 1.0 is the default value. This is scaling term which modifies the A characteristic weight on the frequency axis. Since the A characteristic weight imitates the sound pressure sensitivity of human’s hearing, filtering to suppress sound outside of the speech frequency range is possible. Although the A characteristic weight has a standard, depending on the general sound environment, noise may enter the speech frequency range, and it may be unable to orientate well. Then, the A characteristic weight should be increased, causing more suppression, especially in the lower frequencies.
Matrix<float> type.
<Matrix<float> <rows 2> <cols 5> <data 0.0 2000.0 4000.0 6000.0 8000.0 1.0 1.0 1.0 1.0 1.0> > is the default value. It is designated with the float value 2-by- matrix. is equivalent to the number of interpolation points for spline interpolations. The first row specifies the frequency and the second row specifies the weight corresponding to it. Weighting is performed according to the spline curve which passes along the interpolated point. By default, the weights are all set to 1 for the frequency bands from 0 [Hz] to 8000 [Hz] .
Vector<float> type. <Vector<float> 0.0 2000.0 4000.0 6000.0 8000.0> is the default value. By the frequency specified in MANUAL_WEIGHT_SQUARE, the rectangular weight is generated and is given to MUSIC spectrum. For the frequency bands from the odd components of MANUAL_WEIGHT_SQUARE to the even components, the weight of 1 is given, and for the frequency bands from the even components to the odd components, the weight of 0 is given. By default, the MUSIC spectrum from 2000 [Hz] to 4000 [Hz] and 6000 [Hz] to 8000 [Hz] can be suppressed.
bool type. True is the default value. For true, in the case of calculation of the MUSIC spectrum, the weight is given as the square root of the maximum eigenvalue (or the maximum singular value) acquired from eigenvalue decomposition (or singular value decompositions) of the correlation matrix. Since this weight greatly changes depending on the eigenvalue of the correlation matrix inputted from NOISECM terminal when choosing GEVD and GSVD for MUSIC_ALGORITHM, it is good to choose false.
bool type. ON/OFF of the debug output and the format of the debug output are as follows. First, the set of index of sound, direction, and power is outputted in tab delimited for only several number of sound detected in frames. ID is the number given for convenience in order from 0 for every frame, though the number itself is meaningless. For direction [deg], an integer with rounded decimal is displayed. As for power, the power value of MUSIC spectrum of Eq. () is outputted as is. Next, “MUSIC spectrum” is outputted after a line feed, and the value of of Eq. () is displayed for all .
The MUSIC method is the method of estimating the direction-of-arrival (DOA) utilizing the eigenvalue decomposition of the correlation matrix among input signal channels. The algorithm is summarized below.
Generation of transfer function
In the MUSIC method, the transfer function from sound to each microphone is measured or calculated numerically and it is used as a priori information. If the transfer function in the frequency domain from sound in direction in view of microphone array to the -th microphone is set to , the multichannel transfer function multichannel can be expressed as follows.
(8) |
This transfer function vector is prepared for every suitable interval delta, (non-regular intervals are also possible) by calculation or measurement in advance. In HARK , harktool3 is offered as a tool which can generate the transfer function file also by numerical calculation and also by measurement. Please refer to the paragraph of harktool3 for the prepare a specific transfer function file. In the LocalizeMUSIC node, this a priori information file (transfer function file) is imported and used with the file name specified in A_MATRIX. Thus, since the transfer function is prepared for every direction of sound and is scanned to the direction using the direction vector (or transfer function, in the case of orientation), it is sometimes called ‘steering vector’.
Calculation of correlation matrix between the inputs signal channels
The operation by HARK begins from here. First, the signal vector in the frequency domain obtained by short-time fourier transform of the input acoustic signal in channel is found as follows.
(9) |
where expresses frequency and expresses frame index. In HARK , the process so far is performed by the MultiFFT node in the preceding paragraph.
The correlation matrix between channels of the incoming signal can be defined as follows for every frame and for every frequency .
(10) |
where represents the conjugate transpose operator. If this is utilized in following processing as is, theoretically, it will be satisfactory, but practically, in order to obtain the stable correlation matrix, those time averaging is used in HARK .
(11) |
Decomposition to the signal and noise subspace
In the MUSIC method, an eigenvalue decomposition or singular value decomposition of the correlation matrix found in the Eq. () is performed and the -th space is decomposed into the signal subspace and the other subspace.
Since the processing has high computational cost, it is designed to be calculated only once in several frames. In LocalizeMUSIC , this operation period can be specified in PERIOD.
In LocalizeMUSIC , the method for decomposing into subspace can be specified by MUSIC_ALGORITHM.
When MUSIC_ALGORITHM is specified for SEVD, the following standard eigenvalue decomposition is performed.
(12) |
where represents the matrix which consists of singular vectors which perpendicularly intersect each other, and represents the diagonals matrix using the eigenvalue corresponding to individual eigenvectors as the diagonal component. In addition, the diagonal component of , is considered to have been sorted in descending order.
When MUSIC_ALGORITHM is specified for GEVD, the following generalized eigenvalue decomposition is performed.
(13) |
where represents the correlation matrix inputted from NOISECM terminal at the -th frame. Since large eigenvalues from the noise sources included in can be whitened (surpressed) using generalized eigenvalue decomposition with , SSL with suppressed noise is realizable.
When MUSIC_ALGORITHM is specified for GSVD, the following generalized singular value decomposition is performed.
(14) |
where represents the matrix which consists of left singular vector and right singular vector, respectively, and represents the diagonal matrix using each singular-value as the diagonal components.
Since the eigenvalue (or singular-value) corresponding to eigen vector space obtained by degradation has correlation with the power of sound, by taking eigen vector corresponding to the eigenvalue with the large value, only the subspace of loud desired sound with large power can be chosen. If the number of sounds to be considered is set to , then eigen vector corresponds to the sound, are eigen vector corresponds to noise. In LocalizeMUSIC , can be specified as NUM_SOURCE.
Calculation of MUSIC spectrum
The MUSIC spectrum for SSL is calculated as follows using only noise-related eigen vectors.
(15) |
In the denominator in the right-hand side, the inner product of the noise-related eigen vector and steering vector is calculated. On the space spanned by the eigen vector, since the noise subspace corresponding to small eigenvalue and the signal subspace corresponding to a large eigenvalue intersect perpendicularly each other, if the transfer function is a vector corresponding to the desired sound, this inner product will be 0 theoretically. Therefore, diverges infinitely. In fact, although it does not diverge infinitely under the effect of noise etc., a sharp peak is observed compared to beam forming. The right-hand side of the numerator is an normalization term.
Since is MUSIC spectrum obtained for every frequency, broadband SSL is performed as follows.
(16) |
where and show the minimum and maximum of the frequency bands which are handled in the broadband integration of MUSIC spectrum, respectively, and they can be specified as LOWER_BOUND_FREQUENCY and UPPER_BOUND_FREQUENCY in LocalizeMUSIC , respectively.
Moreover, is the eigen-value weight in the case of broadband integration and is square root of the maximum eigenvalue (or maximum singular-value).
In LocalizeMUSIC , the presence or absence of eigenvalue weight can be chosen by ENABLE_EIGENVALUE_WEIGHT, and in case of false, it is and in case of true, it is . Moreover, is the frequency weight in the case of broadband integration, and the type can be specified as follows by SPECTRUM_WEIGHT_TYPE in LocalizeMUSIC .
In the case that SPECTRUM_WEIGHT_TYPE is Uniform
weights become uniform and all frequency bins.
In the case that SPECTRUM_WEIGHT_TYPE is A_Characteristic
it will be A characteristic weight which the International Electrotechnical Commission standardizes. The frequency characteristics of A characteristic weight is shown in Figure 6.24. The horizontal axis is and the vertical axis is . In LocalizeMUSIC , the scaling term A_CHAR_SCALING of frequency direction is introduced to the frequency characteristic. If A_CHAR_SCALING is set as , then the frequency weight actually used can be expressed as . In Figure 6.24, the case of and the case of are plotted as an example. The weight finally applied to the MUSIC spectrum is . As an example, when A_CHAR_SCALING=1 is shown in Figure 6.25.
When SPECTRUM_WEIGHT_TYPE is Manual_Spline
it is the frequency weight in line with the curve in which the spline interpolation was carried out for the interpolating point specified in MANUAL_WEIGHT_SPLINE. MANUAL_WEIGHT_SPLINE is specified with the Matrix<float> type of 2-by- matrix. The first row represents the frequency and the second row represents the weight for the frequency. The interpolation mark may be any point. As an example, in the case that MANUAL_WEIGHT_SPLINE is
<Matrix<float> <rows 2> <cols 3> <data 0.0 4000.0 8000.0 1.0 0.5 1.0> >
the number of interpolation points is 3, and the spline curve to which the weight of 1, 0.5, and 1 is applied in three frequencies, 0, 4000, and 8000[Hz], on the frequency axis, respectively can be created. at that time is shown in Figure 6.26.
When SPECTRUM_WEIGHT_TYPE is Manual_Square
it is the frequency weight in line with the rectangular weight from which the rectangle changes at the frequency specified in MANUAL_WEIGHT_SQUARE. MANUAL_WEIGHT_SQUARE is specified in the -th Vector<float> type, and expresses the frequency to switch the rectangle. The number of the switching point is arbitrary. As an example, the rectangle weight in the case that MANUAL_WEIGHT_SQUARE is considered as
<Vector<float> 0.0 2000.0 4000.0 6000.0 8000.0>
is shown in Figure 6.27. By using this weight, two or more frequency domains which cannot be specified with only UPPER_BOUND_FREQUENCY and LOWER_BOUND_FREQUENCY can be chosen.
Search of sound
Next, the local peak is detected from the range in to for of Eq. (), and the power of the MUSIC spectrum corresponding to DoA for the top are outputted in descending order of the value. Moreover, the number of output sound sources may become below when the number of peaks does not reach to . In LocalizeMUSIC , and can be specified in MIN_DEG and MAX_DEG, respectively.
Discussion
Finally, we describe the effect that whitening (noise suppression0 has on MUSIC spectrum in Eq. () when choosing GEVD and GSVD for MUSIC_ALGORITHM.
Here, as an example, consider the situation of four speakers (Directions = 75[deg], 25[deg], -25[deg], and -75[deg]) speaking simultaneously.
Figure 6.28(a) shows the result of choosing SEVD for MUSIC_ALGORITHM and not having whitened the noise. The horizontal axis is the azimuth, the vertical axis is frequency, and the value is of the Eq. (). As shown in the figure, there is diffusion noise in the low frequency domain and -150 degree direction, which reveals that the peak is not correctly detectable to only the direction of the 4 speakers.
Figure 6.28(b) shows the MUSIC spectrum in the interval in which SEVD is chosen for MUSIC_ALGORITHM and 4 speakers do not perform speech. The diffusion noise and the direction noise observed can be seen in Figure 6.28(a).
Figure 6.28(c) is the MUSIC spectrum when generating from the information on Figure 6.28(b), choosing GSVD for MUSIC_ALGORITHM as general sound information, and whitening the noise. As shown in the figure, it can be seen that the diffusion noise and the direction noise contained in are suppressed correctly and the strong peaks are only in the direction of the 4 speakers.
Thus, it is useful to use GEVD and GSVD for known noise.
F. Asano et. al, “Real-Time Sound Source Localization and Separation System and Its Application to Automatic Speech Recognition” Proc. of International Conference on Speech Processing (Eurospeech 2001), pp.1013–1016, 2001.
Toshiro Oga, Yutaka Kaneda, Yoshio Yamazaki, "Acoustic system and digital processing" The Institute of Electronics, Information and Communication Engineers.
K. Nakamura, K. Nakadai, F. Asano, Y. Hasegawa, and H. Tsujino, “Intelligent Sound Source Localization for Dynamic Environments”, in Proc. of IEEE/RSJ Int’l Conf. on Intelligent Robots and Systems (IROS 2009), pp. 664–669, 2009.