A transfer function in HARK is defined as a complex spectral function to represent the relationship in wave propagation between a microphone array and a sound source. HARK uses two types of transfer function data sets; one is to localize sound sources(sound source localization), which is called localization transfer function. The other is to separate a mixture of multiple sound sources (sound source separation), which is called separation transfer function.
1.1. Sound Source Localization
In HARK, sound source localization is estimating the direction of the sound source in 1 or 2 dimensions. For 1D localization, HARK can estimate the angle between a reference direction and sound source direction on a horizontal surface or the azimuth (Figure 1). For 2D localization, HARK can be extended so that once the azimuth has been estimated, HARK will also estimate the angle between the horizontal surface and the vertical position of the sound source or the elevation (Figure 2). The location of multiple sound source can be estimated at the same time for both 1D (Figure 3) and 2D (Figure 4).
Figure 1. Example of locating of a single sound source in 1D in HARK
Figure 2. Example of locating of a single sound source in 2D in HARK
Figure 3. Example of locating of multiple simultaneous sound sources in 1D in HARK
Figure 4. Example of locating of multiple simultaneous sound sources in 2D in HARK
LocalizeMUSIC is one of HARK's main nodes used for Sound Source Localization. This node provides MUltiple SIgnal Classification (MUSIC) methods to estimate sound source directions with its individual spectral power. A MUSIC method uses a set of transfer functions where each transfer function represents sound propagation from a sound source to a microphone array. LocalizeMUSIC compares the input signal with each transfer function contained in the set, then it will select the best matching transfer function based on the input signal's propagation characteristics and that of the transfer functions. The best matched propagation characteristics will indicate the general direction of the sound source.
For more information on LocalizeMUSIC and the MUSIC algorithm, please refer to the HARK Document at https://hark.jp/document/document-en/.
Hark can separate inputted sound mixtures into a set of separated sounds. This is called sound source separation. Figure 5 shows an overview of sound source separation. The algorithms to perform sound source separation require location information of sound sources as an input. The location information on sound sources can be obtained directly from sound source localization, or the information can be set with constant values. Sound source separation also uses a set of transfer functions to estimate a separation matrix which is necessary for a separation process.
Figure 5. A sound mixture is separated into the separated sounds using the sound location information and transfer functions
GHDSS and BeamForming can be used for sound source separation. For more information, please refer to the HARK Document at https://hark.jp/document/document-en/.
1.3. Methods to generate transfer functions
If we wish to record a sound originating from a sound source, it is important to understand that the recorded sound data will not exactly match that of the originating sound source due to reflected sounds better known as reverberation. For instance, as shown in Figure 6, let's consider that we have a loudspeaker, a microphone array, and an object positioned in a room surrounded by four walls. When a sound signal is emitted from the loudspeaker, the emitted sound signal is reflected by the walls and/or the object before arriving at the microphone array (Reflected Sound). This will cause delays and changes to the recorded sound data recorded by the microphone array.
Figure 6. Sound traveling from a speaker to a microphone array in a room with four walls and an object. [Top View]
The characteristics of this propagation is represented in the transfer functions. The transfer functions can be obtained by actual measurements or numerical simulation with the geometric relationship between a microphone array and sound sources.
This method uses recorded TSP (Time Stretched Pulse) responses, or impulse responses to generate the transfer functions.
An impulse response is the signal recorded by a microphone array for an impulse signal emitted from a sound source. The impulse signal contains signals at all frequencies in a single point in time. The problem in measuring the impulse response is that a large amount of power is necessary for emitting a maintainable high signal-to-noise ratio (SNR) sound, which is difficult to achieve using a conventional loudspeaker.
To solve this problem, a Time Stretched Pulse (TSP) signal is alternatively used. The TSP signal is obtained by stretching out the frequencies of the impulse signal over time. Figure 7 shows spectrograms of the impulse and TSP signal. For the impulse signal, all the frequencies are concentrated within an instantaneous period, and for the TSP signal, the frequencies are stretched over time.
Stretching out the frequency allows each frequency to be played at maximum power. The signal can be repeated multiple times which improves SNR of the recorded TSP signal by averaging. The impulse response can be calculated from the TSP response.
Figure 7. A graph of the spectrum of frequencies of the Impulse and Time Stretched Pulse (TSP) signal over time.
This method uses listed location data of both sound sources and microphones to generate the transfer functions.
Impulse responses are simulated from the microphone positions and sound source positions. From these impulse responses, the transfer functions are generated on the assumption that the microphones are set in a free space. This assumes that sound reflections from the object in Figure 6 are ignored. For this reason, transfer functions generated using this method are less accurate than those generated using the measurement-based method.
This section will discuss how to generate transfer functions from actual measurements using TSP. This can be divided to two parts: (1) playing and recording the TSP signal, and (2) calculating impulse responses from TSP responses. The steps for recording TSP signals differ for each type of recording device, and they are covered in extra manuals.
The impulse response for each recorded TSP response can be calculated from the TSP response. This calculation can be performed using HARTOOL (see Figure 8). HARTOOL creates transfer functions from the calculated impulse responses.
Figure 8. Calculating the transfer function using HARTOOL5
To generate measurement-based transfer functions, recording TSP responses is essential. TSP responses can be obtained in two different ways; synchronized recording and unsynchronized recording. Synchronized recording requires special equipment (e.g., RASP series), but it provides accurate transfer functions which gives the best performance in sound source localization and separation. On the other hand, unsynchronized recording can be done with most microphone array devices, but it may provide poor performance in sound source localization and separation compared to synchronous recording because it provides less accurate transfer functions.
The transfer functions will perform best when the TSP responses are recorded in the same set-up where the transfer functions are to be used. However, it can happen that it is difficult to record TSP responses in the same set-up. The recording set-up needs a loudspeaker to play the TSP signals and a microphone array to record the responses. The position of both the loudspeaker and the microphone array for every recording needs to be precise. The TSP signal needs to be played multiple times in order to reduce the impact of background noise and reverberation when creating the transfer function. The TSP signal must be played from all directions where sound source localization and or sound source separation are to be performed by the transfer functions generated using the TSP responses. The recording setup's environment can affect the difficulty of performing this procedure. For example, if the recording setup is located in a public place such as a park, then conducting this procedures can be difficult. When the recording cannot be done in the same set-up where the transfer function is to be used, please find a more suitable environment. To make a transfer function which works in most environments, please record TSP responses by moving a loudspeaker in a circle at the 5° intervals in a noise free environment. The obtained transfer functions are less accurate than those generated in the same set-up, however, they are generally more accurate than the transfer function created using Geometric-calculation-based method. Please note that the accuracy of transfer functions may not be directly related to localization and separation performance.
Figure 9 shows a sample set-up of how a TSP response is recorded from every direction with 30° intervals (recommended interval is 5°). The recording environment is assumed to be noise free and without presence of any object. A microphone array is placed at the center of a circle with a radius of 1 m. Decide a starting point where you will begin recording and mark it as 0°. Then, mark the rest of the positions of the loudspeaker every 30° interval on the circle's circumference. When starting the recording, place the speaker on the starting point marked as 0°. After successfully obtaining the data by playing the TSP signal multiple times, move the speaker to the next position as shown in Figure 9. For the recording set-up with 30° intervals indicated in Figure 9, the recording needs to be done at 12 different positions.
Figure 9. Speaker Positions [Top View]
Please note that the direction of the TSP signal recorded is significant to determine the capability of the transfer functions. For example, the direction of a sound source played at 130° cannot be detected precisely when the transfer function was generated using TSP responses with 30° interval in a circular set-up. HARK will output the direction information 120° (Figure 10).
Figure 10. Sound source localization for a set-up with a sound source at 130° but using a transfer function generated using impulse response with 30° interval [Top View]
When the recording can be done in the same set-up, please refer to Figure 11 which will demonstrate an example setup. Figure 11 shows a bar setup environment, and a bartender is standing behind a bar counter. Customers are giving orders to the bartender across the bar counter. In this example setup, we have replaced the bartender with a robot equipped with hark. It is best to record TSP responses in the same recording environment in order to get the best performance. The sample set-up illustrates how to record TSP responses for the transfer functions to perform both sound source localization and sound source separation of simultaneous orders in a bar.
Suppose there are up to 5 customers who wish to place an order as shown in Figure 11 (a). Assuming that the positions of customers surrounding the bartender are all in the same intervals, say, 30°, shown in Figure 11 (b). Then, place a microphone array in the bartender position as shown in Figure 11 (b) and place a loudspeaker in the 1st customer's position. The height of the microphone array and the speaker must be the same, although in real situations the height may vary. Record the TSP responses and then move the speaker to the next customer position. Repeat this for all 5 positions. The transfer functions generated using TSP responses recorded in the Figure 11 (b) setting will perform best on localizing and separating sound sources in the Figure 11 (a) setting.
Figure 11. Bar set-up [Top View], (a) an order taker and the customers, facing each other across the bar counter,
(b) a microphone array and a speaker are being placed in the positions of the order taker and the customer respectively.
Recording equipment (microphone array, recording tool, etc.) and the settings (source position, etc.) should be carefully chosen depending on the purpose. Three different recording procedures are provided. The sample conditions of recording set-ups used in these recording procedures are described in Table 1.
Table 1. TSP Recording set-up condition example
Microphone array | Recording Tool | Synchronized/Unsynchronized | SRC Position | ||
Type | Radius | Interval | |||
TAMAGO-02 | HARK-Designer | Unsynchronized | Circular: 0°-360° | 1m | 30° |
RASP-ZX + Mic | HARK-Designer | Synchronized | Circular: 0°-360° | 1m | 30° |
RASP-24 + Mic | Wios | Synchronized | Circular: 0°-180° | 1.2m | 30° |
Any recording tool can be used.
We recommend using one of the following tools which has a function of multi-channel recording:
The procedure for recording TSP Response will be slightly different depending on the equipment. The detailed procedure for the following HARK supported hardware are provided as separate manuals.
After recording the TSP responses, the transfer functions can be generated from the TSP response wav files with HARKTOOL5. In addition, the Microphone Array Position XML and Sound Source Position XML are necessary as shown in the Figure 13.
Figure 13. Transfer function generation by Measurement-based method
The template files for both microphone array positions and sound source positions are generated by HARKTOOL5. For a Spherical or Cylindrical layout, HARKTOOL5 auto-generates template files with the positions of the microphones and sound sources. To specify another layout, the xml file needs to be edited manually. The Table 2 highlights the requirements for the contents of setting file in the measurement-based method.
Table 2. HARKTOOL list files requirements for Measurement-based method
Method | Microphone Array Positions XML | Sound Source Positions XML |
Measurement-based Method | The number of microphones should be correctly specified. The attribute for x, y and z positions needs to be filled with any non-null number string. | The positions where the loudspeaker played the TSP signal should be specified as accurate as possible. The path of a wav file for recorded TSP responses should be correctly specified. |
<hark_xml version="1.3"> <positions type="microphone" frame="0" coordinate="cartesian"> <position x="0.0365" y="0.0000" z="0.0000" id="0" path=""/> <position x="0.0258" y="0.0258" z="0.0000" id="1" path=""/> <position x="0.0000" y="0.0365" z="0.0000" id="2" path=""/> <position x="-0.0258" y="0.0258" z="0.0000" id="3" path=""/> <position x="-0.0365" y="0.0000" z="0.0000" id="4" path=""/> <position x="-0.0258" y="-0.0258" z="0.0000" id="5" path=""/> <position x="0.0000" y="-0.0365" z="0.0000" id="6" path=""/> <position x="0.0258" y="-0.0258" z="0.0000" id="7" path=""/> </positions> </hark_xml> |
<hark_xml version="1.3"> <positions type="TSP" coordinate="cartesian"> <position x="1.0000" y="0.0000" z="0.0000" id="0" path="/home/user/tamago_rec/sep_0_0.wav"/> <position x="0.8660" y="0.5000" z="0.0000" id="1" path="/home/user/tamago_rec/sep_0_30.wav"/> <position x="0.5000" y="0.8660" z="0.0000" id="2" path="/home/user/tamago_rec/sep_0_60.wav"/> <position x="0.0000" y="1.0000" z="0.0000" id="3" path="/home/user/tamago_rec/sep_0_90.wav"/> <position x="-0.5000" y="0.8660" z="0.0000" id="4" path="/home/user/tamago_rec/sep_0_120.wav"/> <position x="-0.8660" y="0.5000" z="0.0000" id="5" path="/home/user/tamago_rec/sep_0_150.wav"/> <position x="-1.0000" y="0.0000" z="0.0000" id="6" path="/home/user/tamago_rec/sep_0_180.wav"/> <position x="-0.8660" y="-0.5000" z="0.0000" id="7" path="/home/user/tamago_rec/sep_0_210.wav"/> <position x="-0.5000" y="-0.8660" z="0.0000" id="8" path="/home/user/tamago_rec/sep_0_240.wav"/> <position x="0.0000" y="-1.0000" z="0.0000" id="9" path="/home/user/tamago_rec/sep_0_270.wav"/> <position x="0.5000" y="-0.8660" z="0.0000" id="10" path="/home/user/tamago_rec/sep_0_300.wav"/> <position x="0.8660" y="-0.5000" z="0.0000" id="11" path="/home/user/tamago_rec/sep_0_330.wav"/> </positions> <neighbors algorithm="NearestNeighbor"> <neighbor id="0" ids="0;"/> <neighbor id="1" ids="1;"/> <neighbor id="2" ids="2;"/> <neighbor id="3" ids="3;"/> <neighbor id="4" ids="4;"/> <neighbor id="5" ids="5;"/> <neighbor id="6" ids="6;"/> <neighbor id="7" ids="7;"/> <neighbor id="8" ids="8;"/> <neighbor id="9" ids="9;"/> <neighbor id="10" ids="10;"/> <neighbor id="11" ids="11;"/> </neighbors/> </hark_xml> |
HARKTOOL5 is used to generate the transfer functions based on the Microphone Array Positions XML, Sound Source Positions XML, and the TSP Response recording wav files. Transfer function generation procedures by Measurement-based method using the following HARK supported hardware are provided as separate manuals.
This section will discuss generating transfer functions using the Geometric-calculation-based method. In this method, impulse responses will be obtained by numerical calculation with the geometric relationship between microphones and sound sources rather than by the actual measurement of the response which is required in Measurement-based method (see Section 2.1 Recording TSP response).
Geometric calculation-based method assumes that both the sound source and microphones are placed in a free space. "Free space" means there are no obstructive objects, walls and floor in the space which would cause sound reflecting off the environment, so the sound would travel without any changes from the original sound source. Figure 14 illustrates the difference in how sound travels from a speaker to a microphone array in a limited space and in a free space. Figure 14 (a) shows both "Direct sound", sound not affected by the environment, and "Reflected sound," sound reflecting off the environment, which normally happens in the real world. Figure 14 (b) shows only "Direct sound" traveling in an infinitely large space without walls or objects.
Figure 14. (a) Sound traveling from a speaker to a microphone array placed in a room with four walls and an object. (b) Sound traveling from a speaker to a microphone array in free space. [Top View]
Since Geometric-calculation-based method does not take into account the effects on sound due to the environment by excluding reflected sound when generating transfer function, the performance of the transfer functions generated using this method will be generally lower than the ones generated using the Measurement-based method. The user will have to evaluate if the transfer function generated by this method will provide the desired performance level.
3.1. Transfer Function Generation
In this method, transfer functions can be generated without using any pre-recorded TSP response data required for Measurement-based method. Instead, impulse response will be simulated using microphone position information and sound source position information (Figure 15).
Figure 15. Geometric-calculation-based Transfer Function Generation
The template files for both microphone array positions and sound source positions are generated by HARKTOOL5 For a Spherical or Cylindrical layout, HARKTOOL5 auto-generates template files with the positions of the microphones and sound sources. To specify another layout, the xml file needs to be edited manually. The Table 5 highlights the requirements for the contents of setting file in the geometric-calculation-based method.
Table 5. HARKTOOL list files requirements for Geometric-calculation-based method
Method | Microphone Array Positions XML | Sound Source Positions XML |
Geometric-calculation-based Method | Both the number of microphones and the position for each microphone should be correctly specified. | The sound source positions to simulate the impulse response should be specified as accurate as possible. The attribute for file path needs to be filled with any non-null string. |
<hark_xml version="1.3"> <positions type="microphone" frame="0" coordinate="cartesian"> <position x="0.0365" y="0.0000" z="0.0000" id="0" path=""/> <position x="0.0258" y="0.0258" z="0.0000" id="1" path=""/> <position x="0.0000" y="0.0365" z="0.0000" id="2" path=""/> <position x="-0.0258" y="0.0258" z="0.0000" id="3" path=""/> <position x="-0.0365" y="0.0000" z="0.0000" id="4" path=""/> <position x="-0.0258" y="-0.0258" z="0.0000" id="5" path=""/> <position x="0.0000" y="-0.0365" z="0.0000" id="6" path=""/> <position x="0.0258" y="-0.0258" z="0.0000" id="7" path=""/> </positions> </hark_xml> |
<hark_xml version="1.3"> <positions type="TSP" coordinate="cartesian"> <position x="1.0000" y="0.0000" z="0.0000" id="0" path="/dummy"/> <position x="0.8660" y="0.5000" z="0.0000" id="1" path="/dummy"/> <position x="0.5000" y="0.8660" z="0.0000" id="2" path="/dummy"/> <position x="0.0000" y="1.0000" z="0.0000" id="3" path="/dummy"/> <position x="-0.5000" y="0.8660" z="0.0000" id="4" path="/dummy"/> <position x="-0.8660" y="0.5000" z="0.0000" id="5" path="/dummy"/> <position x="-1.0000" y="0.0000" z="0.0000" id="6" path="/dummy"/> <position x="-0.8660" y="-0.5000" z="0.0000" id="7" path="/dummy"/> <position x="-0.5000" y="-0.8660" z="0.0000" id="8" path="/dummy"/> <position x="0.0000" y="-1.0000" z="0.0000" id="9" path="/dummy"/> <position x="0.5000" y="-0.8660" z="0.0000" id="10" path="/dummy"/> <position x="0.8660" y="-0.5000" z="0.0000" id="11" path="/dummy"/> </positions> <neighbors algorithm="NearestNeighbor"> <neighbor id="0" ids="0;"/> <neighbor id="1" ids="1;"/> <neighbor id="2" ids="2;"/> <neighbor id="3" ids="3;"/> <neighbor id="4" ids="4;"/> <neighbor id="5" ids="5;"/> <neighbor id="6" ids="6;"/> <neighbor id="7" ids="7;"/> <neighbor id="8" ids="8;"/> <neighbor id="9" ids="9;"/> <neighbor id="10" ids="10;"/> <neighbor id="11" ids="11;"/> </neighbors/> </hark_xml> |
HARKTOOL5 is used to generate the transfer functions based on the Microphone Array Positions XML and Sound Source Positions XML. Transfer function generation procedures by Geometric-calculation-based method using the following HARK supported hardware are provided as separate manuals.
This section will discuss on how to evaluate both localization transfer functions and separation transfer functions.
4.1. Evaluating Localization Transfer Functions
To evaluate localization transfer functions, a network to localize and to display sound source locations/positions needs to be created. The following sub section provides directions on how to create this network, how to execute the network, and how to confirm the result.
The following nodes are typically used in a network when sound source localization is needed:
The procedures to create a network to evaluate localization transfer functions are as follows:
Sheet LOOP0 created in Step 3 is now ready to be used.
To execute the network file created in the prior section, follow the steps below.
To see if we have successfully conducted Sound Source Localization using the transfer function we have created, follow the steps below.
This window will display any sound sources located using the transfer function. Every time a new source is located, the source will be represented by a different color.
If every expected sound source was not located, please decrease the value of SoundTrackers's THRESH parameter and run the network file once more. If more sound sources were located than the ones expected, increase this value.
4.2. Evaluating Separation Transfer Functions
To evaluate the separation transfer functions, a network that separates a mixture of multiple sound sources needs be created. This network should have GHDSS. GHDSS requires a file containing separation transfer functions. The following sub section provides directions on how to create this network, how to execute the network, and how to confirm the result.
The following nodes are typically used in a network when sound source separation is needed:
The procedures to create a network to evaluate separation transfer functions are as follows:
MAIN_LOOP sheet is now set and ready to be used.
To execute the network file created in Section 4.2.1, follow the steps below.
The results should generate 3 files because we have specified location information of 3 sound sources in ConstantLocalization in Section 4.2.1 as shown in Figure 76.
Figure 76. ConstantLocalization Parameters in Section 4.2.1
To check the results of the execution, follow the steps below:
---- End-----