Detecting the direction of emergency vehicle sirens with microphones

As drivers we use both our eyes and ears as sensors whereas current autonomous vehicle sensors and decision making do not rely on sound. Sound is particularly important in cases involving Emergency Vehicle (EV) sirens, horn, shouts, accident noise, a vehicle approaching from a sharp corner, poor visibility, and other instances where there is no direct line of sight or it is limited. In this work the Direction of Arrival (DoA) of an EV is detected using microphone arrays. The decision of an Autonomic Vehicle (AV) whether to yield to the EV is then dependent on the estimated DoA.


INTRODUCTION
As human drivers, we are capable of using both our eyes and ears to get useful information in a traffic environment [1].Hence, the development of AVs is challenging in that the vehicle should be able to perform no worse than a human driver (if not better), and be able to collect data from the external environment under the same conditions [2].Rapidly moving objects such as other vehicles or bicycles, slower objects such as pedestrians, and static objects such as parked cars and barriers should be all sensed by the AV and used in algorithms for correct decision making [3].Several sensors can be used for sensing these objects; e.g., radar [4], cameras [5], and microphones [6].In cases where an object that emits sound is too far away or near but concealed from the car, sound recorded by microphones may be the only reliable source of information.There are vast numbers of cases where sound information is important, including EV sirens, horn, shouts, accident noise, a vehicle approaching from a sharp corner, and poor visibility.
Recently, Waymo shared a report with the US Transport Department where microphones are used as "supplemental sensors" [7].Furthermore, Waymo has developed microphones that let its robocars hear sounds twice as far away as previous sensors while also letting them discern where the sound is coming from [8].Moreover, a video is available on the web, where it is shown how Waymo is learning to recognize emergency vehicles in Arizona, using sound and light [9].
In this work, we focus on the DoA estimation of EVs using microphone arrays.The estimated DoA can be used to decide whether to yield to an approaching EV.In practice, an EV siren is detected prior to the estimation of its DoA; however, this is a different and easier problem and can be handled using audio signature, and therefore is not addressed in this work.The DoA is estimated using a Multiple Signal Classification (MUSIC)-based algorithm and includes time smoothing technique to improve the reliability of the estimated DoA values.For the DoA estimation using internal microphones we implement a transfer function projection.Here, the DoA can be roughly estimated to determine whether the EV is approaching from behind, in which case the decision of the AV should be to yield to the EV.
Both internal and external microphone array approaches were investigated for their performance.The rational for using an external microphone array is that the results are more reliable and free-field steering vectors can be used; however, the microphones need to be protected from wind.Internal microphones have the advantage of already being available in the car for other applications; e.g., beamforming for the enhancement of Automatic Speech Recognition (ASR) performance.Unfortunately, free-field steering vectors cannot be used and transfer functions were measured with a lower spatial resolution instead.The results showed that despite the additional cost of mounting an external microphone array, it is recommended since the estimated DoA values are far more reliable than the ones achieved using the internal array.

DOA ESTIMATION
In this work a MUSIC-based algorithm was used for DoA estimation.Let s (t, f ) be the source signal in the Short Time Fourier Transform (STFT) domain.This signal is then received at the m'th microphone as where a m (•, θ i ) is the transfer function from a source at direction θ i to the m'th microphone.The signal vector at all M microphones can be represented as where is referred to as the steering vector from direction θ i at frequency f .Practically, the signal vector x is received by the microphones and used to calculate θi , the estimation of θ i .The autocorrelation of x is given by Assuming full rank of R x , it has M eigenvectors.The eigenvector with the largest eigenvalue is associated with the signal space, and all the rest are associated with the noise space.In general, the MUSIC algorithm is designed for any number of sources up to M − 1, but in this application only one source was of interest.Hence, if the eigenvectors u 1 , u 2 , . . ., u M are sorted in descending order, the noise space eigenmatrix is defined by The MUSIC spectrum P is then calculated using the noisespace eigenmatrix Ũ and the steering vector of the hypothetical DoA a (t, f, θ h ) to form (6) As a first step, the frequency for which the MUSIC spectrum is calculated is selected as the one with highest energy in the received signal at the first microphone.That is and then the estimated DoA is given by θi (t) = arg max Temporal smoothing is performed to prevent the consideration of non-realistic estimated DoA values.If in Eq. ( 8) the raw estimated DoA value θi (t) is given using the plain maximum value of the MUSIC spectrum P (t, f, θ h ), then the smoothed DoA is Frequency smoothing can be used to select frequencies f 0 that are near the previously selected frequency since the siren signal is essentially an ascending and decreasing chirp signal.However based on some preliminary results, it was decided not to use frequency smoothing.

Hardware
The external microphone array consisted of 4 Microelectromechanical system (MEMS) microphones selected from 32, as can be seen in Fig. 1, arranged as a square of dimensions 5 × 3cm 2 .The grid dimensions of the microphone array was taken into consideration when calculating the free-field steering vectors for the DoA estimation algorithm.The external microphone array was placed outside the car and mounted on the roof as can be seen in Fig. 2.

Steering Vector
The advantage of the external microphone array is that for the DoA estimation algorithm, the steering vectors can be roughly considered like those in free field.Since the EV is far away, the incident wave form can be considered to be a plane wave, as shown in Fig. 3, where θ i is the DoA angle, and r m , θ m are the distance and angle of the m'th microphone from the origin of the microphone array, respectively.
The free-field steering vector can therefore be calculated in an x-y plane.Let f be the frequency of the sound that is generated by the EV.At this frequency the wave number is k = 2πf c where c = 343 m s is the speed of sound.The frequency response from the source at θ i to the m'th microphone with regard to the origin is given by neglecting differences in amplitude attenuation from the source to the origin and to the microphones.The steering vector of the array that contains M microphones is given by T . (11)

EV Experimental Results
The external microphone array was mounted on the roof of the XTS car.The car was parked near a hospital.The EVs were ambulance vehicles recorded arriving to or departing from the hospital.The parked car and the EV station can be seen in Fig. 4. The case where an EV approached from the opposite lane and made its way to the hospital is shown for example.Basically, at first the DoA comes from the frontal direction, and then switches to behind the car.Figure 5 shows the MUSIC spectrum, the estimated DoA, and the selected frequency for this case.At t = 25s the peak of the MUSIC spectrum shifted from values near 360 • to values near 180 • .It has been detected that the porches of the microphone array board reflect the sound, and therefore even though the EV was on the left side, the peak values appeared at angles that corresponded to the right side.Nevertheless, it was easy to determine when the EV was in front or behind the car.In this case, the decision of an AV should be to continue normal driving and not to yield to the EV.

Hardware
The internal microphone array consisted of different microphones but the same dedicated hardware for sound acquisition as for the external microphone array.The array contained two sub arrays with 3 MEMS each, together forming an array of 6 microphones, from which a new subset of microphone could be selected to form a different microphone array configuration.
The internal microphone array was placed inside the car and mounted either above the rear-right or the frontleft passenger, corresponding to the placement of arrays for speech recognition or hands-free calls.A subset of 4 microphones can be selected to form an end-fire configuration above the rear-right passenger as presented in Fig. 6a, or a broad-side configuration above the front-left passenger, as can be seen in Fig. 6b.For the rear-right array, the distance between any pair of microphones on each sub array was 2cm, and the minimum distance between microphones from different sub arrays was 2.8cm, as shown in Fig. 6a.For the front-left array, the distance between any pair of microphones on each sub array was 2cm, and the minimum distance between microphones from differ-ent sub arrays was 3cm, as shown in Fig. 6b.

Steering Vector
In the case of the internal microphone array, the steering vector cannot be calculated using a free-field representation and instead, the frequency response of the car from each DoA needs to be considered.Therefore, rather than an analytic calculation of the steering vector with high spatial resolution as used by the external array, in the case of the internal array the transfer function needs to be measured in a quiet area with much lower spatial resolution.Since it is very difficult to measure the Acoustic Transfer Function (ATF) from the source to each microphone, the Relative Transfer Function (RTF) was used instead, in such a way that at each microphone the frequency response was calculated relative to the ATF at the 1st microphone.
Let h m be the ATF from the source to the m'th microphone.The RTF is given by If an acoustic source emits a signal x(f ) and assuming a noise signal n m (f ) at the m'th microphone, the recorded signal at the m'th microphone is where

Wienner Filter
The estimation of the RTF based on the Wiener filter solution that minimizes the variance of the error is performed using (14) which leads to .

Generalized Eigenvalue Decomposition (GEVD)
Defining vectors with microphone indices rather than frequencies as coordinates yields the following vector form to Eq. ( 13).Applying the autocorrelation operator to Eq. ( 19) yields The process of GEVD of R y (f ) with respect to R n (f ) relates the generalized eigenvalues λ m (f ) to the corresponding generalized eigenvectors v m (f ) by solving assuming that the rank and the number of microphones are identical and equal to M .
Assuming that the eigenvectors are sorted in descending order The generalized eigenvector that corresponds to the largest generalized eigenvalue is a rotated and scaled form of the ATF [10].The RTF can be calculated using where subscript (•) (1) indicates the first coordinate of a vector.

RTF Estimation Performance
The estimation of the RTF was evaluated using Signal to Distortion Ratio (SDR).The SDR was used to calculate the distortion between the signal recorded by a microphone in the array y m to the signal that is generated by filtering the signal recorded from the first microphone y 1 with RT F m : The SDR values are displayed in Fig. 7 and Fig. 8 for the performance evaluation of the RTF estimation process using the internal microphone array in the broad-side and end-fire configurations, respectively.The RTF was evaluated using a controlled measurement where the recording car was placed in an isolated parking spot, and another car displayed a sweep signal using a speaker mounted on its roof from different directions with a resolution of 45 • .The angle of direction is displayed on the horizontal axes, and the microphone index m is displayed on the vertical axes.The corresponding SDR value is expressed in dB units using gray levels.
For the case examined in this work, the most interesting directions are 0 • and 180 • , which correspond to the frontal and back directions, respectively.For these directions, the RTF was estimated better for the broad-side configuration than for the end-fire configuration.Comparing Fig. 7a to Fig. 7b, and also comparing Fig. 8a to Fig. 8b, shows that the RTFs were estimated better using the LS method than when using the GEVD method for all directions and all microphones.

EV Experimental Results
Only front and back RTFs were used as steering vectors.Figure 9 shows the MUSIC spectrum and DoA estimation results for the case where an EV is approaching the car from the opposite lane.The results in the figures show that using the end-fire array it is impossible to determine whether the EV was behind or in front of the car.The DoA was estimated better using the broad-side array.This result may appear surprising, since one would expect that the  symmetry of the broad-side array around the driving direction would have caused an ambiguity for waves approaching from the front or from the back.However, as explained in the previous section, the RTFs are estimated using each microphone with less distortion using the broad-side array than using the end-fire array.Regardless, the difficulty of estimating the DoA and the lower angular resolution was greater in the case of the internal microphone array than in the case of the external one.

CONCLUSION
The feasibility of detecting the direction of an approaching EV was validated using an external microphone array equipped with 4 microphones.An algorithm for using internal microphones was developed in order but found to be inferior to an external array.

Figure 2 :
Figure 2: External microphone array on the roof of the car.

Figure 3 :
Figure 3: Plane wave propagating to the external microphone array.

Figure 4 :
Figure 4: XTS car parked near an EV station.

Figure 5 :
Figure 5: (a) An EV is approaching from the frontal direction (b) MUSIC spectrum and estimated DoA show switching from frontal (∼ 360 • ) to back (∼ 180 • ) direction.The selected frequency matches the siren.

Figure 6 :
Figure 6: Distance between microphones in the internal array, a subset of 4 microphones forms (a) an end-fire array above the rear-right passenger and (b) a broad-side array above the front-left passenger.

Figure 7 :
Figure 7: Evaluation of the RTF estimation using SDR for the internal array in the broad-side configuration using the (a) LS and (b) GEVD estimation methods.