Abstract : Convolutional Neural Networks (CNN) have been usedin Automatic Speech Recognition (ASR) to learn represen-tations directly from the raw signal instead of hand-craftedacoustic features, providing a richer and lossless input signal.Recent researches propose to inject prior acoustic knowledgeto the first convolutional layer by integrating the shape of theimpulse responses in order to increase both the interpretabil-ity of the learnt acoustic model, and its performances. Wepropose to combine the complex Gabor filter with complex-valued deep neural networks to replace usual CNN weightskernels, to fully take advantage of its optimal time-frequencyresolution and of the complex domain. The conducted exper-iments on the TIMIT phoneme recognition task shows thatthe proposed approach reaches top-of-the-line performanceswhile remaining interpretable.
https://hal.archives-ouvertes.fr/hal-02474746
Contributor : Paul-Gauthier Noé <>
Submitted on : Tuesday, February 11, 2020 - 3:46:50 PM Last modification on : Monday, February 8, 2021 - 11:18:02 AM Long-term archiving on: : Tuesday, May 12, 2020 - 3:18:43 PM