TimeScaleNet : a Multiresolution Approach for Raw Audio Recognition using Learnable Biquadratic IIR Filters and Residual Networks of Depthwise-Separable One-Dimensional Atrous Convolutions

Eric Bavu; Aro Ramamonjy; Hadrien Pujol; Alexandre Garcia

doi:10.1109/JSTSP.2019.2908696

Article Dans Une Revue IEEE Journal of Selected Topics in Signal Processing Année : 2019

TimeScaleNet : a Multiresolution Approach for Raw Audio Recognition using Learnable Biquadratic IIR Filters and Residual Networks of Depthwise-Separable One-Dimensional Atrous Convolutions

(1) , (1) , (1) , (1)

Eric Bavu

Fonction : Auteur
PersonId : 173967
IdHAL : eric-bavu
ORCID : 0000-0001-6395-634X
IdRef : 134574214

Laboratoire de Mécanique des Structures et des Systèmes Couplés

Aro Ramamonjy

Fonction : Auteur

Laboratoire de Mécanique des Structures et des Systèmes Couplés

Hadrien Pujol

Fonction : Auteur
PersonId : 185083
IdHAL : hadrien-pujol

Laboratoire de Mécanique des Structures et des Systèmes Couplés

Alexandre Garcia

Fonction : Auteur
PersonId : 170689
IdHAL : alexandre-garcia
ORCID : 0000-0003-3933-8562

Laboratoire de Mécanique des Structures et des Systèmes Couplés

Résumé

In the present paper, we show the benefit of a multi-resolution approach that allows to encode the relevant information contained in unprocessed time domain acoustic signals. TimeScaleNet aims at learning an efficient representation of a sound, by learning time dependencies both at the sample level and at the frame level. The proposed approach allows to improve the interpretability of the learning scheme, by unifying advanced deep learning and signal processing techniques. In particular, TimeScaleNet's architecture introduces a new form of recurrent neural layer, which is directly inspired from digital IIR signal processing. This layer acts as a learnable passband biquadratic digital IIR filterbank. The learnable filterbank allows to build a time-frequency-like feature map that self-adapts to the specific recognition task and dataset, with a large receptive field and very few learnable parameters. The obtained frame-level feature map is then processed using a residual network of depthwise separable atrous convolutions. This second scale of analysis aims at efficiently encoding relationships between the time fluctuations at the frame timescale, in different learnt pooled frequency bands, in the range of [20 ms ; 200 ms]. TimeScaleNet is tested both using the Speech Commands Dataset and the ESC-10 Dataset. We report a very high mean accuracy of 94.87 ± 0.24% (macro averaged F1-score : 94.9 ± 0.24%) for speech recognition, and a rather moderate accuracy of 69.71 ± 1.91% (macro averaged F1-score : 70.14 ± 1.57%) for the environmental sound classification task.

Mots clés

Audio recognition Machine hearing Learnable Biquadratic filters Multiresolution Deep Learning Time domain modelling

Domaines

Machine Learning [stat.ML] Traitement du signal et de l'image [eess.SP] Acoustique [physics.class-ph]

Fichier principal

bavu_al_jstsp_ml_4audio_review_2_twocols.pdf (830.98 Ko)

Origine : Fichiers produits par l'(les) auteur(s)

Eric BAVU : Connectez-vous pour contacter le contributeur

https://hal.science/hal-02088214

Soumis le : mardi 2 avril 2019-16:29:48

Dernière modification le : mercredi 28 septembre 2022-05:51:22

Archivage à long terme le : mercredi 3 juillet 2019-17:01:21

Dates et versions

hal-02088214 , version 1 (02-04-2019)

Identifiants

HAL Id : hal-02088214 , version 1
DOI : 10.1109/JSTSP.2019.2908696

Citer

Eric Bavu, Aro Ramamonjy, Hadrien Pujol, Alexandre Garcia. TimeScaleNet : a Multiresolution Approach for Raw Audio Recognition using Learnable Biquadratic IIR Filters and Residual Networks of Depthwise-Separable One-Dimensional Atrous Convolutions. IEEE Journal of Selected Topics in Signal Processing, 2019, 13 (2), pp.220-235. ⟨10.1109/JSTSP.2019.2908696⟩. ⟨hal-02088214⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNAM LMSSC-CNAM HESAM

87 Consultations

641 Téléchargements

TimeScaleNet : a Multiresolution Approach for Raw Audio Recognition using Learnable Biquadratic IIR Filters and Residual Networks of Depthwise-Separable One-Dimensional Atrous Convolutions

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager