Optimal spectral transportation with application to music transcription

Abstract : Many spectral unmixing methods rely on the non-negative decomposition of spectral data onto a dictionary of spectral templates. In particular, state-of-the-art music transcription systems decompose the spectrogram of the input signal onto a dictionary of representative note spectra. The typical measures of fit used to quantify the adequacy of the decomposition compare the data and template entries frequency-wise. As such, small displacements of energy from a frequency bin to another as well as variations of timbre can disproportionally harm the fit. We address these issues by means of optimal transportation and propose a new measure of fit that treats the frequency distributions of energy holistically as opposed to frequency-wise. Building on the harmonic nature of sound, the new measure is invariant to shifts of energy to harmonically-related frequencies, as well as to small and local displacements of energy. Equipped with this new measure of fit, the dictionary of note templates can be considerably simplified to a set of Dirac vectors located at the target fundamental frequencies (musical pitch values). This in turns gives ground to a very fast and simple decomposition algorithm that achieves state-of-the-art performance on real musical data. 1 Context Many of nowadays spectral unmixing techniques rely on non-negative matrix decompositions. This concerns for example hyperspectral remote sensing (with applications in Earth observation, astronomy, chemistry, etc.) or audio signal processing. The spectral sample v n (the spectrum of light observed at a given pixel n, or the audio spectrum in a given time frame n) is decomposed onto a dictionary W of elementary spectral templates, characteristic of pure materials or sound objects, such that v n ≈ Wh n. The composition of sample n can be inferred from the non-negative expansion coefficients h n. This paradigm has led to state-of-the-art results for various tasks (recognition, classification, denoising, separation) in the aforementioned areas, and in particular in music transcription, the central application of this paper. In state-of-the-art music transcription systems, the spectrogram V (with columns v n) of a musical signal is decomposed onto a dictionary of pure notes (in so-called multi-pitch estimation) or chords. V typically consists of (power-)magnitude values of a regular short-time Fourier transform (Smaragdis and Brown, 2003). It may also consists of an audio-specific spectral transform such as the Mel-frequency transform, like in (Vincent et al., 2010), or the Q-constant based transform, like in (Oudre et al., 2011). The success of the transcription system depends of course on the adequacy of the time-frequency transform & the dictionary to represent the data V.
Type de document :
Communication dans un congrès
NIPS, Dec 2016, Barcelona, Spain
Liste complète des métadonnées

Contributeur : Nicolas Courty <>
Soumis le : vendredi 7 octobre 2016 - 10:17:31
Dernière modification le : samedi 18 février 2017 - 01:10:24
Document(s) archivé(s) le : vendredi 3 février 2017 - 18:58:08


Fichiers produits par l'(les) auteur(s)


  • HAL Id : hal-01377533, version 1
  • ARXIV : 1609.09799


Rémi Flamary, Cédric Févotte, Nicolas Courty, Valentin Emiya. Optimal spectral transportation with application to music transcription. NIPS, Dec 2016, Barcelona, Spain. <hal-01377533>



Consultations de
la notice


Téléchargements du document