Skip to Main content Skip to Navigation
Conference papers

Optimal spectral transportation with application to music transcription

Rémi Flamary 1, 2 Cédric Févotte 3, 2, 4 Nicolas Courty 5 Valentin Emiya 6 
3 IRIT-SC - Signal et Communications
IRIT - Institut de recherche en informatique de Toulouse
5 OBELIX - Environment observation with complex imagery
6 QARMA - éQuipe AppRentissage et MultimediA [Marseille]
LIF - Laboratoire d'informatique Fondamentale de Marseille
Abstract : Many spectral unmixing methods rely on the non-negative decomposition of spectral data onto a dictionary of spectral templates. In particular, state-of-the-art music transcription systems decompose the spectrogram of the input signal onto a dictionary of representative note spectra. The typical measures of fit used to quantify the adequacy of the decomposition compare the data and template entries frequency-wise. As such, small displacements of energy from a frequency bin to another as well as variations of timbre can disproportionally harm the fit. We address these issues by means of optimal transportation and propose a new measure of fit that treats the frequency distributions of energy holistically as opposed to frequency-wise. Building on the harmonic nature of sound, the new measure is invariant to shifts of energy to harmonically-related frequencies, as well as to small and local displacements of energy. Equipped with this new measure of fit, the dictionary of note templates can be considerably simplified to a set of Dirac vectors located at the target fundamental frequencies (musical pitch values). This in turns gives ground to a very fast and simple decomposition algorithm that achieves state-of-the-art performance on real musical data. 1 Context Many of nowadays spectral unmixing techniques rely on non-negative matrix decompositions. This concerns for example hyperspectral remote sensing (with applications in Earth observation, astronomy, chemistry, etc.) or audio signal processing. The spectral sample v n (the spectrum of light observed at a given pixel n, or the audio spectrum in a given time frame n) is decomposed onto a dictionary W of elementary spectral templates, characteristic of pure materials or sound objects, such that v n ≈ Wh n. The composition of sample n can be inferred from the non-negative expansion coefficients h n. This paradigm has led to state-of-the-art results for various tasks (recognition, classification, denoising, separation) in the aforementioned areas, and in particular in music transcription, the central application of this paper. In state-of-the-art music transcription systems, the spectrogram V (with columns v n) of a musical signal is decomposed onto a dictionary of pure notes (in so-called multi-pitch estimation) or chords. V typically consists of (power-)magnitude values of a regular short-time Fourier transform (Smaragdis and Brown, 2003). It may also consists of an audio-specific spectral transform such as the Mel-frequency transform, like in (Vincent et al., 2010), or the Q-constant based transform, like in (Oudre et al., 2011). The success of the transcription system depends of course on the adequacy of the time-frequency transform & the dictionary to represent the data V.
Complete list of metadata

Cited literature [19 references]  Display  Hide  Download
Contributor : Nicolas Courty Connect in order to contact the contributor
Submitted on : Friday, October 7, 2016 - 10:17:31 AM
Last modification on : Thursday, August 4, 2022 - 4:55:31 PM
Long-term archiving on: : Friday, February 3, 2017 - 6:58:08 PM


Files produced by the author(s)


  • HAL Id : hal-01377533, version 1
  • ARXIV : 1609.09799


Rémi Flamary, Cédric Févotte, Nicolas Courty, Valentin Emiya. Optimal spectral transportation with application to music transcription. Advances in Neural Information Processing Systems (NIPS), Dec 2016, Barcelona, Spain. ⟨hal-01377533⟩



Record views


Files downloads