, Phonétisation automatique et contenu phonétique d'un énoncé . 80 6.1.1 Le système de phonétisation automatique de Voxygen, vol.80

. .. Transcription-phonétique-multimodale, , vol.87

. .. Autres-langues, 92 6.4.1 Création d'une voix pour une langue déjà traitée, p.95

.. .. Conclusion,

, Bibliographie personnelle

K. Vythelingum, Y. Estève, and O. Rosec, Acoustic-dependent Phonemic Transcription for Text-to-speech Synthesis, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01870866

K. Vythelingum, Y. Estève, and O. Rosec, Transcription phonétique automatique pour la synthèse de la parole, XXXIIe Journées d'Etudes sur la Parole (JEP 2018), 2018.

K. Vythelingum, Y. Estève, and O. Rosec, Error detection of graphemeto-phoneme conversion in text-to-speech synthesis using speech signal and lexical context, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Décembre 2017

K. Vythelingum, Détection des erreurs de phonétisation pour la synthèse de parole, 2017.

N. Tomashenko, K. Vythelingum, A. Rousseau, and Y. Estève, LIUM ASR systems for the 2016 Multi-Genre Broadcast Arabic Challenge, IEEE Workshop on Spoken Language Technology (SLT), 2016.
URL : https://hal.archives-ouvertes.fr/hal-01433188

T. Alhanai, W. Hsu, and J. Glass, « Development of the MIT ASR system for the 2016 Arabic multi-genre broadcast challenge, IEEE Spoken Language Technology Workshop (SLT), pp.299-304, 2016.

A. Ali, S. Vogel, and S. Renals, « Speech recognition challenge in the wild : Arabic MGB-3, IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.316-322, 2017.

A. Ali, Y. Zhang, P. Cardinal, N. Dahak, S. Vogel et al., « A complete kaldi recipe for building arabic speech recognition systems, IEEE Spoken Language Technology Workshop (SLT), pp.525-529, 2014.

A. Ali, P. Bell, J. Glass, Y. Messaoui, H. Mubarak et al., « The MGB-2 challenge : Arabic multi-dialect broadcast media recognition, IEEE Spoken Language Technology Workshop (SLT), pp.279-284, 2016.

M. Ali, E. Moustafa, A. Mansour, and A. Et-husni, « Arabic phonetic dictionaries for speech recognition, Journal of Information Technology Research (JITR) 2.4, pp.67-80, 2009.

D. Altinok, Towards Turkish ASR : Anatomy of a rule-based Turkish g2p ». In : arXiv preprint, 2016.

D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg et al., « Deep speech 2 : End-to-end speech recognition in english and mandarin, International Conference on Machine Learning (ICML), pp.173-182, 2016.

J. Andresen, A. Bills, E. Dubinski, J. G. Fiscus, B. Gillies et al., IARPA Babel Turkish Language Pack ». In : LDC2016S10 web download. Philadelphia : Linguistic Data Consortium, 2016.

S. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky et al., « Deep voice : Real-time neural text-to-speech, Proceedings of the 34th International Conference on Machine Learning (ICML), pp.195-204, 2017.

E. Arisoy, D. Can, S. Parlak, H. Sak, and M. Saraçlar, « Turkish broadcast news transcription and retrieval, Transactions on Audio, Speech, and Language Processing, vol.17, pp.874-883, 2009.

D. Bahdanau, K. Cho, and Y. Bengio, « Neural machine translation by jointly learning to align and translate, 2014.

F. Béchet and . Lia-tagg, , 2001.

, « LIA-PHON : Un système complet de phonétisation de textes, Traitement automatique des langues (TAL) 42.1, pp.47-67, 2001.

J. G. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ullmann et al., « Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part I-Temporal alignment, Journal of the Audio Engineering Society, vol.61, pp.366-384, 2013.

A. M. Bell, Visible Speech : The science of Universal alphabetics. London : Simpkin, 1867.

P. Bell, M. J. Gales, T. Hain, J. Kilgour, P. Lanchantin et al., « The MGB challenge : Evaluating multi-genre broadcast media recognition, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp.687-693, 2015.

S. Bengio and G. Heigold, « Word embeddings for speech recognition, 2014.

Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, « A neural probabilistic language model, Journal of machine learning research, vol.3, issue.2, pp.1137-1155, 2003.

F. Biadsy, H. Nizar, and H. Julia, « Improving the Arabic pronunciation dictionary for phone and word recognition with linguisticallybased pronunciation rules, Proceedings of human language technologies : The 2009 annual conference of the North American chapter of the association for computational linguistics, pp.397-405, 2009.

B. Bigi, « A multilingual text normalization approach, Language and Technology Conference, pp.515-526, 2011.

M. Bisani and H. Ney, « Joint-sequence models for grapheme-to-phoneme conversion, Speech communication 50, vol.5, pp.434-451, 2008.

A. W. Black and K. Tokuda, « The Blizzard Challenge-2005 : Evaluating corpus-based speech synthesis on common datasets, Ninth European Conference on Speech Communication and Technology, 2005.

. W. Black-a and P. Taylor, « CHATR : a generic speech synthesis system, Proceedings of the 15th conference on Computational linguistics, vol.2, pp.983-986, 1994.

H. Bourlard and C. J. Wellekens, « Multilayer perceptrons and automatic speech recognition, Proceedings of the First International Conference on Neural Networks. T. 4, pp.407-416, 1987.

S. Brognaux, B. Picart, and T. Drugman, « Speech synthesis in various communicative situations : Impact of pronunciation variations, 2014.

T. Buckwalter, « Arabic transliteration, 2002.

O. Caglayan, M. Garcia-martinez, A. Bardet, W. Aransa, F. Bougares et al., « Nmtpy : A flexible toolkit for advanced neural machine translation systems, The Prague Bulletin of Mathematical Linguistics 109, vol.1, pp.15-28, 2017.

L. Calliope, La parole et son traitement automatique, 1989.

B. Can and H. Artuner, « A syllable-based Turkish speech recognition system by using time delay neural networks (TDNNs), International Conference on Soft Computing and Pattern Recognition (SoCPaR), pp.219-224, 2013.

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, « Listen, attend and spell : A neural network for large vocabulary conversational speech recognition, International Conference on Acoustics, Speech and Signal Processing, pp.4960-4964, 2016.

S. F. Chen and J. Goodman, « An empirical study of smoothing techniques for language modeling, Computer Speech & Language 13, vol.4, pp.359-394, 1999.

J. Chevelu, D. Lolive, S. Le-maguer, and D. Guennec, « How to compare tts systems : A new subjective evaluation methodology focused on differences, 2015.

K. Cho, B. Van-merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares et al., « Learning phrase representations using RNN encoder-decoder for statistical machine translation, 2014.

C. Cöltekin, « A Freely Available Morphological Analyzer for Turkish, LREC. T, vol.2, pp.19-28, 2010.

, « A set of open source tools for Turkish natural language processing, » In : LREC, pp.1079-1086, 2014.

G. Cybenko, « Approximations by superpositions of sigmoidal functions, Mathematics of Control, Signals, and Systems, pp.303-314, 1989.

R. Dall, S. Brognaux, K. Richmond, C. Valentini-botinhao, G. E. Henter et al., « Testing the consistency assumption : Pronunciation variant forced alignment in read and spontaneous speech synthesis, International Conference on Acoustics, Speech and Signal Processing, pp.5155-5159, 2016.

S. Davis and P. Mermelstein, « Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences, IEEE transactions on acoustics, speech, and signal processing 28, vol.4, pp.357-366, 1980.

C. De-brosses, Traité de la formation mécanique des langues et des principes physiques de l'étymologie, 1765.

P. Delattre, « Les Dix Intonations de base du français, The French Review, vol.40, pp.1-14, 1966.

A. P. Dempster, N. M. Laird, and D. B. Rubin, « Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, vol.39, pp.1-22, 1977.

. Di and A. Cristo, « Interpréter la prosodie, XXIIe Journées d'Etudes sur la Parole (JEP), 2000.

V. V. Digalakis, D. Rtischev, and L. G. Neumeyer, « Speaker adaptation using constrained estimation of Gaussian mixtures, IEEE Transactions on speech and Audio Processing, vol.3, pp.357-366, 1995.

N. Dixon and H. Maxey, « Terminal analog synthesis of continuous speech using the diphone method of segment assembly, IEEE transactions on Audio and Electroacoustics 16, pp.40-50, 1968.

J. Duchi, E. Hazan, and Y. Singer, « Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, vol.12, pp.2121-2159, 2011.

J. L. Elman, « Finding structure in time, Cognitive science 14, vol.2, pp.179-211, 1990.

F. Espic, A. Govender, S. Ribeiro, C. Valentini-botinhao, and O. Watts, , 2018.

, « The CSTR entry to the 2018 Blizzard Challenge, Blizzard Challenge Workshop

G. Fant, Acoustic theory of speech production. 2, 1970.

J. L. Flanagan, « Voices of men and machines, The Journal of the Acoustical Society of America 51, vol.5, pp.1375-1387, 1972.

I. Fónagy, « Des fonctions de l'intonation : Essai de synthèse, Flambeau, vol.29, pp.1-20, 2003.

M. J. Gales, « Maximum likelihood linear transformations for HMM-based speech recognition, Computer speech & language 12, vol.2, pp.75-98, 1998.

L. Galescu and J. F. Allen, « Bi-directional conversion between graphemes and phonemes using a joint n-gram model, 4th ISCA Tutorial and Research Workshop (ITRW) on Speech Synthesis, 2001.

, « Pronunciation of proper names with a joint n-gram model for bi-directional grapheme-to-phoneme conversion, Seventh International Conference on Spoken Language Processing, 2002.

J. Gauvain and . Lee, « Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE transactions on speech and audio processing 2, pp.291-298, 1994.

S. Ghannay, « Etude sur les représentations continues de mots appliquées à la détection automatique des erreurs de reconnaissance de la parole, 2017.

A. Gibiansky, S. Arik, G. Diamos, J. Miller, K. Peng et al., « Deep voice 2 : Multi-speaker neural text-to-speech, Advances in neural information processing systems, pp.2962-2970, 2017.

I. A. Goodfellow, . Bengio-y, and . Courville, Deep Learning, 2016.

K. Gorman and R. Sproat, « Minimally supervised number normalization, Transactions of the Association for Computational Linguistics, vol.4, pp.507-519, 2016.

A. Graves and N. Jaitly, « Towards end-to-end speech recognition with recurrent neural networks », International Conference on Machine Learning (ICML), pp.1764-1772, 2014.

A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, « Connectionist temporal classification : labelling unsegmented sequence data with recurrent neural networks, Proceedings of the 23rd international conference on Machine learning, pp.369-376, 2006.

R. Gretter, Euronews : a multilingual speech corpus for ASR. » In : LREC, pp.2635-2638, 2014.

T. Güngör, « Computer processing of Turkish : Morphological and lexical investigation, 1995.

W. I. Hallahan, « DECtalk software : Text-to-speech technology and implementation, Digital Technical Journal, pp.5-19, 1995.

B. Han and T. Baldwin, « Lexical normalisation of short text messages : Makn sens a# twitter, Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics : Human Language Technologies, vol.1, pp.368-378, 2011.

C. M. Harris, « A study of the building blocks in speech, The Journal of the Acoustical Society of America, vol.25, pp.962-969, 1953.

H. Hermansky, « Perceptual linear predictive (PLP) analysis of speech, Journal of the Acoustical Society of America, vol.87, pp.1738-1752, 1990.

F. Hernandez, V. Nguyen, S. Ghannay, N. Tomashenko, and Y. Estève, , 2018.

«. Ted-, LIUM 3 : twice as much data and corpus repartition for experiments on speaker adaptation, International Conference on Speech and Computer, pp.198-208

G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed et al., « Deep neural networks for acoustic modeling in speech recognition, IEEE Signal processing magazine, vol.29, 2012.

S. Hochreiter and J. Schmidhuber, « Long short-term memory, Neural computation 9, vol.8, pp.1735-1780, 1997.

K. Hornik, « Approximation Capabilities of Multilayer Feedforward Networks, Neural Networks, pp.251-257, 1991.

D. H. Hubel and T. N. Wiesel, « Receptive fields, binocular interaction and functional architecture in the cat's visual cortex, The Journal of physiology, vol.160, pp.106-154, 1962.

A. J. Hunt and A. W. Black, « Unit selection in a concatenative speech synthesis system using a large speech database, International Conference on Acoustics, Speech, and Signal Processing Conference (ICASSP). T. 1. IEEE, pp.373-376, 1996.

F. Jelinek, « Continuous speech recognition by statistical methods, Proceedings of the IEEE 64, vol.4, pp.532-556, 1976.

Y. Jiang, Z. Ling, M. Lei, C. Wang, L. Heng et al., The ustc system for Blizzard Challenge, 2018.

M. Jordan, « Attractor dynamics and parallelism in a connectionist sequential machine, Proc. of the Eighth Annual Conference of the Cognitive Science Society, 1986.

N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande et al., , 2018.

, « Efficient neural audio synthesis

R. M. Kaplan and M. Kay, « Regular models of phonological rule systems, Computational linguistics 20, vol.3, pp.331-378, 1994.

S. Katz, « Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE transactions on acoustics, speech, and signal processing, vol.35, pp.400-401, 1987.

S. Khurana and A. Ali, « QCRI advanced transcription system (QATS) for the Arabic multi-dialect broadcast media recognition : MGB-2 challenge, IEEE Spoken Language Technology Workshop (SLT), pp.292-298, 2016.

S. King, J. Crumlish, A. Martin, and L. Wihlborg, « The Blizzard Challenge, Blizzard Challenge Workshop, 2018.

D. P. Kingma and J. Ba, « Adam : A method for stochastic optimization, 2014.

D. Klatt, « The Klattalk text-to-speech conversion system, International Conference on Acoustics, Speech, and Signal Processing (ICASSP). T. 7. IEEE, pp.1589-1592, 1982.

R. Kneser and H. Ney, « Improved backing-off for m-gram language modeling, International Conference on Acoustics, Speech, and Signal Processing, 1995.

P. Koehn, H. Hoang, A. Birch, C. Callison-burch, M. Federico et al., « Moses : Open source toolkit for statistical machine translation, Proceedings of the 45th annual meeting of the association for computational linguistics, pp.177-180, 2007.

A. Laurent, P. Deléglise, and S. Meignier, « Grapheme to phoneme conversion using an SMT system, 2009.

Y. Lecun, K. Kavukcuoglu, and C. Farabet, « Convolutional networks and applications in vision, Proceedings of IEEE International Symposium on Circuits and Systems, pp.253-256, 2010.

C. J. Leggetter, C. Philip, and . Woodland, « Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Computer speech & language 9, vol.2, pp.171-185, 1995.

P. R. Léon, Précis de phonostylistique : parole et expressivité, 1993.

F. Liu, F. Weng, and X. Jiang, « A broad-coverage normalization system for social media language, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics : Long Papers, vol.1, pp.1035-1044, 2012.

L. Lu, X. Zhang, K. Cho, and S. Renals, « A study of the recurrent neural network encoder-decoder for large vocabulary speech recognition, 2015.

T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, « Speech synthesis using HMMs with dynamic features, International Conference on Acoustics, Speech, and Signal Processing (ICASSP). T. 1. IEEE, pp.389-392, 1996.

P. Mermelstein, « Articulatory model for the study of speech production, The Journal of the Acoustical Society of America, vol.53, pp.1070-1082, 1973.

Y. Miao, M. Gowayyed, and F. Metze, « EESEN : End-to-end speech recognition using deep RNN models and WFST-based decoding, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp.167-174, 2015.

T. Mikolov, M. Karafiát, L. Burget, J. ?ernock?, and S. Khudanpur, , 2010.

, « Recurrent neural network based language model

T. Mikolov, K. Chen, G. Corrado, and J. Dean, « Efficient estimation of word representations in vector space, 2013.

J. Montmignon, Système de prononciation figurée applicable à toutes les langues, 1785.

E. Moulines and F. Charpentier, « Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones, Speech communication, vol.9, pp.453-467, 1990.

A. H. Ng, K. Gorman, and R. Sproat, « Minimally supervised written-tospoken text normalization, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp.665-670, 2017.

L. Nguyen, T. Ng, K. Nguyen, R. Zbib, and J. Makhoul, « Lexical and phonetic modeling for Arabic automatic speech recognition, 2009.

J. R. Novak, N. Minematsu, and K. Hirose, « Failure transitions for joint ngram models and G2P conversion, Interspeech, pp.1821-1825, 2013.

J. R. Novak, P. R. Dixon, N. Minematsu, K. Hirose, C. Hori et al., « Improving WFST-based G2P conversion with alignment constraints and RNNLM N-best rescoring, 2012.

K. Oflazer, « Two-level description of Turkish morphology, Literary and Linguistic computing 9, vol.2, pp.137-148, 1994.

K. S. Oflazer and . Inkelas, « A finite state pronunciation lexicon for Turkish, Proceedings of the EACL Workshop on Finite State Methods in NLP. T. 82, pp.900-918, 2003.

, « The architecture and the implementation of a finite state pronunciation lexicon for Turkish, Computer Speech & Language, vol.20, pp.80-106, 2006.

A. Oord, S. Van-den, H. Dieleman, K. Zen, O. Simonyan et al., Wavenet : A generative model for raw audio ». In : arXiv preprint, 2016.

D. Pallett, J. Fiscus, J. Garofolo, A. Martin, M. Przybocki et al., « The history of automatic speech recognition evaluations at NIST, 2009.

A. Pasha, M. Al-badrashiny, M. T. Diab, A. El, R. Kholy et al., « Madamira : A fast, comprehensive tool for morphological analysis and disambiguation of arabic, » In : LREC. T, vol.14, pp.1094-1101, 2014.

V. Peddinti, D. Povey, and S. Khudanpur, « A time delay neural network architecture for efficient modeling of long temporal contexts, 2015.

D. Pennell and Y. Liu, « A character-level machine translation approach for normalization of sms abbreviations, Proceedings of 5th International Joint Conference on Natural Language Processing, pp.974-982, 2011.

J. Pennington, R. Socher, and C. Manning, « Glove : Global vectors for word representation, Proceedings of the 2014 conference on empirical methods in natural language processing, pp.1532-1543, 2014.

G. E. Peterson, W. S. Wang, and E. Sivertsen, « Segmentation techniques in speech synthesis, The Journal of the Acoustical Society of America, vol.30, pp.739-742, 1958.

W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan et al., « Deep voice 3 : Scaling text-to-speech with convolutional sequence learning, 2017.

D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau et al., « fMPE : Discriminatively trained features for speech recognition, International Conference on Acoustics, Speech, and Signal Processing (ICASSP). T. 1. IEEE, p.961, 2005.

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek et al., « The Kaldi speech recognition toolkit, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE Signal Processing Society, 2011.

D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar et al., « Purely sequence-trained neural networks for ASR based on lattice-free MMI, » In : Interspeech, pp.2751-2755, 2016.

L. R. Rabiner, « A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE 77, vol.2, pp.257-286, 1989.

K. Rao, F. Peng, H. Sak, and F. Beaufays, « Grapheme-to-phoneme conversion using long short-term memory recurrent neural networks », International Conference on Acoustics, Speech and Signal Processing, pp.4225-4229, 2015.

A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, « Perceptual evaluation of speech quality (PESQ) -a new method for speech quality assessment of telephone networks and codecs, International Conference on Acoustics, Speech, and Signal Processing, pp.749-752, 2001.

A. Rousseau, P. Deléglise, and Y. Estève, TED-LIUM : an Automatic Speech Recognition dedicated corpus. » In : LREC, pp.125-129, 2012.
URL : https://hal.archives-ouvertes.fr/hal-01434928

, « Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks, pp.3935-3939, 2014.

Y. Sagisaka, « Speech synthesis by rule using an optimal selection of nonuniform synthesis units, International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp.679-682, 1988.

H. Sak, T. Güngör, and M. Saraçlar, « Turkish language resources : Morphological parser, morphological disambiguator and web corpus, International Conference on Natural Language Processing, pp.417-427, 2008.

M. Saraçlar, « Turkish Broadcast News Speech and Transcripts, Web download. Philadelphia : Linguistic Data Consortium, 2012.

S. Schwarm and M. Ostendorf, « Text normalization with varied data sources for conversational speech language modeling, International Conference on Acoustics, Speech, and Signal Processing (ICASSP). T. 1. IEEE, p.789, 2002.

C. Schweitzer, C. Dodane, and J. Lazar, « L'histoire des alphabets phonétiques du XVIIIe siècle jusqu'à l'API, XXXIIe Journées d'Etudes sur la Parole (JEP), 2018.

H. Schwenk, « Continuous space language models, Computer Speech & Language, vol.21, pp.492-518, 2007.

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly et al., « Natural tts synthesis by conditioning wavenet on mel spectrogram predictions, International Conference on Acoustics, Speech and Signal Processing, pp.4779-4783, 2018.

Y. Shibata, T. Kida, S. Fukamachi, M. Takeda, A. Shinohara et al., Byte Pair encoding : A text compression scheme that accelerates pattern matching. Rapp. tech, 1999.

H. Soltau, G. Saon, B. Kingsbury, H. J. Kuo, L. Mangu et al., « Advances in Arabic speech transcription at IBM under the DARPA GALE program, IEEE Transactions on Audio, Speech, and Language processing, vol.17, pp.884-894, 2009.

R. Sproat, A. W. Black, S. Chen, S. Kumar, M. Ostendorf et al., « Normalization of non-standard words, Computer speech & language 15, vol.3, pp.287-333, 2001.

S. S. Stevens, J. Volkmann, and E. B. Newman, « A scale for the measurement of the psychological magnitude pitch, The Journal of the Acoustical Society of America, vol.8, pp.185-190, 1937.

A. Stolcke, « SRILM -an extensible language modeling toolkit, Seventh International Conference on Spoken Language Processing, 2002.

B. H. Story, « Advances in simulation of sentence-level speech production with kinematic models of the vocal tract and vocal folds, The Journal of the Acoustical Society of America, vol.126, pp.2205-2205, 2009.

, « TubeTalker : An airway modulation model of human sound production, Proceedings of the First International Workshop on Performative Speech and Singing Synthesis. P3S, pp.1-8, 2011.

K. Tokuda, T. Kobayashi, and S. Imai, « Speech parameter generation from HMM using dynamic features, International Conference on Acoustics, Speech, and Signal Processing (ICASSP). T. 1. IEEE, pp.660-663, 1995.

N. Tomashenko and Y. Khokhlov, « Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing, 2014.

N. Tomashenko, Y. Khokhlov, and Y. Estève, « On the Use of Gaussian Mixture Model Framework to Improve Speaker Adaptation of Deep Neural Network Acoustic Models. » In : Interspeech, pp.3788-3792, 2016.

N. Tomashenko, K. Vythelingum, A. Rousseau, and Y. Estève, « LIUM ASR systems for the 2016 Multi-Genre Broadcast Arabic challenge, IEEE Spoken Language Technology Workshop (SLT), pp.285-291, 2016.

T. Virtanen, R. Singh, and B. Raj, Techniques for noise robustness in automatic speech recognition, 2012.

A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, « Phoneme recognition using time-delay neural networks, IEEE transactions on acoustics, speech, and signal processing, vol.37, pp.328-339, 1989.

Y. Wang, R. J. Skerry-ryan, D. Stanton, Y. Wu, R. J. Weiss et al., « Tacotron : Towards end-to-end speech synthesis, 2017.

X. Yang, D. Qu, W. Zhang, and W. Zhang, « The NDSC transcription system for the 2016 multi-genre broadcast challenge, IEEE Spoken Language Technology Workshop (SLT), pp.273-278, 2016.

Y. Yang and J. Eisenstein, « A log-linear model for unsupervised text normalization, Empirical Methods in Natural Language Processing Conference (EMNLP), pp.61-72, 2013.

K. Yao and G. Zweig, « Sequence-to-sequence neural net models for graphemeto-phoneme conversion, 2015.

S. J. Young, J. J. Odell, and P. C. Woodland, « Tree-based state tying for high accuracy acoustic modelling, Proceedings of the workshop on Human Language Technology, pp.307-312, 1994.

M. D. Zeiler, « ADADELTA : an adaptive learning rate method, 2012.

H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko et al., « The HMM-based speech synthesis system (HTS) version 2.0, pp.294-299, 2007.

X. Zhang, M. Vimal, P. Daniel, and K. Sanjeev, , 2017.

, Acoustic Data-Driven Lexicon Learning Based on a Greedy Pronunciation Selection Framework », Proc. Interspeech, pp.2541-2545, 2017.