M. Versteegh, X. Anguera, A. Jansen, and E. Dupoux, The Zero Resource Speech Challenge 2015: Proposed approaches and results, Procedia Computer Science, vol.81, pp.67-72, 2016.

E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard et al., The Zero Resource Speech Challenge, pp.323-330, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01687504

P. K. Muthukumar and A. W. Black, Automatic discovery of a phonetic inventory for unwritten languages for statistical speech synthesis, IEEE International Conference on Acoustics, Speech and Signal Processing, pp.2594-2598, 2014.

O. Scharenborg, L. Besacier, A. W. Black, M. Hasegawa-johnson, F. Metze et al., Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the, pp.4979-4983
URL : https://hal.archives-ouvertes.fr/hal-01709578

L. Ondel, L. Burget, and J. Cernocký, Variational inference for acoustic unit discovery, SLTU, ser. Procedia Computer Science, vol.81, pp.80-86, 2016.

Z. Wu, O. Watts, and S. King, Merlin: An open source neural network speech synthesis system, Speech Synthesis Workshop. ISCA, pp.202-207, 2016.

L. Badino, C. Canevari, L. Fadiga, and G. Metta, An autoencoder based approach to unsupervised learning of subword units, ICASSP, pp.7634-7638, 2014.

A. Myrman and G. Salvi, Partitioning of posteriorgrams using siamese models for unsupervised acoustic modelling, International Workshop on Grounding Language Understanding (GLU)

M. Heck, S. Sakti, and S. Nakamura, Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to ZeroSpeech 2017, pp.740-746, 2017.

A. Van-den-oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals et al., Wavenet: A generative model for raw audio, SSW. ISCA, p.125, 2016.

S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain et al., SampleRNN: An unconditional end-to-end neural audio generation model, CoRR, 2016.

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly et al., Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions, ICASSP, pp.4779-4783, 2018.

W. Ping, K. Peng, A. Gibiansky, S. Ö. Arik, A. Kannan et al., Deep voice 3: 2000-speaker neural text-to-speech, CoRR, 2017.

N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu et al., Close to human quality TTS with transformer, CoRR, 2018.

A. Tjandra, S. Sakti, and S. Nakamura, Listening while speaking: Speech chain by deep learning, pp.301-308, 2017.

T. Kaneko and H. Kameoka, Parallel-data-free voice conversion using cycle-consistent adversarial networks, CoRR, 2017.

C. Hsu, H. Hwang, Y. Wu, Y. Tsao, and H. Wang, Voice conversion from non-parallel corpora using variational auto-encoder, Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp.1-6, 2016.

J. Chou, C. Yeh, H. Lee, and L. Lee, Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations, CoRR, 2018.

Y. Gao, R. Singh, and B. Raj, Voice impersonation using generative adversarial networks, ICASSP, pp.2506-2510, 2018.

A. Van-den-oord and O. Vinyals, Neural discrete representation learning, Advances in Neural Information Processing Systems, pp.6306-6315, 2017.

J. Chorowski, R. J. Weiss, S. Bengio, and A. V. Oord, Unsupervised speech representation learning using wavenet autoencoders, 2019.

S. Sakti, R. Maia, S. Sakai, T. Shimizu, and S. Nakamura, Development of HMM-based Indonesian speech synthesis, Proc. Oriental COCOSDA, pp.215-219, 2008.

S. Sakti, E. Kelana, H. Riza, S. Sakai, K. Markov et al., Development of Indonesian large vocabulary continuous speech recognition system within A-STAR project, Proceedings of the Workshop on Technologies and Corpora for AsiaPacific Speech Translation (TCAST), 2008.

L. Ondel, P. Godard, L. Besacier, E. Larsen, M. Hasegawajohnson et al., Bayesian models for unit discovery on a very low resource language, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.5939-5943, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01709589

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek et al., The Kaldi speech recognition toolkit, IEEE Signal Processing Society, Tech. Rep, 2011.

K. Pandia and H. Murthy, Zero Resource Speech Synthesis Using Transcripts Derived from Perceptual Acoustic Units, 2019.

S. Feng, T. Lee, and Z. Peng, Combining Adversarial Training and Disentangled Speech Representation for Robust Zero-Resource Subword Modeling, 2019.

A. Liu, P. Hsu, and H. Lee, Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion, 2019.

R. Eloff, A. Nortje, B. L. Van-niekerk, A. Govender, L. Nortje et al., Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks, 2019.

S. Nayak, C. S. Kumar, G. Ramesh, S. Bhati, and K. S. Murty, Virtual Phone Discovery for Speech Synthesis, 2019.

,

B. Yusuf, A. Gok, B. Gundogdu, O. D. Kose, and M. Saraclar, Temporally-Aware Acoustic Unit Discovery for Zerospeech 2019 Challenge, 2019.

A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li et al., VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge, 2019.