Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra. (November 2019)

Record Type:: Journal Article
Title:: Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra. (November 2019)
Main Title:: Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra
Authors:: Saito, Yuki
Takamichi, Shinnosuke
Saruwatari, Hiroshi
Abstract:: Highlights: We propose novel training algorithms for vocoder-free text-to-speech synthesis using STFT spectra based on generative adversarial networks (GANs). We demonstrate that using GANs with the original-frequency-resolution amplitude spectra degrades the synthetic speech quality. We show that the proposed low-frequency-resolution GANs improves the synthetic speech quality. We also show that using the inverse mel scale for the proposed algorithm further improves the synthetic speech quality. Abstract: This paper proposes novel training algorithms for vocoder-free text-to-speech (TTS) synthesis based on generative adversarial networks (GANs) that compensate for short-term Fourier transform (STFT) amplitude spectra in low/multi frequency resolution. Vocoder-free TTS using STFT amplitude spectra can avoid degradation of synthetic speech quality caused by the vocoder-based parameterization used in conventional TTS. Our previous work for the vocoder-based TTS proposed a method for incorporating the GAN-based distribution compensation into acoustic model training to improve synthetic speech quality. This paper extends the algorithm to the vocoder-free TTS and propose a GAN-based training algorithm using low-frequency-resolution amplitude spectra to overcome the difficulty in modeling complicated distribution of the high-dimensional spectra. In the proposed algorithm, amplitude spectra are transformed into low-frequency-resolution amplitude spectra by applying an average … (more)
Is Part Of:: Computer speech & language. Volume 58(2019)
Journal:: Computer speech & language
Issue:: Volume 58(2019)
Issue Display:: Volume 58, Issue 2019 (2019)
Year:: 2019
Volume:: 58
Issue:: 2019
Issue Sort Value:: 2019-0058-2019-0000
Page Start:: 347
Page End:: 363
Publication Date:: 2019-11
Subjects:: Vocoder-free text-to-speech -- Training algorithm -- STFT amplitude spectra -- Generative adversarial networks -- Frequency resolution -- Frequency warping
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454
Journal URLs:: http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗
DOI:: 10.1016/j.csl.2019.05.008 ↗
Languages:: English
ISSNs:: 0885-2308
Deposit Type:: Legaldeposit
View Content:: Available online (eLD content is only available in our Reading Rooms) ↗
Physical Locations:: British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store
Ingest File:: 11148.xml