Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra. (November 2019)
- Record Type:
- Journal Article
- Title:
- Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra. (November 2019)
- Main Title:
- Vocoder-free text-to-speech synthesis incorporating generative adversarial networks using low-/multi-frequency STFT amplitude spectra
- Authors:
- Saito, Yuki
Takamichi, Shinnosuke
Saruwatari, Hiroshi - Abstract:
- Highlights: We propose novel training algorithms for vocoder-free text-to-speech synthesis using STFT spectra based on generative adversarial networks (GANs). We demonstrate that using GANs with the original-frequency-resolution amplitude spectra degrades the synthetic speech quality. We show that the proposed low-frequency-resolution GANs improves the synthetic speech quality. We also show that using the inverse mel scale for the proposed algorithm further improves the synthetic speech quality. Abstract: This paper proposes novel training algorithms for vocoder-free text-to-speech (TTS) synthesis based on generative adversarial networks (GANs) that compensate for short-term Fourier transform (STFT) amplitude spectra in low/multi frequency resolution. Vocoder-free TTS using STFT amplitude spectra can avoid degradation of synthetic speech quality caused by the vocoder-based parameterization used in conventional TTS. Our previous work for the vocoder-based TTS proposed a method for incorporating the GAN-based distribution compensation into acoustic model training to improve synthetic speech quality. This paper extends the algorithm to the vocoder-free TTS and propose a GAN-based training algorithm using low-frequency-resolution amplitude spectra to overcome the difficulty in modeling complicated distribution of the high-dimensional spectra. In the proposed algorithm, amplitude spectra are transformed into low-frequency-resolution amplitude spectra by applying an averageHighlights: We propose novel training algorithms for vocoder-free text-to-speech synthesis using STFT spectra based on generative adversarial networks (GANs). We demonstrate that using GANs with the original-frequency-resolution amplitude spectra degrades the synthetic speech quality. We show that the proposed low-frequency-resolution GANs improves the synthetic speech quality. We also show that using the inverse mel scale for the proposed algorithm further improves the synthetic speech quality. Abstract: This paper proposes novel training algorithms for vocoder-free text-to-speech (TTS) synthesis based on generative adversarial networks (GANs) that compensate for short-term Fourier transform (STFT) amplitude spectra in low/multi frequency resolution. Vocoder-free TTS using STFT amplitude spectra can avoid degradation of synthetic speech quality caused by the vocoder-based parameterization used in conventional TTS. Our previous work for the vocoder-based TTS proposed a method for incorporating the GAN-based distribution compensation into acoustic model training to improve synthetic speech quality. This paper extends the algorithm to the vocoder-free TTS and propose a GAN-based training algorithm using low-frequency-resolution amplitude spectra to overcome the difficulty in modeling complicated distribution of the high-dimensional spectra. In the proposed algorithm, amplitude spectra are transformed into low-frequency-resolution amplitude spectra by applying an average pooling function along with a frequency axis; then the GAN-based distribution compensation is performed in the low-frequency-resolution domain. Because the low-frequency-resolution amplitude spectra approximately emulate filter banks, the proposed algorithm is expected to improve synthetic speech quality by reducing differences in spectral envelopes of natural and synthetic speech. Furthermore, various frequency scales that are related to human speech perception (e.g., mel and inverse mel frequency scales) can be introduced to the proposed training algorithm by applying an frequency warping function to amplitude spectra. This paper also proposes a GAN-based training algorithm using multi-frequency-resolution amplitude spectra that uses both low- and original-frequency-resolution amplitude spectra to reduce the differences in not only spectral envelopes but also fine structures. Experimental results demonstrate that (1) GANs using low-frequency-resolution amplitude spectra improve speech quality and work robustly against the settings of the frequency resolution and hyperparameters, (2) in comparison among low-, original-, and multi-frequency-resolution amplitude spectra, the use of low-frequency-resolution ones work best improve the synthetic speech quality, and (3) the use of the inverse mel frequency scale for obtaining low-frequency-resolution amplitude spectra further improves synthetic speech quality. … (more)
- Is Part Of:
- Computer speech & language. Volume 58(2019)
- Journal:
- Computer speech & language
- Issue:
- Volume 58(2019)
- Issue Display:
- Volume 58, Issue 2019 (2019)
- Year:
- 2019
- Volume:
- 58
- Issue:
- 2019
- Issue Sort Value:
- 2019-0058-2019-0000
- Page Start:
- 347
- Page End:
- 363
- Publication Date:
- 2019-11
- Subjects:
- Vocoder-free text-to-speech -- Training algorithm -- STFT amplitude spectra -- Generative adversarial networks -- Frequency resolution -- Frequency warping
Speech processing systems -- Periodicals
Automatic speech recognition -- Periodicals
Computers -- Periodicals
Linguistics -- Periodicals
Speech-Language Pathology -- Periodicals
Traitement automatique de la parole -- Périodiques
Reconnaissance automatique de la parole -- Périodiques
Automatic speech recognition
Speech processing systems
Electronic journals
Periodicals
006.454 - Journal URLs:
- http://www.journals.elsevier.com/computer-speech-and-language/ ↗
http://www.elsevier.com/journals ↗ - DOI:
- 10.1016/j.csl.2019.05.008 ↗
- Languages:
- English
- ISSNs:
- 0885-2308
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - 3394.276600
British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 11148.xml