Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Communication Dans Un Congrès Année : 2023

Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS

Résumé

Flow-based generative models are widely used in text-to-speech (TTS) systems to learn the distribution of audio features (e.g., Mel-spectrograms) given the input tokens and to sample from this distribution to generate diverse utterances. However, in the zero-shot multi-speaker TTS scenario, the generated utterances lack diversity and naturalness. In this paper, we propose to improve the diversity of utterances by explicitly learning the distribution of fundamental frequency sequences (pitch contours) of each speaker during training using a stochastic flow-based pitch predictor, then conditioning the model on generated pitch contours during inference. The experimental results demonstrate that the proposed method yields a significant improvement in the naturalness and diversity of speech generated by a Glow-TTS model that uses explicit stochastic pitch prediction, over a Glow-TTS baseline and an improved Glow-TTS model that uses a stochastic duration predictor.
Fichier principal
Vignette du fichier
Stochastic_Pitch_Prediction_for_Improving_the_Diversity_and_Naturalness_in_GlowTTS.pdf (281.52 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-04108825 , version 1 (28-05-2023)

Licence

Paternité

Identifiants

  • HAL Id : hal-04108825 , version 1

Citer

Sewade Ogun, Vincent Colotte, Emmanuel Vincent. Stochastic Pitch Prediction Improves the Diversity and Naturalness of Speech in Glow-TTS. InterSpeech 2023, Aug 2023, Dublin, Ireland. ⟨hal-04108825⟩
40 Consultations
23 Téléchargements

Partager

Gmail Facebook X LinkedIn More