SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis - Département Image, Données, Signal Access content directly
Conference Papers Year : 2024

SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis

Abstract

Generative adversarial network (GAN) models can synthesize highquality audio signals while ensuring fast sample generation. However, they are difficult to train and are prone to several issues including mode collapse and divergence. In this paper, we introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN, which was initially devised for speech synthesis from mel spectrogram. In our model, the training stability is enhanced by means of a forward diffusion process which consists in injecting noise from a Gaussian distribution to both real and fake samples before inputting them to the discriminator. We further improve the model by exploiting a spectrally-shaped noise distribution with the aim to make the discriminator's task more challenging. We then show the merits of our proposed model for speech and music synthesis on several datasets. Our experiments confirm that our model compares favorably in audio quality and efficiency compared to several baselines.
Fichier principal
Vignette du fichier
ICASSP_2024_SpecDiff_GAN___Preprint.pdf (466.56 Ko) Télécharger le fichier
Origin Files produced by the author(s)

Dates and versions

hal-04423979 , version 1 (29-01-2024)

Identifiers

  • HAL Id : hal-04423979 , version 1

Cite

Teysir Baoueb, Haocheng Liu, Mathieu Fontaine, Jonathan Le Roux, Gael Richard. SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul (Korea), South Korea. ⟨hal-04423979⟩
192 View
224 Download

Share

Gmail Mastodon Facebook X LinkedIn More