SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis

Teysir Baoueb; Haocheng Liu; Mathieu Fontaine; Jonathan Le Roux; Gael Richard

Communication Dans Un Congrès Année : 2024

SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis

(1, 2) , (1, 2) , (1, 2) , (3) , (1, 2)

1
2
3

Teysir Baoueb

Fonction : Auteur
PersonId : 1343186
ORCID : 0009-0001-2263-4309

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Haocheng Liu

Fonction : Auteur
PersonId : 1344278

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Mathieu Fontaine

Fonction : Auteur
PersonId : 13405
IdHAL : mathieu-fontaine
ORCID : 0000-0002-7657-6271
IdRef : 236886681

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Jonathan Le Roux

Fonction : Auteur

Mitsubishi Electric Research Laboratories

Gael Richard

Fonction : Auteur
PersonId : 14146
IdHAL : gael-richard
IdRef : 094977208

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Résumé

Generative adversarial network (GAN) models can synthesize highquality audio signals while ensuring fast sample generation. However, they are difficult to train and are prone to several issues including mode collapse and divergence. In this paper, we introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN, which was initially devised for speech synthesis from mel spectrogram. In our model, the training stability is enhanced by means of a forward diffusion process which consists in injecting noise from a Gaussian distribution to both real and fake samples before inputting them to the discriminator. We further improve the model by exploiting a spectrally-shaped noise distribution with the aim to make the discriminator's task more challenging. We then show the merits of our proposed model for speech and music synthesis on several datasets. Our experiments confirm that our model compares favorably in audio quality and efficiency compared to several baselines.

Mots clés

Generative adversarial network (GAN) diffusion process deep audio synthesis spectral envelope

Domaines

Apprentissage [cs.LG] Son [cs.SD] Traitement du signal et de l'image [eess.SP]

Fichier principal

ICASSP_2024_SpecDiff_GAN___Preprint.pdf (466.56 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Teysir Baoueb : Connectez-vous pour contacter le contributeur

https://hal.science/hal-04423979

Soumis le : lundi 29 janvier 2024-13:54:24

Dernière modification le : mercredi 14 février 2024-15:19:27

Dates et versions

hal-04423979 , version 1 (29-01-2024)

Identifiants

HAL Id : hal-04423979 , version 1

Citer

Teysir Baoueb, Haocheng Liu, Mathieu Fontaine, Jonathan Le Roux, Gael Richard. SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis. IEEE International Conference on Acoustics, Speech and Signal Processing, Apr 2024, Seoul (Korea), South Korea. ⟨hal-04423979⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM LTCI IDS S2A IP_PARIS

151 Consultations

193 Téléchargements

SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager