Sergan: Speech Enhancement Using Relativistic Generative Adversarial Networks with Gradient Penalty

Deepak Baby, Sarah Verhulst
2019 ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)  
Popular neural network-based speech enhancement systems operate on the magnitude spectrogram and ignore the phase mismatch between the noisy and clean speech signals. Recently, conditional generative adversarial networks (cGANs) have shown promise in addressing the phase mismatch problem by directly mapping the raw noisy speech waveform to the underlying clean speech signal. However, stabilizing and training cGAN systems is difficult and they still fall short of the performance achieved by
more » ... ce achieved by spectral enhancement approaches. This paper introduces relativistic GANs with a relativistic cost function at its discriminator and gradient penalty to improve time-domain speech enhancement. Simulation results show that relativistic discriminators provide a more stable training of cGANs and yield a better generator network for improved speech enhancement performance. Index Terms-speech enhancement, relativistic GAN, convolutional neural networks This work was funded with support from the EU Horizon 2020 programme under grant agreement No 678120 (RobSpear). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. problem, this paper investigates the use of generative neural networks which can directly map the raw noisy speech waveform to the underlying clean speech waveform. Recently, generative adversarial network (GAN)-based models [9] have been explored for raw speech waveform enhancement [10] [11] [12] [13] [14] . GAN consists of a generative model or generator network (G) and a discriminator network (D) that play a min-max game between each other. [12] demonstrated that the generator part G alone with an L1 loss can yield similar performance as GANs that adversarially train G to fool D. Therefore, there is a growing debate on the suitability of GANs for speech enhancement. A part of this concern is attributed to their complex training which requires finding a Nash equilibrium of a nonconvex game between G and D [9, 15] , and the quality of the generated samples critically depends on this achieved equilibrium. This paper investigates whether an improved discriminator could lead to a better generator to yield a cleaner speech signal. We introduce SERGANs: speech enhancement systems that make use of relativistic GANs (RGANs) [16] . RGANs use a relativistic loss function at the discriminator and are shown successful in image generation [16] . This paper investigates whether RGANs can yield a better generator network for speech enhancement. We also investigate the use of gradient penalty in D [17] for stabilizing such systems. This paper evaluates and compares several relativistic GAN models such as relativistic GANs and relativistic average GANs with mean-square error and binary cross-entropy loss functions with gradient penalty in the discriminator. In addition, we also introduce Wasserstein GANs [18] for speech enhancement. Simulation results show that SERGAN models with gradient penalty improve the speech enhancement performance in addition to yielding a more stable GAN training. To the best of our knowledge, it is the first time the standard binary cross-entropy loss has been shown successful for GAN-based speech enhancement. SPEECH ENHANCEMENT USING GAN Speech enhancement systems aim to estimate the clean speech signal x from the noisy mixture y = x + w, where w is the added background noise.
doi:10.1109/icassp.2019.8683799 dblp:conf/icassp/BabyV19 fatcat:mvjq67d245deretpd3mm4nfb5u