Filters








542 Hits in 6.8 sec

Fast Spectrogram Inversion using Multi-head Convolutional Neural Networks [article]

Sercan O. Arik, Heewoo Jun, Gregory Diamos
2018 arXiv   pre-print
We propose the multi-head convolutional neural network (MCNN) architecture for waveform synthesis from spectrograms.  ...  Nonlinear interpolation in MCNN is employed with transposed convolution layers in parallel heads.  ...  We propose the multi-head convolutional neural network (MCNN) architecture.  ... 
arXiv:1808.06719v1 fatcat:cugkvqp55fgbzab3gjxdufp27u

WaveGlow: A Flow-based Generative Network for Speech Synthesis [article]

Ryan Prenger, Rafael Valle, Bryan Catanzaro
2018 arXiv   pre-print
In this paper we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms.  ...  WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.  ...  MCNN for spectrogram inversion [8] produces audio using one multi-headed convolutional network.  ... 
arXiv:1811.00002v1 fatcat:etqou46otjfkdgdi6uygcjooji

Neural Speech Synthesis with Transformer Network

Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu
2019 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs).  ...  Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original  ...  Recently, ClariNet (Ping, Peng, and Chen 2018) , a fully convolutional text-towave neural architecture, is proposed to enable the fast endto-end training from scratch.  ... 
doi:10.1609/aaai.v33i01.33016706 fatcat:325z3grkm5bh7ozwmbbzjjizdq

Neural Speech Synthesis with Transformer Network [article]

Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou
2019 arXiv   pre-print
and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs).  ...  Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original  ...  Recently, ClariNet (Ping, Peng, and Chen 2018) , a fully convolutional text-towave neural architecture, is proposed to enable the fast endto-end training from scratch.  ... 
arXiv:1809.08895v3 fatcat:2lrukvnil5ds7b5nxj7y46ucle

Table of Contents

2019 IEEE Signal Processing Letters  
Channappayya 89 Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks . . . . . . . S.Ö. Ark, H. Jun, and G.  ...  Kim 109 Edge-Aware Convolution Neural Network Based Salient Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ... 
doi:10.1109/lsp.2018.2880624 fatcat:jk53rhxerzg4lidtunzeqj2t4m

TSTNN: Two-stage Transformer based Neural Network for Speech Enhancement in the Time Domain [article]

Kai Wang, Bengbeng He, Wei-Ping Zhu
2021 arXiv   pre-print
In this paper, we propose a transformer-based architecture, called two-stage transformer neural network (TSTNN) for end-to-end speech denoising in the time domain.  ...  Finally, the decoder uses the masked encoder feature to reconstruct the enhanced speech.  ...  Lu, "Complex spectrogram enhancement by convolutional neural network with multi-metrics learning," in 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).  ... 
arXiv:2103.09963v1 fatcat:7wspvk2imnfr7cw4wu6sasmboy

A light-weight full-band speech enhancement model [article]

Qinwen Hu, Zhongshu Hou, Xiaohuai Le, Jing Lu
2022 arXiv   pre-print
Deep neural network based full-band speech enhancement systems face challenges of high demand of computational resources and imbalanced frequency distribution.  ...  ., a learnable spectral compression mapping for more effective high-band spectral information compression, and the utilization of the multi-head attention mechanism for more effective modeling of the global  ...  The fast Fourier transform length is thus 1200 points, resulting in a dimension of 601 for frequency features fed into the network. Hanning window is used when performing STFT.  ... 
arXiv:2206.14524v2 fatcat:odqovwuhbfgkfjiqxaqldatvki

K-Space Transformer for Fast MRIReconstruction with Implicit Representation [article]

Ziheng Zhao, Tianjiao Zhang, Weidi Xie, Yanfeng Wang, Ya Zhang
2022 arXiv   pre-print
This paper considers the problem of fast MRI reconstruction.  ...  We adopt an implicit representation of spectrogram, treating spatial coordinates as inputs, and dynamically query the partially observed measurements to complete the spectrogram, i.e. learning the inductive  ...  by applying a prediction layer (g(•)) on the output from transformer decoders (ψ k-dec (•)), consisting of multi-head self-attention (MHSA), multi-head cross-attention (MHCA), a feed-forward network (  ... 
arXiv:2206.06947v1 fatcat:nac4gvyz4zectpvoxtyyiaasra

A Survey of Sound Source Localization with Deep Learning Methods [article]

Pierre-Amaury Grumiaux, Srđan Kitić, Laurent Girin, Alexandre Guérin
2022 arXiv   pre-print
We provide an exhaustive topography of the neural-based localization literature in this context, organized according to several aspects: the neural network architecture, the type of input features, the  ...  output strategy (classification or regression), the types of data used for model training and evaluation, and the model training strategy.  ...  Convolutional recurrent neural networks CRNNs are neural networks containing one or more convolutional layers and one or more recurrent layers.  ... 
arXiv:2109.03465v3 fatcat:tq5vmgikwrenlbqba4lrqo3pee

An Empirical Study on End-to-End Singing Voice Synthesis with Encoder-Decoder Architectures [article]

Dengfeng Ke and Yuxing Lu and Xudong Liu and Yanyan Xu and Jing Sun and Cheng-Hao Cai
2021 arXiv   pre-print
With the rapid development of neural network architectures and speech processing models, singing voice synthesis with neural networks is becoming the cutting-edge technique of digital music production.  ...  In this work, in order to explore how to improve the quality and efficiency of singing voice synthesis, in this work, we use encoder-decoder neural models and a number of vocoders to achieve singing voice  ...  The Pitch and spectrogram features are used to train the same neural network with dif-255 ferent hyper-parameters, and the neural network is used to learn the mapping relationships between the two features  ... 
arXiv:2108.03008v1 fatcat:p73gewntybebbi2pggotyyqtwy

Multi-instrument Music Synthesis with Spectrogram Diffusion [article]

Curtis Hawthorne, Ian Simon, Adam Roberts, Neil Zeghidour, Josh Gardner, Ethan Manilow, Jesse Engel
2022 arXiv   pre-print
We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter.  ...  In this work, we focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime.  ...  Spectrograms to Audio To translate the model's magnitude spectrogram output to audio, we use a convolutional spectrogram inversion network as proposed in MelGAN [2] .  ... 
arXiv:2206.05408v2 fatcat:nt7sl2qirjcptlx56lkfpi62rm

Natural statistics as inference principles of auditory tuning in biological and artificial midbrain networks

Sangwook Park, Angeles Salles, Kathryne Allen, Cynthia F. Moss, Mounya Elhilali
2021 eNeuro  
Natural statistics as inference principles of auditory tuning in biological and artificial midbrain networks Abbreviated Title (50 character maximum) Auditory tuning in biological and artificial brain  ...  Fig. 5 5 Fig.5Operations using multi-scale filters. A, convolution using multi-scale filters. B, transposed convolution using multi-scale filters. Fig. 6 6 Fig. 6 STRF calculation.  ...  A transposed convolution using multi-scale filters is performed in three steps (Fig. 5B ).  ... 
doi:10.1523/eneuro.0525-20.2021 fatcat:z76wcmm26bhdrkq4mp23qinstm

Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram [article]

Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim
2020 arXiv   pre-print
We propose Parallel WaveGAN, a distillation-free, fast, and small-footprint waveform generation method using a generative adversarial network.  ...  In the proposed method, a non-autoregressive WaveNet is trained by jointly optimizing multi-resolution spectrogram and adversarial loss functions, which can effectively capture the time-frequency distribution  ...  The model consisted of a six-layer encoder and a sixlayer decoder, each was based on multi-head attention (with eight heads).  ... 
arXiv:1910.11480v2 fatcat:uh6nagxx7fan5lf4gknmii3qhi

A^3T: Alignment-Aware Acoustic and Text Pretraining for Speech Synthesis and Editing [article]

He Bai, Renjie Zheng, Junkun Chen, Xintong Li, Mingbo Ma, Liang Huang
2022 arXiv   pre-print
Experiments show A^3T outperforms SOTA models on speech editing, and improves multi-speaker speech synthesis without the external speaker verification model.  ...  In this way, the pretrained model can generate high quality reconstructed spectrogram, which can be applied to the speech editing and unseen speaker TTS directly.  ...  FastSpeech: Fast, robust and controllable text adversarial networks with multi-resolution spectrogram. to speech. arXiv preprint arXiv:1905.09263, 2019.  ... 
arXiv:2203.09690v2 fatcat:h44bnzrjerge7b33srgsb6txii

Glow-TTS: A Generative Flow for Text-to-Speech via Monotonic Alignment Search [article]

Jaehyeon Kim, Sungwon Kim, Jungil Kong, Sungroh Yoon
2020 arXiv   pre-print
We further show that our model can be easily extended to a multi-speaker setting.  ...  Recently, text-to-speech (TTS) models such as FastSpeech and ParaNet have been proposed to generate mel-spectrograms from text in parallel.  ...  Hyper-Parameter Glow-TTS (LJ Dataset) Encoder Multi-Head Attention Hidden Dimension 192 Encoder Multi-Head Attention Heads 2 Encoder Multi-Head Attention Maximum Relative Position Embedding Dimension  ... 
arXiv:2005.11129v2 fatcat:efmkcdp6j5hwjbnd22nqir6pce
« Previous Showing results 1 — 15 out of 542 results