Learning Energy-Based Generative Models via Coarse-to-Fine Expanding and Sampling

Yang Zhao, Jianwen Xie, Ping Li
2021 International Conference on Learning Representations  
Energy-based models (EBMs) parameterized by neural networks can be trained by the Markov chain Monte Carlo (MCMC) sampling-based maximum likelihood estimation. Despite the recent significant success of EBMs in image generation, the current approaches to train EBMs are unstable and have difficulty synthesizing diverse and high-fidelity images. In this paper, we propose to train EBMs via a multistage coarse-to-fine expanding and sampling strategy, which starts with learning a coarse-level EBM
more » ... images at low resolution and then gradually transits to learn a finer-level EBM from images at higher resolution by expanding the energy function as the learning progresses. The proposed framework is computationally efficient with smooth learning and sampling. It achieves the best performance on image generation amongst all EBMs and is the first successful EBM to synthesize high-fidelity images at 512 × 512 resolution. It can also be useful for image restoration and out-of-distribution detection. Lastly, the proposed framework is further generalized to the one-sided unsupervised image-to-image translation and beats baseline methods in terms of model size and training budget. We also present a gradient-based generative saliency method to interpret the translation dynamics. Published as a conference paper at ICLR 2021ICLR et al., 2019)), etc., as well as image-to-image translation (Xie et al., 2021b; c), out-of-distribution detection (Liu et al., 2020) and inverse optimal control (Xu et al., 2019) . EBMs are characterized by (i) Simplicity: The maximum likelihood learning of EBMs unifies representation and generation in a single model, and (ii) Explicitness: EBMs provide an explicit density distribution of data by training an energy function that assigns lower values to observed data and higher values to unobserved ones. However, it is still difficult to train an EBM to synthesize diverse and high-fidelity images. The maximum likelihood estimation (MLE) of EBMs requires the Markov chain Monte Carlo (MCMC) (Liu, 2008; Barbu & Zhu, 2020) to sample from the model and then updates the model parameters according to the difference between those samples and the observed data. Such an "analysis by synthesis" (Grenander et al., 2007) learning scheme is challenging because the sampling step is neither efficient nor stable. In particular, when the energy function is multimodal due to the highly varied or high resolution training data, it is not easy for the MCMC chains to traverse the modes of the learned model. Fortunately, it is common knowledge that the manifold residing in a downsampled low-dimensional image space is smoother than that in the original high-dimensional counterpart. Thus, learning an EBM from low-dimensional data is much stabler and faster than learning from high-dimensional data in terms of convergence (Odena et al., 2017; Gao et al., 2018) . Inspired by the above knowledge, we propose to train EBMs via a multistage coarse-to-fine expanding and sampling strategy (CF-EBM). As shown in Figure 1 (a), the approach starts with learning a coarselevel EBM on low resolution images and then smoothly transits to learn the finer-level EBM by adding new layers that take into account the higher resolution information as the learning progresses. The gradient-based short-run MCMC , e.g., Langevin dynamics (Neal et al., 2011) , is used for sampling. From the modeling aspect, the coarse-level training can be useful for exploring the global structure of image, while the fine-level training will then gradually refine the image details. Recent works have demonstrated the advantages of this incremental learning (Karras et al., 2018; Wang et al., 2018) . However, there have been no works focusing on the incremental learning of EBMs that incorporates bottom-up representation and top-down sampling in a single net. Besides, as shown in Figure 1 (a), the top-down gradient information for synthesis flows from coarse-level layers towards fine-level layers. Thus, during the coarse-to-fine expanding, we can use the coarse-level synthesis to help the fine-level synthesis to stabilize the sampling. Such a coarse-to-fine expanding and sampling scheme is useful for high-fidelity synthesis in several vision tasks. See Figure 1(b) . Furthermore, we propose a one-sided energy-based unsupervised image-to-image translation method and scale it up to high resolution. The approach is immediately available with the FC-EBM by using its iterative Langevin dynamics without the need of the cycle consistency or geometry constraints (Fu et al., 2019) . Specifically, we learn an EBM of target domain with Langevin dynamics initialized by the examples from source domain. The resulting translator is the short-run MCMC. Compared with those prior works Huang et al., 2018; Park et al., 2020) that learn black-box encoder-decoder networks between domains, our method is much more interpretable in the sense that ours can be explained by a visualization method (Simonyan et al., 2014; Adebayo et al., 2018) that uses gradients to visualize the most essential regions, i.e., the generative saliency, when translating an image from the source domain to the target domain. See Figure 1(c) . The contributions of our paper can be summarized as: • To the best of our knowledge, this is the first work that trains EBMs under the "analysis by synthesis" scheme via a multistage coarse-to-fine expanding and sampling strategy. Besides, we propose several essential techniques for improving EBM, e.g., smooth activations. Particularly, our work is the first to train a pure EBM for synthesizing 512 × 512 images. • We propose a novel energy-based unsupervised image-to-image translation approach, which is essentially different from all other existing GAN-based approaches. We demonstrate noticeable results in terms of both translation quality and efficiency of time and memory. • We conduct extensive experiments to validate our approach, including image generation, denoising, inpainting, out-of-distribution detection and unsupervised image translation. Strong results show that our method outperforms or is competitive with the prior art. The rest of the paper is organized as follows. Section 2 summarizes related works and how the proposed method is different from the prior art. Section 3 introduces the proposed methodology in detail. Section 4 presents extensive experiments to test our method. Section 5 concludes our paper and discuss some future research directions.
dblp:conf/iclr/ZhaoXL21 fatcat:v5qs2rzaevgxnc53g5dtpqaf3m