Energy-Based Model Meets Deep Learning

Generative Modeling of Data

Statistical modeling of high-dimensional signals, such as images, videos, and 3D shapes, is a very challenging task in computer vision and machine learning. Even though Generative Adversarial Net (GAN) and Variational Auto-Encoder (VAE) have become very popular in the task of data generation, they still have a few drawbacks such as mode collapse in GANs and posterior collapse in VAE. Most importantly, these models require an assisting network to train the generator, which makes the model less natural. Most importantly, GANs and VAEs are not able to provide explicit density functions of the data. In other words, their data likelihood functions do not have closed forms. Despite the popularity of GANs and VAEs, researchers have been looking for statistical models that can explicitly represent the density of the visual data, and help to understand the data space. Such a research direction is very fundamental in the fields of statistics and computer vision.

Traditional Energy-Based Models

Energy-based model (or Markov random field model) with its origin in statistical physics provides a way to define an unnormalized probability density of images by a well-designed and trainable potential energy function. The energy function usually captures some statistical property of the input data. The parameters of the energy function can be learned via maximum likelihood estimation, and data generation can be achieved by drawing samples from the learned models via Markov chain Monte Carlo methods (MCMC), such as Langevin dynamics.

In the past few decades, researcher have devoted themselves to develop energy-based image models. The Hopfield network (Hopfield 1982) adapted the Ising energy model into a model that is able to represent arbitrary observed data. The FRAME (Filters, Random field, And Maximum Entropy) (Zhu, Wu, and Mumford 1998) and RBM (Restricted Boltzmann Machine) (Hinton 2012) adopted energy functions with greater representational capacity. More specifically, RBM uses binary-valued hidden units which have a joint density with the observed units such as image pixels, whose energy function defined on both hidden and visible units is analogous to that of a Hopfield network. FRAME uses convolutional filters and parameterizes the energy functions as one-dimensional non-linear transformations of linear filter responses. Following an article titled “A Tutorial on Energy-Based Learning” (LeCun 2006), people in the field of machine learning tend to call this type of models Energy-based models. In physics, people call them Markov Random Fields or Gibbs distributions, and in statistics, they are referred to as Exponential Family Models.

Using a Neural Network as Energy Function

In 2012, deep learning, especially the convolutional neural network (ConvNet or CNN), had become the most successful predictive learning machine in computer vision. Can we turn the discriminative ConvNet into a generative ConvNet? The answer is “Yes”! In early 2016, a deep energy-based model with energy function parameterized by convolutional neural network (ConvNet or CNN) was first proposed in the research paper “A Theory of Generative ConvNet (ICML 2016)” by Song-Chun Zhu’s group from UCLA. See Figure 1 for an illustration of the energy function parameterized by a ConvNet. The ConvNet takes an image as input and maps it to a scalar as energy by feedforward computation, where all the parameters in this ConvNet will be trained by maximum likelihood estimation (MLE) without relying on any class labels.

Figure 1: The energy-based model with ConvNet structure as an energy function proposed in 2016

They call it the Energy-Based Generative ConvNet model, because the form of the model can be derived from the discriminative ConvNet. The model is also called deep FRAME model, because it is a deep generalization of the FRAME model in 1998. The model is trained by Langevin dynamics-based maximum likelihood estimation, which iterates (1) Sampling step: synthesizing new examples via Langevin dynamics from the current model, and (2) Learning step: updating the model parameters by maximum likelihood learning, where the gradient is in the form of the difference between the observed examples and the synthesized examples. Song-Chun Zhu’s group recognized that both steps can be efficiently computed by back-propagation, which makes the training of this type of deep energy-based models possible and efficient.

Figure 2: Langevin dynamics-based maximum likelihood estimation of the energy-based generative ConvNet

The model proposed in 2016 adopted a simple ConvNet energy unction, which consists of multiple layers of convolutions, down-sampling, and RuLU non-linearity. The model can generate state-of-the-art realistic texture and object image patterns in 2016. See Figure 3 for an illustration of the generated examples by Langevin dynamics from the learned models.

Figure 3: Image generation by Langevin dynamics from energy-based models with ConvNets as potential energy functions in 2016 (by Song-Chun Zhu’s group)

Energy-Based Generative Neural Network

The idea of combining Energy-Based models, deep neural network, and Langevin dynamics provides an elegant, efficient, and powerful way to synthesize high-dimensional data with high quality. Most importantly, it has established a new type of generative model framework, the Energy-Based Generative Neural Network. This is a new way to train deep neural networks in an unsupervised manner. Besides, compared with those deep generative models with a top-down generator, such as VAEs and GANs, this energy-based framework has a bottom-up neural network as an energy function and generates new examples by an implicit process, i.e., Langevin dynamics.

Figure 4: Energy function is parameterized by a deep neural network

Under this framework, Song-Chun Zhu’s group has further developed the Energy-Based Generative Spatial-Temporal ConvNet for video pattern in this research paper in 2017, the Energy-Based 3D ConvNet for voxeled shape pattern in this research paper in 2018, as well as Energy-Based Generative PointNet for unordered point set in 2020. Unlike GANs and VAEs, this approach does not require an assisting neural network in training, and is able to offer an explicit density distribution of the data. The maximum likelihood learning of the model does not suffer from issues like mode collapse in GANs or posterior collapse in VAEs. Figure 5 displays some synthesized results of different data.

Figure 5: Synthesizing images, videos, 3D volumetric shapes, and point clouds by Langevin dynamics from the learned Energy-based Generative Neural Networks

Generative Cooperative Network

In 2016, Song-Chun Zhu’s group proposed a Cooperative Learning framework where deep energy-based model and deep latent variable model are trained jointly via MCMC teaching in this paper. They called the model Cooperative Network (CoopNets). As we know, in order to learn energy-based models, such as energy-based generative ConvNet models, synthesized examples need to be sampled from the current model using Langevin dynamics, which is time inefficient. A generator model as a much more efficient sampler that generates synthesized examples via non-iterative direct ancestral sampling can be recruited to initialize the Langevin dynamics of the energy-based models. In return, the energy-based model teaches the generator how to mimic the Markov chain Monte Carlo (MCMC) transition.

Figure 6 shows the flow chart of the Cooperative Learning algorithm. The energy-based model is called descriptor model, while the latent variable model is called generator model. The descriptor model and the generator model cooperate with each other like a teacher and a student. The generator plays the role of the student. It generates the initial draft of the synthesized examples. The descriptor plays the role of the teacher. It revises the initial draft by running a number of Langevin revisions. The descriptor learns from the outside review (training examples), which is in the form of the difference between the observed examples and the revised synthesized examples. The generator learns from how the descriptor’s MCMC revises the initial draft by reconstructing the revised draft. That is, the descriptor model (teacher) distills its knowledge to the generator model (student) via MCMC, and this is called MCMC teaching.

Figure 6: A flow chart of the cooperative learning algorithm

The cooperative learning process interweaves the existing maximum likelihood learning algorithms for the two models, therefore the training is stable and does not encounter mode collapsing issue. Figure 7 displays some examples generated by the learned cooperative networks.

Figure 7: Synthesized results by Generative Cooperative Networks

The cooperative learning framework is also easily to be generalized to conditional version that is very useful for different types of computer vision tasks, such as supervised image-to-image translation, image in-painting. The conditional cooperative network that trains both the conditional generator model and the conditional energy-based model simultaneously via MCMC teaching is proposed in this research paper in 2019. Figure 8 shows an application of the conditional cooperative network for image recovery.

Figure 8: Conditional cooperative learning for image recovery

New Direction of Deep Generative Models

The pioneering works from Song-Chun Zhu’s group at UCLA have showed that the energy-based deep generative models with modern neural network as energy functions are perfectly able to represent the probability density functions of high-dimensional data, and are also capable of generating meaningful and realistic new examples by Langevin dynamics. Their works also demonstrated that the models are useful for applications in computer vision, for example, data recovery, unsupervised feature learning for classification, in-painting, super-resolution etc. This type of work is inspiring other researchers in the field of computer vision, machine learning, and statistics to consider the Energy-based Generative Neural Network and the Generative Cooperative Network (CoopNets) as new powerful deep generative models for data generating and representation learning.

Figure 9: Connections between different learning frameworks

Reading Sources: Papers, Survey, and Textbook

Some useful reading sources, which include (1) research papers that originally propose those models discussed above, (2) survey papers that present good summaries of them, and (3) a textbook which includes the above two MCMC-based generative frameworks and is used for teaching purpose in graduate schools, are listed in the reference section below.


Original Research Papers:

[1] A Theory of Generative ConvNet. Jianwen Xie *, Yang Lu *, Song-Chun Zhu, Ying Nian Wu (ICML 2016)

[2] Synthesizing Dynamic Pattern by Spatial-Temporal Generative ConvNet. Jianwen Xie, Song-Chun Zhu, Ying Nian Wu (CVPR 2017)

[3] Learning Descriptor Networks for 3D Shape Synthesis and Analysis. Jianwen Xie *, Zilong Zheng *, Ruiqi Gao, Wenguan Wang, Song-Chun Zhu, Ying Nian Wu (CVPR 2018)

[4] Learning Generative ConvNets via Multigrid Modeling and Sampling. Ruiqi Gao*, Yang Lu*, Junpei Zhou, Song-Chun Zhu, and Ying Nian Wu (CVPR 2018).

[5] On Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based Model. E Nijkamp, M Hill, Song-Chun Zhu, and Ying Nian Wu (NeurIPS 2019)

[6] On the anatomy of MCMC-based maximum likelihood learning of energy-based models. Erik Nijkamp*, Mitch Hill*, Tian Han, Song-Chun Zhu, and Ying Nian Wu (AAAI 2020)

[7] Generative PointNet: Energy-Based Learning on Unordered Point Sets for 3D Generation, Reconstruction and Classification. Jianwen Xie, Yifei Xu, Zilong Zheng, Song-Chun Zhu, Ying Nian Wu (ArXiv 2020)

[8] Cooperative Training of Descriptor and Generator Networks. Jianwen Xie, Yang Lu, Ruiqi Gao, Song-Chun Zhu, Ying Nian Wu (ArXiv 2016, TPAMI 2018)

[9] Cooperative Learning of Energy-Based Model and Latent Variable Model via MCMC Teaching. Jianwen Xie, Yang Lu, Ruiqi Gao, Ying Nian Wu (AAAI 2018)

[10] Cooperative Training of Fast Thinking Initializer and Slow Thinking Solver for Multi-Modal Conditional Learning. Jianwen Xie, Zilong Zheng, Xiaolin Fang, Song-Chun Zhu, Ying Nian Wu, (ArXiv 2019)

Survey Papers:

[11] Sparse and Deep Generalizations of the FRAME Model. Ying Nian Wu, Jianwen Xie, Yang Lu, Song-Chun Zhu. Annals of Mathematical Sciences and Applications 2018.

[12] A Tale of Three Probabilistic Families: Discriminative, Descriptive and Generative Models. Ying Nian Wu, Ruiqi Gao, Tian Han, and Song-Chun Zhu. Quarterly of Applied Mathematics 2019.

[13] Representation Learning: A Statistical Perspective. Jianwen Xie, Ruiqi Gao, Erik Nijkamp, Song-Chun Zhu, Ying Nian Wu. Annual Review of Statistics and Its Application (ARSIA 2020)


[14] Monte Carlo Methods. Adrian Barbu, Song-Chun Zhu. Springer. 2020.