Variational Autoencoder

Introduction

In my previous article “Expectation Maximization Algorithm”, we discussed how to optimize probabilistic models with latent variables using the Expectation Maximization (EM) algorithm.

In this article, I would like to discuss how to optimize probabilistic models with latent variables using Variational Autoencoder (VAE).

Variational Autoencoder

Similar to the maximization - maximization procedure interpretation for the EM algorithm, the log evidence of a probabilistic model consisting of parameters θ, logpθ(x), can be expressed as the sum of an evidence lower bound (ELBO) and a Kullback-Leibler (KL) divergence.

logpθ(x)=Ezq(z)[logpθ(x)]=Ezq(z)[logpθ(x,z)pθ(z|x)]=Ezq(z)[logpθ(x,z)q(z)q(z)pθ(z|x)]=Ezq(z)[logpθ(x,z)q(z)+logq(z)pθ(z|x)]=Ezq(z)[logpθ(x,z)q(z)]+Ezq(z)[logq(z)pθ(z|x)]=Ezq(z)[logpθ(x,z)q(z)]Ezq(z)[logpθ(z|x)q(z)]=ELBO(q,θ)+DKL(q(z)||pθ(z|x))

where ELBO(q,θ) is the evidence lower bound (ELBO) and DKL(q(z)||pθ(z|x)) is the Kullback-Leibler (KL) divergence between the approximate posterior q(z) and the true posterior pθ(z|x).

In the iterative EM algorithm, the approximate posterior q(z) is not learned explicitly. In the VAE, the approximate posterior q(z) is learned explicitly using an inference model. Specifically, the approximate posterior q(z) is parameterized by a neural network, qϕ(z|x), where ϕ are the variational parameters of the neural network (ϕ are called variational parameters because they are the parameters of the variational distribution q(z|x)), conditioned on the observed data x. Concretely,

logpθ(x)=Ezqϕ(z|x)[logpθ(x,z)qϕ(z|x)]+Ezqϕ(z|x)[logqϕ(z|x)pθ(z|x)]=ELBO(ϕ,θ)+DKL(qϕ(z|x)||pθ(z|x))

where ELBO(ϕ,θ) is the evidence lower bound (ELBO) and DKL(qϕ(z|x)||pθ(z|x)) is the Kullback-Leibler (KL) divergence between the approximate posterior qϕ(z|x) and the true posterior pθ(z|x).

Model Parameters Optimization

In the ELBO(ϕ,θ), pθ(x,z) is called the generative model and qϕ(z|x) is called the inference model. pθ(x) is also the generative model, even though it is not explicitly modeled.

DKL(qϕ(z|x)||pθ(z|x)) is non-negative and it is zero if and only if the approximate posterior qϕ(z|x) is equal to the true posterior pθ(z|x).

If the parameters θ are fixed and only the parameters ϕ are optimized, maximizing the ELBO is equivalent to minimizing the KL divergence between the approximate posterior qϕ(z|x) and the true posterior pθ(z|x), making the approximate posterior qϕ(z|x) as close as possible to the true posterior pθ(z|x).

If the parameters ϕ are fixed and only the parameters θ are optimized, maximizing the ELBO is approximately equivalent to maximizing the log evidence logpθ(x), making the generative model pθ(x) better.

Therefore, the parameters ϕ and θ can even be jointly optimized by maximizing the ELBO so that the approximate posterior qϕ(z|x) becomes close to the true posterior pθ(z|x) and the generative model pθ(x) becomes better.

Gradient-based optimization algorithms can be used to optimize the parameters ϕ and θ. The target function is defined as follows.

Lϕ,θ(x)=ELBO(ϕ,θ)=Ezqϕ(z|x)[logpθ(x,z)qϕ(z|x)]=Ezqϕ(z|x)[logpθ(x,z)logqϕ(z|x)]

The gradients of the target function Lϕ,θ(x) with respect to the generative model parameters θ are straightforward to obtain.

θLϕ,θ(x)=θEzqϕ(z|x)[logpθ(x,z)logqϕ(z|x)]=Ezqϕ(z|x)[θ(logpθ(x,z)logqϕ(z|x))]=Ezqϕ(z|x)[θlogpθ(x,z)θlogqϕ(z|x)]=Ezqϕ(z|x)[θlogpθ(x,z)]

This θLϕ,θ(x) can be approximated using an unbiased Monte Carlo estimate by sampling z(i)qϕ(z|x).

θLϕ,θ(x)1Mi=1Mθlogpθ(x,z(i))

where M is the number of samples which can be just 1.

The gradients of the target function Lϕ,θ(x) with respect to the inference model parameters ϕ are somewhat more complicated to obtain.

ϕLϕ,θ(x)=ϕEzqϕ(z|x)[logpθ(x,z)logqϕ(z|x)]

Here we encountered the same problem as we did in the EM algorithm. The probability distribution that the latent variables z follow is parameterized by ϕ, and the parameters to be optimized are also ϕ. Therefore, the Monte Carlo estimate of Ezqϕ(z|x)[logpθ(x,z)logqϕ(z|x)] by sampling z(i)qϕold(z|x) is biased. In addition, even if the Monte Carlo estimate is unbiased, the sampling process is not differentiable with respect to ϕ and the gradient methods would become invalid.

If the latent variables are transformed from some other latent variables that are not parameterized by ϕ via a deterministic transformation, the unbiased Monte Carlo estimate can be obtained and the sampling process becomes differentiable with respect to ϕ. This is the idea of the reparameterization trick.

Concretely, we have some differentiable and invertible transformation g that transforms some other latent variables ϵ that are not parameterized by ϕ into the latent variables z that are parameterized by ϕ.

z=g(ϵ,ϕ,x)ϵp(ϵ)

The distribution of ϵ is usually chosen to be a distribution p(ϵ) such that the distribution of z is the same as qϕ(z|x).

As a result, the target function Lϕ,θ(x) can be expressed as an expectation over the latent variables ϵ that are not parameterized by ϕ and this target function is completely differentiable with respect to θ and ϕ.

Lϕ,θ(x)=Ezqϕ(z|x)[logpθ(x,z)logqϕ(z|x)]=Eϵp(ϵ)[logpθ(x,z)logqϕ(z|x)]

where z=g(ϵ,ϕ,x).

The gradients of the target function Lϕ,θ(x) with respect to the generative model parameters θ and the inference model parameters ϕ become straightforward to obtain.

θLϕ,θ(x)=θEϵp(ϵ)[logpθ(x,z)logqϕ(z|x)]=Eϵp(ϵ)[θ(logpθ(x,z)logqϕ(z|x))]=Eϵp(ϵ)[θlogpθ(x,z)]

ϕLϕ,θ(x)=ϕEϵp(ϵ)[logpθ(x,z)logqϕ(z|x)]=Eϵp(ϵ)[ϕ(logpθ(x,z)logqϕ(z|x))]

According to the reparameterization trick and the transformations of random variables, the probability distribution qϕ(z|x) can be transformed from the probability distribution p(ϵ).

qϕ(z|x)=p(ϵ)|detJg(ϵ)|

where Jg(ϵ) is the Jacobian matrix of the transformation g at ϵ.

Jg(ϵ)=zϵ=[z1ϵ1z1ϵ2z1ϵnz2ϵ1z2ϵ2z2ϵnznϵ1znϵ2znϵn]

Thus, the target function Lϕ,θ(x) can be rewritten as follows.

Lϕ,θ(x)=Eϵp(ϵ)[logpθ(x,z)logqϕ(z|x)]=Eϵp(ϵ)[logpθ(x,z)logp(ϵ)|detJg(ϵ)|]=Eϵp(ϵ)[logpθ(x,z)logp(ϵ)+log|detJg(ϵ)|]=Eϵp(ϵ)[log(pθ(x|z)pθ(z))logp(ϵ)+log|detJg(ϵ)|]=Eϵp(ϵ)[logpθ(x|z)+logpθ(z)logp(ϵ)+log|detJg(ϵ)|]=Eϵp(ϵ)[logpθ(x|g(ϵ,ϕ,x))+logpθ(g(ϵ,ϕ,x))logp(ϵ)+log|detJg(ϵ)|]=Eϵp(ϵ)[logpθ(x|g(ϵ,ϕ,x))+logpθ(g(ϵ,ϕ,x))+log|detJg(ϵ)|]+constant

where constant is just Eϵp(ϵ)[logp(ϵ)] which does not depend on ϕ and θ.

The gradients of the target function Lϕ,θ(x) with respect to the generative model parameters θ and the inference model parameters ϕ can be approximated using an unbiased Monte Carlo estimate by sampling ϵ(i)p(ϵ) and easily computed in automatic differentiation frameworks, such as PyTorch and TensorFlow.

Generative Model Evaluation

To evaluate the performance of the generative model pθ(x), the marginal likelihood logpθ(x) can be estimated via importance sampling.

logpθ(x)=logEzqϕ(z|x)[pθ(x,z)qϕ(z|x)]log1Mi=1Mpθ(x,z(i))qϕ(z(i)|x)

where z(i)qϕ(z|x) and M is the number of samples.

One can also sample from the generative model pθ(x,z) and evaluate the quality of the generated sample x using some other methods, such as visual inspection. We will first sample the latent variables z from the prior distribution pθ(z) and then sample the observed data x from the generative model pθ(x|z).

Increasing the Inference Model Flexibility

What if the inference model qϕ(z|x) is not flexible enough to approximate the true posterior pθ(z|x)?

The tightness of the ELBO depends on the flexibility of the approximate posterior qϕ(z|x). If qϕ(z|x) is not flexible enough to approximate the true posterior pθ(z|x), the ELBO may not be tight enough to approximate the log evidence logpθ(x) well. In this case, the generative model pθ(x) may not be learned well.

In many simple settings, the approximate posterior qϕ(z|x) is just a simple multivariate Gaussian distribution, and it is insufficient to approximate the true posterior pθ(z|x) if true posterior pθ(z|x) is more complex.

Similar to what we have discussed in the article “Expectation Maximization Algorithm”, having more hierarchical latent variables can make the approximate posterior qϕ(z|x) more flexible and the ELBO tighter. In our case, we will introduce more hierarchical latent variables to both the generative model and the inference model.

Suppose we have n sets of latent variables z1,z2,,zn and the generative model pθ(x,z1,z2,,zn) and the inference model qϕ(z1,z2,,zn|x). The ELBO can be expressed as follows.

ELBO(ϕ,θ)=Ez1,z2,,znqϕ(z1,z2,,zn|x)[logpθ(x,z1,z2,,zn)logqϕ(z1,z2,,zn|x)]

In the generative model, we can introduce more hierarchical latent variables by factorizing the joint distribution of the observed data x and the latent variables z1,z2,,zn.

pθ(x,z1,z2,,zn)=pθ(x|z1,z2,,zn)pθ(z1,z2,,zn)=pθ(x|z1,z2,,zn)pθ(z1|z2,,zn)pθ(z2,z3,,zn)=pθ(x|z1,z2,,zn)pθ(z1|z2,,zn)pθ(z2|z3,,zn)pθ(zn)

Given the generative model, there can be different ways to introduce more hierarchical latent variables to the inference model.

One common way is to follow the same hierarchy as the generative model.

qϕ(z1,z2,,zn|x)=qϕ(z1|z2,,zn,x)qϕ(z2|z3,,zn,x)qϕ(zn|x)

In the first step, the inference model will sample the latent variables zn from the approximate posterior qϕ(zn|x), and the sampled latent variables zn can be used to compute the prior pθ(zn).

In the second step, the inference model will sample the latent variables zn1 from the approximate posterior qϕ(zn1|zn,x), and the sampled latent variables zn1 and zn can be used to compute the prior pθ(zn1|zn).

In the n-th step, the inference model will sample the latent variables z1 from the approximate posterior qϕ(z1|z2,,zn,x), and the sampled latent variables z1,z2,,zn can be used to compute the prior pθ(z1|z2,,zn).

This kind of inference model is sometimes also called the top-down inference model.

The other common way is reverse the hierarchy of the generative model.

qϕ(z1,z2,,zn|x)=qϕ(zn|x,z1,z2,,zn1)qϕ(zn1|x,z1,z2,,zn2)qϕ(z1|x)

In the first step, the inference model will sample the latent variables z1 from the approximate posterior qϕ(z1|x).

In the second step, the inference model will sample the latent variables z2 from the approximate posterior qϕ(z2|x,z1).

In the n-th step, the inference model will sample the latent variables zn from the approximate posterior qϕ(zn|x,z1,z2,,zn1).

Once all the latent variables z1,z2,,zn are sampled, the sampled latent variables can be used to compute the prior pθ(z1,z2,,zn) in the generative model.

This kind of inference model is sometimes also called the bottom-up inference model.

It seems that the top-down inference model is more favorable than the bottom-up inference model in practice.

References

Author

Lei Mao

Posted on

06-02-2024

Updated on

06-02-2024

Licensed under


Comments