# Expectation Maximization Algorithm

## Introduction

Optimizing probabilistic models with latent variables is a challenging task because the marginal distribution of the observed random variables is usually intractable. The Expectation Maximization (EM) algorithm is an iterative optimization algorithm that alternates between two steps: the Expectation (E) step and the Maximization (M) step. In the Expectation (E) step, the algorithm computes the expected value of the log-likelihood of the joint distribution given the observed random variables and the current probabilistic parameters. In the Maximization (M) step, the algorithm computes the probabilistic parameters that maximize the expected value of the log-likelihood of the joint distribution given the observed random variables and the latent variables. The EM algorithm can be viewed as two alternating maximization steps where the first step is to maximize the evidence lower bound with respect to the distribution of the latent variables and the second step is to maximize the evidence lower bound with respect to the probabilistic parameters.

## Probabilistic Models

In machine learning, probabilistic models are mathematical descriptions to some phenomena that is represented by data. If such mathematical descriptions are accurate, it will be helpful for understanding the phenomena and making predictions.

Let $\mathbf{x}$ be the vector representing the set of random variables that can be observed and we are interested in modeling the joint distribution. Assuming the $\mathbf{x}$ is sampled from an unknown underlying process whose true probability distribution $p^{\ast}(\mathbf{x})$ is unknown. We come up with a probabilistic model $p_{\boldsymbol{\theta}}(\mathbf{x})$ parameterized by $\boldsymbol{\theta}$, to approximate the true probability distribution $p^{\ast}(\mathbf{x})$.

The probabilistic model learning or optimization is the process of search for a value of $\boldsymbol{\theta}$ such that

$$p_{\boldsymbol{\theta}}(\mathbf{x}) \approx p^{\ast}(\mathbf{x})$$

The probabilistic model learning or optimization, usually, requires samples, sometimes called data, from the true probability distribution $p^{\ast}(\mathbf{x})$. This is sometimes referred as the model training process.

The probabilistic model $p_{\boldsymbol{\theta}}(\mathbf{x})$ should be sufficiently flexible to be adapted to the data. We wish with more data and better optimization techniques, the probabilistic model has a chance to approximate the true probability distribution $p^{\ast}(\mathbf{x})$ accurately. Otherwise, if the probabilistic model is not flexible enough and the unknown underlying process is too complex, no matter how the probabilistic model is trained, it will not be able to approximate the true probability distribution well.

For example, if the true probability distribution $p^{\ast}(\mathbf{x})$ is a mixture of Gaussians distributions, but the probabilistic model is $p_{\boldsymbol{\theta}}(\mathbf{x})$ a single Gaussian distribution, no matter how the probabilistic model is trained, it will not be able to approximate the true probability distribution well. On the contrary, if the true probability distribution $p^{\ast}(\mathbf{x})$ is a single Gaussian distribution, but the probabilistic model is $p_{\boldsymbol{\theta}}(\mathbf{x})$ a mixture of Gaussians distributions, the probabilistic model has a chance to approximate the true probability distribution well.

## Probabilistic Models Optimization

Given a probabilistic model $p_{\boldsymbol{\theta}}(\mathbf{x})$ and a dataset $\mathcal{D} = \{ x_{1}, x_{2}, \ldots, x_{N} \}$, the probabilistic model learning or optimization is the process of search for a value of $\boldsymbol{\theta}$ such that the expected value of the probability of the samples in the dataset $\mathcal{D}$ is maximized, i.e., the maximum likelihood estimation of the probabilistic parameters $\boldsymbol{\theta}$.

\begin{align} \boldsymbol{\theta}^{\ast} &= \arg\max_{\boldsymbol{\theta}} \mathbb{E}_{\mathbf{x} \sim \mathcal{D}} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}) \right] \\ &= \arg\max_{\boldsymbol{\theta}} \sum_{n=1}^{N} p_{\mathcal{D}}(x_{n}) \log p_{\boldsymbol{\theta}}(x_{n}) \\ &= \arg\max_{\boldsymbol{\theta}} \sum_{n=1}^{N} \frac{1}{N} \log p_{\boldsymbol{\theta}}(x_{n}) \\ &= \arg\max_{\boldsymbol{\theta}} \frac{1}{N} \sum_{n=1}^{N} \log p_{\boldsymbol{\theta}}(x_{n}) \end{align}

The probabilistic parameters $\boldsymbol{\theta}$ can be optimized using various optimization algorithms, such as the gradient descent algorithm, directly.

## Probabilistic Models With Latent Variables

In addition to the observed random variables $\mathbf{x}$, the unknown underlying process sometimes also involves some latent variables $\mathbf{z}$ that are not observed. The latent variables $\mathbf{z}$ are usually introduced to model the complexity of the unknown underlying process and make the probabilistic model more flexible. The more complicated the unknown underlying process is, the more latent variables are usually involved.

For example, modeling the probability distribution of earthquakes at certain locations at certain times is an extremely difficult even with the most recent technology. The observed random variables $\mathbf{x}$ can be the time of the earthquakes, the magnitude of the earthquakes, etc., but we know that the process of earthquakes is extremely complex and involves many latent variables $\mathbf{z}$ that are not observed. The latent variables $\mathbf{z}$ can be the geological features of the locations, the underground movements, etc. If we only model the probability distribution of earthquakes with the observed random variables $\mathbf{x}$, the probabilistic model will, of course, not be able to approximate the true probability distribution well. If some latent variables $\mathbf{z}$ are introduced, the probabilistic model will at least have a chance to approximate the true probability distribution better than the one without the latent variables.

In this case, instead of modeling the joint distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$, we will model the joint distribution $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})$ in the probabilistic model. The marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$, which is the distribution we are interested in, can be obtained by integrating out the latent variables $\mathbf{z}$ from the joint distribution $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})$. $p_{\boldsymbol{\theta}}(\mathbf{x})$ can be formulated as follows.

\begin{align} p_{\boldsymbol{\theta}}(\mathbf{x}) &= \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] \\ \end{align}

If the latent variables $\mathbf{z}$ are continuous and defined on a continuous space $\mathcal{Z}$, the integral can be formulated as follows.

\begin{align} p_{\boldsymbol{\theta}}(\mathbf{x}) &= \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] \\ &= \int_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p_{\boldsymbol{\theta}}(\mathbf{z}) d\mathbf{z} \\ &= \int_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) d\mathbf{z} \\ \end{align}

If the latent variables $\mathbf{z}$ are discrete and defined on a discrete space $\mathcal{Z}$, the sum can be formulated as follows.

\begin{align} p_{\boldsymbol{\theta}}(\mathbf{x}) &= \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] \\ &= \sum_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p_{\boldsymbol{\theta}}(\mathbf{z}) \\ &= \sum_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \\ \end{align}

We could see that because the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$ is a mixture of distributions $p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z})$ with different weights $p_{\boldsymbol{\theta}}(\mathbf{z})$, even if $p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z})$ is just a Gaussian distribution, the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$ becomes a mixture of Gaussians distributions. Thus, introducing latent variables $\mathbf{z}$ makes the probabilistic model more flexible.

To make the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$ even more flexible, we can introduce more latent variables $\mathbf{z}$ and add dependencies between the latent variables $\mathbf{z}$. For example, suppose we have two latent variables $\mathbf{z}_{1}$ and $\mathbf{z}_{2}$, and $\mathbf{z} = \{ \mathbf{z}_{1}, \mathbf{z}_{2} \}$, we can add a dependency between them by introducing a conditional distribution $p_{\boldsymbol{\theta}}(\mathbf{z}_{2} | \mathbf{z}_{1})$.

\begin{align} p_{\boldsymbol{\theta}}(\mathbf{x}) &= \int_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) d\mathbf{z} \\ &= \int_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p_{\boldsymbol{\theta}}(\mathbf{z}) d\mathbf{z} \\ &= \int_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}_{1}, \mathbf{z}_{2}) p_{\boldsymbol{\theta}}(\mathbf{z}_{1}, \mathbf{z}_{2}) d\mathbf{z} \\ &= \int_{\mathcal{Z}_{1}} \int_{\mathcal{Z}_{2}} p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}_{1}, \mathbf{z}_{2}) p_{\boldsymbol{\theta}}(\mathbf{z}_{2} | \mathbf{z}_{1}) p_{\boldsymbol{\theta}}(\mathbf{z}_{1}) d\mathbf{z}_{2} d\mathbf{z}_{1} \end{align}

Similarly, even if $p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}_{1}, \mathbf{z}_{2})$, $p_{\boldsymbol{\theta}}(\mathbf{z}_{2} | \mathbf{z}_{1})$, and $p_{\boldsymbol{\theta}}(\mathbf{z}_{1})$ are simple distributions, the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$ becomes a very complex distribution. Thus, introducing more latent variables $\mathbf{z}$ and adding dependencies between the latent variables $\mathbf{z}$ makes the probabilistic model even more flexible.

## Probabilistic Models With Latent Variables Optimization

Similar to the optimization of the probabilistic models without latent variables, we want the expected value of the probability of the samples in the dataset $\mathcal{D}$ to be maximized.

\begin{align} \boldsymbol{\theta}^{\ast} &= \arg\max_{\boldsymbol{\theta}} \frac{1}{N} \sum_{n=1}^{N} \log p_{\boldsymbol{\theta}}(x_{n}) \end{align}

When it comes to optimizing probabilistic models with latent variables, however, there is a problem. In computing the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$ using the integral or the sum of the joint distribution $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})$ when the latent variables $\mathbf{z}$ are involved. This integral or the sum is usually intractable, because the number of latent variables $\mathbf{z}$ can be very large and the integral or the sum can be very high-dimensional.

For example, suppose the latent variables $\mathbf{z}$ consists of $N$ variables, $\mathbf{z} = \{ z_1, z_2, \ldots, z_N \}$, and each variable, in the simplest scenario, is binary, $z_i \in \{ 0, 1 \}$, the number of possible values of the latent variables $\mathbf{z}$ is $2^{N}$.

Because

\begin{align} p_{\boldsymbol{\theta}}(\mathbf{x} = x) &= \sum_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{x} = x, \mathbf{z} = z) \\ &= \sum_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{x} = x | \mathbf{z} = z) p_{\boldsymbol{\theta}}(\mathbf{z} = z) \end{align}

Even though $p_{\boldsymbol{\theta}}(\mathbf{x} = x | \mathbf{z} = z)$, $p_{\boldsymbol{\theta}}(\mathbf{z} = z)$, $p_{\boldsymbol{\theta}}(\mathbf{x} = x, \mathbf{z} = z)$ are simple to compute given the latent variables $\mathbf{z} = z$, $p_{\boldsymbol{\theta}}(\mathbf{x} = x)$ contains $2^{N}$ terms and computing $p_{\boldsymbol{\theta}}(\mathbf{x} = x)$ still requires summing over $2^{N}$ terms, which is intractable.

Therefore, unless the summation or the integral has a closed-form solution, computing the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$ is usually intractable and infeasible using the summation or the integral when the latent variables $\mathbf{z}$ are involved.

Evaluating the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$ is already so difficult, how could we optimize the probabilistic models with latent variables? There are approximate inference algorithms that can be used to approximate the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$. Among them, the Expectation Maximization (EM) algorithm is one of the most widely used algorithms.

## Expectation Maximization Algorithm

Instead of performing the optimization directly on the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$, the Expectation Maximization (EM) algorithm performs the optimization on the joint distribution $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})$.

### Expectation Maximization Algorithm

The Expectation Maximization (EM) algorithm is an iterative optimization algorithm that alternates between two steps: the Expectation (E) step and the Maximization (M) step.

In the Expectation (E) step, the algorithm computes the expected value of the log-likelihood of the joint distribution $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})$ given the observed random variables $\mathbf{x}$ and the current probabilistic parameters $\boldsymbol{\theta}$.

\begin{align} Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}}) &= \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \right] \\ &= \sum_{\mathcal{Z}} p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x}) \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \end{align}

In the Maximization (M) step, the algorithm computes the probabilistic parameters $\boldsymbol{\theta}$ that maximizes the expected value of the log-likelihood of the joint distribution $p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})$ given the observed random variables $\mathbf{x}$ and the latent variables $\mathbf{z}$.

\begin{align} \boldsymbol{\theta}^{\text{new}} &= \arg\max_{\boldsymbol{\theta}} Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}}) \end{align}

The Expectation Maximization (EM) algorithm iterates between the Expectation (E) step and the Maximization (M) step until the change in the probabilistic parameters $\boldsymbol{\theta}$ is small enough.

### Why Expectation Maximization Algorithm Works?

The Expectation Maximization (EM) algorithm optimizes the probabilistic parameters $\boldsymbol{\theta}$ by maximizing $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}})$ rather than directly maximizing the expected value of the log-likelihood of the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$. However, we could show that improving $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}})$ in each iteration also improves the expected value of the log-likelihood of the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$.

Proof

\begin{align} \log p_{\boldsymbol{\theta}}(\mathbf{x}) &= \log \int_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) d\mathbf{z} \\ &= \log \int_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \frac{p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})}{p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} d\mathbf{z} \\ &= \log \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \right] \\ &\geq \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \right] \\ &= \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \right] - \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x}) \right] \\ &= Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}}) - \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x}) \right] \end{align}

Note that we have applied the Jensen’s inequality in the proof because the log function is concave and $\mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x}) \right]$ does not depend on $\boldsymbol{\theta}$.

Because $\log p_{\boldsymbol{\theta}}(\mathbf{x}) \geq Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}}) - \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x}) \right]$, maximizing $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}})$ in each iteration also improves the expected value of the log-likelihood of the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$ by the same amount.

### Expectation Maximization As Maximization - Maximization Procedure

The EM algorithm can be viewed as two alternating maximization steps.

An arbitrary distribution $q(\mathbf{z})$ is introduced.

\begin{align} \log p_{\boldsymbol{\theta}}(\mathbf{x}) &= \log p_{\boldsymbol{\theta}}(\mathbf{x}) \int_{\mathcal{Z}} q(\mathbf{z}) d\mathbf{z} \\ &= \int_{\mathcal{Z}} q(\mathbf{z}) \log p_{\boldsymbol{\theta}}(\mathbf{x}) d\mathbf{z} \\ &= \int_{\mathcal{Z}} q(\mathbf{z}) \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})} d\mathbf{z} \\ &= \int_{\mathcal{Z}} q(\mathbf{z}) \log \left( \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \frac{q(\mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})} \right) d\mathbf{z} \\ &= \int_{\mathcal{Z}} q(\mathbf{z}) \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} d\mathbf{z} + \int_{\mathcal{Z}} q(\mathbf{z}) \log \frac{q(\mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})} d\mathbf{z} \\ &= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right] + \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z})} \left[ \log \frac{q(\mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})} \right] \\ &= \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{q(\mathbf{z})} \right] - \mathbb{E}_{\mathbf{z} \sim q(\mathbf{z})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})}{q(\mathbf{z})} \right] \\ &= \text{ELBO}(q, \boldsymbol{\theta}) + D_{\text{KL}}(q(\mathbf{z}) || p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})) \end{align}

where $\text{ELBO}(q, \boldsymbol{\theta})$ is the evidence lower bound (ELBO) and $D_{\text{KL}}(q(\mathbf{z}) || p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x}))$ is the Kullback-Leibler divergence between the distribution $q(\mathbf{z})$ and the posterior distribution $p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})$.

Because the KL divergence is always non-negative, the evidence lower bound (ELBO) $\text{ELBO}(q, \boldsymbol{\theta})$ is a lower bound of the log-likelihood of the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$. When the distribution $q(\mathbf{z})$ is the posterior distribution $p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x})$, the KL divergence $D_{\text{KL}}(q(\mathbf{z}) || p_{\boldsymbol{\theta}}(\mathbf{z} | \mathbf{x}))$ is zero and the ELBO is equal to the log-likelihood of the marginal distribution $p_{\boldsymbol{\theta}}(\mathbf{x})$.

This is corresponding to the Expectation (E) step of the EM algorithm where we set $q(\mathbf{z}) = p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})$.

\begin{align} \log p_{\boldsymbol{\theta}}(\mathbf{x}) &= \text{ELBO}(p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x}), \boldsymbol{\theta}) \\ &= \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log \frac{p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z})}{p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \right] \\ &= \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \right] - \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x}) \right] \end{align}

To maximize $\log p_{\boldsymbol{\theta}}(\mathbf{x})$ with respect to $\boldsymbol{\theta}$, we maximize the $\text{ELBO}(q, \boldsymbol{\theta})$ with respect to $\boldsymbol{\theta}$.

\begin{align} \boldsymbol{\theta}^{\text{new}} &= \arg\max_{\boldsymbol{\theta}} \log p_{\boldsymbol{\theta}}(\mathbf{x}) \\ &= \arg\max_{\boldsymbol{\theta}} \text{ELBO}(p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x}), \boldsymbol{\theta}) \\ &= \arg\max_{\boldsymbol{\theta}} \left( \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \right] - \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x}) \right] \right) \\ &= \arg\max_{\boldsymbol{\theta}} \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \right] \end{align}

This is corresponding to the Maximization (M) step of the EM algorithm.

Therefore, the EM algorithm can be viewed as two alternating maximization steps where the first step is to maximize the $\text{ELBO}(q, \boldsymbol{\theta})$ with respect to the distribution $q(\mathbf{z})$ and the second step is to maximize the $\text{ELBO}(q, \boldsymbol{\theta})$ with respect to the probabilistic parameters $\boldsymbol{\theta}$.

## FAQs

### Why Expectation Maximization Algorithm Can Be Tractable Whereas Computing Marginal Distribution Is Intractable?

One key question for understanding the Expectation Maximization (EM) algorithm is why computing $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}})$ is usually tractable. Looking at the formula of $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}})$ and $p_{\boldsymbol{\theta}}(\mathbf{x})$, it is not obvious why computing the former is any easier than computing the latter.

\begin{align} Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}}) &= \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \right] \\ &= \sum_{\mathcal{Z}} p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x}) \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \end{align}

\begin{align} p_{\boldsymbol{\theta}}(\mathbf{x}) &= \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] \\ &= \sum_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) p_{\boldsymbol{\theta}}(\mathbf{z}) \\ &= \sum_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \\ \end{align}

The number of summations to perform seems to be the same in both cases. This does suggest that computing $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}})$ is not always tractable.

In fact, if there is no approximation being used for computing $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}})$ and $p_{\boldsymbol{\theta}}(\mathbf{x})$, the number of summations to perform is the same in both cases. So neither of them is tractable to compute exactly.

$Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}})$ has an unbiased Monte Carlo estimate by sampling $\mathbf{z}$ from $p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})$. According to the central limit theorem, we have

\begin{align} Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}}) &= \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \right] \\ &\approx \frac{1}{M} \sum_{m=1}^{M} \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}^{(m)}) \end{align}

where $\mathbf{z}^{(m)}$ is sampled from $p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})$ and $M$ is the number of samples. This Monte Carlo estimate is usually tractable to compute because $\log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}^{(m)})$ is usually simple and efficient to compute given $\mathbf{z}^{(m)}$.

To see why this estimate is unbiased, we have

\begin{align} \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \frac{1}{M} \sum_{m=1}^{M} \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}^{(m)}) \right] &= \frac{1}{M} \sum_{m=1}^{M} \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}^{(m)}) \right] \\ &= \frac{1}{M} \sum_{m=1}^{M} \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \right] \\ &= \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z} | \mathbf{x})} \left[ \log p_{\boldsymbol{\theta}}(\mathbf{x}, \mathbf{z}) \right] \\ \end{align}

$p_{\boldsymbol{\theta}}(\mathbf{x})$ seems to also have an unbiased Monte carlo estimate by sampling $\mathbf{z}$ from $p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z})$, where $\boldsymbol{\theta}^{\text{old}}$ is the current probabilistic parameters. So, the following Monte Carlo estimate was proposed.

\begin{align} p_{\boldsymbol{\theta}}(\mathbf{x}) &= \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] \\ &\approx \frac{1}{M} \sum_{m=1}^{M} p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}^{(m)}) \end{align}

where $\mathbf{z}^{(m)}$ is sampled from $p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z})$.

If this is the case, computing $p_{\boldsymbol{\theta}}(\mathbf{x})$ can also be tractable using the Monte Carlo estimate. However, as we will see, this Monte Carlo estimate is actually biased.

\begin{align} \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z})} \left[ \frac{1}{M} \sum_{m=1}^{M} p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}^{(m)}) \right] &= \frac{1}{M} \sum_{m=1}^{M} \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}^{(m)}) \right] \\ &= \frac{1}{M} \sum_{m=1}^{M} \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] \\ &= \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] \\ &\neq \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] \\ \end{align}

The difference is very subtle. It is easy to overlook and cause the people to think this Monte Carlo estimate is unbiased. Before $\boldsymbol{\theta}$ is updated, $\boldsymbol{\theta} = \boldsymbol{\theta}^{\text{old}}$, and $\mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] = \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right]$. However, once $\boldsymbol{\theta}$ is updated, no matter how subtle the difference is, $\boldsymbol{\theta} \neq \boldsymbol{\theta}^{\text{old}}$, and $\mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] \neq \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right]$. This means optimizing $\boldsymbol{\theta}$ for $\mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right]$ is actually different from optimizing $\boldsymbol{\theta}$ for $\mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right]$.

If it’s still unclear why $\mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] \neq \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right]$. Let’s assume $\mathbf{z}$ are discrete variables and we use gradient ascent to optimize $\boldsymbol{\theta}$.

\begin{align} \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] &= \sum_{\mathcal{Z}} p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z}) p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \\ \end{align}

\begin{align} \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] &= \sum_{\mathcal{Z}} p_{\boldsymbol{\theta}}(\mathbf{z}) p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \\ \end{align}

The gradient of $\mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right]$ with respect to $\boldsymbol{\theta}$ is

\begin{align} \nabla_{\boldsymbol{\theta}} \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] &= \sum_{\mathcal{Z}} \nabla_{\boldsymbol{\theta}} \left( p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z}) p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right) \\ &= \sum_{\mathcal{Z}} p_{\boldsymbol{\theta}^{\text{old}}}(\mathbf{z}) \nabla_{\boldsymbol{\theta}} p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \\ \end{align}

The gradient of $\mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right]$ with respect to $\boldsymbol{\theta}$ is

\begin{align} \nabla_{\boldsymbol{\theta}} \mathbb{E}_{\mathbf{z} \sim p_{\boldsymbol{\theta}}(\mathbf{z})} \left[ p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right] &= \sum_{\mathcal{Z}} \nabla_{\boldsymbol{\theta}} \left( p_{\boldsymbol{\theta}}(\mathbf{z}) p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \right) \\ &= \sum_{\mathcal{Z}} \left( \nabla_{\boldsymbol{\theta}} p_{\boldsymbol{\theta}} (\mathbf{z}) \right) p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) + p_{\boldsymbol{\theta}}(\mathbf{z}) \nabla_{\boldsymbol{\theta}} p_{\boldsymbol{\theta}}(\mathbf{x} | \mathbf{z}) \\ \end{align}

Even if initially $\boldsymbol{\theta} = \boldsymbol{\theta}^{\text{old}}$, the gradients are completely different.

Finally, as mentioned in the article “Marginal Likelihood Estimation”, the naive Monte Carlo estimate for $p_{\boldsymbol{\theta}}(\mathbf{x})$ by sampling from the prior distribution $p_{\boldsymbol{\theta}}(\mathbf{z})$ can have a very high variance, making it an inferior estimator for optimizing $\boldsymbol{\theta}$. The importance sampling Monte Carlo estimate can reduce the variance, but it requires the knowledge of the proposal distribution to be close to the true posterior distribution, which is usually not the case before the model has been optimized well.

Thus, all of these things suggest that we cannot use the Monte Carlo estimate to approximate $p_{\boldsymbol{\theta}}(\mathbf{x})$ for optimizing $\boldsymbol{\theta}$.

Therefore, computing $Q(\boldsymbol{\theta}, \boldsymbol{\theta}^{\text{old}})$ can be tractable using the unbiased Monte Carlo estimate approximations, whereas computing $p_{\boldsymbol{\theta}}(\mathbf{x})$ is usually intractable because the simple Monte Carlo estimate approximations are biased.

Lei Mao

04-15-2024

04-15-2024