# Exponential Moving Average

## Introduction

Exponential moving average is a neural network training trick that sometimes improves the model accuracy. Concretely, instead of using the optimized parameters from the final training iteration (parameter update step) as the final parameters for the model, the exponential moving average of the parameters over the course of all the training iterations are used.

In this blog post, I would like to quickly discuss the exponential moving average and how it improves deep learning model generalization.

## Moving Average

Moving average is a smoothing technique that is commonly used for reducing the noise and fluctuation from time-series data. For example, if we would like to build a model for predicting the hotel room price using the historic data, applying moving averaging on the historic data will usually help the model predict better.

The simplest form of moving average is simple moving average (SMA) which takes the equally weighted mean of the previous $k$ data points. Concretely, given a sequence of data points $p_1, p_2, \cdots, p_t, \cdots$, the SMA over the last $k$ data points at data point $t$, denoted as $\text{SMA}_t$, is calculated as

\begin{align} \text{SMA}_t &= \frac{1}{k} \sum_{i = t - k + 1}^{t} p_i \\ \end{align}

Iteratively, $\text{SMA}_t$ can also be calculated using $\text{SMA}_{t-1}$. This is very useful for computing the SMA for online streaming data.

\begin{align} \text{SMA}_t &= \text{SMA}_{t-1} + \frac{1}{k} (p_{t} - p_{t-k}) \\ \end{align}

There are other moving average variants, including as cumulative average and exponential moving average.

Mathematically, moving average is a convolution that computes the weighted average of a certain number of previous data points.

## Exponential Moving Average

Exponential moving average (EMA) computes the weighted mean of all the previous data points and the weights decay exponentially. Concretely, given a sequence of data points $p_1, p_2, \cdots, p_t, \cdots$, the EMA at data point $t$, denoted as $\text{EMA}_t$, can be calculated iteratively or recursively.

\begin{align} \text{EMA}_t &= \begin{cases} p_1 & \text{if t = 1}\\ \alpha p_t + (1- \alpha) \text{EMA}_{t-1} & \text{else}\\ \end{cases} \\ \end{align}

where $\alpha \in (0, 1]$.

It’s not obvious what exponential means here from the above equation. However, if we expand it,

\begin{align} \text{EMA}_t &= \alpha \left[ p_{t} + (1-\alpha)p_{t-1} + (1-\alpha)^2 p_{t-2} + \cdots + (1-\alpha)^{t-2} p_{2} \right] + (1-\alpha)^{t-1} p_{1} \\ &\approx \alpha \left[ p_{t} + (1-\alpha)p_{t-1} + (1-\alpha)^2 p_{t-2} + \cdots (1-\alpha)^{t-2} p_{2} + (1-\alpha)^{t-1} p_{1} \right] \\ &= \alpha \sum_{i = 1}^{t} (1-\alpha)^{i-1} p_{t-i+1} \\ \end{align}

We find that the coefficients for each data point is exponentially decayed. The above equation can be further re-arranged to a weighted moving average form.

\begin{align} \text{EMA}_t &= \alpha \sum_{i = 1}^{t} (1-\alpha)^{i-1} p_{t-i+1} \\ &= \frac{1}{\frac{1}{\alpha}} \sum_{i = 1}^{t} (1-\alpha)^{i-1} p_{t-i+1} \\ \end{align}

when $t \rightarrow \infty$, $\frac{1}{\alpha}$ is the sum of all the coefficients for the data points.

\begin{align} \frac{1}{\alpha} &= \sum_{i = 1}^{t} (1-\alpha)^{i-1} \\ \end{align}

This is because

\begin{align} \sum_{i = 1}^{t} (1-\alpha)^{i-1} &= \frac{1 (1 - (1-\alpha)^t) }{\left( 1 - \left( 1 - \alpha \right) \right)} \\ &= \frac{1 - (1-\alpha)^t }{\alpha} \\ &\stackrel{t \rightarrow \infty}{=}\frac{1}{\alpha} \\ \end{align}

It is obvious now EMA is just a weighted moving average whose weights are exponentially decayed.

## How Exponential Moving Average Improves Model Generalization

Ideally, given a labeled training dataset $\mathcal{D}$ consisting of $N$ samples, the parameters in the neural network $\Theta$ will be optimized by minimizing some kind of training loss function $L(D; \Theta)$. If the neural network does not overfit to the training dataset and the validation dataset is representative of the training dataset, the optimal parameters that minimize the training loss will also minimize the validation loss. In this case, we denote the parameters after optimization $\Theta^{\ast}$, which is a local optimum.

In practice, $N$ is usually very large, thus the parameter optimization using the entire training dataset becomes very costly, especially with common neural network optimization techniques such as gradient descent. Therefore, instead of using the entire training dataset, small-sized samples from training dataset, called mini-batch, are randomly sampled for training. This actually introduced the noise to the training. This training noise has both advantages and disadvantages.

In gradient descent, the gradient for parameter update using the entire training dataset is estimated from the gradient using the mini-batch, and thus the gradient computation is much faster and also less accurate. However, the less accurate gradient or the noisy gradient can sometimes help the optimization, resulting in a better local optimum than the local optimum trained using the full dataset. This is the advantage of the training noise introduced by mini-batch.

In gradient descent, when the parameter optimization is about to converge, the gradient using the entire training dataset will become smaller and smaller. With the learning rate also become smaller, the optimized parameters will be truly local optimum. However, since the gradient using the mini-batch is noisy and less accurate, when the parameter optimization is about to converge, there are chances that the optimized parameters are not local minimum, but fluctuating around a local minimum. This is the disadvantage of the training noise introduced by mini-batch.

The smoothing technique, such as EMA in particular, can be used to reduce the optimized parameter fluctuation noise and the parameter EMAs are more likely to be closed to a local minimum. Concretely, the optimized parameters after each update step are $\Theta_1, \Theta_2, \cdots, \Theta_t, \cdots, \Theta_n$, respectively. It’s just like a sequence of time-series data that has noise. Therefore, EMA can improve model generalization.

To remove the noise, the EMA for the optimized parameters is computed as

\begin{align} \text{EMA}_t &= \begin{cases} \Theta_1 & \text{if t = 1}\\ \alpha \Theta_t + (1- \alpha) \text{EMA}_{t-1} & \text{else}\\ \end{cases} \\ \end{align}