One-Pass Naive Algorithm for Computing Variance

Introduction

In many applications, we would like to compute the variance of some data efficiently and accurately.

In this blog post, I would like to discuss an efficient naive algorithm for computing variance, as well as its caveats and improvements.

One-Pass Naive Algorithm for Computing Variance

Mathematically, the variance of a random variable could be computed using the following formula.

$$
\begin{align}
\text{Var}(X) = \mathbb{E}[X^2] - \left(\mathbb{E}[X]\right)^2
\end{align}
$$

In our native algorithm, we will compute $\mathbb{E}[X^2]$ and $(\mathbb{E}[X])^2$ simultaneously as we scan the data. This means we could compute the variance in a single pass through the data, which is efficient.

Concretely, suppose we have a dataset $X = \{x_1, x_2, \ldots, x_N\}$.

$$
\begin{align}
\mathbb{E}[X]
&= \frac{1}{N} \sum_{i=1}^{N} x_i \\
&= \sum_{i=1}^{N} \frac{x_i}{N} \\
\end{align}
$$

$$
\begin{align}
\mathbb{E}[X^2]
&= \frac{1}{N} \sum_{i=1}^{N} x_i^2 \\
&= \sum_{i=1}^{N} \left( \frac{x_i}{N} \right) x_i \\
\end{align}
$$

Numerical Instability

In practice, there can be numerical problems when using the aforementioned formula for computing the variance in floating point precision.

Because floating point values are not evenly distributed in linear scale, this can lead to significant round-off errors during accumulation. The larger the floating point value is, the larger round-off error there can be near the floating point value. According to the “Floating Point Value Binary Representation”, the maximum rounding error for a FP32 value can be $2^{-23} \times 2^{2^{8 - 1}} = 2^{-23} \times 2^{127} = 2^{104} \approx 2.02 \times 10^{31}$.

Because of this, when we compute two mathematically equivalent or close entities in floating point representation, their difference can be much larger than expected, especially when the values of the two entities are very large. Computing the difference between those two entities can lead to what’s called catastrophic cancellation.

In our case, if the variance of the data is very small but the mean is very large, $\mathbb{E}[X^2]$ and $\left(\mathbb{E}[X]\right)^2$ will be mathematically close large values. Therefore, computing the variance using $\text{Var}(X) = \mathbb{E}[X^2] - \left(\mathbb{E}[X]\right)^2$ can result in catastrophic cancellation. Consequently, the mathematically small variance can be become very large positive or even negative values, which is not acceptable. For instance, if all the values in the data are extremely large but they are all the same, the variance mathematically should be exactly zero. However, using the naive algorithm, the computed variance can result in a very large positive or negative value. If the variance is very large, this means $\mathbb{E}[X^2] \gg\left(\mathbb{E}[X]\right)^2$. Usually $\mathbb{E}[X^2] \gg\left(\mathbb{E}[X]\right)^2$ remains to be true when computing in floating points.

Therefore, to avoid the numerical instability of the naive algorithm, we should avoid applying the naive algorithm for the data whose variance is small and mean is large.

Data Shift Trick

Based on the previous analysis, we learned that the naive algorithm does not work for the data whose variance is small and mean is large. However, we noticed that if all the data are shifted for a constant value, mathematically the variance remains the same.

Proof

Let $X’ = X + c$ be the shifted data, where $c$ is a constant. Then we have:

$$
\begin{align}
\mathbb{E}[X’] &= \mathbb{E}[X] + c \\
\mathbb{E}[(X’)^2] &= \mathbb{E}[X^2] + 2c\mathbb{E}[X] + c^2
\end{align}
$$

Substituting these into the variance formula gives:

$$
\begin{align}
\text{Var}(X’) &= \mathbb{E}[(X’)^2] - \left(\mathbb{E}[X’]\right)^2 \\
&= \left(\mathbb{E}[X^2] + 2c\mathbb{E}[X] + c^2\right) - \left(\mathbb{E}[X] + c\right)^2 \\
&= \left(\mathbb{E}[X^2] + 2c\mathbb{E}[X] + c^2\right) - \left(\mathbb{E}[X]^2 + 2c\mathbb{E}[X] + c^2\right) \\
&= \mathbb{E}[X^2] - \mathbb{E}[X]^2 \\
&= \text{Var}(X)
\end{align}
$$

Thus, the variance remains unchanged under shifting.

This concludes the proof. $\square$

In the situation where the variance is small and the mean is large, we can just pick any value, usually the first one, from the data as the shift value $c$. Then when the situation becomes where the variance is small and the mean is small, the chance of encountering numerical issues, such as getting negative variance, is greatly reduced. In other situations where the variance is large, applying the same data shift trick is mostly harmless.

Therefore, we could use the data shift trick to improve the numerical stability of the naive algorithm for computing variance, regardless of the data distribution.

Concretely, suppose we have a dataset $X = \{x_1, x_2, \ldots, x_n\}$. We could slightly modify the native algorithm by adding the data shift trick.

$$
\begin{align}
\mathbb{E}[X]
&= \frac{1}{N} \sum_{i=1}^{N} \left(x_i - x_1 \right) \\
&= \sum_{i=1}^{N} \frac{x_i - x_1}{N} \\
\end{align}
$$

$$
\begin{align}
\mathbb{E}[X^2]
&= \frac{1}{N} \sum_{i=1}^{N} \left(x_i - x_1 \right)^2 \\
&= \sum_{i=1}^{N} \left( \frac{x_i - x_1}{N} \right) \left(x_i - x_1 \right) \\
\end{align}
$$

The performance of the modified algorithm will be very close to the original one, while being more numerically stable.

References

Author

Lei Mao

Posted on

10-29-2025

Updated on

10-29-2025

Licensed under


Comments