Median-of-Medians Selection Algorithm
Introduction
Although the name of the median-of-medians selection algorithm suggests the algorithm finds the median of an unsorted array by finding the median of the medians from each subarray, the algorithm is actually an $O(N)$ selection algorithm for finding the $i$-th smallest element in an unsorted array by finding the approximate median of the array for partition. The median of medians from each subarray is not necessary the median of the entire array, but it is a good enough pivot for partitioning the array into two subarrays, one of which contains the $i$-th smallest element.
In this blog post, I would like to quickly discuss the median-of-medians selection algorithm, its asympotoic complexity analysis, and its implementation in C++.
Prerequisite
Master Theorem
The master theorem is useful for the asymptotic analysis for divide-and-conquer algorithms. It states that the runtime of the divide-and-conquer algorithm could usually be described as
$$
\begin{align}
T(N) &=
\begin{cases}
\Theta(1) & \text{if $N = 1$}\\
aT(\frac{N}{b}) + f(N) & \text{if $N > 1$}\\
\end{cases} \\
\end{align}
$$
where $a \geq 1$ and $b > 1$.
The master theorem states that
- If $f(N) = O(N^{\log_{b} a - \epsilon})$ for some constant $\epsilon > 0$, then $T(N) = \Theta(N^{\log_{b} a})$.
- If $f(N) = \Theta(N^{\log_{b} a})$, then $T(N) = \Theta(N^{\log_{b} a} \lg N)$.
- If $f(N) = \Omega(N^{\log_{b} a + \epsilon})$ for some constant $\epsilon > 0$, and if $aT(\frac{N}{b}) \leq cf(N)$ for some constant $c < 1$ and all sufficiently large N, then $T(N) = \Theta(f(N))$.
Median-of-Medians Algorithm
The median-of-medians selection algorithm solves the selection problem, i.e., finding the $i$-th smallest element in an unsorted array. The core idea is if we can find a good enough pivot and its location in the array if sorted, we can partition the array into two subarrays, one of which contains the $i$-th smallest element. We already know that given a pivot a partition algorithm can partition the array into two subarrays in $O(N)$ time. For any array of size $N$, if we can find a pivot that is approximately the median of the array, we would only have to recursively partition the array roughly $O(\log N)$ times to find the $i$-th smallest element. Of course, finding the approximate median of the array should not take more than $O(N)$ time. Otherwise, why not just sort the array and find the $i$-th smallest element in $O(N \log N)$ time?
Algorithm Analysis
To find the approximate median of an unsorted array, we can use the median-of-medians algorithm. The idea is to divide the array into subarrays of certain size $k$, sort each subarray in constant time (because the subarray size is always a constant), and find the median of each subarray. Then, we recursively find the approximate median of the medians to approximate median of the entire array. The approximate median usually has some theoretical guarantees that it is close to the median of the entire array.
For example, as illustrated in the figure below, the array can be divided into subarrays of size $k = 5$ recursively. If the entire array size is 5, we would find the approximate median of the array in the fractional range of $[p_{1}, 1-p_{1}]$ where $p_{1} = \frac{2}{5}$, which is exactly the median of the array. If the entire array size is $5^2 = 25$, we would find the approximate median of the array in the fractional range of $[p_{2}, 1-p_{2}]$ where $p_{2} = p_{1} \times \frac{3}{5} + \frac{2}{5} \times \frac{1}{5^{1}} = \frac{2}{5} \times \frac{3}{5} + \frac{2}{5} \times \frac{1}{5} = \frac{8}{25}$. If the entire array size is $5^3 = 125$, we would find the approximate median of the array in the fractional range of $[p_{3}, 1-p_{3}]$ where $p_{3} = p_{2} \times \frac{3}{5} + \frac{2}{5} \times \frac{1}{5^{2}} = \frac{8}{25} \times \frac{3}{5} + \frac{2}{5} \times \frac{1}{25} = \frac{26}{125}$. If the entire array size is $5^{i}$, we would find the approximate median of the array in the fractional range of $[p_{i}, 1-p_{i}]$ where $p_{i} = p_{i-1} \times \frac{3}{5} + \frac{2}{5} \times \frac{1}{5^{i-1}} = p_{i-1} \times \frac{3}{5} + \frac{2}{5^{i}}$.
Let’s further analyze exactly the value of $p_{i}$ given $i$.
$$
\begin{align}
p_{i} &= p_{i-1} \times \frac{3}{5} + \frac{2}{5^{i}} \\
&= \left( p_{i-2} \times \frac{3}{5} + \frac{2}{5^{i-1}} \right) \times \frac{3}{5} + \frac{2}{5^{i}} \\
&= p_{i-2} \times \left( \frac{3}{5} \right)^{2} + \frac{2 + 2 \times 3}{5^{i}} \\
&= \left( p_{i-3} \times \frac{3}{5} + \frac{2}{5^{i-2}} \right) \times \left( \frac{3}{5} \right)^{2} + \frac{2 + 2 \times 3}{5^{i}} \\
&= p_{i-3} \times \left( \frac{3}{5} \right)^{3} + \frac{2 + 2 \times 3 + 2 \times 3^{2}}{5^{i}} \\
&= \cdots \\
&= p_{1} \times \left( \frac{3}{5} \right)^{i-1} + \frac{2 + 2 \times 3 + 2 \times 3^{2} + \cdots + 2 \times 3^{i-2}}{5^{i}} \\
&= \frac{2}{5} \times \left( \frac{3}{5} \right)^{i-1} + \frac{\frac{2\left(1 - 3^{i-1}\right)}{1 - 3}}{5^{i}} \\
&= \frac{2}{5} \times \left( \frac{3}{5} \right)^{i-1} + \frac{3^{i-1} - 1}{5^{i}} \\
&= \frac{2}{5} \times \left( \frac{3}{5} \right)^{i-1} + \left(\frac{3}{5}\right)^{i-1} \frac{1}{5} - \frac{1}{5^{i}} \\
&= \left(\frac{3}{5}\right)^{i} - \frac{1}{5^{i}} \\
\end{align}
$$
In general, when the array can be divided into subarrays of size $k$ recursively, we have
$$
\begin{align}
p_{i} &= p_{i-1} \times \frac{\lceil \frac{k}{2} \rceil}{k} + \frac{\lceil \frac{k}{2} \rceil - 1}{k^{i}} \\
&= \left( p_{i-2} \times \frac{\lceil \frac{k}{2} \rceil}{k} + \frac{\lceil \frac{k}{2} \rceil - 1}{k^{i-1}} \right) \times \frac{\lceil \frac{k}{2} \rceil}{k} + \frac{\lceil \frac{k}{2} \rceil - 1}{k^{i}} \\
&= p_{i-2} \times \left( \frac{\lceil \frac{k}{2} \rceil}{k} \right)^{2} + \frac{\lceil \frac{k}{2} \rceil - 1 + (\lceil \frac{k}{2} \rceil - 1) \times \lceil \frac{k}{2} \rceil}{k^{i}} \\
&= \left( p_{i-3} \times \frac{\lceil \frac{k}{2} \rceil}{k} + \frac{\lceil \frac{k}{2} \rceil - 1}{k^{i-2}} \right) \times \left( \frac{\lceil \frac{k}{2} \rceil}{k} \right)^{2} + \frac{\lceil \frac{k}{2} \rceil - 1 + (\lceil \frac{k}{2} \rceil - 1) \times \lceil \frac{k}{2} \rceil + \cdots + (\lceil \frac{k}{2} \rceil - 1) \times \lceil \frac{k}{2} \rceil^{i-3}}{k^{i}} \\
&= p_{i-3} \times \left( \frac{\lceil \frac{k}{2} \rceil}{k} \right)^{3} + \frac{\lceil \frac{k}{2} \rceil - 1 + (\lceil \frac{k}{2} \rceil - 1) \times \lceil \frac{k}{2} \rceil + \cdots + (\lceil \frac{k}{2} \rceil - 1) \times \lceil \frac{k}{2} \rceil^{i-2}}{k^{i}} \\
&= \cdots \\
&= p_{1} \times \left( \frac{\lceil \frac{k}{2} \rceil}{k} \right)^{i-1} + \frac{\lceil \frac{k}{2} \rceil - 1 + (\lceil \frac{k}{2} \rceil - 1) \times \lceil \frac{k}{2} \rceil + \cdots + (\lceil \frac{k}{2} \rceil - 1) \times \lceil \frac{k}{2} \rceil^{i-2}}{k^{i}} \\
&= \frac{\lceil \frac{k}{2} \rceil - 1}{k} \times \left( \frac{\lceil \frac{k}{2} \rceil}{k} \right)^{i-1} + \frac{\frac{\left( \lceil \frac{k}{2} \rceil - 1 \right)\left(1 - \lceil \frac{k}{2} \rceil^{i-1}\right)}{1 - \lceil \frac{k}{2} \rceil}}{k^{i}} \\
&= \frac{\lceil \frac{k}{2} \rceil - 1}{k} \times \left( \frac{\lceil \frac{k}{2} \rceil}{k} \right)^{i-1} + \frac{\lceil \frac{k}{2} \rceil^{i-1} - 1}{k^{i}} \\
&= \frac{\lceil \frac{k}{2} \rceil - 1}{k} \times \left( \frac{\lceil \frac{k}{2} \rceil}{k} \right)^{i-1} + \left(\frac{\lceil \frac{k}{2} \rceil}{k}\right)^{i-1} \frac{1}{k} - \frac{1}{k^{i}} \\
&= \left(\frac{\lceil \frac{k}{2} \rceil}{k}\right)^{i} - \frac{1}{k^{i}} \\
\end{align}
$$
We could see that when $i$ is large, i.e., when the array size $N$ is large because $i = \lceil \log_{k}{N} \rceil$, $p_{i}$ is very small and the approximate of the median of the array can be extremely poor in the worst case.
This suggests that using the median-of-medians algorithm to find the approximate median of the array might not be a good idea when the array size is large.
Nevertheless, the recursive divide-and-conquer median-of-medians algorithm can be analyzed using the master theorem to show that the algorithm is $O(N)$.
In our case,
$$
\begin{align}
T(N) &=
\begin{cases}
\Theta(1) & \text{if $N \leq k$}\\
k T(\frac{N}{k}) + \Theta(1) & \text{if $N > k$}\\
\end{cases} \\
\end{align}
$$
Because $T(N) = aT(\frac{N}{b}) + f(N)$, $a = k$, $b = k$, $\log_{b} {a} = 1$, $\epsilon = k$, $f(N) = O(N^{\log_{b} {a} - \epsilon})$, the master theorem tells us that $T(N) = \Theta(N^{\log_{b} {a}}) = \Theta(N)$ for the median-of-medians algorithm.
Algorithm Implementation
The median-of-medians algorithm can be implemented in C++ as follows.
1 |
|
From the test result below, we can see that the approximate median of the array using the median-of-medians algorithm is 102, which is close to the median of the array using sorting, which is 100.
1 | $ g++ approximate_median.cpp -o approximate_median --std=c++14 |
Median-of-Medians Selection Algorithm
The median-of-medians selection algorithm is a selection algorithm that combines the $O(N)$ partition algorithm and the $O(N)$ median-of-medians algorithm to find the $i$-th smallest element in an unsorted array. Concretely, we find the approximate median of the array using the median-of-medians algorithm, using the approximate median as the pivot to partition the array into two subarrays and we would know how many elements are there in the left and right subarrays after partition. If the number of elements in the left subarray is less than $i$, we would recursively find the $i$-th smallest element in the right subarray. If the number of elements in the left subarray is greater than $i$, we would recursively find the $i$-th smallest element in the left subarray. If the number of elements in the left subarray is equal to $i$, we would return the pivot as the $i$-th smallest element.
Algorithm Analysis
We will analyze the median-of-medians selection algorithm in the worst case. In the worst case, the approximate median that’s found using the median-of-medians algorithm will always be at the boundary of the fractional range and the $i$-th smallest element would never occur until all the elements in the entire array are excluded from the partition. This means that each time we partition the array, we only exclude the theoretical lower-bound number of elements from the entire array for finding the median.
Based on the analysis of the median-of-medians algorithm, given an array of size $N$, we know that the median-of-medians algorithm is $O(N)$ and the maximum number of remained elements after exclusion using partition is $N(1-p)$ where $p = \left(\frac{\lceil \frac{k}{2} \rceil}{k}\right)^{i} - \frac{1}{k^{i}}$ in $(0, 1)$ and $i = \lceil \log_{k}{N} \rceil$.
So to find the most difficult $i$-th smallest element in the worst case, we have
$$
\begin{align}
N(1 - p_{1})(1 - p_{2})\cdots(1 - p_{i})\cdots(1 - p_{n}) & = 1
\end{align}
$$
We need to know how many iterations $n$ we have to run the median-of-medians selection algorithm to find the $i$-th smallest element in the worst case. It is, however, difficult to solve the equation and compute the exact value for $n$ because $p_{i}$ is a function of $k$ and $N$ and this $N$ is subject to change in each iteration.
We can, however, make a bold assumption that $p_{i}$ is a constant $p$ where $p \leq p_{i}$ for any $i$ and solve the equation to find the value for $n$. Such $p$ can just be $p = \left(\frac{\lceil \frac{k}{2} \rceil}{k}\right)^{\lceil \log_{k}{N} \rceil} - \frac{1}{k^{\lceil \log_{k}{N} \rceil}}$ in $(0, 1)$.
Therefore, the number of iterations we have run the median-of-medians selection algorithm to find the $i$-th smallest element in the worst case can be easily solved using the following equation.
$$
\begin{align}
N(1 - p)^{n} & = 1 \\
\end{align}
$$
The asymptotic complexity of the median-of-medians selection algorithm can be computed as follows.
$$
\begin{align}
T(N) &= \sum_{i=1}^{n} \left( \underbrace{O\left(N (1 - p)^{i-1}\right)}_{\text{median of medians}} + \underbrace{O\left(N (1 - p)^{i-1}\right)}_{\text{partition}} \right) \\
&= \sum_{i=1}^{n} O\left(N (1 - p)^{i-1}\right) \\
&= O\left(N \sum_{i=1}^{n} (1 - p)^{i-1}\right) \\
&= O\left(N \frac{1 \left( 1- (1-p)^{n} \right)}{1 - \left(1 - p\right)} \right) \\
&= O\left(N \frac{1 - (1-p)^{n}}{p} \right) \\
&= O\left(\frac{N - N(1-p)^{n}}{p} \right) \\
&= O\left(\frac{N - 1}{p} \right) \\
&= O(N) \\
\end{align}
$$
Therefore, even if we have assumed everything in the worst scenario, the median-of-medians selection algorithm is still $O(N)$. It is also important to note that the asymptotic constant for the median-of-medians selection algorithm can be very large as you could probably feel from the analysis.
Algorithm Implementation
The median-of-medians selection algorithm can be implemented in C++ as follows.
1 |
|
From the test result below, we can see that the $i$-th smallest element in the array using the median-of-medians selection algorithm is 100, which is exactly the same as the $i$-th smallest element in the array using the sorting selection algorithm.
1 | $ g++ selection.cpp -o selection --std=c++14 |
Conclusions
The median-of-medians selection algorithm is an $O(N)$ selection algorithm for finding the $i$-th smallest element in an unsorted array. However, its asymptotic constant can be very large and the performance of the algorithm implementation can be less satistactory in practice.
References
Median-of-Medians Selection Algorithm
https://leimao.github.io/blog/Median-of-Medians-Select-Algorithm/