# CPU TopK Algorithm

## Introduction

TopK algorithms are used for finding the top $K$ largest or smallest elements in an unsorted array. They are useful in many applications such as information retrieval, data mining, and machine learning.

There are usually two commonly seen TopK algorithms for CPU. The first one is an $O(N + K \log N)$ TopK algorithm which uses the $O(N)$ heap building algorithm for building a min-heap or a max-heap from the array followed by an $O(K \log N)$ heap extraction operation. The second one is an $O(N)$ TopK algorithm which uses the $O(N)$ median-of-medians selection algorithm for finding the $K$-th order statistics in the array followed by an $O(N)$ linear scan or an $O(N)$ partition algorithm.

In this blog post, I would like to quickly discuss these two CPU TopK algorithms.

## Heap Building TopK Algorithm

### Algorithm Analysis

In my previous article “Heap Building Algorithm”, I discussed the $O(N)$ heap building algorithm for building a min-heap or a max-heap from an unsorted array. The next step is to extract the top $K$ largest or smallest elements from the heap. The heap property states that the key stored in each node is either greater than or equal to or less than or equal to the keys in the node’s children, according to some total order. For a min-heap, the key stored in the root node is the smallest, and for a max-heap, the key stored in the root node is the largest. To retrieve the top $K$ elements, we cannot just simply take the first $K$ elements from the min-heap or max-heap arrays because the heap property does not theoretically guarantee that the top $K$ elements are the first $K$ elements in the array.

For example, the following heap is a valid min-heap.

But its top $K$ smallest elements are not always the first $K$ elements in the array. Its array representation is [1, 2, 5, 3, 4]. When $K = 3$, and the top 3 smallest elements are 1, 2, 3, which are not the first 3 elements in the array.

To extract the top $K$ largest or smallest elements from a min-heap or a max-heap, we need to perform the heap extraction operations by removing the root node from the heap and then recursively heapify the heap $K$ times. Each such operation takes $O(\log N)$ time, and the total time complexity for extracting $K$ elements is $O(K \log N)$.

Therefore, the total time complexity for the heap building TopK algorithm is $O(N + K \log N)$. Notice that the top $K$ elements we extracted are always in sorted order because of the heap property. But this is sometimes useless in the TopK problem because we only need the top $K$ elements, not the top $K$ elements in sorted order. Because of this unwanted side effect, the heap building TopK algorithm is not a linear time algorithm.

### Algorithm Implementation

To extract the top $K$ largest or smallest elements from a min-heap or a max-heap, we can use the std::make_heap for heap building and std::pop_heap and the vector pop_back functions for heap root removal in C++.

Sometimes, we also need to get the top $K$ element indices from the original array. In this case, when we build the heap, we can use a vector of pairs to store the original array element values and their indices. We can then use the std::make_heap and std::pop_heap functions with custom comparison functions to build the heap and extract the top $K$ element indices.

The output of the program is as follows.

Notice that the top $K$ elements we extracted are always in sorted order.

## Median-of-Medians Selection TopK Algorithm

### Algorithm Analysis

In my previous article “Median-of-Medians Selection Algorithm”, I discussed the $O(N)$ median-of-medians selection algorithm for finding the $K$-th order statistics in an unsorted array. The next step is to partition the array into two parts, one part containing the top $K$ largest elements and the other part containing the remaining elements. The partition algorithm takes $O(N)$ time, and the total time complexity for the median-of-medians selection TopK algorithm is $O(N)$.

This time, the top $K$ elements we extracted are not always in sorted order. The the median-of-medians selection TopK algorithm is a true linear time algorithm.

### Algorithm Implementation

The implementation of the median-of-medians selection TopK algorithm in C++ is derived from the median-of-medians selection algorithm implemented in my previous article “Median-of-Medians Selection Algorithm”. To partition the array into two parts using the order statistics obtained from the median-of-medians selection algorithm, we can use the std::partition function in C++.

The output of the program is as follows.

Notice that if the original array is mutable, there is no need to compute the top $K$ element indices as the partition algorithm can directly partition the array into two parts, one part containing the top $K$ largest elements and the other part containing the remaining elements.

Lei Mao

03-01-2024

03-01-2024