TopK algorithms are used for finding the top $K$ largest or smallest elements in an unsorted array. They are useful in many applications such as information retrieval, data mining, and machine learning.
There are usually two commonly seen TopK algorithms for CPU. The first one is an $O(N + K \log N)$ TopK algorithm which uses the $O(N)$ heap building algorithm for building a min-heap or a max-heap from the array followed by an $O(K \log N)$ heap extraction operation. The second one is an $O(N)$ TopK algorithm which uses the $O(N)$ median-of-medians selection algorithm for finding the $K$-th order statistics in the array followed by an $O(N)$ linear scan or an $O(N)$ partition algorithm.
In this blog post, I would like to quickly discuss these two CPU TopK algorithms.
Heap Building TopK Algorithm
Algorithm Analysis
In my previous article “Heap Building Algorithm”, I discussed the $O(N)$ heap building algorithm for building a min-heap or a max-heap from an unsorted array. The next step is to extract the top $K$ largest or smallest elements from the heap. The heap property states that the key stored in each node is either greater than or equal to or less than or equal to the keys in the node’s children, according to some total order. For a min-heap, the key stored in the root node is the smallest, and for a max-heap, the key stored in the root node is the largest. To retrieve the top $K$ elements, we cannot just simply take the first $K$ elements from the min-heap or max-heap arrays because the heap property does not theoretically guarantee that the top $K$ elements are the first $K$ elements in the array.
For example, the following heap is a valid min-heap.
1 2 3 4 5
1 / \ 2 5 / \ 3 4
But its top $K$ smallest elements are not always the first $K$ elements in the array. Its array representation is [1, 2, 5, 3, 4]. When $K = 3$, and the top 3 smallest elements are 1, 2, 3, which are not the first 3 elements in the array.
To extract the top $K$ largest or smallest elements from a min-heap or a max-heap, we need to perform the heap extraction operations by removing the root node from the heap and then recursively heapify the heap $K$ times. Each such operation takes $O(\log N)$ time, and the total time complexity for extracting $K$ elements is $O(K \log N)$.
Therefore, the total time complexity for the heap building TopK algorithm is $O(N + K \log N)$. Notice that the top $K$ elements we extracted are always in sorted order because of the heap property. But this is sometimes useless in the TopK problem because we only need the top $K$ elements, not the top $K$ elements in sorted order. Because of this unwanted side effect, the heap building TopK algorithm is not a linear time algorithm.
Algorithm Implementation
To extract the top $K$ largest or smallest elements from a min-heap or a max-heap, we can use the std::make_heap for heap building and std::pop_heap and the vector pop_back functions for heap root removal in C++.
Sometimes, we also need to get the top $K$ element indices from the original array. In this case, when we build the heap, we can use a vector of pairs to store the original array element values and their indices. We can then use the std::make_heap and std::pop_heap functions with custom comparison functions to build the heap and extract the top $K$ element indices.
autoconst min_heap_pair_comp = [](autoconst& a, autoconst& b) { return a.first > b.first; }; autoconst max_heap_pair_comp = [](autoconst& a, autoconst& b) { return a.first < b.first; }; std::make_heap(min_heap_with_origin_index.begin(), min_heap_with_origin_index.end(), min_heap_pair_comp); std::make_heap(max_heap_with_origin_index.begin(), max_heap_with_origin_index.end(), max_heap_pair_comp);
size_tconst k{3};
// Get the top k elements. std::cout << "Top " << k << " Smallest Elements: "; for (size_t i{0}; i < k; ++i) { std::cout << min_heap.front() << " "; std::pop_heap(min_heap.begin(), min_heap.end(), std::greater<>{}); min_heap.pop_back(); } std::cout << std::endl;
std::cout << "Top " << k << " Largest Elements: "; for (size_t i{0}; i < k; ++i) { std::cout << max_heap.front() << " "; std::pop_heap(max_heap.begin(), max_heap.end(), std::less<>{}); max_heap.pop_back(); } std::cout << std::endl;
// Get the top k element indices from the original array. std::cout << "Top " << k << " Smallest Element Indices: "; for (size_t i{0}; i < k; ++i) { std::cout << min_heap_with_origin_index.front().second << " "; std::pop_heap(min_heap_with_origin_index.begin(), min_heap_with_origin_index.end(), min_heap_pair_comp); min_heap_with_origin_index.pop_back(); } std::cout << std::endl;
std::cout << "Top " << k << " Largest Element Indices: "; for (size_t i{0}; i < k; ++i) { std::cout << max_heap_with_origin_index.front().second << " "; std::pop_heap(max_heap_with_origin_index.begin(), max_heap_with_origin_index.end(), max_heap_pair_comp); max_heap_with_origin_index.pop_back(); } std::cout << std::endl;
return0; }
The output of the program is as follows.
1 2 3 4 5 6 7 8
$ g++ heap_building_topk.cpp -o heap_building_topk -std=c++14 $ ./heap_building_topk Original Array: 3 4 1 5 2 Top 3 Smallest Elements: 1 2 3 Top 3 Largest Elements: 5 4 3 Top 3 Smallest Element Indices: 2 4 0 Top 3 Largest Element Indices: 3 1 0
Notice that the top $K$ elements we extracted are always in sorted order.
Median-of-Medians Selection TopK Algorithm
Algorithm Analysis
In my previous article “Median-of-Medians Selection Algorithm”, I discussed the $O(N)$ median-of-medians selection algorithm for finding the $K$-th order statistics in an unsorted array. The next step is to partition the array into two parts, one part containing the top $K$ largest elements and the other part containing the remaining elements. The partition algorithm takes $O(N)$ time, and the total time complexity for the median-of-medians selection TopK algorithm is $O(N)$.
This time, the top $K$ elements we extracted are not always in sorted order. The the median-of-medians selection TopK algorithm is a true linear time algorithm.
Algorithm Implementation
The implementation of the median-of-medians selection TopK algorithm in C++ is derived from the median-of-medians selection algorithm implemented in my previous article “Median-of-Medians Selection Algorithm”. To partition the array into two parts using the order statistics obtained from the median-of-medians selection algorithm, we can use the std::partition function in C++.
// Find the approximate median value. template <typename T, size_t SmallArraySize, classForwardIt> T median_of_medians(ForwardIt first, ForwardIt last) { autoconst size{std::distance(first, last)}; if (size <= SmallArraySize) { std::array<T, SmallArraySize> v; std::copy(first, last, std::begin(v)); std::sort(std::begin(v), std::begin(v) + size); size_tconst median_index{static_cast<size_t>(size / 2)}; return v.at(median_index); } std::vector<T> medians; for (auto it{first}; it < last; it += SmallArraySize) { autoconst sub_last{std::min(it + SmallArraySize, last)}; medians.push_back(median_of_medians<T, SmallArraySize>(it, sub_last)); } returnmedian_of_medians<T, SmallArraySize>(medians.begin(), medians.end()); }
template <typename T, classForwardIt> T custom_selection_algorithm(ForwardIt first, ForwardIt last, size_t i) { autoconst size{std::distance(first, last)}; // Check if the size is large or equal to the order statistics index. if (size < i) { throw std::out_of_range{"The order statistics is out of range."}; } // Find the approximate median value. autoconst pivot{median_of_medians<T, 5>(first, last)}; // Partition the elements around the pivot. autoconst partition_it{ std::partition(first, last, [pivot](T const n) { return n < pivot; })}; // Calculate the number of elements less than the pivot. autoconst num_less_than_pivot{std::distance(first, partition_it)}; // Check if the pivot is the order statistics. if (num_less_than_pivot == i) { return pivot; } // Check if the pivot is less than the order statistics. elseif (num_less_than_pivot < i) { returncustom_selection_algorithm<T>(partition_it, last, i - num_less_than_pivot); } // Check if the pivot is greater than the order statistics. else { returncustom_selection_algorithm<T>(first, partition_it, i); } }
template <typename T, classForwardIt> T custom_selection(ForwardIt first, ForwardIt last, size_t i) { // Copy the elements to a vector. std::vector<T> v; v.reserve(std::distance(first, last)); std::copy(first, last, std::back_inserter(v)); // Call the custom selection algorithm. returncustom_selection_algorithm<T>(v.begin(), v.end(), i); }
// Get the top k smallest elements. std::vector<int> v_1{v}; int kth_smallest{custom_selection<int>(v_1.begin(), v_1.end(), k)}; std::partition(v_1.begin(), v_1.end(), [kth_smallest](intconst n) { return n < kth_smallest; }); std::cout << "Top " << k << " Smallest Elements: "; for (size_t i{0}; i < k; ++i) { std::cout << v_1.at(i) << " "; } std::cout << std::endl;
// Get the top k largest elements. std::vector<int> v_2{v}; int kth_largest{ custom_selection<int>(v_2.begin(), v_2.end(), v_2.size() - k)}; std::partition(v_2.begin(), v_2.end(), [kth_largest](intconst n) { return n < kth_largest; }); std::cout << "Top " << k << " Largest Elements: "; for (size_t i{0}; i < k; ++i) { std::cout << v_2.at(v_2.size() - 1 - i) << " "; } std::cout << std::endl; }
The output of the program is as follows.
1 2 3 4 5 6
$ g++ selection_topk.cpp -o selection_topk -std=c++14 $ ./selection_topk Original Array: 3 4 1 5 2 Top 3 Smallest Elements: 3 2 1 Top 3 Largest Elements: 3 5 4
Notice that if the original array is mutable, there is no need to compute the top $K$ element indices as the partition algorithm can directly partition the array into two parts, one part containing the top $K$ largest elements and the other part containing the remaining elements.