CPU TopK Algorithm

Introduction

TopK algorithms are used for finding the top $K$ largest or smallest elements in an unsorted array. They are useful in many applications such as information retrieval, data mining, and machine learning.

There are usually two commonly seen TopK algorithms for CPU. The first one is an $O(N + K \log N)$ TopK algorithm which uses the $O(N)$ heap building algorithm for building a min-heap or a max-heap from the array followed by an $O(K \log N)$ heap extraction operation. The second one is an $O(N)$ TopK algorithm which uses the $O(N)$ median-of-medians selection algorithm for finding the $K$-th order statistics in the array followed by an $O(N)$ linear scan or an $O(N)$ partition algorithm.

In this blog post, I would like to quickly discuss these two CPU TopK algorithms.

Heap Building TopK Algorithm

Algorithm Analysis

In my previous article “Heap Building Algorithm”, I discussed the $O(N)$ heap building algorithm for building a min-heap or a max-heap from an unsorted array. The next step is to extract the top $K$ largest or smallest elements from the heap. The heap property states that the key stored in each node is either greater than or equal to or less than or equal to the keys in the node’s children, according to some total order. For a min-heap, the key stored in the root node is the smallest, and for a max-heap, the key stored in the root node is the largest. To retrieve the top $K$ elements, we cannot just simply take the first $K$ elements from the min-heap or max-heap arrays because the heap property does not theoretically guarantee that the top $K$ elements are the first $K$ elements in the array.

For example, the following heap is a valid min-heap.

1
2
3
4
5
    1
/ \
2 5
/ \
3 4

But its top $K$ smallest elements are not always the first $K$ elements in the array. Its array representation is [1, 2, 5, 3, 4]. When $K = 3$, and the top 3 smallest elements are 1, 2, 3, which are not the first 3 elements in the array.

To extract the top $K$ largest or smallest elements from a min-heap or a max-heap, we need to perform the heap extraction operations by removing the root node from the heap and then recursively heapify the heap $K$ times. Each such operation takes $O(\log N)$ time, and the total time complexity for extracting $K$ elements is $O(K \log N)$.

Therefore, the total time complexity for the heap building TopK algorithm is $O(N + K \log N)$. Notice that the top $K$ elements we extracted are always in sorted order because of the heap property. But this is sometimes useless in the TopK problem because we only need the top $K$ elements, not the top $K$ elements in sorted order. Because of this unwanted side effect, the heap building TopK algorithm is not a linear time algorithm.

Algorithm Implementation

To extract the top $K$ largest or smallest elements from a min-heap or a max-heap, we can use the std::make_heap for heap building and std::pop_heap and the vector pop_back functions for heap root removal in C++.

Sometimes, we also need to get the top $K$ element indices from the original array. In this case, when we build the heap, we can use a vector of pairs to store the original array element values and their indices. We can then use the std::make_heap and std::pop_heap functions with custom comparison functions to build the heap and extract the top $K$ element indices.

heap_building_topk.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
#include <algorithm>
#include <functional>
#include <iostream>
#include <vector>

int main()
{
std::vector<int> const v{3, 4, 1, 5, 2};
std::cout << "Original Array:" << std::endl;
for (auto const& i : v)
{
std::cout << i << " ";
}
std::cout << std::endl;

std::vector<int> min_heap{v};
std::vector<int> max_heap{v};

std::vector<std::pair<int, size_t>> min_heap_with_origin_index{};
std::vector<std::pair<int, size_t>> max_heap_with_origin_index{};
for (size_t i{0}; i < v.size(); ++i)
{
min_heap_with_origin_index.push_back({v[i], i});
max_heap_with_origin_index.push_back({v[i], i});
}

std::make_heap(min_heap.begin(), min_heap.end(), std::greater<>{});
std::make_heap(max_heap.begin(), max_heap.end(), std::less<>{});

auto const min_heap_pair_comp = [](auto const& a, auto const& b)
{ return a.first > b.first; };
auto const max_heap_pair_comp = [](auto const& a, auto const& b)
{ return a.first < b.first; };
std::make_heap(min_heap_with_origin_index.begin(),
min_heap_with_origin_index.end(), min_heap_pair_comp);
std::make_heap(max_heap_with_origin_index.begin(),
max_heap_with_origin_index.end(), max_heap_pair_comp);

size_t const k{3};

// Get the top k elements.
std::cout << "Top " << k << " Smallest Elements: ";
for (size_t i{0}; i < k; ++i)
{
std::cout << min_heap.front() << " ";
std::pop_heap(min_heap.begin(), min_heap.end(), std::greater<>{});
min_heap.pop_back();
}
std::cout << std::endl;

std::cout << "Top " << k << " Largest Elements: ";
for (size_t i{0}; i < k; ++i)
{
std::cout << max_heap.front() << " ";
std::pop_heap(max_heap.begin(), max_heap.end(), std::less<>{});
max_heap.pop_back();
}
std::cout << std::endl;

// Get the top k element indices from the original array.
std::cout << "Top " << k << " Smallest Element Indices: ";
for (size_t i{0}; i < k; ++i)
{
std::cout << min_heap_with_origin_index.front().second << " ";
std::pop_heap(min_heap_with_origin_index.begin(),
min_heap_with_origin_index.end(), min_heap_pair_comp);
min_heap_with_origin_index.pop_back();
}
std::cout << std::endl;

std::cout << "Top " << k << " Largest Element Indices: ";
for (size_t i{0}; i < k; ++i)
{
std::cout << max_heap_with_origin_index.front().second << " ";
std::pop_heap(max_heap_with_origin_index.begin(),
max_heap_with_origin_index.end(), max_heap_pair_comp);
max_heap_with_origin_index.pop_back();
}
std::cout << std::endl;

return 0;
}

The output of the program is as follows.

1
2
3
4
5
6
7
8
$ g++ heap_building_topk.cpp -o heap_building_topk -std=c++14
$ ./heap_building_topk
Original Array:
3 4 1 5 2
Top 3 Smallest Elements: 1 2 3
Top 3 Largest Elements: 5 4 3
Top 3 Smallest Element Indices: 2 4 0
Top 3 Largest Element Indices: 3 1 0

Notice that the top $K$ elements we extracted are always in sorted order.

Median-of-Medians Selection TopK Algorithm

Algorithm Analysis

In my previous article “Median-of-Medians Selection Algorithm”, I discussed the $O(N)$ median-of-medians selection algorithm for finding the $K$-th order statistics in an unsorted array. The next step is to partition the array into two parts, one part containing the top $K$ largest elements and the other part containing the remaining elements. The partition algorithm takes $O(N)$ time, and the total time complexity for the median-of-medians selection TopK algorithm is $O(N)$.

This time, the top $K$ elements we extracted are not always in sorted order. The the median-of-medians selection TopK algorithm is a true linear time algorithm.

Algorithm Implementation

The implementation of the median-of-medians selection TopK algorithm in C++ is derived from the median-of-medians selection algorithm implemented in my previous article “Median-of-Medians Selection Algorithm”. To partition the array into two parts using the order statistics obtained from the median-of-medians selection algorithm, we can use the std::partition function in C++.

selection_topk.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
#include <algorithm>
#include <array>
#include <iostream>
#include <iterator>
#include <vector>

// Find the approximate median value.
template <typename T, size_t SmallArraySize, class ForwardIt>
T median_of_medians(ForwardIt first, ForwardIt last)
{
auto const size{std::distance(first, last)};
if (size <= SmallArraySize)
{
std::array<T, SmallArraySize> v;
std::copy(first, last, std::begin(v));
std::sort(std::begin(v), std::begin(v) + size);
size_t const median_index{static_cast<size_t>(size / 2)};
return v.at(median_index);
}
std::vector<T> medians;
for (auto it{first}; it < last; it += SmallArraySize)
{
auto const sub_last{std::min(it + SmallArraySize, last)};
medians.push_back(median_of_medians<T, SmallArraySize>(it, sub_last));
}
return median_of_medians<T, SmallArraySize>(medians.begin(), medians.end());
}

template <typename T, class ForwardIt>
T custom_selection_algorithm(ForwardIt first, ForwardIt last, size_t i)
{
auto const size{std::distance(first, last)};
// Check if the size is large or equal to the order statistics index.
if (size < i)
{
throw std::out_of_range{"The order statistics is out of range."};
}
// Find the approximate median value.
auto const pivot{median_of_medians<T, 5>(first, last)};
// Partition the elements around the pivot.
auto const partition_it{
std::partition(first, last, [pivot](T const n) { return n < pivot; })};
// Calculate the number of elements less than the pivot.
auto const num_less_than_pivot{std::distance(first, partition_it)};
// Check if the pivot is the order statistics.
if (num_less_than_pivot == i)
{
return pivot;
}
// Check if the pivot is less than the order statistics.
else if (num_less_than_pivot < i)
{
return custom_selection_algorithm<T>(partition_it, last,
i - num_less_than_pivot);
}
// Check if the pivot is greater than the order statistics.
else
{
return custom_selection_algorithm<T>(first, partition_it, i);
}
}

template <typename T, class ForwardIt>
T custom_selection(ForwardIt first, ForwardIt last, size_t i)
{
// Copy the elements to a vector.
std::vector<T> v;
v.reserve(std::distance(first, last));
std::copy(first, last, std::back_inserter(v));
// Call the custom selection algorithm.
return custom_selection_algorithm<T>(v.begin(), v.end(), i);
}

int main()
{
std::vector<int> const v{3, 4, 1, 5, 2};
std::cout << "Original Array:" << std::endl;
for (auto const& i : v)
{
std::cout << i << " ";
}
std::cout << std::endl;

size_t const k{3};

// Get the top k smallest elements.
std::vector<int> v_1{v};
int kth_smallest{custom_selection<int>(v_1.begin(), v_1.end(), k)};
std::partition(v_1.begin(), v_1.end(),
[kth_smallest](int const n) { return n < kth_smallest; });
std::cout << "Top " << k << " Smallest Elements: ";
for (size_t i{0}; i < k; ++i)
{
std::cout << v_1.at(i) << " ";
}
std::cout << std::endl;

// Get the top k largest elements.
std::vector<int> v_2{v};
int kth_largest{
custom_selection<int>(v_2.begin(), v_2.end(), v_2.size() - k)};
std::partition(v_2.begin(), v_2.end(),
[kth_largest](int const n) { return n < kth_largest; });
std::cout << "Top " << k << " Largest Elements: ";
for (size_t i{0}; i < k; ++i)
{
std::cout << v_2.at(v_2.size() - 1 - i) << " ";
}
std::cout << std::endl;
}

The output of the program is as follows.

1
2
3
4
5
6
$ g++ selection_topk.cpp -o selection_topk -std=c++14
$ ./selection_topk
Original Array:
3 4 1 5 2
Top 3 Smallest Elements: 3 2 1
Top 3 Largest Elements: 3 5 4

Notice that if the original array is mutable, there is no need to compute the top $K$ element indices as the partition algorithm can directly partition the array into two parts, one part containing the top $K$ largest elements and the other part containing the remaining elements.

References

Author

Lei Mao

Posted on

03-01-2024

Updated on

03-01-2024

Licensed under


Comments