CPU Cache False Sharing

Introduction

In addition to CPU clock rate and core numbers, CPU cache is another key attribute for CPU performance. For example, Intel server-grade Xeon CPU usually has lower maximum clock rate than the flag-ship Intel desktop-grade Core i9 CPU, but Intel Xeon CPU usually has more cores and much larger cache. Therefore, Intel Xeon CPUs always have much better performance for multi-threading applications than Intel Core i9 CPUs from the same generation, and the Intel Xeon CPU price is of course much higher.

Although usually we could not control how CPU caches memory in the memory, there are certain heuristics or rules that CPU follows to cache memory. Th users would have to ensure their programs are created in a fashion that is friendly to CPU cache. If the CPU cache behavior is aligned with what the program is intended to do, good performance can be achieved for the program.

In this blog post, I would like to borrow Scott Meyers’s CPU cache false sharing example to demonstrate the importance of implementation details for CPU caching and program performance.

CPU Cache

The CPU specifications could be obtained using the lscpu command in Linux. In this case, my Intel i9-9900K CPU has a 256 KB L1 cache for data, a 256 KB L1 cache for instructions, a 2 MB L2 cache, for each CPU core, and a 16 MB L3 cache shared across all the CPU cores.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 39 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Vendor ID: GenuineIntel
Model name: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
CPU family: 6
Model: 158
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
Stepping: 12
CPU max MHz: 5000.0000
CPU min MHz: 800.0000
BogoMIPS: 7200.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss
ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
arch_perfmon pebs bts rep_good nopl xtopology nonstop_
tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cp
l vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid ss
e4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes
xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_f
ault epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow
vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust
bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap cl
flushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm
ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
md_clear flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 256 KiB (8 instances)
L1i: 256 KiB (8 instances)
L2: 2 MiB (8 instances)
L3: 16 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-15
Vulnerabilities:
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Not affected
Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Meltdown: Not affected
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Mitigation; IBRS
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer
sanitization
Spectre v2: Mitigation; IBRS, IBPB conditional, RSB filling
Srbds: Mitigation; Microcode
Tsx async abort: Mitigation; TSX disabled

L1 cache is much faster than L2 cache, L2 cache is much faster than L3 cache, L3 cache is much faster than main memory. The cache access performances are related to their physical distance to the CPU processor.

CPU Cache Rules

There are a couple of rules that CPU caching follows. Understanding the rules will help us write better code.

  • Caches consist of lines, each holding multiple adjacent words from main memory.
  • If a word is already in cache, the word will be read from cache instead of main memory. Thus the reading is much faster. Overwriting the word in cache will eventually result in the overwriting of the word in main memory.
  • If a word is not in cache, the full cache line that contains the word will be read from main memory. Thus the reading is slow.
  • Hardware speculatively prefetches cache lines. That is to say, the next cache line might already be ready in the cache when the CPU is still reading the current cache line in iterative reading instructions.
  • If the same cache line is cached in multiple caches (that belongs to different CPU cores), when any of the cache lines gets overwritten (by one thread), all the cache lines become invalidated and need to be read from main memory again after the overwritten is reflected in main memory.

False Sharing

The following example is an implementation of Scott Meyers’s false sharing pseudocode example in his presentation “CPU Caches and Why You Care” in the CppCon2014. It demonstrates how the same cache line gets invalidated across multiple caches in almost every iteration and the performance of the multi-threading application gets significantly compromised. The fix to the problem is also simple and elegant.

false_sharing.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
#include <cassert>
#include <chrono>
#include <functional>
#include <iomanip>
#include <iostream>
#include <random>
#include <thread>
#include <vector>

template <typename T>
class Matrix
{
public:
Matrix(size_t m, size_t n) : m_num_rows{m}, m_num_cols{n}, m_data(m * n){};
T& operator()(size_t i, size_t j) noexcept
{
return m_data[i * m_num_cols + j];
}
T operator()(size_t i, size_t j) const noexcept
{
return m_data[i * m_num_cols + j];
}
// Enable mat[i][j]
T* operator[](size_t i) noexcept { return &m_data[i * m_num_cols]; }
T const* operator[](size_t i) const noexcept
{
return &m_data[i * m_num_cols];
}
size_t get_num_rows() const noexcept { return m_num_rows; };
size_t get_num_cols() const noexcept { return m_num_cols; };
T* data() noexcept { return m_data.data(); }
T const* data() const noexcept { return m_data.data(); }

private:
size_t m_num_rows;
size_t m_num_cols;
std::vector<T> m_data;
};

template <class T>
float measure_performance(std::function<T(void)> bound_function,
int num_repeats = 100, int num_warmups = 100)
{
for (int i{0}; i < num_warmups; ++i)
{
bound_function();
}

std::chrono::steady_clock::time_point time_start{
std::chrono::steady_clock::now()};
for (int i{0}; i < num_repeats; ++i)
{
bound_function();
}
std::chrono::steady_clock::time_point time_end{
std::chrono::steady_clock::now()};

auto const time_elapsed{
std::chrono::duration_cast<std::chrono::milliseconds>(time_end -
time_start)
.count()};
float const latency{time_elapsed / static_cast<float>(num_repeats)};

return latency;
}

void random_initialize_matrix(Matrix<int>& mat, unsigned int seed)
{
size_t const num_rows{mat.get_num_rows()};
size_t const num_cols{mat.get_num_cols()};
std::default_random_engine e(seed);
std::uniform_int_distribution<int> uniform_dist(-1024, 1024);
for (size_t i{0}; i < num_rows; ++i)
{
for (size_t j{0}; j < num_cols; ++j)
{
mat[i][j] = uniform_dist(e);
}
}
}

size_t count_odd_values_row_major(Matrix<int> const& mat)
{
size_t num_odd_values{0};
size_t const num_rows{mat.get_num_rows()};
size_t const num_cols{mat.get_num_cols()};
for (size_t i{0}; i < num_rows; ++i)
{
for (size_t j{0}; j < num_cols; ++j)
{
if (mat[i][j] % 2 != 0)
{
++num_odd_values;
}
}
}
return num_odd_values;
}

size_t count_odd_values_column_major(Matrix<int> const& mat)
{
size_t num_odd_values{0};
size_t const num_rows{mat.get_num_rows()};
size_t const num_cols{mat.get_num_cols()};
for (size_t j{0}; j < num_cols; ++j)
{
for (size_t i{0}; i < num_rows; ++i)
{
if (mat[i][j] % 2 != 0)
{
++num_odd_values;
}
}
}
return num_odd_values;
}

size_t
multi_thread_count_odd_values_row_major_non_scalable(Matrix<int> const& mat,
size_t num_threads)
{
std::vector<std::thread> workers{};
workers.reserve(num_threads);
size_t const num_rows{mat.get_num_rows()};
size_t const num_cols{mat.get_num_cols()};
size_t const num_elements{num_rows * num_cols};
size_t const trunk_size{(num_elements + num_threads - 1) / num_threads};

std::vector<size_t> results(num_threads, 0);
for (size_t i{0}; i < num_threads; ++i)
{
workers.emplace_back(
[&, i]()
{
size_t const start_pos{i * trunk_size};
size_t const end_pos{
std::min((i + 1) * trunk_size, num_elements)};
for (size_t j{start_pos}; j < end_pos; ++j)
{
if (mat.data()[j] % 2 != 0)
{
// False sharing
// The array is shared across multiple different
// threads. A consecutive piece of the array that
// contains the entry will be cached in CPU for reading
// in a thread. Multiple threads have multiple caches of
// the same content. However, writing to the array on
// main memory invalidates the all these caches that are
// have the same content. The CPU would have to read
// from the main memory for the updated entry and
// re-cache the array. This slows down the performance
// significantly.
++results[i];
}
}
});
}
for (int i{0}; i < num_threads; ++i)
{
workers[i].join();
}

size_t num_odd_values{0};
for (int i{0}; i < num_threads; ++i)
{
num_odd_values += results[i];
}

return num_odd_values;
}

size_t multi_thread_count_odd_values_row_major_scalable(Matrix<int> const& mat,
size_t num_threads)
{
std::vector<std::thread> workers{};
workers.reserve(num_threads);
size_t const num_rows{mat.get_num_rows()};
size_t const num_cols{mat.get_num_cols()};
size_t const num_elements{num_rows * num_cols};
size_t const trunk_size{(num_elements + num_threads - 1) / num_threads};

std::vector<size_t> results(num_threads, 0);
for (size_t i{0}; i < num_threads; ++i)
{
workers.emplace_back(
[&, i]()
{
size_t count = 0;
size_t const start_pos{i * trunk_size};
size_t const end_pos{
std::min((i + 1) * trunk_size, num_elements)};
for (size_t j{start_pos}; j < end_pos; ++j)
{
if (mat.data()[j] % 2 != 0)
{
++count;
}
}
results[i] = count;
});
}
for (int i{0}; i < num_threads; ++i)
{
workers[i].join();
}

size_t num_odd_values{0};
for (int i{0}; i < num_threads; ++i)
{
num_odd_values += results[i];
}

return num_odd_values;
}

int main()
{
unsigned int const seed{0U};
int const num_repeats{100};
int const num_warmups{100};

size_t const num_threads{8};

float latency{0};

Matrix<int> mat(1000, 2000);
random_initialize_matrix(mat, seed);

assert(count_odd_values_row_major(mat) ==
count_odd_values_column_major(mat));

assert(
count_odd_values_row_major(mat) ==
multi_thread_count_odd_values_row_major_non_scalable(mat, num_threads));

assert(count_odd_values_row_major(mat) ==
multi_thread_count_odd_values_row_major_scalable(mat, num_threads));

std::function<size_t(void)> const function_1{
std::bind(count_odd_values_row_major, mat)};
std::function<size_t(void)> const function_2{
std::bind(count_odd_values_column_major, mat)};
std::function<size_t(void)> const function_3{
std::bind(multi_thread_count_odd_values_row_major_non_scalable, mat,
num_threads)};
std::function<size_t(void)> const function_4{std::bind(
multi_thread_count_odd_values_row_major_scalable, mat, num_threads)};

latency = measure_performance(function_1, num_repeats, num_warmups);
std::cout << "Single-Thread Row-Major Traversal" << std::endl;
std::cout << std::fixed << std::setprecision(3) << "Latency: " << latency
<< " ms" << std::endl;

latency = measure_performance(function_2, num_repeats, num_warmups);
std::cout << "Single-Thread Column-Major Traversal" << std::endl;
std::cout << std::fixed << std::setprecision(3) << "Latency: " << latency
<< " ms" << std::endl;

latency = measure_performance(function_3, num_repeats, num_warmups);
std::cout << num_threads << "-Thread Row-Major Non-Scalable Traversal"
<< std::endl;
std::cout << std::fixed << std::setprecision(3) << "Latency: " << latency
<< " ms" << std::endl;

latency = measure_performance(function_4, num_repeats, num_warmups);
std::cout << num_threads << "-Thread Row-Major Scalable Traversal"
<< std::endl;
std::cout << std::fixed << std::setprecision(3) << "Latency: " << latency
<< " ms" << std::endl;
}

We could see from the latency measurements that with false sharing, the performance of the 8-thread traversal is so bad that it’s almost no better than the single-thread traversal. By removing almost all the false sharing, except the final writing to the output array, the performance of the 8-thread traversal is significantly boosted to its theoretical value.

1
2
3
4
5
6
7
8
9
10
$ g++ false_sharing.cpp -o false_sharing -lpthread -std=c++14
$ ./false_sharing
Single-Thread Row-Major Traversal
Latency: 16.840 ms
Single-Thread Column-Major Traversal
Latency: 20.210 ms
8-Thread Row-Major Non-Scalable Traversal
Latency: 16.520 ms
8-Thread Row-Major Scalable Traversal
Latency: 2.740 ms

CPU Cache VS GPU Cache

There are a lot of similarities between CPU cache and GPU cache.

For example, to improve GPU memory IO from the global memory, we want the memory access to be coalesced so that all the entries in the cache line can be used by the GPU threads.

In addition, there can also be false sharing on GPU if many threads read and writes to the global memory in iterations. To fix the false sharing issue on GPU, similar to the solution for CPU, we use local variable or shared memory to store the intermediate values, and only write to the global memory from the local variable or shared memory once at the end of the algorithm.

References

Author

Lei Mao

Posted on

08-27-2022

Updated on

08-27-2022

Licensed under


Comments