In addition to CPU clock rate and core numbers, CPU cache is another key attribute for CPU performance. For example, Intel server-grade Xeon CPU usually has lower maximum clock rate than the flag-ship Intel desktop-grade Core i9 CPU, but Intel Xeon CPU usually has more cores and much larger cache. Therefore, Intel Xeon CPUs always have much better performance for multi-threading applications than Intel Core i9 CPUs from the same generation, and the Intel Xeon CPU price is of course much higher.
Although usually we could not control how CPU caches memory in the memory, there are certain heuristics or rules that CPU follows to cache memory. Th users would have to ensure their programs are created in a fashion that is friendly to CPU cache. If the CPU cache behavior is aligned with what the program is intended to do, good performance can be achieved for the program.
In this blog post, I would like to borrow Scott Meyers’s CPU cache false sharing example to demonstrate the importance of implementation details for CPU caching and program performance.
CPU Cache
The CPU specifications could be obtained using the lscpu command in Linux. In this case, my Intel i9-9900K CPU has a 256 KB L1 cache for data, a 256 KB L1 cache for instructions, a 2 MB L2 cache, for each CPU core, and a 16 MB L3 cache shared across all the CPU cores.
$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 39 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz CPU family: 6 Model: 158 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 12 CPU max MHz: 5000.0000 CPU min MHz: 800.0000 BogoMIPS: 7200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_ tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cp l vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid ss e4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_f ault epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap cl flushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities Virtualization features: Virtualization: VT-x Caches (sum of all): L1d: 256 KiB (8 instances) L1i: 256 KiB (8 instances) L2: 2 MiB (8 instances) L3: 16 MiB (1 instance) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerabilities: Itlb multihit: KVM: Mitigation: VMX disabled L1tf: Not affected Mds: Mitigation; Clear CPU buffers; SMT vulnerable Meltdown: Not affected Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Retbleed: Mitigation; IBRS Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; IBRS, IBPB conditional, RSB filling Srbds: Mitigation; Microcode Tsx async abort: Mitigation; TSX disabled
L1 cache is much faster than L2 cache, L2 cache is much faster than L3 cache, L3 cache is much faster than main memory. The cache access performances are related to their physical distance to the CPU processor.
CPU Cache Rules
There are a couple of rules that CPU caching follows. Understanding the rules will help us write better code.
Caches consist of lines, each holding multiple adjacent words from main memory.
If a word is already in cache, the word will be read from cache instead of main memory. Thus the reading is much faster. Overwriting the word in cache will eventually result in the overwriting of the word in main memory.
If a word is not in cache, the full cache line that contains the word will be read from main memory. Thus the reading is slow.
Hardware speculatively prefetches cache lines. That is to say, the next cache line might already be ready in the cache when the CPU is still reading the current cache line in iterative reading instructions.
If the same cache line is cached in multiple caches (that belongs to different CPU cores), when any of the cache lines gets overwritten (by one thread), all the cache lines become invalidated and need to be read from main memory again after the overwritten is reflected in main memory.
False Sharing
The following example is an implementation of Scott Meyers’s false sharing pseudocode example in his presentation “CPU Caches and Why You Care” in the CppCon2014. It demonstrates how the same cache line gets invalidated across multiple caches in almost every iteration and the performance of the multi-threading application gets significantly compromised. The fix to the problem is also simple and elegant.
std::vector<size_t> results(num_threads, 0); for (size_t i{0}; i < num_threads; ++i) { workers.emplace_back( [&, i]() { size_tconst start_pos{i * trunk_size}; size_tconst end_pos{ std::min((i + 1) * trunk_size, num_elements)}; for (size_t j{start_pos}; j < end_pos; ++j) { if (mat.data()[j] % 2 != 0) { // False sharing // The array is shared across multiple different // threads. A consecutive piece of the array that // contains the entry will be cached in CPU for reading // in a thread. Multiple threads have multiple caches of // the same content. However, writing to the array on // main memory invalidates the all these caches that are // have the same content. The CPU would have to read // from the main memory for the updated entry and // re-cache the array. This slows down the performance // significantly. ++results[i]; } } }); } for (int i{0}; i < num_threads; ++i) { workers[i].join(); }
size_t num_odd_values{0}; for (int i{0}; i < num_threads; ++i) { num_odd_values += results[i]; }
We could see from the latency measurements that with false sharing, the performance of the 8-thread traversal is so bad that it’s almost no better than the single-thread traversal. By removing almost all the false sharing, except the final writing to the output array, the performance of the 8-thread traversal is significantly boosted to its theoretical value.
1 2 3 4 5 6 7 8 9 10
$ g++ false_sharing.cpp -o false_sharing -lpthread -std=c++14 $ ./false_sharing Single-Thread Row-Major Traversal Latency: 16.840 ms Single-Thread Column-Major Traversal Latency: 20.210 ms 8-Thread Row-Major Non-Scalable Traversal Latency: 16.520 ms 8-Thread Row-Major Scalable Traversal Latency: 2.740 ms
CPU Cache VS GPU Cache
There are a lot of similarities between CPU cache and GPU cache.
For example, to improve GPU memory IO from the global memory, we want the memory access to be coalesced so that all the entries in the cache line can be used by the GPU threads.
In addition, there can also be false sharing on GPU if many threads read and writes to the global memory in iterations. To fix the false sharing issue on GPU, similar to the solution for CPU, we use local variable or shared memory to store the intermediate values, and only write to the global memory from the local variable or shared memory once at the end of the algorithm.