CPU Cache False Sharing

08-27-202208-27-2022 blog 14 minutes read (About 2152 words) visits

Introduction

In addition to CPU clock rate and core numbers, CPU cache is another key attribute for CPU performance. For example, Intel server-grade Xeon CPU usually has lower maximum clock rate than the flag-ship Intel desktop-grade Core i9 CPU, but Intel Xeon CPU usually has more cores and much larger cache. Therefore, Intel Xeon CPUs always have much better performance for multi-threading applications than Intel Core i9 CPUs from the same generation, and the Intel Xeon CPU price is of course much higher.

Although usually we could not control how CPU caches memory in the memory, there are certain heuristics or rules that CPU follows to cache memory. Th users would have to ensure their programs are created in a fashion that is friendly to CPU cache. If the CPU cache behavior is aligned with what the program is intended to do, good performance can be achieved for the program.

In this blog post, I would like to borrow Scott Meyers’s CPU cache false sharing example to demonstrate the importance of implementation details for CPU caching and program performance.

CPU Cache

The CPU specifications could be obtained using the lscpu command in Linux. In this case, my Intel i9-9900K CPU has a 256 KB L1 cache for data, a 256 KB L1 cache for instructions, a 2 MB L2 cache, for each CPU core, and a 16 MB L3 cache shared across all the CPU cores.

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            12
    CPU max MHz:         5000.0000
    CPU min MHz:         800.0000
    BogoMIPS:            7200.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
                         a cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss 
                         ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art
                          arch_perfmon pebs bts rep_good nopl xtopology nonstop_
                         tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cp
                         l vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid ss
                         e4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes 
                         xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_f
                         ault epb invpcid_single ssbd ibrs ibpb stibp tpr_shadow
                          vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust 
                         bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap cl
                         flushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm
                          ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
                          md_clear flush_l1d arch_capabilities
Virtualization features: 
  Virtualization:        VT-x
Caches (sum of all):     
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    2 MiB (8 instances)
  L3:                    16 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:         
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Not affected
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; IBRS
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
                          and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer
                          sanitization
  Spectre v2:            Mitigation; IBRS, IBPB conditional, RSB filling
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Mitigation; TSX disabled

L1 cache is much faster than L2 cache, L2 cache is much faster than L3 cache, L3 cache is much faster than main memory. The cache access performances are related to their physical distance to the CPU processor.

CPU Cache Rules

There are a couple of rules that CPU caching follows. Understanding the rules will help us write better code.

Caches consist of lines, each holding multiple adjacent words from main memory.
If a word is already in cache, the word will be read from cache instead of main memory. Thus the reading is much faster. Overwriting the word in cache will eventually result in the overwriting of the word in main memory.
If a word is not in cache, the full cache line that contains the word will be read from main memory. Thus the reading is slow.
Hardware speculatively prefetches cache lines. That is to say, the next cache line might already be ready in the cache when the CPU is still reading the current cache line in iterative reading instructions.
If the same cache line is cached in multiple caches (that belongs to different CPU cores), when any of the cache lines gets overwritten (by one thread), all the cache lines become invalidated and need to be read from main memory again after the overwritten is reflected in main memory.

The following example is an implementation of Scott Meyers’s false sharing pseudocode example in his presentation “CPU Caches and Why You Care” in the CppCon2014. It demonstrates how the same cache line gets invalidated across multiple caches in almost every iteration and the performance of the multi-threading application gets significantly compromised. The fix to the problem is also simple and elegant.

false_sharing.cpp

#include <cassert>
#include <chrono>
#include <functional>
#include <iomanip>
#include <iostream>
#include <random>
#include <thread>
#include <vector>

template <typename T>
class Matrix
{
public:
    Matrix(size_t m, size_t n) : m_num_rows{m}, m_num_cols{n}, m_data(m * n){};
    T& operator()(size_t i, size_t j) noexcept
    {
        return m_data[i * m_num_cols + j];
    }
    T operator()(size_t i, size_t j) const noexcept
    {
        return m_data[i * m_num_cols + j];
    }
    // Enable mat[i][j]
    T* operator[](size_t i) noexcept { return &m_data[i * m_num_cols]; }
    T const* operator[](size_t i) const noexcept
    {
        return &m_data[i * m_num_cols];
    }
    size_t get_num_rows() const noexcept { return m_num_rows; };
    size_t get_num_cols() const noexcept { return m_num_cols; };
    T* data() noexcept { return m_data.data(); }
    T const* data() const noexcept { return m_data.data(); }

private:
    size_t m_num_rows;
    size_t m_num_cols;
    std::vector<T> m_data;
};

template <class T>
float measure_performance(std::function<T(void)> bound_function,
                          int num_repeats = 100, int num_warmups = 100)
{
    for (int i{0}; i < num_warmups; ++i)
    {
        bound_function();
    }

    std::chrono::steady_clock::time_point time_start{
        std::chrono::steady_clock::now()};
    for (int i{0}; i < num_repeats; ++i)
    {
        bound_function();
    }
    std::chrono::steady_clock::time_point time_end{
        std::chrono::steady_clock::now()};

    auto const time_elapsed{
        std::chrono::duration_cast<std::chrono::milliseconds>(time_end -
                                                              time_start)
            .count()};
    float const latency{time_elapsed / static_cast<float>(num_repeats)};

    return latency;
}

void random_initialize_matrix(Matrix<int>& mat, unsigned int seed)
{
    size_t const num_rows{mat.get_num_rows()};
    size_t const num_cols{mat.get_num_cols()};
    std::default_random_engine e(seed);
    std::uniform_int_distribution<int> uniform_dist(-1024, 1024);
    for (size_t i{0}; i < num_rows; ++i)
    {
        for (size_t j{0}; j < num_cols; ++j)
        {
            mat[i][j] = uniform_dist(e);
        }
    }
}

size_t count_odd_values_row_major(Matrix<int> const& mat)
{
    size_t num_odd_values{0};
    size_t const num_rows{mat.get_num_rows()};
    size_t const num_cols{mat.get_num_cols()};
    for (size_t i{0}; i < num_rows; ++i)
    {
        for (size_t j{0}; j < num_cols; ++j)
        {
            if (mat[i][j] % 2 != 0)
            {
                ++num_odd_values;
            }
        }
    }
    return num_odd_values;
}

size_t count_odd_values_column_major(Matrix<int> const& mat)
{
    size_t num_odd_values{0};
    size_t const num_rows{mat.get_num_rows()};
    size_t const num_cols{mat.get_num_cols()};
    for (size_t j{0}; j < num_cols; ++j)
    {
        for (size_t i{0}; i < num_rows; ++i)
        {
            if (mat[i][j] % 2 != 0)
            {
                ++num_odd_values;
            }
        }
    }
    return num_odd_values;
}

size_t
multi_thread_count_odd_values_row_major_non_scalable(Matrix<int> const& mat,
                                                     size_t num_threads)
{
    std::vector<std::thread> workers{};
    workers.reserve(num_threads);
    size_t const num_rows{mat.get_num_rows()};
    size_t const num_cols{mat.get_num_cols()};
    size_t const num_elements{num_rows * num_cols};
    size_t const trunk_size{(num_elements + num_threads - 1) / num_threads};

    std::vector<size_t> results(num_threads, 0);
    for (size_t i{0}; i < num_threads; ++i)
    {
        workers.emplace_back(
            [&, i]()
            {
                size_t const start_pos{i * trunk_size};
                size_t const end_pos{
                    std::min((i + 1) * trunk_size, num_elements)};
                for (size_t j{start_pos}; j < end_pos; ++j)
                {
                    if (mat.data()[j] % 2 != 0)
                    {
                        // False sharing
                        // The array is shared across multiple different
                        // threads. A consecutive piece of the array that
                        // contains the entry will be cached in CPU for reading
                        // in a thread. Multiple threads have multiple caches of
                        // the same content. However, writing to the array on
                        // main memory invalidates the all these caches that are
                        // have the same content. The CPU would have to read
                        // from the main memory for the updated entry and
                        // re-cache the array. This slows down the performance
                        // significantly.
                        ++results[i];
                    }
                }
            });
    }
    for (int i{0}; i < num_threads; ++i)
    {
        workers[i].join();
    }

    size_t num_odd_values{0};
    for (int i{0}; i < num_threads; ++i)
    {
        num_odd_values += results[i];
    }

    return num_odd_values;
}

size_t multi_thread_count_odd_values_row_major_scalable(Matrix<int> const& mat,
                                                        size_t num_threads)
{
    std::vector<std::thread> workers{};
    workers.reserve(num_threads);
    size_t const num_rows{mat.get_num_rows()};
    size_t const num_cols{mat.get_num_cols()};
    size_t const num_elements{num_rows * num_cols};
    size_t const trunk_size{(num_elements + num_threads - 1) / num_threads};

    std::vector<size_t> results(num_threads, 0);
    for (size_t i{0}; i < num_threads; ++i)
    {
        workers.emplace_back(
            [&, i]()
            {
                size_t count = 0;
                size_t const start_pos{i * trunk_size};
                size_t const end_pos{
                    std::min((i + 1) * trunk_size, num_elements)};
                for (size_t j{start_pos}; j < end_pos; ++j)
                {
                    if (mat.data()[j] % 2 != 0)
                    {
                        ++count;
                    }
                }
                results[i] = count;
            });
    }
    for (int i{0}; i < num_threads; ++i)
    {
        workers[i].join();
    }

    size_t num_odd_values{0};
    for (int i{0}; i < num_threads; ++i)
    {
        num_odd_values += results[i];
    }

    return num_odd_values;
}

int main()
{
    unsigned int const seed{0U};
    int const num_repeats{100};
    int const num_warmups{100};

    size_t const num_threads{8};

    float latency{0};

    Matrix<int> mat(1000, 2000);
    random_initialize_matrix(mat, seed);

    assert(count_odd_values_row_major(mat) ==
           count_odd_values_column_major(mat));

    assert(
        count_odd_values_row_major(mat) ==
        multi_thread_count_odd_values_row_major_non_scalable(mat, num_threads));

    assert(count_odd_values_row_major(mat) ==
           multi_thread_count_odd_values_row_major_scalable(mat, num_threads));

    std::function<size_t(void)> const function_1{
        std::bind(count_odd_values_row_major, mat)};
    std::function<size_t(void)> const function_2{
        std::bind(count_odd_values_column_major, mat)};
    std::function<size_t(void)> const function_3{
        std::bind(multi_thread_count_odd_values_row_major_non_scalable, mat,
                  num_threads)};
    std::function<size_t(void)> const function_4{std::bind(
        multi_thread_count_odd_values_row_major_scalable, mat, num_threads)};

    latency = measure_performance(function_1, num_repeats, num_warmups);
    std::cout << "Single-Thread Row-Major Traversal" << std::endl;
    std::cout << std::fixed << std::setprecision(3) << "Latency: " << latency
              << " ms" << std::endl;

    latency = measure_performance(function_2, num_repeats, num_warmups);
    std::cout << "Single-Thread Column-Major Traversal" << std::endl;
    std::cout << std::fixed << std::setprecision(3) << "Latency: " << latency
              << " ms" << std::endl;

    latency = measure_performance(function_3, num_repeats, num_warmups);
    std::cout << num_threads << "-Thread Row-Major Non-Scalable Traversal"
              << std::endl;
    std::cout << std::fixed << std::setprecision(3) << "Latency: " << latency
              << " ms" << std::endl;

    latency = measure_performance(function_4, num_repeats, num_warmups);
    std::cout << num_threads << "-Thread Row-Major Scalable Traversal"
              << std::endl;
    std::cout << std::fixed << std::setprecision(3) << "Latency: " << latency
              << " ms" << std::endl;
}

We could see from the latency measurements that with false sharing, the performance of the 8-thread traversal is so bad that it’s almost no better than the single-thread traversal. By removing almost all the false sharing, except the final writing to the output array, the performance of the 8-thread traversal is significantly boosted to its theoretical value.

$ g++ false_sharing.cpp -o false_sharing -lpthread -std=c++14
$ ./false_sharing 
Single-Thread Row-Major Traversal
Latency: 16.840 ms
Single-Thread Column-Major Traversal
Latency: 20.210 ms
8-Thread Row-Major Non-Scalable Traversal
Latency: 16.520 ms
8-Thread Row-Major Scalable Traversal
Latency: 2.740 ms

CPU Cache VS GPU Cache

There are a lot of similarities between CPU cache and GPU cache.

For example, to improve GPU memory IO from the global memory, we want the memory access to be coalesced so that all the entries in the cache line can be used by the GPU threads.

In addition, there can also be false sharing on GPU if many threads read and writes to the global memory in iterations. To fix the false sharing issue on GPU, similar to the solution for CPU, we use local variable or shared memory to store the intermediate values, and only write to the global memory from the local variable or shared memory once at the end of the algorithm.

References

CPU Caches and Why You Care - Scott Meyers

CPU Cache False Sharing

https://leimao.github.io/blog/CPU-Cache-False-Sharing/

Author

Lei Mao

Posted on

08-27-2022

Updated on

08-27-2022

Licensed under

CPP,

CUDA,

GPU,

CPU

CPU Cache False Sharing

Introduction

CPU Cache

CPU Cache Rules

CPU Cache VS GPU Cache

References

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Advertisement

Catalogue