CuTe Swizzle

12-01-202403-04-2025 blog 19 minutes read (About 2808 words) visits

Introduction

In my article “CUDA Shared Memory Swizzling”, we have discussed how to use swizzling to avoid bank conflicts when a warp of threads accesses shared memory in a strided pattern. Because the swizzle operation and the mathematical proof involve both integer and bit operations, it might not be straightforward to understand and could be error-prone to implement.

CuTe provided a shared memory swizzling abstraction class Swizzle to simplify the shared memory swizzling implementation. Its implementation only involves bit operations, therefore it is more readable and somewhat easier to prove.

In this blog post, I would like to quickly discuss the implementation of the CuTe shared memory swizzling abstraction class Swizzle and its configurations in practice.

CuTe Swizzle

CuTe Swizzle Implementation

The CuTe shared memory swizzling abstraction class Swizzle from the source code is as follows. Only three parameters are used for the swizzle configuration: BBits, MBase, and SShift, where BBits is the number of bits in the mask, MBase is the number of least-significant bits to keep constant, and SShift is the distance to shift the mask. This might be obscure at first glance, let’s walk through a quick example.

For simplicity, suppose we have a 16-bit integer offset whose value is 65 and a swizzle configuration Swizzle<5, 0, 6>. The bit representation of offset is 0b0000000001000001. The bit_msk is 0b0000000000011111, the yyy_msk is 0b0000011111000000, the zzz_msk is 0b0000000000011111, and msk_sft is 6. To swizzle the offset, offset & yyy_msk{} is 0b0000000001000000 where only the bits in the masked region are kept, and shiftr(offset & yyy_msk{}, msk_sft{}) is 0b0000000000000001 where the masked bits are shifted to the right. The final result is offset ^ shiftr(offset & yyy_msk{}, msk_sft{}) which is 0b0000000001000001 xor 0b0000000000000001 equals 0b0000000001000000 whose value is 64. This means the swizzle operation Swizzle<5, 0, 6> projects the offset from 65 to 64.

// A generic Swizzle functor
/* 0bxxxxxxxxxxxxxxxYYYxxxxxxxZZZxxxx
 *                               ^--^ MBase is the number of least-sig bits to keep constant
 *                  ^-^       ^-^     BBits is the number of bits in the mask
 *                    ^---------^     SShift is the distance to shift the YYY mask
 *                                       (pos shifts YYY to the right, neg shifts YYY to the left)
 *
 * e.g. Given
 * 0bxxxxxxxxxxxxxxxxYYxxxxxxxxxZZxxx
 * the result is
 * 0bxxxxxxxxxxxxxxxxYYxxxxxxxxxAAxxx where AA = ZZ xor YY
 */
template <int BBits, int MBase, int SShift = BBits>
struct Swizzle
{
  static constexpr int num_bits = BBits;
  static constexpr int num_base = MBase;
  static constexpr int num_shft = SShift;

  static_assert(num_base >= 0,             "MBase must be positive.");
  static_assert(num_bits >= 0,             "BBits must be positive.");
  static_assert(abs(num_shft) >= num_bits, "abs(SShift) must be more than BBits.");

  // using 'int' type here to avoid unintentially casting to unsigned... unsure.
  using bit_msk = cute::constant<int, (1 << num_bits) - 1>;
  using yyy_msk = cute::constant<int, bit_msk{} << (num_base + max(0,num_shft))>;
  using zzz_msk = cute::constant<int, bit_msk{} << (num_base - min(0,num_shft))>;
  using msk_sft = cute::constant<int, num_shft>;

  static constexpr uint32_t swizzle_code = uint32_t(yyy_msk{} | zzz_msk{});

  template <class Offset>
  CUTE_HOST_DEVICE constexpr static
  auto
  apply(Offset const& offset)
  {
    return offset ^ shiftr(offset & yyy_msk{}, msk_sft{});   // ZZZ ^= YYY
  }

  template <class Offset>
  CUTE_HOST_DEVICE constexpr
  auto
  operator()(Offset const& offset) const
  {
    return apply(offset);
  }

  template <int B, int M, int S>
  CUTE_HOST_DEVICE constexpr
  auto
  operator==(Swizzle<B,M,S> const&) const
  {
    return B == BBits && M == MBase && S == SShift;
  }
};

Swizzle Offset Bijection and Memory Safety

Given an integer $m$, a domain $X = [0, 2^m - 1]$, and a constant $c \in X$, the function $f: X \to X$ defined as $f(x) = x \oplus c$, where $\oplus$ is the XOR operation, is a bijection.

Proof

It is trivial to see the XOR operation is commutative and associative. Suppose there exist two different values $x_1, x_2 \in X$ such that $f(x_1) = f(x_2)$, then $x_1 \oplus c = x_2 \oplus c$, which implies $x_1 = x_2$. Therefore, the function $f$ is injective. Because $X = [0, 2^m - 1]$ and the XOR operation cannot produce a value outside of $X$, the function $f$ is surjective.

This concludes the proof. $\square$

Given an integer $n$, a domain $X = [0, k \cdot 2^m - 1]$ where $k$ is an integer, and a constant $c \in [0, 2^m - 1]$, the function $f: X \to X$ defined as $f(x) = x \oplus c$, where $\oplus$ is the XOR operation, is a bijection.

Proof

We have proved that given an integer $m$, a domain $X = [0, 2^m - 1]$, and a constant $c \in X$, the function $f: X \to X$ defined as $f(x) = x \oplus c$, where $\oplus$ is the XOR operation, is a bijection.

By adding a constant value $k \cdot 2^m$ where $k$ is an integer, to the domain $X$, we get a new domain $X^{\prime} = [k \cdot 2^m, (k+1) \cdot 2^m - 1]$. The function $f^{\prime}: X^{\prime} \to X^{\prime}$ defined as $f^{\prime}(x) = x \oplus c$ is still a bijection because $f^{\prime}(x) = x \oplus c = (x - k \cdot 2^m) \oplus c + k \cdot 2^m = f(x - k \cdot 2^m) + k \cdot 2^m$.

Because $f^{\prime}(x) = x \oplus c$ is a bijection for every domain $X^{\prime} = [k \cdot 2^m, (k+1) \cdot 2^m - 1]$ where $k$ is an integer, the function $f^{\prime}(x) = x \oplus c$ is a bijection for the domain $X = [0, k \cdot 2^m - 1]$ where $k$ is an integer.

This concludes the proof. $\square$

Given an integer $n$, a domain $X = [0, k \cdot 2^{m + n} - 1]$ where $k$ is an integer, and a constant $c \in [0, 2^m - 1]$, the function $f: X \to X$ defined as $f(x) = 2^n \cdot \left((x / 2^n) \oplus c\right) + (x \mod 2^n)$, where $\oplus$ is the XOR operation, is a bijection.

Proof

We have proved that given an integer $n$, a domain $X = [0, k \cdot 2^m - 1]$ where $k$ is an integer, and a constant $c \in [0, 2^m - 1]$, the function $f: X \to X$ defined as $f(x) = x \oplus c$, where $\oplus$ is the XOR operation, is a bijection.

Given a constant $2^n$ where $n$ is an integer, we define a new domain $X^{\prime} = [0, 2^n, 2 \cdot 2^n, \ldots, (k \cdot 2^m - 1) \cdot 2^n]$ and $f^{\prime}: X^{\prime} \to X^{\prime}$ defined as $f^{\prime}(x) = 2^n \cdot \left((x / 2^n) \oplus c\right)$ is still a bijection because $f^{\prime}(x) = 2^n \cdot \left((x / 2^n) \oplus c\right) = 2^n \cdot f(x / 2^n)$.

Given a constant $b \in [0, 2^n -1]$, we define a new domain $X^{\prime\prime} = [b, b + 2^n, b + 2 \cdot 2^n, \ldots, b + (k \cdot 2^m - 1) \cdot 2^n]$ and $f^{\prime\prime}: X^{\prime\prime} \to X^{\prime\prime}$ defined as $f^{\prime\prime}(x) = 2^n \cdot \left((x / 2^n) \oplus c\right) + (x \mod 2^n)$ is still a bijection because $f^{\prime\prime}(x) = 2^n \cdot \left((x / 2^n) \oplus c\right) + (x \mod 2^n) = f^{\prime}(x - b) + b$.

Because $f^{\prime\prime}(x) = 2^n \cdot \left((x / 2^n) \oplus c\right) + (x \mod 2^n)$ is a bijection for every domain $X^{\prime\prime} = [b, b + 2^n \cdot (k \cdot 2^m - 1)]$ where $b \in [0, 2^n - 1]$, the function $f^{\prime\prime}(x) = 2^n \cdot \left((x / 2^n) \oplus c\right) + (x \mod 2^n)$ is a bijection for the domain $X = [0, k \cdot 2^{m + n} - 1]$ where $k$ is an integer.

This concludes the proof. $\square$

This proves the offset bijection property of the swizzle operation. Notice that here $m$ is the BBits, i.e., the number of bits in the mask, $n$ is the MBase, i.e., the number of least-significant bits to keep constant, and $f(x) = 2^n \cdot \left((x / 2^n) \oplus c\right) + (x \mod 2^n)$ is the swizzle operation.

Because the swizzle operation is a bijection, assuming the source indices are in the range of $[0, k \cdot 2^{m + n} - 1]$, the target indices are also in the range of $[0, k \cdot 2^{m + n} - 1]$. If the allocated shared memory size is $k \cdot 2^{m + n}$, which is typically the case for most of the CUDA kernels, the swizzle operation guarantees that there is no out-of-bound memory access.

Shared Memory 2D Layout and Shared Memory Bank Conflicts

On devices of compute capability 5.x or newer, each bank has a bandwidth of 32 bits every clock cycle, and successive 32-bit words are assigned to successive banks. So element size matters for shared memory bank conflicts.

In many use cases, the shared memory layout is a 2D row-major matrix whose row size is a multiple of 32 and element size is 32-bit. When strided access from a warp of threads is performed on the column of the matrix, severe 32-way shared memory bank conflicts will occur. If the element in each column are mapped to different shared memory banks, the shared memory bank conflicts can be mitigated.

Assuming MBase is zero, element size is 32-bit, and the matrix row size is $n$, we could design the swizzle operation such that there is free of shared memory bank conflicts when accessing each column of the matrix, based on the offset bijection property we just proved above. To configure BBits and SShift such that offset % 32 is distinct when a warp of threads accesses the column of the matrix, we have to set SShift to be $\log_2 n$ and BBits to be $\log_2 32 = 5$, so that the $c$ used in $f(x) = x \oplus c$ are different for each row, resulting in distinct offset % 32 when a warp of threads accesses the column of the matrix. To further elaborate on this, suppose the shared memory row size $n = 128$, and the strided access of a column results in a 32-way shared memory bank conflict. The indices of the first column are $[0, 128, 256, \ldots, 128 \cdot 31]$, and the last 7 bits of those indices are all 0, i.e., $c = 0$. We want the last BBits of those indices to be swizzled so that the last 5 bits of the swizzled indices are $[0, 1, 2, \ldots, 31]$. Intuitively, by setting SShift to be $\log_2 128 = 7$ and BBits to be 5, the $x$ used for swizzle for each row are $[0, 1, 2, \ldots, 31]$. The last 5 bits after swizzle $f(x) = x \oplus c$ are $[0 \oplus 0, 1 \oplus 0, 2 \oplus 0, \ldots, 31 \oplus 0] = [0, 1, 2, \ldots, 31]$, which guarantees the free of shared memory bank conflicts when accessing the column of the matrix. Similarly, this is also the case for the second, third, and other columns, where $c$ is different for each column.

Vectorized Memory Access

In CuTe, when we want to perform vectorized access, which is common in CUDA kernels such as matrix transpose, all the elements in the vector from both source and target must be contiguous on the memory. If MBase is zero, the swizzle operation will not be able to guarantee the contiguous memory access because of the XOR operation. However, when we know the number of elements in the vector, we could set MBase to the log2 of the number of elements in the vector, and as we will prove later, the swizzle operation will guarantee the contiguous memory access.

Suppose a vector holds $n$ values where $n$ is a power of 2, then all the bits of offsets $[kn, kn + 1, \ldots, kn + n - 1]$, where $k$ is an integer, are the same except the least significant $\log_2 n$ bits.

Proof

It is trivial to see the least significant $\log_2 n$ bits of $[kn, kn + 1, \ldots, kn + n - 1]$ are $[0, 1, \ldots, n - 1]$ in decimal. Suppose there exist one value among $[kn, kn + 1, \ldots, kn + n - 1]$ whose non-least significant bits are different from one of its neighbors, then the difference between the value and its neighbor must be different from $1$, which contradicts the fact that the difference between any two consecutive values is $1$ in decimal.

This concludes the proof. $\square$

Therefore, no matter how the swizzle operation is performed, as long as MBase is set to $\log_2 n$ and the memory access starting offset is a multiple of $n$, the contiguous memory access is guaranteed.

If the element size is less than or equal to 32-bit, such as 1-bit, 2-bit, 4-bit, 8-bit, 16-bit, and 32-bit, we could set the number of values in the vector to be 32, 16, 8, 4, 2, and 1, and set MBase to be $\log_2 32 = 5$, $\log_2 16 = 4$, $\log_2 8 = 3$, $\log_2 4 = 2$, $\log_2 2 = 1$, and $\log_2 1 = 0$, respectively, such that each vectorized memory access transaction is 32-bit. Consequently, the vector is just treated as if it were just a 32-bit element, and the offset can be re-indexed by offset >>= MBase before the swizzle operation. Then all the swizzle properties we just proved above still hold and the swizzle configurations can be reused.

If the element size is greater than 32-bit, such as 64-bit, 128-bit, 256-bit, etc., we could treat it as a vectorized memory access of multiple 32-bit elements. The offset can be re-indexed by offset <<= log2(element_size / 32) before the swizzle operation, and the swizzle configurations can be reused. This, however, will bring a consequence that there can never be free of shared memory bank conflicts when accessing the column of the matrix, because the image of shared memory bank ids offset % 32 will become smaller than $[0, 32)$. According to the pigeon hole principle, there will always be at least a log2(element_size / 32)-way shared memory bank conflict.

Universal Swizzle Equations and Configurations

Suppose the element size is $S$-byte, the number of elements in a vector is $N$, the number of elements in the fast dimension of the shared memory is $X$. The MBase should be set to $\log_2 N$, the BBits should be set to $\log_2 (32 \times 4 / S) - \text{MBase}$, and the SShift should be set to $\log_2 X - \text{MBase}$.

An example of the universal swizzle configurations could be implemented as follows.

constexpr int constexpr_log2(int n)
{
    return ((n < 2) ? 0 : 1 + constexpr_log2(n / 2));
}

using VectorType = cute::uint128_t;
CUTE_STATIC_ASSERT(sizeof(VectorType) % sizeof(T) == 0,
                    "sizeof(VectorType) must be a multiple of sizeof(T)");
constexpr unsigned int NUM_VECTOR_ELEMENTS{sizeof(VectorType) / sizeof(T)};

using TileSizeX = cute::Int<128>; // Fast dimension size on shared memory.
using TileSizeY = cute::Int<32>;  // Slow dimension size on shared memory.
constexpr int NUM_BASE_BITS{constexpr_log2(NUM_VECTOR_ELEMENTS)};
constexpr int NUM_MASK_BITS{constexpr_log2(32 * 4 / sizeof(T)) - NUM_BASE_BITS};
constexpr int NUM_SHIFT_BITS{constexpr_log2(TileSizeX::value) - NUM_BASE_BITS};

Examples

Let’s see a few more complicated examples in which the data type is not of 32-bit size and vectorized memory access is used.

Suppose we have an INT8 8 x 128 row-major matrix, and we want to use 128-bit vectorized memory access. In this case, there are 16 elements in a vector, and MBase should be set to $\log_2 16 = 4$. Because there are 32 shared memory banks of 32-bit size, each 32-bit word contains 4 elements, so BBits should be set to $\log_2 (32 \times 4) - \text{MBase} = 7 - 4 = 3$. The SShift is set to $\log_2 128 - \text{MBase} = 7 - 4 = 3$ to ensure the constant $c$ used for the XOR operation is different for each row. Therefore, the swizzle configuration is Swizzle<3, 4, 3>.

Suppose we have an FP16 8 x 64 row-major matrix, and we want to use 128-bit vectorized memory access. In this case, there are 8 elements in a vector, and MBase should be set to $\log_2 8 = 3$. Because there are 32 shared memory banks of 32-bit size, each 32-bit word contains 2 elements, so BBits should be set to $\log_2 (32 \times 2) - \text{MBase} = 6 - 3 = 3$. The SShift is set to $\log_2 64 - \text{MBase} = 6 - 3 = 3$ to ensure the constant $c$ used for the XOR operation is different for each row. Therefore, the swizzle configuration is Swizzle<3, 3, 3>.

CuTe Swizzle Preview

The shared memory bank ids of CuTe swizzled layout can be previewed using the CuTe Swizzle Preview App I created. The app saves the shared memory bank ids of the swizzled layout to a LaTeX file, and the LaTeX file can be compiled to a PDF file for previewing.

For example, given a 2D row-major $32 \times 64$ matrix consisting of elements of 32-bit size, we printed the shared memory bank ids of the swizzled layouts, including Swizzle<5, 0, 6>, Swizzle<5, 0, 8>, Swizzle<5, 2, 6>, and Swizzle<5, 2, 8>. Only Swizzle<5, 0, 6> results in free of shared memory bank conflicts when accessing each column of the matrix.