CuTe Swizzle
Introduction
In my article “CUDA Shared Memory Swizzling”, we have discussed how to use swizzling to avoid bank conflicts when a warp of threads accesses shared memory in a strided pattern. Because the swizzle operation and the mathematical proof involve both integer and bit operations, it might not be straightforward to understand and could be error-prone to implement.
CuTe provided a shared memory swizzling abstraction class Swizzle
to simplify the shared memory swizzling implementation. Its implementation only involves bit operations, therefore it is more readable and somewhat easier to prove.
In this blog post, I would like to quickly discuss the implementation of the CuTe shared memory swizzling abstraction class Swizzle
and its configurations in practice.
CuTe Swizzle
CuTe Swizzle Implementation
The CuTe shared memory swizzling abstraction class Swizzle
from the source code is as follows. Only three parameters are used for the swizzle configuration: BBits
, MBase
, and SShift
, where BBits
is the number of bits in the mask, MBase
is the number of least-significant bits to keep constant, and SShift
is the distance to shift the mask. This might be obscure at first glance, let’s walk through a quick example.
For simplicity, suppose we have a 16-bit integer offset
whose value is 65
and a swizzle configuration Swizzle<5, 0, 6>
. The bit representation of offset
is 0b0000000001000001
. The bit_msk
is 0b0000000000011111
, the yyy_msk
is 0b0000011111000000
, the zzz_msk
is 0b0000000000011111
, and msk_sft
is 6
. To swizzle the offset
, offset & yyy_msk{}
is 0b0000000001000000
where only the bits in the masked region are kept, and shiftr(offset & yyy_msk{}, msk_sft{})
is 0b0000000000000001
where the masked bits are shifted to the right. The final result is offset ^ shiftr(offset & yyy_msk{}, msk_sft{})
which is 0b0000000001000001
xor 0b0000000000000001
equals 0b0000000001000000
whose value is 64
. This means the swizzle operation Swizzle<5, 0, 6>
projects the offset
from 65
to 64
.
1 | // A generic Swizzle functor |
Offset Bijection
Given an integer $m$, a domain $X = [0, 2^m - 1]$, and a constant $c \in X$, the function $f: X \to X$ defined as $f(x) = x \oplus c$, where $\oplus$ is the XOR operation, is a bijection.
Proof
It is trivial to see the XOR operation is commutative and associative. Suppose there exist two different values $x_1, x_2 \in X$ such that $f(x_1) = f(x_2)$, then $x_1 \oplus c = x_2 \oplus c$, which implies $x_1 = x_2$. Therefore, the function $f$ is injective. Because $X = [0, 2^m - 1]$ and the XOR operation cannot produce a value outside of $X$, the function $f$ is surjective.
This concludes the proof. $\square$
Therefore, assuming MBase
is zero, and BBits = m
, for the offsets $x$ in the range $X = [k, k \cdot 2^m]$, because $c$ is a constant (as it’s in the bit position higher than $m$), the swizzle operation will permute the offsets in $X$ bijectively.
Shared Memory 2D Layout and Shared Memory Bank Conflicts
On devices of compute capability 5.x or newer, each bank has a bandwidth of 32 bits every clock cycle, and successive 32-bit words are assigned to successive banks. So element size matters for shared memory bank conflicts.
In many use cases, the shared memory layout is a 2D row-major matrix whose row size is a multiple of 32 and element size is 32-bit. When strided access from a warp of threads is performed on the column of the matrix, severe 32-way shared memory bank conflicts will occur. If the element in each column are mapped to different shared memory banks, the shared memory bank conflicts can be mitigated.
Assuming MBase
is zero, element size is 32-bit, and the matrix row size is $n$, we could design the swizzle operation such that there is free of shared memory bank conflicts when accessing each column of the matrix, based on the offset bijection property we just proved above. To configure BBits
and SShift
such that offset % 32
is distinct when a warp of threads accesses the column of the matrix, we have to set SShift
to be $\log_2 n$ and BBits
to be $\log_2 32 = 5$, so that the $c$ used in $f(x) = x \oplus c$ are different for each row, resulting in distinct offset % 32
when a warp of threads accesses the column of the matrix.
Vectorized Memory Access
In CuTe, when we want to perform vectorized access, which is common in CUDA kernels such as matrix transpose, all the elements in the vector from both source and target must be contiguous on the memory. If MBase
is zero, the swizzle operation will not be able to guarantee the contiguous memory access because of the XOR operation. However, when we know the number of elements in the vector, we could set MBase
to the log2 of the number of elements in the vector, and as we will prove later, the swizzle operation will guarantee the contiguous memory access.
Suppose a vector holds $n$ values where $n$ is a power of 2, then all the bits of offsets $kn, kn + 1, \ldots, kn + n - 1$, where $k$ is an integer, are the same except the least significant $\log_2 n$ bits.
Proof
It is trivial to see the least significant $\log_2 n$ bits of $kn, kn + 1, \ldots, kn + n - 1$ are $0, 1, \ldots, n - 1$ in decimal. Suppose there exist one value among $kn, kn + 1, \ldots, kn + n - 1$ whose non-least significant bits are different from one of its neighbors, then the difference between the value and its neighbor must be different from $1$, which contradicts the fact that the difference between any two consecutive values is $1$ in decimal.
This concludes the proof. $\square$
Therefore, no matter how the swizzle operation is performed, as long as MBase
is set to $\log_2 n$ and the memory access starting offset is a multiple of $n$, the contiguous memory access is guaranteed.
If the element size is less than or equal to 32-bit, such as 1-bit, 2-bit, 4-bit, 8-bit, 16-bit, and 32-bit, we could set the number of values in the vector to be 32, 16, 8, 4, 2, and 1, and set MBase
to be $\log_2 32 = 5$, $\log_2 16 = 4$, $\log_2 8 = 3$, $\log_2 4 = 2$, $\log_2 2 = 1$, and $\log_2 1 = 0$, respectively, such that each vectorized memory access transaction is 32-bit. Consequently, the vector is just treated as if it were just a 32-bit element, and the offset can be re-indexed by offset >>= MBase
before the swizzle operation. Then all the swizzle properties we just proved above still hold and the swizzle configurations can be reused.
If the element size is greater than 32-bit, such as 64-bit, 128-bit, 256-bit, etc., we could treat it as a vectorized memory access of multiple 32-bit elements. The offset can be re-indexed by offset <<= log2(element_size / 32)
before the swizzle operation, and the swizzle configurations can be reused. This, however, will bring a consequence that there can never be free of shared memory bank conflicts when accessing the column of the matrix, because the image of shared memory bank ids offset % 32
will become smaller than $[0, 32)$. According to the pigeon hole principle, there will always be at least a log2(element_size / 32)
-way shared memory bank conflict.
Universal Swizzle Equations and Configurations
Suppose the element size is $S$-byte, the number of elements in a vector is $N$, the number of elements in the fast dimension of the shared memory is $X$. The MBase
should be set to $\log_2 N$, the BBits
should be set to $\log_2 (32 \times 4 / S) - \text{MBase}$, and the SShift
should be set to $\log_2 X - \text{MBase}$.
An example of the universal swizzle configurations could be implemented as follows.
1 | constexpr int constexpr_log2(int n) |
Examples
Let’s see a few more complicated examples in which the data type is not of 32-bit size and vectorized memory access is used.
Suppose we have an INT8 8 x 128 row-major matrix, and we want to use 128-bit vectorized memory access. In this case, there are 16 elements in a vector, and MBase
should be set to $\log_2 16 = 4$. Because there are 32 shared memory banks of 32-bit size, each 32-bit word contains 4 elements, so BBits
should be set to $\log_2 (32 \times 4) - \text{MBase} = 7 - 4 = 3$. The SShift
is set to $\log_2 128 - \text{MBase} = 7 - 4 = 3$ to ensure the constant $c$ used for the XOR operation is different for each row. Therefore, the swizzle configuration is Swizzle<3, 4, 3>
.
Suppose we have an FP16 8 x 64 row-major matrix, and we want to use 128-bit vectorized memory access. In this case, there are 8 elements in a vector, and MBase
should be set to $\log_2 8 = 3$. Because there are 32 shared memory banks of 32-bit size, each 32-bit word contains 2 elements, so BBits
should be set to $\log_2 (32 \times 2) - \text{MBase} = 6 - 3 = 3$. The SShift
is set to $\log_2 64 - \text{MBase} = 6 - 3 = 3$ to ensure the constant $c$ used for the XOR operation is different for each row. Therefore, the swizzle configuration is Swizzle<3, 3, 3>
.
CuTe Swizzle Preview
The shared memory bank ids of CuTe swizzled layout can be previewed using the CuTe Swizzle Preview App I created. The app saves the shared memory bank ids of the swizzled layout to a LaTeX file, and the LaTeX file can be compiled to a PDF file for previewing.
For example, given a 2D row-major $32 \times 64$ matrix consisting of elements of 32-bit size, we printed the shared memory bank ids of the swizzled layouts, including Swizzle<5, 0, 6>
, Swizzle<5, 0, 8>
, Swizzle<5, 2, 6>
, and Swizzle<5, 2, 8>
. Only Swizzle<5, 0, 6>
results in free of shared memory bank conflicts when accessing each column of the matrix.
References
CuTe Swizzle