CuTe Tiled Copy
Introduction
CuTe tiled copy is used almost everywhere for performing efficient data copy operations in a CuTe program. It is usually used for performing vectorized copy operations between global memory and shared memory and load matrix operations from shared memory to registers.
In this article, I would like to discuss CuTe tiled copy implementations and how it works for performing tiled copy operations.
CuTe Tiled Copy
There are two key templated classes and functions in CuTe for performing tiled copy operations, such as TiledCopy and ThrCopy. These classes and functions are defined in the cute/atom/copy_atom.hpp header file in CuTe. We would like to have some shallow understanding of these classes and functions for us to have a better idea of how tiled copy operations are performed in CuTe.
TiledCopy and ThrCopy Templated Classes
A TiledCopy templated class in CuTe is used for producing and assisting a ThrCopy templated class instance for each thread in the thread group that performs the tiled copy operation. The TiledCopy templated class is templated on three templated classes, Copy_Atom, LayoutCopy_TV, and ShapeTiler_MN, where Copy_Atom is a copy atom that defines the atomic copy operation, LayoutCopy_TV is the TV layout for the tiled copy operation, and ShapeTiler_MN is the tiler shape for the tiled copy operation.
The ThrCopy templated class is templated on two templated classes, TiledCopy and ThrIdx, where TiledCopy is the templated class that produces this ThrCopy templated class instance using the member functions such as get_slice and get_thread_slice, and ThrIdx is data type of the thread index value which is usually int or unsigned int.
1 | template <class TiledCopy, class ThrIdx> |
Given a source tensor and a target tensor, the TiledCopy templated class knows how to partition the source tensor and target tensor into multiple smaller tensors using the Tiler_MN tiler information and the TiledLayout_TV TV layout information. Such partition allows slicing the partitioned tensor using the thread index value and allows the tiled copy operation to be performed iteratively for tensors that are larger than the tiler shape. The partition functions are implemented in the tidfrg_S and tidfrg_D member functions, respectively.
The ThrCopy templated class instance produced by the TiledCopy templated class instance using the get_slice or get_thread_slice member function and it owns the thread index value at the runtime. So it knows how to slice the partitioned source tensor and target tensor for a specific thread in the thread group using the thread index value.
TiledCopy Helper Factory Function
There is a common helper factory function make_tiled_copy_impl in CuTe for constructing a TiledCopy templated class instance from a copy atom, a thread layout, and a value layout. It is used by all the overloaded make_tiled_copy functions in CuTe.
1 | template <class... Args, |
CuTe Tiled Copy from Thread Layout and Value Layout
One of the most widely used overloaded make_tiled_copy functions in CuTe is the one that takes a copy atom copy_atom, a thread layout thr_layout, and a value layout val_layout as inputs and returns a TiledCopy templated class instance, which is commonly used for performing tiled copy operations between global memory and shared memory (especially before SM90), resulting in fewer memory transactions and higher memory bandwidth utilization.
1 | /** Produce a TiledCopy from logical thread and values layouts. |
It will construct a TV layout layout_tv and a tiler tiler, and use them to construct a TiledCopy templated class instance using the make_tiled_copy_impl helper factory function.
CuTe Tiled Copy Thread-Value Layout
CuTe Thread-Value (TV) layout is used for partitioning a tensor into multiple smaller tensors of special access patterns, each of which will be accessed by a thread in a thread group. This is particularly useful for tiled copy and tiled MMA operations in CuTe.
To construct a TV layout for tiled copy, CuTe provides a few convenient functions, such as cute::make_tiled_copy, which takes a copy atom copy_atom, a thread layout thr_layout, and a value layout val_layout as inputs and returns a cute::TiledCopy object.
1 | /** Produce a TiledCopy from logical thread and values layouts. |
The val_layout is the layout of the values that a thread will access in an atomic copy operation. Usually, uint128_t is used as the vector type for the atomic copy operation, and each thread will access a vector of values whose total size is 128 bits. If the size of a value is 32 bits, each thread will access a vector of 4 values at a time, and if the size of a value is 16 bits, each thread will access a vector of 8 values at a time. In this case, val_layout is $4 : 1$ for a vector of four 32-bit values, or $8 : 1$ for a vector of eight 16-bit values.
The thr_layout is the layout of the threads in a thread group. Usually this layout is a 2D layout, such as $(32, 4) : (1, 32)$ for a thread group of 128 threads in column-major order.
With the thr_layout and val_layout, we could compute the TV layout layout_tv for tiled copy. To construct the TV layout, cute::make_tiled_copy first computes the raked product of thr_layout and val_layout to get a layout layout_mn, which maps the coordinates of the target tensor to the thread index and value index, i.e., $(m, n) \rightarrow (\text{thr_idx}, \text{val_idx})$. This is also known as the inverse TV layout. Then, the TV layout layout_tv, which maps the thread index and value index to the coordinates of the target tensor, i.e., $(\text{thr_idx}, \text{val_idx}) \rightarrow (m, n)$, is computed by taking the right inverse of layout_mn. The inverse of the inverse TV layout is the TV layout.
This is actually a brilliant algebra to compute the TV layout for tiled copy, although it might not seem to be straightforward at first glance. With the understanding of CuTe raked product, CuTe TV layout, and CuTe inverse TV layout, also by looking at the example resulting layout of the raked product from the CuTe documentation, it might become somewhat more intuitive. However, I would still like to show an informal mathematical proof of why this algebra works.
Let $A$ and $B$ be the thr_layout and val_layout, respectively. Both $A$ and $B$ are compact layouts. We would like to prove that the values for each thread in the TV layout computed by cute::make_tiled_copy belongs to the thread and the layout of the tiled copy tiler that the thread will access is the same as val_layout.
Proof
Without loss of generality, we could assume that the rank of $A$ and $B$ is 2, i.e., $A = (M_0, M_1) : (d_0, d_1)$ and $B = (N_0, N_1) : (r_0, r_1)$. Let $C$ be the inverse TV layout layout_mn computed by the raked product of $A$ and $B$, i.e., $C = A \bowtie B$. Let $D$ be the TV layout layout_tv computed by taking the right inverse of $C$, i.e., $D = C^{-1}$.
By definition of the raked product, and being a little bit sloppy, we must have
$$
\begin{align}
C &= A \bowtie B \\
&= (M_0, M_1) : (d_0, d_1) \bowtie (N_0, N_1) : (r_0, r_1) \\
&= ((N_0, M_0), (N_1, M_1)) : ((r_0 \left( \text{cosize}(A) \right), d_0), (r_1 \left( \text{cosize}(A) \right), d_1)) \\
\end{align}
$$
where $\text{cosize}(A) = \text{size}(A) = M_0 M_1$ because $A$ is a compact layout.
For a certain thread whose coordinate is $(m_0, m_1)$ in $A$ and a certain value whose coordinate is $(n_0, n_1)$ in $B$, the coordinate of the thread-value in the TV layout $D$ could be computed as follows.
$$
\begin{align}
C((n_0, m_0), (n_1, m_1)) &= n_0 \times (r_0 \left( \text{cosize}(A) \right)) + m_0 \times d_0 + n_1 \times (r_1 \left( \text{cosize}(A) \right)) + m_1 \times d_1 \\
&= n_0 r_0 M_0 M_1 + m_0 d_0 + n_1 r_1 M_0 M_1 + m_1 d_1 \\
\end{align}
$$
The thread index $\text{thr_idx}$ of the thread in the TV layout $D$ whose coordinate is $(m_0, m_1)$ in $A$ could be computed as follows.
$$
\begin{align}
\text{thr_idx} &= C((n_0, m_0), (n_1, m_1)) \mod (M_0 M_1) \\
&= (n_0 r_0 M_0 M_1 + m_0 d_0 + n_1 r_1 M_0 M_1 + m_1 d_1) \mod (M_0 M_1) \\
&= m_0 d_0 + m_1 d_1 \\
\end{align}
$$
where $m_0 d_0 + m_1 d_1$ is exactly the expected thread index computed by the layout $A$, $A(m_0, m_1)$.
Thus, we have proven that the values for each thread in the TV layout computed by cute::make_tiled_copy belongs to the thread.
$$
C((n_0, m_0), (n_1, m_1)) = A(m_0, m_1)
$$
Then, the value index $\text{val_idx}$ of the value in the TV layout $D$ whose coordinate is $(n_0, n_1)$ in $B$ could be computed as follows.
$$
\begin{align}
\text{val_idx} &= \left\lfloor \frac{C((n_0, m_0), (n_1, m_1))}{M_0 M_1} \right\rfloor \\
&= \left\lfloor \frac{n_0 r_0 M_0 M_1 + m_0 d_0 + n_1 r_1 M_0 M_1 + m_1 d_1}{M_0 M_1} \right\rfloor \\
&= \left\lfloor n_0 r_0 + n_1 r_1 + \frac{m_0 d_0 + m_1 d_1}{M_0 M_1} \right\rfloor \\
&= n_0 r_0 + n_1 r_1 + \left\lfloor \frac{m_0 d_0 + m_1 d_1}{M_0 M_1} \right\rfloor \\
&= n_0 r_0 + n_1 r_1 + 0 \\
&= n_0 r_0 + n_1 r_1 \\
\end{align}
$$
where $n_0 r_0 + n_1 r_1$ is exactly the expected value index computed by the layout $B$, $B(n_0, n_1)$.
Thus, we have proven that the layout of the tiled copy tiler that the thread will access is the same as val_layout.
This concludes the proof. $\square$
Example of CuTe Tiled Copy from Thread Layout and Value Layout
The following is a simple host preview example of using cute::make_tiled_copy from a thread layout and a value layout to perform a tiled copy operation from global memory to shared memory.
1 |
|
The tiled copy operation and tensor layouts involved in the tiled copy operation are printed out as follows. They match our intuition and expectations.
1 | gmem_tensor_A: |
More specifically, we want each thread to access a vector of 8 values of 16-bit integers at a time, so the vector_layout_A is $8 : 1$. Given we have a thread group of 32 threads in the thread block whose layout is $(8, 4) : (1, 8)$, the tiler Tiler_MN of the tiled copy gmem_tiled_copy_A must be $(64, 4)$. The TV layout TiledLayout_TV of the tiled copy gmem_tiled_copy_A must be $(32, 8) : (8, 1)$ so that each thread will access 8 values and coordinate of the values on the target tensor are contiguous. It has to be emphasized that it is the coordinate of the values on the target tensor that are contiguous, not the index of the values on the target tensor.
Given the shared memory tensor smem_tensor_A of shape $(128, 32)$ and the global memory tensor gmem_cta_tensor_A of shape $(128, 32, 32)$, the partitioned source tensor thread_layout_A_gmem_cta_tensor_A of shape $((8, 1), 2, 8, 32)$ and the partitioned target tensor thread_layout_A_smem_tensor_A of shape $((8, 1), 2, 8)$ are exactly what we expect. The first mode of the partitioned source tensor and target tensor, i.e., CPY, is the copy atom mode, which has a layout of $(8, 1) : (1, 0)$. This means that each thread will perform a vector copy operation of 8 continugous values at a time. In this case, each thread will perform 2 atomic copy operations along the $M$ dimension and $8$ atomic copy operations along the $K$ dimension to copy a tile of shape $(128, 32)$ from global memory to shared memory. That’s why CPY_M is 2 and CPY_K is 8.
It should be noted that we have to carefully design the tiled copy. Otherwise the CuTe program might not compile or in the worst scenario the copy behavior might not be expected. In our previous example, our tensor layouts, thread layout, and value layout are all column-major. However, if they are inconsistent, it is possible that we see an undesired tiler Tiler_MN or a copy atom mode CPY that is not compatible with the intended copy behavior.
References
CuTe Tiled Copy