CuTe Tiled Copy
Introduction
CuTe tiled copy is used almost everywhere for performing efficient data copy operations in a CuTe program. It is usually used for performing vectorized copy operations between global memory and shared memory and load matrix operations from shared memory to registers.
In this article, I would like to discuss CuTe tiled copy implementations and how it works for performing tiled copy operations.
CuTe Tiled Copy
There are two key templated classes and functions in CuTe for performing tiled copy operations, such as TiledCopy
and ThrCopy
. These classes and functions are defined in the cute/atom/copy_atom.hpp
header file in CuTe. We would like to have some shallow understanding of these classes and functions for us to have a better idea of how tiled copy operations are performed in CuTe.
TiledCopy and ThrCopy Templated Classes
A TiledCopy
templated class in CuTe is used for producing and assisting a ThrCopy
templated class instance for each thread in the thread group that performs the tiled copy operation. The TiledCopy
templated class is templated on three templated classes, Copy_Atom
, LayoutCopy_TV
, and ShapeTiler_MN
, where Copy_Atom
is a copy atom that defines the atomic copy operation, LayoutCopy_TV
is the TV layout for the tiled copy operation, and ShapeTiler_MN
is the tiler shape for the tiled copy operation.
The ThrCopy
templated class is templated on two templated classes, TiledCopy
and ThrIdx
, where TiledCopy
is the templated class that produces this ThrCopy
templated class instance using the member functions such as get_slice
and get_thread_slice
, and ThrIdx
is data type of the thread index value which is usually int
or unsigned int
.
1 | template <class TiledCopy, class ThrIdx> |
Given a source tensor and a target tensor, the TiledCopy
templated class knows how to partition the source tensor and target tensor into multiple smaller tensors using the Tiler_MN
tiler information and the TiledLayout_TV
TV layout information. Such partition allows slicing the partitioned tensor using the thread index value and allows the tiled copy operation to be performed iteratively for tensors that are larger than the tiler shape. The partition functions are implemented in the tidfrg_S
and tidfrg_D
member functions, respectively.
The ThrCopy
templated class instance produced by the TiledCopy
templated class instance using the get_slice
or get_thread_slice
member function and it owns the thread index value at the runtime. So it knows how to slice the partitioned source tensor and target tensor for a specific thread in the thread group using the thread index value.
TiledCopy Helper Factory Function
There is a common helper factory function make_tiled_copy_impl
in CuTe for constructing a TiledCopy
templated class instance from a copy atom, a thread layout, and a value layout. It is used by all the overloaded make_tiled_copy
functions in CuTe.
1 | template <class... Args, |
CuTe Tiled Copy from Thread Layout and Value Layout
One of the most widely used overloaded make_tiled_copy
functions in CuTe is the one that takes a copy atom copy_atom
, a thread layout thr_layout
, and a value layout val_layout
as inputs and returns a TiledCopy
templated class instance, which is commonly used for performing tiled copy operations between global memory and shared memory (especially before SM90), resulting in fewer memory transactions and higher memory bandwidth utilization.
1 | /** Produce a TiledCopy from logical thread and values layouts. |
It will construct a TV layout layout_tv
and a tiler tiler
, and use them to construct a TiledCopy
templated class instance using the make_tiled_copy_impl
helper factory function.
CuTe Tiled Copy Thread-Value Layout
CuTe Thread-Value (TV) layout is used for partitioning a tensor into multiple smaller tensors of special access patterns, each of which will be accessed by a thread in a thread group. This is particularly useful for tiled copy and tiled MMA operations in CuTe.
To construct a TV layout for tiled copy, CuTe provides a few convenient functions, such as cute::make_tiled_copy
, which takes a copy atom copy_atom
, a thread layout thr_layout
, and a value layout val_layout
as inputs and returns a cute::TiledCopy
object.
1 | /** Produce a TiledCopy from logical thread and values layouts. |
The val_layout
is the layout of the values that a thread will access in an atomic copy operation. Usually, uint128_t
is used as the vector type for the atomic copy operation, and each thread will access a vector of values whose total size is 128 bits. If the size of a value is 32 bits, each thread will access a vector of 4 values at a time, and if the size of a value is 16 bits, each thread will access a vector of 8 values at a time. In this case, val_layout
is $4 : 1$ for a vector of four 32-bit values, or $8 : 1$ for a vector of eight 16-bit values.
The thr_layout
is the layout of the threads in a thread group. Usually this layout is a 2D layout, such as $(32, 4) : (1, 32)$ for a thread group of 128 threads in column-major order.
With the thr_layout
and val_layout
, we could compute the TV layout layout_tv
for tiled copy. To construct the TV layout, cute::make_tiled_copy
first computes the raked product of thr_layout
and val_layout
to get a layout layout_mn
, which maps the coordinates of the target tensor to the thread index and value index, i.e., $(m, n) \rightarrow (\text{thr_idx}, \text{val_idx})$. This is also known as the inverse TV layout. Then, the TV layout layout_tv
, which maps the thread index and value index to the coordinates of the target tensor, i.e., $(\text{thr_idx}, \text{val_idx}) \rightarrow (m, n)$, is computed by taking the right inverse of layout_mn
. The inverse of the inverse TV layout is the TV layout.
This is actually a brilliant algebra to compute the TV layout for tiled copy, although it might not seem to be straightforward at first glance. With the understanding of CuTe raked product, CuTe TV layout, and CuTe inverse TV layout, also by looking at the example resulting layout of the raked product from the CuTe documentation, it might become somewhat more intuitive. However, I would still like to show an informal mathematical proof of why this algebra works.
Let $A$ and $B$ be the thr_layout
and val_layout
, respectively. Both $A$ and $B$ are compact layouts. We would like to prove that the values for each thread in the TV layout computed by cute::make_tiled_copy
belongs to the thread and the layout of the tiled copy tiler that the thread will access is the same as val_layout
.
Proof
Without loss of generality, we could assume that the rank of $A$ and $B$ is 2, i.e., $A = (M_0, M_1) : (d_0, d_1)$ and $B = (N_0, N_1) : (r_0, r_1)$. Let $C$ be the inverse TV layout layout_mn
computed by the raked product of $A$ and $B$, i.e., $C = A \bowtie B$. Let $D$ be the TV layout layout_tv
computed by taking the right inverse of $C$, i.e., $D = C^{-1}$.
By definition of the raked product, and being a little bit sloppy, we must have
$$
\begin{align}
C &= A \bowtie B \\
&= (M_0, M_1) : (d_0, d_1) \bowtie (N_0, N_1) : (r_0, r_1) \\
&= ((N_0, M_0), (N_1, M_1)) : ((r_0 \left( \text{cosize}(A) \right), d_0), (r_1 \left( \text{cosize}(A) \right), d_1)) \\
\end{align}
$$
where $\text{cosize}(A) = \text{size}(A) = M_0 M_1$ because $A$ is a compact layout.
For a certain thread whose coordinate is $(m_0, m_1)$ in $A$ and a certain value whose coordinate is $(n_0, n_1)$ in $B$, the coordinate of the thread-value in the TV layout $D$ could be computed as follows.
$$
\begin{align}
C((n_0, m_0), (n_1, m_1)) &= n_0 \times (r_0 \left( \text{cosize}(A) \right)) + m_0 \times d_0 + n_1 \times (r_1 \left( \text{cosize}(A) \right)) + m_1 \times d_1 \\
&= n_0 r_0 M_0 M_1 + m_0 d_0 + n_1 r_1 M_0 M_1 + m_1 d_1 \\
\end{align}
$$
The thread index $\text{thr_idx}$ of the thread in the TV layout $D$ whose coordinate is $(m_0, m_1)$ in $A$ could be computed as follows.
$$
\begin{align}
\text{thr_idx} &= C((n_0, m_0), (n_1, m_1)) \mod (M_0 M_1) \\
&= (n_0 r_0 M_0 M_1 + m_0 d_0 + n_1 r_1 M_0 M_1 + m_1 d_1) \mod (M_0 M_1) \\
&= m_0 d_0 + m_1 d_1 \\
\end{align}
$$
where $m_0 d_0 + m_1 d_1$ is exactly the expected thread index computed by the layout $A$, $A(m_0, m_1)$.
Thus, we have proven that the values for each thread in the TV layout computed by cute::make_tiled_copy
belongs to the thread.
$$
C((n_0, m_0), (n_1, m_1)) = A(m_0, m_1)
$$
Then, the value index $\text{val_idx}$ of the value in the TV layout $D$ whose coordinate is $(n_0, n_1)$ in $B$ could be computed as follows.
$$
\begin{align}
\text{val_idx} &= \left\lfloor \frac{C((n_0, m_0), (n_1, m_1))}{M_0 M_1} \right\rfloor \\
&= \left\lfloor \frac{n_0 r_0 M_0 M_1 + m_0 d_0 + n_1 r_1 M_0 M_1 + m_1 d_1}{M_0 M_1} \right\rfloor \\
&= \left\lfloor n_0 r_0 + n_1 r_1 + \frac{m_0 d_0 + m_1 d_1}{M_0 M_1} \right\rfloor \\
&= n_0 r_0 + n_1 r_1 + \left\lfloor \frac{m_0 d_0 + m_1 d_1}{M_0 M_1} \right\rfloor \\
&= n_0 r_0 + n_1 r_1 + 0 \\
&= n_0 r_0 + n_1 r_1 \\
\end{align}
$$
where $n_0 r_0 + n_1 r_1$ is exactly the expected value index computed by the layout $B$, $B(n_0, n_1)$.
Thus, we have proven that the layout of the tiled copy tiler that the thread will access is the same as val_layout
.
This concludes the proof. $\square$
Example of CuTe Tiled Copy from Thread Layout and Value Layout
The following is a simple host preview example of using cute::make_tiled_copy
from a thread layout and a value layout to perform a tiled copy operation from global memory to shared memory.
1 |
|
The tiled copy operation and tensor layouts involved in the tiled copy operation are printed out as follows. They match our intuition and expectations.
1 | gmem_tensor_A: |
More specifically, we want each thread to access a vector of 8 values of 16-bit integers at a time, so the vector_layout_A
is $8 : 1$. Given we have a thread group of 32 threads in the thread block whose layout is $(8, 4) : (1, 8)$, the tiler Tiler_MN
of the tiled copy gmem_tiled_copy_A
must be $(64, 4)$. The TV layout TiledLayout_TV
of the tiled copy gmem_tiled_copy_A
must be $(32, 8) : (8, 1)$ so that each thread will access 8 values and coordinate of the values on the target tensor are contiguous. It has to be emphasized that it is the coordinate of the values on the target tensor that are contiguous, not the index of the values on the target tensor.
Given the shared memory tensor smem_tensor_A
of shape $(128, 32)$ and the global memory tensor gmem_cta_tensor_A
of shape $(128, 32, 32)$, the partitioned source tensor thread_layout_A_gmem_cta_tensor_A
of shape $((8, 1), 2, 8, 32)$ and the partitioned target tensor thread_layout_A_smem_tensor_A
of shape $((8, 1), 2, 8)$ are exactly what we expect. The first mode of the partitioned source tensor and target tensor, i.e., CPY
, is the copy atom mode, which has a layout of $(8, 1) : (1, 0)$. This means that each thread will perform a vector copy operation of 8 continugous values at a time. In this case, each thread will perform 2 atomic copy operations along the $M$ dimension and $8$ atomic copy operations along the $K$ dimension to copy a tile of shape $(128, 32)$ from global memory to shared memory. That’s why CPY_M
is 2 and CPY_K
is 8.
It should be noted that we have to carefully design the tiled copy. Otherwise the CuTe program might not compile or in the worst scenario the copy behavior might not be expected. In our previous example, our tensor layouts, thread layout, and value layout are all column-major. However, if they are inconsistent, it is possible that we see an undesired tiler Tiler_MN
or a copy atom mode CPY
that is not compatible with the intended copy behavior.
References
CuTe Tiled Copy