CuTe Tiled Copy

Introduction

CuTe tiled copy is used almost everywhere for performing efficient data copy operations in a CuTe program. It is usually used for performing vectorized copy operations between global memory and shared memory and load matrix operations from shared memory to registers.

In this article, I would like to discuss CuTe tiled copy implementations and how it works for performing tiled copy operations.

CuTe Tiled Copy

There are two key templated classes and functions in CuTe for performing tiled copy operations, such as TiledCopy and ThrCopy. These classes and functions are defined in the cute/atom/copy_atom.hpp header file in CuTe. We would like to have some shallow understanding of these classes and functions for us to have a better idea of how tiled copy operations are performed in CuTe.

TiledCopy and ThrCopy Templated Classes

A TiledCopy templated class in CuTe is used for producing and assisting a ThrCopy templated class instance for each thread in the thread group that performs the tiled copy operation. The TiledCopy templated class is templated on three templated classes, Copy_Atom, LayoutCopy_TV, and ShapeTiler_MN, where Copy_Atom is a copy atom that defines the atomic copy operation, LayoutCopy_TV is the TV layout for the tiled copy operation, and ShapeTiler_MN is the tiler shape for the tiled copy operation.

The ThrCopy templated class is templated on two templated classes, TiledCopy and ThrIdx, where TiledCopy is the templated class that produces this ThrCopy templated class instance using the member functions such as get_slice and get_thread_slice, and ThrIdx is data type of the thread index value which is usually int or unsigned int.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
template <class TiledCopy, class ThrIdx>
struct ThrCopy;

template <class Copy_Atom,
class LayoutCopy_TV, // (tid,vid) -> coord [Need not be 2D...]
class ShapeTiler_MN> // coord space
struct TiledCopy : Copy_Atom
{
// Layout information from the CopyAtom
using AtomThrID = typename Copy_Atom::ThrID; // thrid -> thr_idx
using AtomLayoutSrc = typename Copy_Atom::ValLayoutSrc; // (thr,val) -> offset
using AtomLayoutDst = typename Copy_Atom::ValLayoutDst; // (thr,val) -> offset
using AtomLayoutRef = typename Copy_Atom::ValLayoutRef; // (thr,val) -> offset

using AtomNumThr = decltype(size<0>(AtomLayoutRef{}));
using AtomNumVal = decltype(size<1>(AtomLayoutRef{}));

// Layout information for the TiledCopy
using Tiler_MN = ShapeTiler_MN;
using TiledLayout_TV = LayoutCopy_TV;
using TiledNumThr = decltype(size<0>(TiledLayout_TV{}));
using TiledNumVal = decltype(size<1>(TiledLayout_TV{}));

CUTE_STATIC_ASSERT_V(TiledNumThr{} % AtomNumThr{} == Int<0>{}, "TiledCopy uses too few thrs for selected CopyAtom");
CUTE_STATIC_ASSERT_V(TiledNumVal{} % AtomNumVal{} == Int<0>{}, "TiledCopy uses too few vals for selected CopyAtom");

// Tile a tensor or a layout from shape
// (M,N,...)
// to shape
// ((ThrV,ThrX),FrgV,(RestM,RestN,...))
// where
// ThrV: The threads local to a COPY_ATOM Src.
// ThrX: The threads tiled across COPY_ATOMs Src.
// FrgV: The values local to a COPY_ATOM Src.
// RestM: The values tiled in M.
// RestN: The values tiled in N.
template <class STensor>
CUTE_HOST_DEVICE constexpr static
auto
tidfrg_S(STensor&& stensor)
{
CUTE_STATIC_ASSERT_V(rank(stensor) >= rank(Tiler_MN{}), "Rank of tensor to be partitioned too small.");

// Tile the stensor and compute the (src-thr, src-val) -> (ref-thr, ref-val) layout
return tile2thrfrg(zipped_divide(stensor,Tiler_MN{}), right_inverse(AtomLayoutRef{}).compose(AtomLayoutSrc{}));
}

// Tile a tensor or a layout from shape
// (M,N,...)
// to shape
// ((ThrV,ThrX),FrgV,(RestM,RestN,...))
// where
// ThrV: The threads local to a COPY_ATOM Dst.
// ThrX: The threads tiled across COPY_ATOMs Dst.
// FrgV: The values local to a COPY_ATOM Dst.
// RestM: The values tiled in M.
// RestN: The values tiled in N.
template <class DTensor>
CUTE_HOST_DEVICE constexpr static
auto
tidfrg_D(DTensor&& dtensor)
{
CUTE_STATIC_ASSERT_V(rank(dtensor) >= rank(Tiler_MN{}), "Rank of tensor to be partitioned too small.");

// Tile the dtensor and compute the (dst-thr, dst-val) -> (ref-thr, ref-val) layout
return tile2thrfrg(zipped_divide(dtensor,Tiler_MN{}), right_inverse(AtomLayoutRef{}).compose(AtomLayoutDst{}));
}

// Tile a tensor or a layout from shape
// ((TileM,TileN,...), (RestM,RestN,...))
// to shape
// ((ThrV,ThrX),FrgV,(RestM,RestN,...))
template <class Tensor, class Ref2TrgLayout>
CUTE_HOST_DEVICE constexpr static
auto
tile2thrfrg(Tensor&& tensor, Ref2TrgLayout const& ref2trg)
{
// Take the thrs/vals that the atom is interested in
// NOTE: Assumes the AtomNumThr are contiguous and identity within TiledThrID
auto atom_layout_TV = zipped_divide(TiledLayout_TV{}, make_shape(AtomNumThr{}, AtomNumVal{}));
// ((atom_tid,atom_val),(rest_tid,rest_val)) -> (m,n)

// Transform to the trg layout
auto trg_layout_TV = atom_layout_TV.compose(ref2trg, _);
// ((trg_tid,trg_val),(rest_tid,rest_val)) -> (m,n)

// Transform the thrs mode from thrid to thr_idx
// NOTE: Assumes the AtomNumThr are contiguous and identity within TiledThrID
auto thrval2mn = coalesce(zip(trg_layout_TV), Shape<_1,Shape<_1,_1>>{});
// ((trg_tid,rest_tid),(trg_val,rest_val)) -> (m,n)

/// ==================

// Transform the tile mode
auto tv_tensor = tensor.compose(thrval2mn, _);
// ((thrid,val),(RestM,RestN,...))

// Unfold and return
return tv_tensor(make_coord(_,_), _);
}

// retile_S and retile_D assume they are working with the reference layout -- they are the same
template <class Tensor>
CUTE_HOST_DEVICE constexpr static
auto
retile(Tensor&& tensor)
{
constexpr int R = remove_cvref_t<Tensor>::rank;
// Assert that AtomLayoutSrc|Dst is identity so we can skip the Ref transformation

// Assume the first size<0>(tensor) elements are the first val_ids in TiledLayout_TV.
// Then, we only need the shape+layout of those size<0>(tensor) elements in TiledLayout_TV
// and that shape is what we gather from the other modes of tensor

auto V = size<0>(tensor);

auto frg_layout_mn = upcast<TiledNumThr{} * V>(right_inverse(TiledLayout_TV{}).with_shape(shape(Tiler_MN{})));
// (m,n) -> v_idx -- The shape and order of the V inside of TiledLayout_TV

auto frg_layout_v = zipped_divide(logical_product(make_layout(V), right_inverse(frg_layout_mn)), make_layout(AtomNumVal{}));
// (atom_vals,rest_vals) -> (v,m,n)

/// =======

// Tile the tensor for TileFrg
auto t_tensor = zipped_divide(tensor, prepend(product_each(shape(frg_layout_mn)), V));
// ((TileV,TileM,TileN,...),(1,RestM,RestN,...))

// Transform the tile mode
auto v_tensor = t_tensor.compose(frg_layout_v, _);
// ((atom_vals,rest_vals),(1,RM,RN,...))

// Unfold and return
return v_tensor(_, append<R>(Int<0>{},_));
}

CUTE_HOST_DEVICE constexpr static
auto
get_layoutS_TV()
{
// (M,N) -> (M,N)
auto ref_S = make_layout(make_shape(shape(Tiler_MN{}), Int<1>{}));
// (thr_idx,val_idx) -> (M,N)
return tile2thrfrg(ref_S, right_inverse(AtomLayoutRef{}).compose(AtomLayoutSrc{}))(_,_,Int<0>{});
}

CUTE_HOST_DEVICE constexpr static
auto
get_layoutD_TV()
{
// (M,N) -> (M,N)
auto ref_D = make_layout(make_shape(shape(Tiler_MN{}), Int<1>{}));
// (thr_idx,val_idx) -> (M,N)
return tile2thrfrg(ref_D, right_inverse(AtomLayoutRef{}).compose(AtomLayoutDst{}))(_,_,Int<0>{});
}

template <class ThrIdx,
__CUTE_REQUIRES(is_integral<ThrIdx>::value)>
CUTE_HOST_DEVICE static
auto
get_slice(ThrIdx const& thr_idx)
{
return ThrCopy<TiledCopy, ThrIdx>(thr_idx);
}

template <class ThrIdx,
__CUTE_REQUIRES(is_integral<ThrIdx>::value)>
CUTE_HOST_DEVICE static
auto
get_thread_slice(ThrIdx const& thr_idx)
{
return get_slice(thr_idx);
}
};

template <class TiledCopy, class ThrIdx>
struct ThrCopy
{
ThrIdx thr_idx_;

CUTE_HOST_DEVICE
ThrCopy(ThrIdx const& thr_idx) : thr_idx_(thr_idx) {}

template <class STensor>
CUTE_HOST_DEVICE
auto
partition_S(STensor&& stensor) const {
//static_assert(sizeof(typename remove_cvref_t<STensor>::value_type) == sizeof(typename TiledCopy::ValType),
// "Expected ValType for tiling SrcTensor.");
auto thr_tensor = make_tensor(static_cast<STensor&&>(stensor).data(), TiledCopy::tidfrg_S(stensor.layout()));
return thr_tensor(thr_idx_, _, repeat<rank_v<STensor>>(_));
}

template <class DTensor>
CUTE_HOST_DEVICE
auto
partition_D(DTensor&& dtensor) const {
//static_assert(sizeof(typename remove_cvref_t<DTensor>::value_type) == sizeof(typename TiledCopy::ValType),
// "Expected ValType for tiling DstTensor.");
auto thr_tensor = make_tensor(static_cast<DTensor&&>(dtensor).data(), TiledCopy::tidfrg_D(dtensor.layout()));
return thr_tensor(thr_idx_, _, repeat<rank_v<DTensor>>(_));
}

template <class STensor>
CUTE_HOST_DEVICE static
auto
retile_S(STensor&& stensor) {
// static_assert(sizeof(typename remove_cvref_t<STensor>::value_type) == sizeof(typename TiledCopy::ValType),
// "Expected ValType for tiling SrcTensor.");
return make_tensor(static_cast<STensor&&>(stensor).data(), TiledCopy::retile(stensor.layout()));
}

template <class DTensor>
CUTE_HOST_DEVICE static
auto
retile_D(DTensor&& dtensor) {
// static_assert(sizeof(typename remove_cvref_t<DTensor>::value_type) == sizeof(typename TiledCopy::ValType),
// "Expected ValType for tiling DstTensor.");
return make_tensor(static_cast<DTensor&&>(dtensor).data(), TiledCopy::retile(dtensor.layout()));
}
};

Given a source tensor and a target tensor, the TiledCopy templated class knows how to partition the source tensor and target tensor into multiple smaller tensors using the Tiler_MN tiler information and the TiledLayout_TV TV layout information. Such partition allows slicing the partitioned tensor using the thread index value and allows the tiled copy operation to be performed iteratively for tensors that are larger than the tiler shape. The partition functions are implemented in the tidfrg_S and tidfrg_D member functions, respectively.

The ThrCopy templated class instance produced by the TiledCopy templated class instance using the get_slice or get_thread_slice member function and it owns the thread index value at the runtime. So it knows how to slice the partitioned source tensor and target tensor for a specific thread in the thread group using the thread index value.

TiledCopy Helper Factory Function

There is a common helper factory function make_tiled_copy_impl in CuTe for constructing a TiledCopy templated class instance from a copy atom, a thread layout, and a value layout. It is used by all the overloaded make_tiled_copy functions in CuTe.

1
2
3
4
5
6
7
8
9
10
11
template <class... Args,
class LayoutCopy_TV,
class Tiler>
CUTE_HOST_DEVICE
auto
make_tiled_copy_impl(Copy_Atom<Args...> const& atom,
LayoutCopy_TV const&,
Tiler const&)
{
return TiledCopy<Copy_Atom<Args...>, LayoutCopy_TV, Tiler>{atom};
}

CuTe Tiled Copy from Thread Layout and Value Layout

One of the most widely used overloaded make_tiled_copy functions in CuTe is the one that takes a copy atom copy_atom, a thread layout thr_layout, and a value layout val_layout as inputs and returns a TiledCopy templated class instance, which is commonly used for performing tiled copy operations between global memory and shared memory (especially before SM90), resulting in fewer memory transactions and higher memory bandwidth utilization.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
/** Produce a TiledCopy from logical thread and values layouts.
* The thread and value layouts map coordinates to thr_idx and val_idx.
* The product of these layouts is taken to produce the TV layout and the Tiler.
* Useful when threads and values need very specific mappings onto coordinates
* in the target tensors.
*/
template <class... Args,
class ThrLayout,
class ValLayout = Layout<_1>>
CUTE_HOST_DEVICE
auto
make_tiled_copy(Copy_Atom<Args...> const& copy_atom,
ThrLayout const& thr_layout = {}, // (m,n) -> thr_idx
ValLayout const& val_layout = {}) // (m,n) -> val_idx
{
// Take the raked_products to compute the Layout_MN
// (M,N) -> (thr_idx, val_idx)
auto layout_mn = raked_product(thr_layout, val_layout);
// (thr_idx, val_idx) -> (M,N)
auto layout_tv = right_inverse(layout_mn).with_shape(make_shape(size(thr_layout), size(val_layout)));
// Tiler for extracting relevant elements
// (M,N) -> tensor coord
auto tiler = product_each(shape(layout_mn));

#if 0
print("thr_layout: "); print(thr_layout); print("\n");
print("val_layout: "); print(val_layout); print("\n");
print("layout_mn : "); print(layout_mn); print("\n");
print("layout_tv : "); print(layout_tv); print("\n");
print("tiler : "); print(tiler); print("\n");
#endif

return make_tiled_copy_impl(copy_atom, layout_tv, tiler);
}

It will construct a TV layout layout_tv and a tiler tiler, and use them to construct a TiledCopy templated class instance using the make_tiled_copy_impl helper factory function.

CuTe Tiled Copy Thread-Value Layout

CuTe Thread-Value (TV) layout is used for partitioning a tensor into multiple smaller tensors of special access patterns, each of which will be accessed by a thread in a thread group. This is particularly useful for tiled copy and tiled MMA operations in CuTe.

To construct a TV layout for tiled copy, CuTe provides a few convenient functions, such as cute::make_tiled_copy, which takes a copy atom copy_atom, a thread layout thr_layout, and a value layout val_layout as inputs and returns a cute::TiledCopy object.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
/** Produce a TiledCopy from logical thread and values layouts.
* The thread and value layouts map coordinates to thr_idx and val_idx.
* The product of these layouts is taken to produce the TV layout and the Tiler.
* Useful when threads and values need very specific mappings onto coordinates
* in the target tensors.
*/
template <class... Args,
class ThrLayout,
class ValLayout = Layout<_1>>
CUTE_HOST_DEVICE
auto
make_tiled_copy(Copy_Atom<Args...> const& copy_atom,
ThrLayout const& thr_layout = {}, // (m,n) -> thr_idx
ValLayout const& val_layout = {}) // (m,n) -> val_idx
{
// Take the raked_products to compute the Layout_MN
// (M,N) -> (thr_idx, val_idx)
auto layout_mn = raked_product(thr_layout, val_layout);
// (thr_idx, val_idx) -> (M,N)
auto layout_tv = right_inverse(layout_mn).with_shape(make_shape(size(thr_layout), size(val_layout)));
// Tiler for extracting relevant elements
// (M,N) -> tensor coord
auto tiler = product_each(shape(layout_mn));

#if 0
print("thr_layout: "); print(thr_layout); print("\n");
print("val_layout: "); print(val_layout); print("\n");
print("layout_mn : "); print(layout_mn); print("\n");
print("layout_tv : "); print(layout_tv); print("\n");
print("tiler : "); print(tiler); print("\n");
#endif

return make_tiled_copy_impl(copy_atom, layout_tv, tiler);
}

The val_layout is the layout of the values that a thread will access in an atomic copy operation. Usually, uint128_t is used as the vector type for the atomic copy operation, and each thread will access a vector of values whose total size is 128 bits. If the size of a value is 32 bits, each thread will access a vector of 4 values at a time, and if the size of a value is 16 bits, each thread will access a vector of 8 values at a time. In this case, val_layout is $4 : 1$ for a vector of four 32-bit values, or $8 : 1$ for a vector of eight 16-bit values.

The thr_layout is the layout of the threads in a thread group. Usually this layout is a 2D layout, such as $(32, 4) : (1, 32)$ for a thread group of 128 threads in column-major order.

With the thr_layout and val_layout, we could compute the TV layout layout_tv for tiled copy. To construct the TV layout, cute::make_tiled_copy first computes the raked product of thr_layout and val_layout to get a layout layout_mn, which maps the coordinates of the target tensor to the thread index and value index, i.e., $(m, n) \rightarrow (\text{thr_idx}, \text{val_idx})$. This is also known as the inverse TV layout. Then, the TV layout layout_tv, which maps the thread index and value index to the coordinates of the target tensor, i.e., $(\text{thr_idx}, \text{val_idx}) \rightarrow (m, n)$, is computed by taking the right inverse of layout_mn. The inverse of the inverse TV layout is the TV layout.

This is actually a brilliant algebra to compute the TV layout for tiled copy, although it might not seem to be straightforward at first glance. With the understanding of CuTe raked product, CuTe TV layout, and CuTe inverse TV layout, also by looking at the example resulting layout of the raked product from the CuTe documentation, it might become somewhat more intuitive. However, I would still like to show an informal mathematical proof of why this algebra works.

Let $A$ and $B$ be the thr_layout and val_layout, respectively. Both $A$ and $B$ are compact layouts. We would like to prove that the values for each thread in the TV layout computed by cute::make_tiled_copy belongs to the thread and the layout of the tiled copy tiler that the thread will access is the same as val_layout.

Proof

Without loss of generality, we could assume that the rank of $A$ and $B$ is 2, i.e., $A = (M_0, M_1) : (d_0, d_1)$ and $B = (N_0, N_1) : (r_0, r_1)$. Let $C$ be the inverse TV layout layout_mn computed by the raked product of $A$ and $B$, i.e., $C = A \bowtie B$. Let $D$ be the TV layout layout_tv computed by taking the right inverse of $C$, i.e., $D = C^{-1}$.

By definition of the raked product, and being a little bit sloppy, we must have

$$
\begin{align}
C &= A \bowtie B \\
&= (M_0, M_1) : (d_0, d_1) \bowtie (N_0, N_1) : (r_0, r_1) \\
&= ((N_0, M_0), (N_1, M_1)) : ((r_0 \left( \text{cosize}(A) \right), d_0), (r_1 \left( \text{cosize}(A) \right), d_1)) \\
\end{align}
$$

where $\text{cosize}(A) = \text{size}(A) = M_0 M_1$ because $A$ is a compact layout.

For a certain thread whose coordinate is $(m_0, m_1)$ in $A$ and a certain value whose coordinate is $(n_0, n_1)$ in $B$, the coordinate of the thread-value in the TV layout $D$ could be computed as follows.

$$
\begin{align}
C((n_0, m_0), (n_1, m_1)) &= n_0 \times (r_0 \left( \text{cosize}(A) \right)) + m_0 \times d_0 + n_1 \times (r_1 \left( \text{cosize}(A) \right)) + m_1 \times d_1 \\
&= n_0 r_0 M_0 M_1 + m_0 d_0 + n_1 r_1 M_0 M_1 + m_1 d_1 \\
\end{align}
$$

The thread index $\text{thr_idx}$ of the thread in the TV layout $D$ whose coordinate is $(m_0, m_1)$ in $A$ could be computed as follows.

$$
\begin{align}
\text{thr_idx} &= C((n_0, m_0), (n_1, m_1)) \mod (M_0 M_1) \\
&= (n_0 r_0 M_0 M_1 + m_0 d_0 + n_1 r_1 M_0 M_1 + m_1 d_1) \mod (M_0 M_1) \\
&= m_0 d_0 + m_1 d_1 \\
\end{align}
$$

where $m_0 d_0 + m_1 d_1$ is exactly the expected thread index computed by the layout $A$, $A(m_0, m_1)$.

Thus, we have proven that the values for each thread in the TV layout computed by cute::make_tiled_copy belongs to the thread.

$$
C((n_0, m_0), (n_1, m_1)) = A(m_0, m_1)
$$

Then, the value index $\text{val_idx}$ of the value in the TV layout $D$ whose coordinate is $(n_0, n_1)$ in $B$ could be computed as follows.

$$
\begin{align}
\text{val_idx} &= \left\lfloor \frac{C((n_0, m_0), (n_1, m_1))}{M_0 M_1} \right\rfloor \\
&= \left\lfloor \frac{n_0 r_0 M_0 M_1 + m_0 d_0 + n_1 r_1 M_0 M_1 + m_1 d_1}{M_0 M_1} \right\rfloor \\
&= \left\lfloor n_0 r_0 + n_1 r_1 + \frac{m_0 d_0 + m_1 d_1}{M_0 M_1} \right\rfloor \\
&= n_0 r_0 + n_1 r_1 + \left\lfloor \frac{m_0 d_0 + m_1 d_1}{M_0 M_1} \right\rfloor \\
&= n_0 r_0 + n_1 r_1 + 0 \\
&= n_0 r_0 + n_1 r_1 \\
\end{align}
$$

where $n_0 r_0 + n_1 r_1$ is exactly the expected value index computed by the layout $B$, $B(n_0, n_1)$.

Thus, we have proven that the layout of the tiled copy tiler that the thread will access is the same as val_layout.

This concludes the proof. $\square$

Example of CuTe Tiled Copy from Thread Layout and Value Layout

The following is a simple host preview example of using cute::make_tiled_copy from a thread layout and a value layout to perform a tiled copy operation from global memory to shared memory.

cute_tiled_copy_preview.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
#include <cassert>
#include <fstream>
#include <iomanip>
#include <iostream>

#include <cute/layout.hpp>
#include <cute/swizzle.hpp>
#include <cute/tensor.hpp>

#include <thrust/host_vector.h>

int main(int argc, const char** argv)
{
using TA = uint16_t;
using VectorTypeA = cute::uint128_t;
using CopyOperationA = cute::SM80_CP_ASYNC_CACHEALWAYS<VectorTypeA>;

constexpr int M{1024};
constexpr int K{1024};
auto const shape_M = cute::Int<M>{};
auto const shape_K = cute::Int<K>{};

auto const gmem_shape_A{cute::make_shape(shape_M, shape_K)};
auto const gmem_stride_A{
cute::make_stride(cute::Int<1>{}, shape_M)}; // Column-major
auto const gmem_layout_A{
cute::make_layout(gmem_shape_A, gmem_stride_A)}; // (M, K)

auto const gmem_size_A{M * K};
auto h_gmem_A = thrust::host_vector<TA>(gmem_size_A);

auto gmem_tensor_A{
cute::make_tensor(cute::make_gmem_ptr(h_gmem_A.data()), gmem_layout_A)};

constexpr int bM{128 * 2 / sizeof(TA)};
constexpr int bN{128 * 2 / sizeof(TA)};
constexpr int bK{32};
auto const blk_M = cute::Int<bM>{};
auto const blk_K = cute::Int<bK>{};
auto const cta_tiler{cute::make_shape(bM, bN, bK)}; // (BLK_M, BLK_N, BLK_K)

auto const smem_shape_A{cute::make_shape(blk_M, blk_K)};
auto const smem_stride_A{
cute::make_stride(cute::Int<1>{}, blk_M)}; // Column-major
auto const smem_layout_A{
cute::make_layout(smem_shape_A, smem_stride_A)}; // (blk_M, blk_K)

auto const smem_size_A{blk_M * blk_K};

auto h_smem_A = thrust::host_vector<TA>(smem_size_A);

auto smem_tensor_A{cute::make_tensor(h_smem_A.data(), smem_layout_A)};

auto const cta_coord{cute::make_coord(0, 0, cute::_)}; // (m, n, :)
auto gmem_cta_tensor_A{
cute::local_tile(gmem_tensor_A, cta_tiler, cta_coord,
cute::Step<cute::Int<1>, cute::X,
cute::Int<1>>{})}; // (BLK_M, BLK_K, k)

auto const thread_shape_A{
cute::make_shape(cute::Int<8>{}, cute::Int<4>{})}; // (THR_M, THR_K)
auto const thread_stride_A{cute::make_stride(
cute::Int<1>{}, cute::size<0>(thread_shape_A))}; // column-major
auto const thread_layout_A{
cute::make_layout(thread_shape_A, thread_stride_A)}; // (THR_M, THR_K)

constexpr auto NUM_VECTOR_ELEMENTS_A{sizeof(VectorTypeA) / sizeof(TA)};
auto const vector_shape_A{cute::make_shape(
cute::Int<NUM_VECTOR_ELEMENTS_A>{})}; // (NUM_VECTOR_ELEMENTS_A,)
auto const vector_stride_A{cute::make_stride(cute::Int<1>{})};
auto const vector_layout_A{cute::make_layout(
vector_shape_A, vector_stride_A)}; // (NUM_VECTOR_ELEMENTS_A,)
auto gmem_tiled_copy_A{
cute::make_tiled_copy(cute::Copy_Atom<CopyOperationA, TA>{},
thread_layout_A, vector_layout_A)};

auto const thread_idx{0};
auto thread_gmem_copy_A{gmem_tiled_copy_A.get_slice(thread_idx)};
auto thread_layout_A_gmem_cta_tensor_A{thread_gmem_copy_A.partition_S(
gmem_cta_tensor_A)}; // (CPY, CPY_M, CPY_K, k)
auto thread_layout_A_smem_tensor_A{
thread_gmem_copy_A.partition_D(smem_tensor_A)}; // (CPY, CPY_M, CPY_K)

std::cout << "gmem_tensor_A: " << std::endl;
cute::print(gmem_tensor_A);
std::cout << std::endl;
std::cout << "gmem_cta_tensor_A: " << std::endl;
cute::print(gmem_cta_tensor_A);
std::cout << std::endl;
std::cout << "smem_tensor_A: " << std::endl;
cute::print(smem_tensor_A);
std::cout << std::endl;
std::cout << "thread_layout_A: " << std::endl;
cute::print(thread_layout_A);
std::cout << std::endl;
std::cout << "vector_layout_A: " << std::endl;
cute::print(vector_layout_A);
std::cout << std::endl;
std::cout << "gmem_tiled_copy_A: " << std::endl;
cute::print(gmem_tiled_copy_A);
std::cout << std::endl;
std::cout << "thread_gmem_copy_A: " << std::endl;
cute::print(thread_gmem_copy_A);
std::cout << std::endl;
std::cout << "thread_layout_A_gmem_cta_tensor_A: " << std::endl;
cute::print(thread_layout_A_gmem_cta_tensor_A);
std::cout << std::endl;
std::cout << "thread_layout_A_smem_tensor_A: " << std::endl;
cute::print(thread_layout_A_smem_tensor_A);
std::cout << std::endl;

return 0;
}

The tiled copy operation and tensor layouts involved in the tiled copy operation are printed out as follows. They match our intuition and expectations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
gmem_tensor_A:
gmem_ptr[16b](0x7892c94a5010) o (_1024,_1024):(_1,_1024)
gmem_cta_tensor_A:
gmem_ptr[16b](0x7892c94a5010) o (128,32,32):(_1,_1024,32768)
smem_tensor_A:
ptr[16b](0x5674bb3e9010) o (_128,_32):(_1,_128)
thread_layout_A:
(_8,_4):(_1,_8)
vector_layout_A:
(_8):(_1)
gmem_tiled_copy_A:
TiledCopy
Tiler_MN: (_64,_4)
TiledLayout_TV: (_32,_8):(_8,_1)
Copy_Atom
ThrID: _1:_0
ValLayoutSrc: (_1,_8):(_0,_1)
ValLayoutDst: (_1,_8):(_0,_1)
ValLayoutRef: (_1,_8):(_0,_1)
ValueType: 16b

thread_gmem_copy_A:
ThrCopy
ThrIdx: 0
TiledCopy
Tiler_MN: (_64,_4)
TiledLayout_TV: (_32,_8):(_8,_1)
Copy_Atom
ThrID: _1:_0
ValLayoutSrc: (_1,_8):(_0,_1)
ValLayoutDst: (_1,_8):(_0,_1)
ValLayoutRef: (_1,_8):(_0,_1)
ValueType: 16b

thread_layout_A_gmem_cta_tensor_A:
gmem_ptr[16b](0x7892c94a5010) o ((_8,_1),2,8,32):((_1,_0),_64,_4096,32768)
thread_layout_A_smem_tensor_A:
ptr[16b](0x5674bb3e9010) o ((_8,_1),_2,_8):((_1,_0),_64,_512)

More specifically, we want each thread to access a vector of 8 values of 16-bit integers at a time, so the vector_layout_A is $8 : 1$. Given we have a thread group of 32 threads in the thread block whose layout is $(8, 4) : (1, 8)$, the tiler Tiler_MN of the tiled copy gmem_tiled_copy_A must be $(64, 4)$. The TV layout TiledLayout_TV of the tiled copy gmem_tiled_copy_A must be $(32, 8) : (8, 1)$ so that each thread will access 8 values and coordinate of the values on the target tensor are contiguous. It has to be emphasized that it is the coordinate of the values on the target tensor that are contiguous, not the index of the values on the target tensor.

Given the shared memory tensor smem_tensor_A of shape $(128, 32)$ and the global memory tensor gmem_cta_tensor_A of shape $(128, 32, 32)$, the partitioned source tensor thread_layout_A_gmem_cta_tensor_A of shape $((8, 1), 2, 8, 32)$ and the partitioned target tensor thread_layout_A_smem_tensor_A of shape $((8, 1), 2, 8)$ are exactly what we expect. The first mode of the partitioned source tensor and target tensor, i.e., CPY, is the copy atom mode, which has a layout of $(8, 1) : (1, 0)$. This means that each thread will perform a vector copy operation of 8 continugous values at a time. In this case, each thread will perform 2 atomic copy operations along the $M$ dimension and $8$ atomic copy operations along the $K$ dimension to copy a tile of shape $(128, 32)$ from global memory to shared memory. That’s why CPY_M is 2 and CPY_K is 8.

It should be noted that we have to carefully design the tiled copy. Otherwise the CuTe program might not compile or in the worst scenario the copy behavior might not be expected. In our previous example, our tensor layouts, thread layout, and value layout are all column-major. However, if they are inconsistent, it is possible that we see an undesired tiler Tiler_MN or a copy atom mode CPY that is not compatible with the intended copy behavior.

References

Author

Lei Mao

Posted on

10-16-2025

Updated on

10-16-2025

Licensed under


Comments