CuTe Thread-Value Layout

Introduction

CuTe Thread-Value (TV) layout is used for partitioning a tensor into multiple smaller tensors of special access patterns, each of which will be accessed by a thread in a thread group. This is particularly useful for tiled copy and tiled MMA operations in CuTe.

In this article, I would like to quickly discuss CuTe TV layout, CuTe inverse TV layout, and CuTe TV partition.

CuTe Thread-Value Layout

CuTe Thread-Value (TV) layout is a 2D layout that describes how each thread in a thread group and each of the values the thread will access map to the 1D coordinates of the target data of any layout. The first mode of the TV layout corresponds to the threads in a thread group, and the second mode corresponds to the values each thread will access.

By orchestrating the TV layout, we could specify what data each thread will access in a thread group. This is particularly useful for tiled copy, where each thread in a thread group will access multiple values whose memory locations follows a certain pattern of the atomic copy operation, and tiled MMA, where each thread in a thread group will access multiple values from the MMA sub-matrices so that the atomic MMA operation can be performed.

CuTe Inverse Thread-Value Layout

CuTe inverse TV layout is a layout that describes how the coordinates of each element in the target data map to the thread index in a thread group and the value index that the thread will access. Because an inverse TV layout returns a 1D index, it has to be converted to get the 2D thread index and value index, using the shape of the TV layout, i.e., the number of threads in a thread group and the number of values each thread will access.

The inverse TV layout is often useful for illustration purposes, which can be useful for verifying whether the data access pattern is as expected with the knowledge of the layout of the target data.

CuTe Thread-Value Layout and Inverse Thread-Value Layout Example

For example, in my previous article “CuTe Tiled MMA”, we used an MMA atom cute::SM80_16x8x16_F16F16F16F16_TN whose TV layouts for the three MMA matrices are as follows.

1
2
3
4
5
6
MMA_Atom
ThrID: _32:_1
Shape_MNK: (_16,_8,_16)
LayoutA_TV: ((_4,_8),(_2,_2,_2)):((_32,_1),(_16,_8,_128))
LayoutB_TV: ((_4,_8),(_2,_2)):((_16,_1),(_8,_64))
LayoutC_TV: ((_4,_8),(_2,_2)):((_32,_1),(_16,_8))

The TV layouts were designed by the CuTe architects to match the data access pattern of the underlying CUDA PTX mma instruction. This saves us, to some extent, from having to study the CUDA PTX mma instruction and design the TV layouts ourselves.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// MMA 16x8x16 TN
struct SM80_16x8x16_F16F16F16F16_TN
{
using DRegisters = uint32_t[2];
using ARegisters = uint32_t[4];
using BRegisters = uint32_t[2];
using CRegisters = uint32_t[2];

CUTE_HOST_DEVICE static void
fma(uint32_t & d0, uint32_t & d1,
uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3,
uint32_t const& b0, uint32_t const& b1,
uint32_t const& c0, uint32_t const& c1)
{
#if defined(CUTE_ARCH_MMA_SM80_ENABLED)
asm volatile(
"mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
"{%0, %1},"
"{%2, %3, %4, %5},"
"{%6, %7},"
"{%8, %9};\n"
: "=r"(d0), "=r"(d1)
: "r"(a0), "r"(a1), "r"(a2), "r"(a3),
"r"(b0), "r"(b1),
"r"(c0), "r"(c1));
#else
CUTE_INVALID_CONTROL_PATH("Attempting to use SM80_16x8x16_F16F16F16F16_TN without CUTE_ARCH_MMA_SM80_ENABLED");
#endif
}
};

template <>
struct MMA_Traits<SM80_16x8x16_F16F16F16F16_TN>
{
using ValTypeD = half_t;
using ValTypeA = half_t;
using ValTypeB = half_t;
using ValTypeC = half_t;

using Shape_MNK = Shape<_16,_8,_16>;
using ThrID = Layout<_32>;
using ALayout = Layout<Shape <Shape < _4,_8>,Shape < _2,_2, _2>>,
Stride<Stride<_32,_1>,Stride<_16,_8,_128>>>;
using BLayout = Layout<Shape <Shape < _4,_8>,Shape <_2, _2>>,
Stride<Stride<_16,_1>,Stride<_8,_64>>>;
using CLayout = SM80_16x8_Row;
};

Its inverse TV layout is visualized as follows. Consequently, we could know exactly what data each thread will access for the MMA atom operation.

cute::SM80_16x8x16_F16F16F16F16_TN MMA Atom

The CuTe APIs for layout inverse and its mathematical derivations could be found in my previous article “CuTe Inverse Layout”.

CuTe Thread-Value Partition

CuTe TV partition is a way to partition a large tensor into multiple smaller tensors according to the TV layout. Each smaller partitioned tensor will be accessed by a thread in a thread group.

Compared to CuTe inner partition and outer partition, CuTe TV partition is more straightforward. It’s nothing but composing the TV layout with the original layout of the target tensor. To select the partition for a specific thread, we just need to compute the coordinate of the thread in the thread group and use it to slice the first mode of the TV partitioned layout.

1
2
3
4
5
6
7
8
9
10
11
12
// Construct a TV-layout that maps 8 thread indices and 4 value indices
// to 1D coordinates within a 4x8 tensor
// (T8,V4) -> (M4,N8)
auto tv_layout = Layout<Shape <Shape <_2,_4>,Shape <_2, _2>>,
Stride<Stride<_8,_1>,Stride<_4,_16>>>{}; // (8,4)

// Construct a 4x8 tensor with any layout
Tensor A = make_tensor<float>(Shape<_4,_8>{}, LayoutRight{}); // (4,8)
// Compose A with the tv_layout to transform its shape and order
Tensor tv = composition(A, tv_layout); // (8,4)
// Slice so each thread has 4 values in the shape and order that the tv_layout prescribes
Tensor v = tv(threadIdx.x, _); // (4)

References

Author

Lei Mao

Posted on

10-13-2025

Updated on

10-13-2025

Licensed under


Comments