CuTe Tensor is parameterized by two template parameters: Engine and Layout. The Engine holds an iterator which can be dereferenced for accessing the data.
When we print a Tensor instance from a CuTe program, in most cases, we will see the following printout, which consists of an iterator and a layout.
1
ptr[16b](0x5ded6f122010) o (_128,_32):(_1,_128)
In this case, the iterator is a pointer described as ptr[16b](0x5ded6f122010), which is a pointer to 16-byte elements. The layout is (_128,_32):(_1,_128), which means the tensor has a shape of $(128, 32)$ and a stride of $(1, 128)$.
However, sometimes, we might encounter a different printout from a CuTe Tensor instance, such as:
1
ArithTuple(0,0) o (_128,_128):(_1@0,_1@1)
In this case, the iterator is an ArithmeticTuple described as ArithTuple(0,0) and the layout is (_128,_128):(_1@0,_1@1), both of which look quite different from the previous common case.
In the CuTe official documentation, such CuTe Tensors were referred to as “CuTe TMA Tensors”. However, because it is not only used for TMA operations in CuTe and it already existed before TMA becomes available on NVIDIA Hopper GPUs, personally I don’t like to call it “CuTe TMA Tensor”. Instead, I would like to call it “CuTe Arithmetic Tuple Tensor” because the iterator used in the CuTe tensor is an ArithmeticTuple, as opposed to “CuTe Data Tensor” where the iterator is a pointer pointing to data.
In this article, I would like to quickly discuss the CuTe Arithmetic Tuple Tensor.
CuTe Arithmetic Tuple Tensor Example
CuTe Arithmetic Tuple Identity Tensor
In addition to the CuTe TMA operations, CuTe Arithmetic Tuple Tensor is also commonly used for computing the coordinates of each element in the original tensor from a partitioned tensor for data access boundary checking. More specifically, the cute::make_identity_tensor function is often used for creating a CuTe Arithmetic Tuple tensor that represents the coordinates of each element in the original CuTe Data tensor. Then the CuTe tensor will follow the same problem partitioning as the correspondent CuTe Data tensor.
The following is a simple host preview example of using cute::make_identity_tensor and CuTe partition functions to compute the coordinates of each element in a partitioned tensor for a MMA problem.
// Configure tiled MMA. using MmaTraits = cute::MMA_Traits<cute::SM80_16x8x16_F16F16F16F16_TN>; using MmaAtomShape = MmaTraits::Shape_MNK; autoconst mma_atom = cute::MMA_Atom<MmaTraits>{}; autoconst mma_atom_shape = MmaAtomShape{}; // Repeating the mma atom along the M, N, and K dimensions. // This increases the number of threads to process the tiled MMA. constexprint MMA_LAYOUT_M{2}; constexprint MMA_LAYOUT_N{2}; constexprint MMA_LAYOUT_K{1}; auto mma_layout{cute::make_layout( cute::make_shape(cute::Int<MMA_LAYOUT_M>{}, cute::Int<MMA_LAYOUT_N>{}, cute::Int<MMA_LAYOUT_K>{}))}; // Repeating the mma processing along the M, N, and K dimensions. // This does not increase the number of threads to process the tiled MMA. // But the number of registers required for processing the tiled MMA // increases. constexprint NUM_MMA_TILE_M{1}; constexprint NUM_MMA_TILE_N{2}; constexprint NUM_MMA_TILE_K{1}; constexprint MMA_TILE_M{cute::get<0>(mma_atom_shape) * MMA_LAYOUT_M * NUM_MMA_TILE_M}; constexprint MMA_TILE_N{cute::get<1>(mma_atom_shape) * MMA_LAYOUT_N * NUM_MMA_TILE_N}; constexprint MMA_TILE_K{cute::get<2>(mma_atom_shape) * MMA_LAYOUT_K * NUM_MMA_TILE_K}; auto mma_tile{cute::make_tile(cute::Int<MMA_TILE_M>{}, cute::Int<MMA_TILE_N>{}, cute::Int<MMA_TILE_K>{})}; auto tiled_mma{cute::make_tiled_mma(mma_atom, mma_layout, mma_tile)};
The coordinate tensors created by cute::make_identity_tensor and partitioned by CuTe partition functions, including global_identity_tensor, block_identity_tensor, and thread_identity_tensor, can be printed using cute::print or cute::print_tensor. We could see that they are all CuTe Arithmetic Tuple Tensors.
global_tensor ptr[16b](0x7a06afbba010) o (_512,_512):(_1,_512) block_tensor ptr[16b](0x7a06afbda110) o (_128,_128):(_1,_512) thread_tensor ptr[16b](0x7a06afbda910) o ((_2,_2),_4,_8):((_512,_8),_32,_8192)
For example, the CuTe Arithmetic Tuple Tensor thread_identity_tensor has an iterator ArithTuple(128,130) and a layout of ((_2,_2),_4,_8):((_1@1,_8@0),_32@0,_16@1). The cute::print_tensor function will iterate through each element in the tensor, compute the coordinates of each element based on the iterator and layout, and print the coordinates. In this case, the stride of the layout is ((_1@1,_8@0),_32@0,_16@1), which is different from the integer stride like (_1,_128) in CuTe Data Tensor, yet CuTe layout algebra, such as logical division, still seem to work. We will discuss how CuTe algebra works for layouts whose strides are arithmetic tuples in the next section.
In fact, before I learned the difference between CuTe Arithmetic Tuple Tensor and CuTe Data Tensor, I was fooled by the name of cute::make_identity_tensor. I thoughtcute::make_identity_tensor would produce a CuTe Data Tensor whose storage saves the coordinates. However, if this is the case, it will usually bring the problem that the coordinates data for large tensors will not fit into the storage, especially when the storage is registers, which makes completely no sense in high performance computing. Therefore, the coordinates are generated on-the-fly from CuTe Arithmetic Tuple Tensor without taking additional storage. This is also consistent with how we compute the coordinates of each element in a tensor when we write CUDA kernels without using CuTe.
CuTe Layout Algebra In CuTe Arithmetic Tuple Tensor
CuTe Layout Coordinate Mapping With Arithmetic Tuple Stride
In my previous article “CuTe Layout Algebra”, I have discussed how CuTe Layout Algebra works for the layouts whose shape and stride are all integers. In CuTe Arithmetic Tuple Tensor, the stride of the layout is not an integer but an arithmetic tuple, such as (_1@0,_1@1).
So what is 1@0, 1@1, etc. in the non-integer stride? The descriptions can be found in the CuTe official documentation. Basically, they represent basis elements in an infinite-dimensional vector space.
String Representation
Description
1
1
1@0
(1,0,...)
1@1
(0,1,0,...)
1@0@0
((1,0,...),0,...)
1@1@0
((0,1,0,...),0,...)
1@0@1
(0,(1,0,...),0,...)
1@1@1
(0,(0,1,0,...),0,...)
The basis elements can be nested. That’s why we could see multiple @ in the string representation. For example, 1@0@1 represents the basis element (0,(1,0,...),0,...).
The basis elements can be scaled by integers. For example, 3@1 represents the basis element (0,3,0,...).
The basis elements can be added together. For example, 1@0 + 2@1 represents the basis element (1,2,0,...).
By defining the above properties of basis elements, we could compute coordinates of each element in a CuTe Arithmetic Tuple Tensor based on its iterator and layout, which is essentially an inner product.
Taking an example of thread_identity_tensor from the previous section, its iterator is ArithTuple(128,130) and layout is ((_2,_2),_4,_8):((_1@1,_8@0),_32@0,_16@1). If we have an input coordinate ((1,1),2,3) to the layout, the inner product can be computed as (1,1) x (_1@1,_8@0) + 2 x _32@0 + 3 x _16@1 = 1 x (0,1,0,...) + 1 x (8,0,0,...) + 2 x (32,0,0,...) + 3 x (0,16,0,...) = (0,1,0,...) + (8,0,0,...) + (64,0,0,...) + (0,48,0,...) = (72,49,0,...). Then we could add the iterator ArithTuple(128,130) to the output coordinate to get the final coordinate of the element in the original tensor: (72+128,49+130) = (200,179).
CuTe Layout Algebra With Arithmetic Tuple Stride
CuTe layout coordinate mapping with arithmetic tuple stride seems to be straightforward. The next question is how CuTe layout algebra, such as composition, complement, logical division, and logical product, can be applied in this context. It turns out that CuTe layout algebra is still applicable to the layouts whose strides are arithmetic tuples.
Without loss of generality, suppose we have a 2D layout with arithmetic tuple L1 = (S1,S2):(T1@i,T2@j). It’s CuTe layout algebra is exactly the same as a 2D layout with integer stride L2 = (S1,S2):(T1,T2). In many cases, L2 will not be a valid layout for a CuTe Data Tensor because the index mapping might not be injective. However, L2 can still be used for CuTe layout algebra.
For example, CuTe composition is an essential operation for both logical division and logical product. In the following example, we found that the CuTe composition and the CuTe division behaviors are the same for layouts whose strides are arithmetic tuples as for layouts whose strides are integers.
We could also verify the idea using the example from the previous section. Because the arithmetic tuple stride of the global_identity_tensor in the example is (_1@0,_1@1), we will use the integer stride (_1,_1) for global_layout_C instead of the original integer stride (_1,_M).
1 2 3 4 5 6
// auto global_layout_C{ // cute::make_layout(cute::make_shape(cute::Int<M>{}, cute::Int<N>{}), // cute::make_stride(cute::Int<1>{}, cute::Int<M>{}))}; auto global_layout_C{cute::make_layout( cute::make_shape(cute::Int<M>{}, cute::Int<N>{}), cute::make_stride(cute::Int<1>{}, cute::Int<1>{}))};
The layouts of the integer strided tensors, including global_tensor, block_tensor, and thread_tensor, and the layouts of the arithmetic tuple strided tensors, including global_identity_tensor, block_identity_tensor, and thread_identity_tensor, are printed out as follows.
global_tensor ptr[16b](0x73332bde7010) o (_512,_512):(_1,_1) block_tensor ptr[16b](0x73332bde7210) o (_128,_128):(_1,_1) thread_tensor ptr[16b](0x73332bde7214) o ((_2,_2),_4,_8):((_1,_8),_32,_16)
The layouts of global_identity_tensor, block_identity_tensor, and thread_identity_tensor will are (_512,_512):(_1,_1), (_128,_128):(_1,_1), and ((_2,_2),_4,_8):((_1@1,_8@0),_32@0,_16@1), respectively. The layouts of global_tensor, block_tensor, and thread_tensor will be (_512,_512):(_1,_1), (_128,_128):(_1,_1), and ((_2,_2),_4,_8):((_1,_8),_32,_16), respectively. This verifies that the CuTe layout algebra works the same for layouts whose strides are arithmetic tuples as for layouts whose strides are integers.
To have a better intuition of how CuTe layout algebra works for layouts whose strides are arithmetic tuples, try deriving the CuTe composition of global_identity_tensor and tiler_layout manually.
Conclusions
CuTe Arithmetic Tuple Tensor is just like a Python Generator expression, which generates coordinates on-the-fly based on the arithmetic tuple and the layout.