CuTe Tilers
Introduction
Tiler is an important concept that allows the user to design how to access data in a tiled manner using CuTe layout algebra. This is critical for CUDA accelerated computing which involves multiple levels of tiling for data access.
In this blog post, I would like to quickly discuss the concept and usage of tilers in CuTe layout algebra.
CuTe Logical Coordinate
CuTe logical coordinate describes the logical location of a specific data of certain shape using a fixed convention. How the data is actually stored in storage is some details described using CuTe layout, which maps from logical coordinate to index for storage. Even if the logical description of the data is the same, there can be different ways to store the data, hence the layout can vary.
Suppose we have logical coordinates $0$, $1$, $2$, $3$, that maps to the data a
, b
, c
, d
respectively. There can be infinite number of ways, described by layout, to store the data a
, b
, c
, d
in a linear storage. If the layout is $4:1$, the data is stored as abcd
. If the layout is $4:2$, the data is stored as a▢b▢c▢d▢
. If the layout is $(2,2):(2,4)$, the data is also stored as a▢b▢c▢d▢
.
On the contrary, suppose we have the data a
, b
, c
, d
stored in a linear storage as abcd
. The layout $4:1$ maps the logical coordinates $0$, $1$, $2$, $3$ to the data a
, b
, c
, d
respectively. The layout $(2,2):(1,2)$ maps the logical coordinates $0$, $1$, $2$, $3$ to the data a
, b
, c
, d
respectively. The layout $(2,2):(2,1)$, however, maps the logical coordinates $0$, $1$, $2$, $3$ to the data a
, c
, b
, d
respectively.
When we want to access the data via certain way, we will also need to come up with new layouts that changes the mapping from logical coordinate to index for storage. CuTe layout algebra allows us to compute the new layout from the original layout using relative simple descriptions.
CuTe Composition
Suppose I have a 2D matrix whose logical shape is $(32, 16)$ and layout is $A$. Regardless of layout $A$, I would like to come up with a layout $C$ whose shape is $(2, 3)$ that maps the logical coordinate $(0, 0)$, $(1, 0)$, $(0, 1)$, $(1, 1)$, $(0, 2)$, and $(1, 2)$ to the data in $A$ at the logical coordinate $(0, 0)$, $(3, 0)$, $(0, 2)$, $(3, 2)$, $(0, 4)$, and $(3, 4)$. That is also to say, the stride of accessing the data in $A$ is $(3, 2)$. Such layout $C$ can be computed using CuTe composition using the original layout $A$ and a tiler, which is a tuple of layouts that describe the number of data and the stride of the data to be accessed along each mode. In this case, the tiler is $B = \langle 2:3, 3:2 \rangle$ and $C = A \circ B$ for any layout of $A$ that is compatible with the tiler $B$. More complicated mappings from logical coordinate of layout $C$ to the logical coordinate of layout $A$ can also be orchestrated on a per-mode basis.
I can also come up with a different layout $D$ whose shape is also $(2, 3)$ but maps the logical coordinate $(0, 0)$, $(1, 0)$, $(0, 1)$, $(1, 1)$, $(0, 2)$, and $(1, 2)$ to the data in $A$ at the logical coordinate $(0, 0)$, $(1, 0)$, $(0, 1)$, $(1, 1)$, $(0, 2)$, and $(1, 2)$. In this case, the tiler is $B = \langle 2:1, 3:1 \rangle$.
We could see that to create data access patterns, we typically could only focus on creating a tiler $B$ even on a per-mode basis. In fact, the tiler is always designed for each mode without considering about other modes. The actual layout $A$ is not important for correctness as CuTe by-mode composition will figure out the correct mapping from logical coordinate of layout $C$ to the indexing of the data in storage.
CuTe by-mode composition is defined as follows, taking the example of rank size of 2.
$$
\begin{align}
A \circ B
&:= (A_{0}, A_{1}) \circ \langle B_{0}, B_{1} \rangle \\
&:= (A_{0} \circ B_{0}, A_{1} \circ B_{1}) \\
\end{align}
$$
CuTe Complement
CuTe complement computes the way of tiling a layout so that the tiled layout has the size and the cosize of a same specified value.
Taking the previously mentioned example, suppose $A = (A_{0}, A_{1})$ where $\text{size}(A_{0}) = 32$ and $\text{size}(A_{1}) = 16$, we would like to know how to repeat each layout in the tiler $B = \langle B_{0}, B_{1} \rangle = \langle 2:3, 3:2 \rangle$ so that the layout $(B_{0}, B_{0}^{\ast})$, where $\text{size}((B_{0}, B_{0}^{\ast})) = \text{cosize}((B_{0}, B_{0}^{\ast})) = \text{size}(A_{0}) = 32$, maps the logical coordinate $0$, $1$, $\cdots$, $\text{size}(A_{0}) - 1$ to the logical coordinate of layout $A_{0}$, $0$, $1$, $\cdots$, $\text{size}(A_{0}) - 1$, and the layout $(B_{1}, B_{1}^{\ast})$, where $\text{size}((B_{1}, B_{1}^{\ast})) = \text{cosize}((B_{1}, B_{1}^{\ast})) = \text{size}(A_{1}) = 16$, maps the logical coordinate $0$, $1$, $\cdots$, $\text{size}(A_{1}) - 1$ to the logical coordinate of layout $A_{1}$, $0$, $1$, $\cdots$, $\text{size}(A_{1}) - 1$.
In this case, the repeat layout for the first layout and the second layout of the tiler are $B_{0}^{\ast} = \text{complement}(B_{0}, \text{size}(A_{0})) = \text{complement}(2:3, 32)$ and $B_{1}^{\ast} = \text{complement}(B_{1}, \text{size}(A_{1})) = \text{complement}(3:2, 16)$, respectively.
CuTe Logical Division
CuTe logical division is just an algebraic combination of CuTe composition and CuTe complement.
In the previous section, we have already computed layout for coordinate mapping for each mode using a tiled layout, $(B_{0}, B_{0}^{\ast})$ and $(B_{1}, B_{1}^{\ast})$. We could access the data in the matrix by composing $A_{0}$ with $(B_{0}, B_{0}^{\ast})$ and composing $A_{1}$ with $(B_{1}, B_{1}^{\ast})$ and the consequent layout is the concatenation of the two layouts, i.e., $C = (A_{0} \circ (B_{0}, B_{0}^{\ast}), A_{1} \circ (B_{1}, B_{1}^{\ast}))$. The layout $C$ is the new layout that maps the logical coordinate to the index of data in storage. Now the layout $A$ and $C$ are both layouts that maps the same logical coordinate domain to the same data in storage, but the mapping is different.
In fact, CuTe single-mode logical division is defined as follows.
$$
\begin{align}
A / B
&:= A \circ (B, \text{complement}(B, \text{size}(A))) \\
&= (A \circ B, A \circ \text{complement}(B, \text{size}(A))) \\
\end{align}
$$
We could see that the first mode of $A / B$ is $A \circ B$. So $B$ decides what the data mapping of each tile is. That’s also why it is called tiler.
Similarly, CuTe by-mode logical division is defined as follows, taking the example of rank size of 2.
$$
\begin{align}
A / B
&:= (A_{0}, A_{1}) / \langle B_{0}, B_{1} \rangle \\
&= (A_{0} \circ (B_{0}, \text{complement}(B_{0}, \text{size}(A_{0}))), A_{1} \circ (B_{1}, \text{complement}(B_{1}, \text{size}(A_{1})))) \\
&= ((A_{0} \circ B_{0}, A_{0} \circ \text{complement}(B_{0}, \text{size}(A_{0}))), (A_{1} \circ B_{1}, A_{1} \circ \text{complement}(B_{1}, \text{size}(A_{1}))))
\end{align}
$$
It is typical to rearrange the modes in the resulted layout of logical division, so that we could access the data is a certain partitioned manner. For example we could rearrange $((A_{0} \circ B_{0}, A_{0} \circ \text{complement}(B_{0}, \text{size}(A_{0}))), (A_{1} \circ B_{1}, A_{1} \circ \text{complement}(B_{1}, \text{size}(A_{1}))))$ to $((A_{0} \circ B_{0}, A_{1} \circ B_{1}), (A_{0} \circ \text{complement}(B_{0}, \text{size}(A_{0})), A_{1} \circ \text{complement}(B_{1}, \text{size}(A_{1}))))$ and this is called zipped division.
The division operations are useful for partitioning the data or problem into smaller problems based on the thread blocks we have, i.e., inner partition, or partitioning the data for each thread to iterate based on the threads we have, i.e., outer partition.
CuTe Logical Product
Suppose we already have a layout $A$ that specifies how we access certain data in storage. Now we could orchestrate how to tile the layout $A$ so that we could access a larger chunk of data in a tiled manner.
Given a 1D tiler $B$ so that the layout $A$ can be repeated $\text{size}(B)$ times. In addition, the stride of the tiler $B$, which might not be $1$, should also be considered. The stride of the tiler $B$ determines the offset between the two consecutive tiles of $A$ is $\text{size}(A) \cdot \text{stride}(B)$. Therefore, the tiled layout must have a cosize of $M = \text{size}(A) \cdot \text{cosize}(B)$.
CuTe complement allows us to compute the complement layout $\text{complement}(A, M)$, such that the layout $(A, \text{complement}(A, M))$ is a bijection $[0, M) \cong [0, M)$. This will tile the layout $A$ more than we want if $\text{size}(B) \neq \text{cosize}(B)$. Therefore, to accommodate the stride of $B$ and tile the layout $A$ exactly $\text{size}(B)$ times, the repeat layout should be $\text{complement}(A, M) \circ B$.
Consequently, CuTe logical product is defined as follows.
$$
\begin{align}
A \times B := \left(A , \text{complement}(A, M) \circ B\right)
\end{align}
$$
where $B$ is usually a 1D tiler to be useful and less error-prone in practice.
Similar to the logical division, we could also rearrange the modes in the resulted layout of logical product, so that we could access the data in a certain way that we want.
The product operation is usually used for designing the thread-value (tv) layout for the atom operation of copy or mma operations.
References
CuTe Tilers