CuTe Blocked and Raked Products

08-07-202508-07-2025 blog 9 minutes read (About 1283 words) visits

Introduction

Given a multi-dimensional layout, sometimes we would like to create a tiled layout that repeats the original layout along each mode. In those cases, we could use the blocked product or raked product to create the tiled layout.

In this article, I would like to discuss the blocked product and raked product in CuTe and how they work.

CuTe Logical Product

Given a tiler and a layout, we could compute how the repeat layout that repeats the tiler to fill the domain and codomain of the layout using logical division. Similarly, given a tiler and a repeat layout, we could compute the resulting layout that is a tile of the repeat layout using logical product.

Let $A$ and $B$ be layouts, $M = \text{size}(A) \text{cosize}(B)$. Suppose that the pairs $\{A, M\}$ and $\{\text{complement}(A, M), B\}$ are admissible (for complementation and composition, respectively). Then we define the logical product $A \times B$ to be the layout

$$
\begin{align}
A \times B := \left(A , \text{complement}(A, M) \circ B\right)
\end{align}
$$

CuTe 1D Logical Product

If the logical product is 1D, i.e., $B$ is a layout and the logical product is not by-mode, the tiling of the layout $A$ up to the cosize of $M = \text{size}(A) \text{cosize}(B)$ is intuitive. The 1D logical product $A \times B$ always exists if $\{A, M\}$ is admissible for complementation.

Here are some examples of 1D logical products:

$$
\begin{align}
(2, 2) : (4, 1) \times 6 : 1 &= ((2, 2), (2, 3)) : ((4, 1), ((2, 8))) \\
(2, 2) : (4, 1) \times 6 : 2 &= ((2, 2), 6) : ((4, 1), 8) \\
(2, 2) : (4, 1) \times (4, 2) : (2, 1) &= ((2, 2), (4, 2)) : ((4, 1), (8, 2)) \\
(2, 2) : (4, 1) \times (4, 2) : (1, 4) &= ((2, 2), ((2, 2), 2)) : ((4, 1), ((2, 8), 16)) \\
\end{align}
$$

Tiling seems to be straightforward for 1D logical products, especially for the tilers that are compact. The user will just have to specify how many repeats of the original layout $A$ in the shape of the tiler $B$ and also specify the stride of the tiler $B$ if necessary. The CuTe layout algebra could automatically figure out the tiled layout.

CuTe 2D Logical Product

If the logical product is 2D, i.e., $A$ is a layout of rank 2, $B$ is a tuple of two layouts and the logical product is by-mode, however, the tiling of the layout $A$ along the first mode and the second mode becomes not intuitive.

Consider the layout $A = (2, 5) : (5, 1)$, which is a compact 2D layout, and we would like to first tile it along the first mode 3 times and then along the second mode 4 times, i.e., the column-major $(3, 4): (1, 3)$, resulting in a tiled layout of shape $((2, 3), (5, 4))$. But what is the stride of the tiled layout? What is the tiler $B$ we should use in the by-mode logical product so that the tiled layout is desired?

It is feasible to compute the stride of the tiled layout. Because the layout $A = (2, 5) : (5, 1)$ is first repeated along the first mode 3 times, the layout becomes $((2, 3), 5) : ((5, 2 \times 5), 1) = ((2, 3), 5) : ((5, 10), 1)$. Then the layout is repeated along the second mode 4 times, resulting in $((2, 3), (5, 4)) : ((5, 10), (1, 3 \times 10)) = ((2, 3), (5, 4)) : ((5, 10), (1, 30))$. Therefore, the stride of the tiled layout is $((5, 10), (1, 30))$.

Although somewhat tricky, it is also feasible to compute the tiler $B$ for the by-mode logical product. Note that the size of the first mode of $A$ is 2, the cosize of the first mode of the tiled layout is $3 \times 10 = 30$, therefore, the cosize of the first mode of the tiler $B$ must be $\frac{30}{2} = 15$, and consequently the first mode of the tiler $B$ is $3: 5$. Similarly, the size of the second mode of $A$ is 5, the cosize of the second mode of the tiled layout is $4 \times 30 = 120$, therefore, the cosize of the second mode of the tiler $B$ must be $\frac{120}{5} = 24$, and consequently the second mode of the tiler $B$ is $4: 6$. Therefore, the tiler $B$ is $\langle 3: 5, 4: 6 \rangle$.

However, we could see that the algebra used for computing the tiled layout is complex and it seems that we would have to know how to the tiled layout to compute the tiler $B$ which is not practical. In addition, when the rank of each layout in the tiler $B$ is greater than 2, the algebra used above does not seem to be applicable. Therefore, we need a more intuitive way to compute the tiled layout using a tiler $B$ in which the user only have to specify the number of repeats of the original layout $A$ to be tiled along each mode.

CuTe Blocked Product

Because CuTe 1D logical product is intuitive, straightforward, and “error-free”, we could take advantage of it to compute the tiled layout of a layout given a tiler.

Continue with the example in the 2D logical product, if the tiler $B = (3, 4) : (1, 3)$, we could tile the layout $A = (2, 5) : (5, 1)$ in an 1D fashion. The tiled layout using 1D logical product is $A \times B = (2, 5) : (5, 1) \times (3, 4) : (1, 3) = ((2, 3), (5, 4)) : ((5, 10), (1, 30))$. The stride of the tiled layout is $((5, 10), (1, 30))$ and the shape of the tiled layout is $((2,5),(3,4))$.

By definition, the first mode of the tiled layout from logical product is the layout of $A$, which is $(2,5):(5,1)$ in our case. The layout of the repeat layout, which is the second mode of the tiled layout, correctly saves not only the shape of the tiler $B$ but also the stride of the tiled layout. This means that the zip of the first mode and second of the tiled layout from CuTe 1D logical product is the tiled layout we wanted in CuTe 2D logical product mentioned previously. We call this the blocked product. This algebra remains intuitive and unchanged even when the rank of the layout $A$ and the tiler $B$ is greater than 2.

In our case, the first mode and second mode of the tiled layout from 1D logical product are $(2,5):(5,1)$ and $(3,4):(10,30)$, respectively. The zip of the two modes is $((2,3),(5,4)) : ((5,10),(1,30))$, which is the tiled layout we wanted. The tiler $B$ we used is column-major $(3, 4): (1, 3)$ layout, which intuitively means that we repeat the layout $A$ first along the first mode 3 times and then along the second mode 4 times.

In fact, this is how CuTe implements the blocked product.

// blocked_product -- Reproduce a block over a tiler.
// Think of every element of "tiler" as a "block"
//   and return the layout of the resulting structure.
// @post rank(@a result) == cute::max(rank(@a block), rank(@a tiler))
template <class TShape, class TStride,
          class UShape, class UStride>
CUTE_HOST_DEVICE constexpr
auto
blocked_product(Layout<TShape,TStride> const& block,
                Layout<UShape,UStride> const& tiler)
{
  constexpr int R = cute::max(rank_v<TShape>, rank_v<UShape>);

  auto result = logical_product(append<R>(block), append<R>(tiler));

  return zip(get<0>(result), get<1>(result));
}

Note that there is an append operation in the blocked_product function that appends dummy mode $1:0$ to the original layout and tiler so that they have the same rank. This is to ensure that the zip operation works correctly because the zip operation requires the two layouts to have the same rank.

CuTe Raked Product

There is also a similar algebra called raked product that is used to reproduce a block over a tiler with block-interleaving. The raked product is similar to the blocked product, but it interleaves the blocks of the tiler instead of stacking them. The only difference between the raked product and the blocked product implementations is the order of the two modes in the zip operation.

// raked_product -- Reproduce a block over a tiler with block-interleaving.
// Think of every element of "tiler" as a "block", interleave those blocks,
//   and return the layout of the resulting structure.
// @post rank(@a result) == cute::max(rank(@a block), rank(@a tiler))
template <class TShape, class TStride,
          class UShape, class UStride>
CUTE_HOST_DEVICE constexpr
auto
raked_product(Layout<TShape,TStride> const& block,
              Layout<UShape,UStride> const& tiler)
{
  constexpr int R = cute::max(rank_v<TShape>, rank_v<UShape>);

  auto result = logical_product(append<R>(block), append<R>(tiler));

  return zip(get<1>(result), get<0>(result));
}

References

CuTe Blocked and Raked Products

https://leimao.github.io/blog/CuTe-Blocked-Raked-Products/

Author

Lei Mao

Posted on

08-07-2025

Updated on

08-07-2025

Licensed under

Mathematics,

Accelerated Computing,

CUDA,

CUTLASS,

CuTe

CuTe Blocked and Raked Products

Introduction

CuTe Logical Product

CuTe 1D Logical Product

CuTe 2D Logical Product

CuTe Blocked Product

CuTe Raked Product

References

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Advertisement

Catalogue