NVIDIA Tensor Core TN Layout MMA Instruction

Introduction

NVIDIA Tensor Cores are specialized hardware units designed to accelerate matrix operations, particularly matrix multiplications, which are fundamental to many deep learning and high-performance computing applications. Tensor Cores utilize Matrix-Matrix Accumulate (MMA) instructions to perform these operations efficiently. However, most of the NVIDIA Tensor Core MMA instructions are optimized for a specific data layout known as TN layout, where the input matrix $A$ is stored in row-major order and input matrix $B$ is stored in column-major order.

In this article, I would like to discuss why TN layout is the most optimized layout for GEMM problems across different hardware architectures, how NVIDIA Tensor Core MMA instructions are designed specifically for TN layout, and how non-TN layout GEMM solutions can be implemented using these TN layout MMA instructions.

GEMM and MMA Layouts

GEMM (General Matrix-Matrix Multiplication) computes matrix multiplications using different data layouts for an input matrix $A$ of shape $(M, K)$ and an input matrix $B$ of shape $(K, N)$, resulting in an output matrix $C$ of shape $(M, N)$. In the context of this article, we ignore the accumulation of matrix $C$ in GEMM problems.

MMA (Matrix-Matrix Accumulate) instructions are low-level instructions that perform matrix multiplications on small sub-matrices (tiles) of the input matrices $A$ and $B$, producing a sub-matrix (tile) of the output matrix $C$. MMA instructions are often used to accelerate GEMM operations on NVIDIA GPUs when using Tensor Cores.

There are four different layouts for GEMM problems as well as for MMA instructions:

  1. TN: Matrix $A$ is stored in row-major and matrix $B$ is stored in column-major.
  2. NT: Matrix $A$ is stored in column-major and matrix $B$ is stored in row-major.
  3. NN: Both matrix $A$ and matrix $B$ are stored in column-major.
  4. TT: Both matrix $A$ and matrix $B$ are stored in row-major.

The resulting matrix $C$ is always stored in column-major format. If the user wants to store the resulting matrix $C$ in row-major format, there are ways to translate the GEMM problems to the existing one described above. Please refer to my previous article “cuBLAS GEMM API Usages for Column-Major and Row-Major Matrices” for more details.

MMA Instruction Performances of Different Layouts

Most of the NVIDIA Tensor Core MMA instructions, especially those on newer GPU architectures, such as SM80, SM90, SM100, and SM120, are specific to TN layout. There are no direct MMA instructions for other layouts such as NT, NN, and TT for those GPU architectures.

The only platform, as far as I know, that provides MMA instructions for all four layouts is the SM70 (Volta) architecture. Below are the benchmark results of different SM70 FP16 MMA instructions using CuTe MMA Benchmark on an NVIDIA RTX 5080 GPU (SM120).

MMA Instruction Accumulation Datatype Throughput (TOPS)
SM70_8x8x4_F16F16F16F16_TN FP16 95.499
SM70_8x8x4_F16F16F16F16_NT FP16 13.971
SM70_8x8x4_F16F16F16F16_NN FP16 30.857
SM70_8x8x4_F16F16F16F16_TT FP16 19.596
SM70_8x8x4_F32F16F16F32_TN FP32 107.138
SM70_8x8x4_F32F16F16F32_NT FP32 15.362
SM70_8x8x4_F32F16F16F32_NN FP32 28.888
SM70_8x8x4_F32F16F16F32_TT FP32 28.578

We could see that among all four layouts of MMA instructions in the same “family”, the TN layout MMA instruction has the highest throughput, which is about 3x to 7x faster than other layouts on SM120 architecture. It should be noted that, according to the NVIDIA Parallel Thread Execution ISA, mma.sync.m8n8k4, the MMA instruction behind the CuTe MMA Atom wrappers, is optimized for target architecture SM70 and may have substantially reduced performance on other target architectures. Because I no longer have access to SM70 GPUs, I could not the true performance of SM70 MMA instructions on SM70 architecture.

Why TN Layout Is Chosen as the Most Optimized for GEMM Problems?

The next question is, why TN layout is chosen as the most optimized and the only supported MMA instruction layout most of the time on NVIDIA GPU architectures?

In fact, it is not only NVIDIA GPU architectures that choose to optimize TN layout GEMM problems. Even before deep learning became a thing, GEMM problems are widely used in high-performance computing (HPC) applications, and TN layout is often the preferred layout for matrix multiplications in HPC as well. This is because TN layout allows for better memory access patterns and cache utilization, leading to improved performance. Even if we are just using single-threaded naive implementations on CPU without SIMD instructions, TN layout can still outperform other layouts due to better cache locality.

benchmark_matmul_layouts.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
#include <chrono>
#include <cstddef>
#include <iomanip>
#include <iostream>
#include <random>
#include <string>
#include <vector>

void fill_random(std::vector<int>& mat)
{
std::mt19937 gen(42);
std::uniform_int_distribution<int> dist(-100, 100);
for (auto& x : mat)
{
x = dist(gen);
}
}

// TN: A row-major, B col-major, C col-major
void matmul_TN(std::vector<int> const& A, std::vector<int> const& B,
std::vector<int>& C, size_t M, size_t N, size_t K)
{
for (size_t m{0}; m < M; ++m)
{
for (size_t n{0}; n < N; ++n)
{
long long sum{0};
for (size_t k{0}; k < K; ++k)
{
sum += static_cast<long long>(A[m * K + k]) *
B[n * K + k]; // A row-major, B col-major
}
C[n * M + m] = static_cast<int>(sum); // C col-major
}
}
}

// NT: A col-major, B row-major, C col-major
void matmul_NT(std::vector<int> const& A, std::vector<int> const& B,
std::vector<int>& C, size_t M, size_t N, size_t K)
{
for (size_t m{0}; m < M; ++m)
{
for (size_t n{0}; n < N; ++n)
{
long long sum{0};
for (size_t k{0}; k < K; ++k)
{
sum += static_cast<long long>(A[k * M + m]) *
B[k * N + n]; // A col-major, B row-major
}
C[n * M + m] = static_cast<int>(sum); // C col-major
}
}
}

// NN: A col-major, B col-major, C col-major
void matmul_NN(std::vector<int> const& A, std::vector<int> const& B,
std::vector<int>& C, size_t M, size_t N, size_t K)
{
for (size_t m{0}; m < M; ++m)
{
for (size_t n{0}; n < N; ++n)
{
long long sum{0};
for (size_t k{0}; k < K; ++k)
{
sum += static_cast<long long>(A[k * M + m]) *
B[n * K + k]; // A col-major, B col-major
}
C[n * M + m] = static_cast<int>(sum); // C col-major
}
}
}

// TT: A row-major, B row-major, C col-major
void matmul_TT(std::vector<int> const& A, std::vector<int> const& B,
std::vector<int>& C, size_t M, size_t N, size_t K)
{
for (size_t m{0}; m < M; ++m)
{
for (size_t n{0}; n < N; ++n)
{
long long sum{0};
for (size_t k{0}; k < K; ++k)
{
sum += static_cast<long long>(A[m * K + k]) *
B[k * N + n]; // A row-major, B row-major
}
C[n * M + m] = static_cast<int>(sum); // C col-major
}
}
}

typedef void (*matmul_func_t)(std::vector<int> const&, std::vector<int> const&,
std::vector<int>&, size_t, size_t, size_t);
void benchmark(std::string const& name, matmul_func_t matmul,
std::vector<int> const& A, std::vector<int> const& B,
std::vector<int>& C, size_t M, size_t N, size_t K,
size_t warmup_runs, size_t measure_runs)
{
// Warm-up to stabilize caches and branch predictors
for (size_t i{0}; i < warmup_runs; ++i)
{
matmul(A, B, C, M, N, K);
}

// Measure multiple times and take the average
double total_ms{0.0};
for (size_t i{0}; i < measure_runs; ++i)
{
auto start = std::chrono::high_resolution_clock::now();
matmul(A, B, C, M, N, K);
auto end = std::chrono::high_resolution_clock::now();
double ms =
std::chrono::duration<double, std::milli>(end - start).count();
total_ms += ms;
}
double avg_ms{total_ms / static_cast<double>(measure_runs)};
std::cout << std::setw(2) << name << ": " << avg_ms << " ms" << " (avg of "
<< measure_runs << " runs, " << warmup_runs << " warmup)"
<< std::endl;
}

int main()
{
size_t const M{1024};
size_t const N{1024};
size_t const K{1024};
std::vector<int> A_row(M * K), A_col(M * K);
std::vector<int> B_row(K * N), B_col(K * N);
std::vector<int> C(M * N);

size_t const warmup_runs{2};
size_t const measure_runs{5};

// Fill A and B in both layouts
fill_random(A_row);
fill_random(B_row);

// Convert A_row to col-major
for (size_t m{0}; m < M; ++m)
{
for (size_t k{0}; k < K; ++k)
{
A_col[k * M + m] = A_row[m * K + k];
}
}
// Convert B_row to col-major
for (size_t k{0}; k < K; ++k)
{
for (size_t n{0}; n < N; ++n)
{
B_col[n * K + k] = B_row[k * N + n];
}
}

std::cout << "Benchmarking " << M << "x" << N << "x" << K
<< " matrix multiplication (single-threaded, CPU)" << std::endl;
benchmark("TN", matmul_TN, A_row, B_col, C, M, N, K, warmup_runs,
measure_runs);
benchmark("NT", matmul_NT, A_col, B_row, C, M, N, K, warmup_runs,
measure_runs);
benchmark("NN", matmul_NN, A_col, B_col, C, M, N, K, warmup_runs,
measure_runs);
benchmark("TT", matmul_TT, A_row, B_row, C, M, N, K, warmup_runs,
measure_runs);

return 0;
}

We could see from the benchmark results below from an Intel Core i9-9900K CPU that even with a naive single-threaded implementation, TN layout still outperforms other layouts by a large margin, especially when compiler optimizations are enabled with -O1, -O2, or -O3 flags.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
$ g++ benchmark_matmul_layouts.cpp -o benchmark_matmul_layouts -O0
$ ./benchmark_matmul_layouts
Benchmarking 1024x1024x1024 matrix multiplication (single-threaded, CPU)
TN: 3896.94 ms
NT: 7866.73 ms
NN: 5844.9 ms
TT: 6077.64 ms
$ g++ benchmark_matmul_layouts.cpp -o benchmark_matmul_layouts -O1
$ ./benchmark_matmul_layouts
Benchmarking 1024x1024x1024 matrix multiplication (single-threaded, CPU)
TN: 377.64 ms
NT: 6355.77 ms
NN: 1433.67 ms
TT: 1362.49 ms
$ g++ benchmark_matmul_layouts.cpp -o benchmark_matmul_layouts -O2
$ ./benchmark_matmul_layouts
Benchmarking 1024x1024x1024 matrix multiplication (single-threaded, CPU)
TN: 455.691 ms
NT: 6346.08 ms
NN: 1391.93 ms
TT: 1318.56 ms
$ g++ benchmark_matmul_layouts.cpp -o benchmark_matmul_layouts -O3
$ ./benchmark_matmul_layouts
Benchmarking 1024x1024x1024 matrix multiplication (single-threaded, CPU)
TN: 455.647 ms
NT: 6503.69 ms
NN: 1500.05 ms
TT: 1358.68 ms

Therefore, TN layout is generally the most important layout to optimize for GEMM problems across different hardware architectures. Whenever we invented a new hardware architecture or an algorithm for GEMM problems, we should always consider optimizing TN layout first to compete with existing solutions.

NVIDIA Tensor Core TN Layout MMA Instructions

In the following figure, I am showing an example of NVIDIA Tensor Core TN layout MMA instruction SM80_16x8x16_F16F16F16F16_TN, which performs a matrix multiplication of an input matrix $A$ of shape $(16, 16)$ in row-major layout and an input matrix $B$ of shape $(16, 8)$ in column-major layout, resulting in an output matrix $C$ of shape $(16, 8)$ in column-major layout.

SM80_16x8x16_F16F16F16F16_TN MMA Atom

Typically, the data is stored in shared memory before being loaded into registers for the MMA instruction. We could see that multiple low-bitwidth values along the $K$ dimension, stored as row-major for matrix $A$ and column-major for matrix $B$, can be loaded into one 32-bit register by each thread in the warp (via the ldmatrix instruction). The shared memory bank conflicts are usually eliminated by swizzling the data in shared memory before loading them into registers.

Implementing Non-TN Layout GEMM Using NVIDIA Tensor Core TN Layout MMA Instructions

The next question is, given there are no non-TN layout MMA instructions on most NVIDIA GPU architectures, how could we implement NT, NN, and TT layout GEMM solutions? Will it still be efficient?

Suppose we want to implement an NN layout GEMM solution using NVIDIA Tensor Core TN layout MMA instructions. This means the input matrix $A$ has to be transposed conceptually from column-major to row-major layout. We could think of two possible ways to achieve this:

  1. Physically transpose the input matrix $A$ from column-major to row-major layout in shared memory before loading it into registers for the MMA instruction. This is typically done when loading data from global memory to shared memory.
  2. Logically transpose the input matrix $A$ by swapping the thread mapping and the value mapping when loading data from shared memory to registers for the MMA instruction. This is typically done when loading data from shared memory to registers.

The first approach is relatively straightforward and it is MMA-instruction agnostic. However, it may introduce additional overhead due to shared memory conflicts. With proper techniques to avoid shared memory bank conflicts, such as swizzling, the overhead could be minimized to almost zero for most GEMM problems.

The second approach is relatively MMA-instruction dependent. Taking the TN layout MMA instruction SM80_16x8x16_F16F16F16F16_TN as an example, the thread $T0$ has to load two FP16 values from the logical coordinates $(0, 0)$ and $(0, 1)$ of matrix $A$ in row-major layout. If the matrix $A$ is stored in column-major layout, the two FP16 values are actually located at the physical coordinates $(0, 0)$ and $(1, 0)$. There are utility functions to help loading data from shared memory to registers with logical transposition, such as the ldmatrix instruction and its transposed version.

Conclusions

In this article, we discussed why TN layout is the most optimized layout for GEMM problems across different hardware architectures. We also explored how NVIDIA Tensor Core MMA instructions are designed specifically for TN layout and how non-TN layout GEMM solutions can be implemented efficiently using these TN layout MMA instructions. Because of this, NVIDIA does not have to support all four layouts for MMA instructions on GPU.

References

Author

Lei Mao

Posted on

12-06-2025

Updated on

12-06-2025

Licensed under


Comments