NVIDIA Tensor Core TN Layout MMA Instruction

12-06-202512-06-2025 blog 16 minutes read (About 2389 words) visits

Introduction

NVIDIA Tensor Cores are specialized hardware units designed to accelerate matrix operations, particularly matrix multiplications, which are fundamental to many deep learning and high-performance computing applications. Tensor Cores utilize Matrix-Matrix Accumulate (MMA) instructions to perform these operations efficiently. However, most of the NVIDIA Tensor Core MMA instructions are optimized for a specific data layout known as TN layout, where the input matrix $A$ is stored in row-major order and input matrix $B$ is stored in column-major order.

In this article, I would like to discuss why TN layout is the most optimized layout for GEMM problems across different hardware architectures, how NVIDIA Tensor Core MMA instructions are designed specifically for TN layout, and how non-TN layout GEMM solutions can be implemented using these TN layout MMA instructions.

GEMM and MMA Layouts

GEMM (General Matrix-Matrix Multiplication) computes matrix multiplications using different data layouts for an input matrix $A$ of shape $(M, K)$ and an input matrix $B$ of shape $(K, N)$, resulting in an output matrix $C$ of shape $(M, N)$. In the context of this article, we ignore the accumulation of matrix $C$ in GEMM problems.

MMA (Matrix-Matrix Accumulate) instructions are low-level instructions that perform matrix multiplications on small sub-matrices (tiles) of the input matrices $A$ and $B$, producing a sub-matrix (tile) of the output matrix $C$. MMA instructions are often used to accelerate GEMM operations on NVIDIA GPUs when using Tensor Cores.

There are four different layouts for GEMM problems as well as for MMA instructions:

TN: Matrix $A$ is stored in row-major and matrix $B$ is stored in column-major.
NT: Matrix $A$ is stored in column-major and matrix $B$ is stored in row-major.
NN: Both matrix $A$ and matrix $B$ are stored in column-major.
TT: Both matrix $A$ and matrix $B$ are stored in row-major.

The resulting matrix $C$ is always stored in column-major format. If the user wants to store the resulting matrix $C$ in row-major format, there are ways to translate the GEMM problems to the existing one described above. Please refer to my previous article “cuBLAS GEMM API Usages for Column-Major and Row-Major Matrices” for more details.

MMA Instruction Performances of Different Layouts

Most of the NVIDIA Tensor Core MMA instructions, especially those on newer GPU architectures, such as SM80, SM90, SM100, and SM120, are specific to TN layout. There are no direct MMA instructions for other layouts such as NT, NN, and TT for those GPU architectures.

The only platform, as far as I know, that provides MMA instructions for all four layouts is the SM70 (Volta) architecture. Below are the benchmark results of different SM70 FP16 MMA instructions using CuTe MMA Benchmark on an NVIDIA RTX 5080 GPU (SM120).

MMA Instruction	Accumulation Datatype	Throughput (TOPS)
`SM70_8x8x4_F16F16F16F16_TN`	FP16	95.499
`SM70_8x8x4_F16F16F16F16_NT`	FP16	13.971
`SM70_8x8x4_F16F16F16F16_NN`	FP16	30.857
`SM70_8x8x4_F16F16F16F16_TT`	FP16	19.596
`SM70_8x8x4_F32F16F16F32_TN`	FP32	107.138
`SM70_8x8x4_F32F16F16F32_NT`	FP32	15.362
`SM70_8x8x4_F32F16F16F32_NN`	FP32	28.888
`SM70_8x8x4_F32F16F16F32_TT`	FP32	28.578

We could see that among all four layouts of MMA instructions in the same “family”, the TN layout MMA instruction has the highest throughput, which is about 3x to 7x faster than other layouts on SM120 architecture. It should be noted that, according to the NVIDIA Parallel Thread Execution ISA, mma.sync.m8n8k4, the MMA instruction behind the CuTe MMA Atom wrappers, is optimized for target architecture SM70 and may have substantially reduced performance on other target architectures. Because I no longer have access to SM70 GPUs, I could not the true performance of SM70 MMA instructions on SM70 architecture.

Why TN Layout Is Chosen as the Most Optimized for GEMM Problems?

The next question is, why TN layout is chosen as the most optimized and the only supported MMA instruction layout most of the time on NVIDIA GPU architectures?

In fact, it is not only NVIDIA GPU architectures that choose to optimize TN layout GEMM problems. Even before deep learning became a thing, GEMM problems are widely used in high-performance computing (HPC) applications, and TN layout is often the preferred layout for matrix multiplications in HPC as well. This is because TN layout allows for better memory access patterns and cache utilization, leading to improved performance. Even if we are just using single-threaded naive implementations on CPU without SIMD instructions, TN layout can still outperform other layouts due to better cache locality.

benchmark_matmul_layouts.cpp

#include <chrono>
#include <cstddef>
#include <iomanip>
#include <iostream>
#include <random>
#include <string>
#include <vector>

void fill_random(std::vector<int>& mat)
{
    std::mt19937 gen(42);
    std::uniform_int_distribution<int> dist(-100, 100);
    for (auto& x : mat)
    {
        x = dist(gen);
    }
}

// TN: A row-major, B col-major, C col-major
void matmul_TN(std::vector<int> const& A, std::vector<int> const& B,
               std::vector<int>& C, size_t M, size_t N, size_t K)
{
    for (size_t m{0}; m < M; ++m)
    {
        for (size_t n{0}; n < N; ++n)
        {
            long long sum{0};
            for (size_t k{0}; k < K; ++k)
            {
                sum += static_cast<long long>(A[m * K + k]) *
                       B[n * K + k]; // A row-major, B col-major
            }
            C[n * M + m] = static_cast<int>(sum); // C col-major
        }
    }
}

// NT: A col-major, B row-major, C col-major
void matmul_NT(std::vector<int> const& A, std::vector<int> const& B,
               std::vector<int>& C, size_t M, size_t N, size_t K)
{
    for (size_t m{0}; m < M; ++m)
    {
        for (size_t n{0}; n < N; ++n)
        {
            long long sum{0};
            for (size_t k{0}; k < K; ++k)
            {
                sum += static_cast<long long>(A[k * M + m]) *
                       B[k * N + n]; // A col-major, B row-major
            }
            C[n * M + m] = static_cast<int>(sum); // C col-major
        }
    }
}

// NN: A col-major, B col-major, C col-major
void matmul_NN(std::vector<int> const& A, std::vector<int> const& B,
               std::vector<int>& C, size_t M, size_t N, size_t K)
{
    for (size_t m{0}; m < M; ++m)
    {
        for (size_t n{0}; n < N; ++n)
        {
            long long sum{0};
            for (size_t k{0}; k < K; ++k)
            {
                sum += static_cast<long long>(A[k * M + m]) *
                       B[n * K + k]; // A col-major, B col-major
            }
            C[n * M + m] = static_cast<int>(sum); // C col-major
        }
    }
}

// TT: A row-major, B row-major, C col-major
void matmul_TT(std::vector<int> const& A, std::vector<int> const& B,
               std::vector<int>& C, size_t M, size_t N, size_t K)
{
    for (size_t m{0}; m < M; ++m)
    {
        for (size_t n{0}; n < N; ++n)
        {
            long long sum{0};
            for (size_t k{0}; k < K; ++k)
            {
                sum += static_cast<long long>(A[m * K + k]) *
                       B[k * N + n]; // A row-major, B row-major
            }
            C[n * M + m] = static_cast<int>(sum); // C col-major
        }
    }
}

typedef void (*matmul_func_t)(std::vector<int> const&, std::vector<int> const&,
                              std::vector<int>&, size_t, size_t, size_t);
void benchmark(std::string const& name, matmul_func_t matmul,
               std::vector<int> const& A, std::vector<int> const& B,
               std::vector<int>& C, size_t M, size_t N, size_t K,
               size_t warmup_runs, size_t measure_runs)
{
    // Warm-up to stabilize caches and branch predictors
    for (size_t i{0}; i < warmup_runs; ++i)
    {
        matmul(A, B, C, M, N, K);
    }

    // Measure multiple times and take the average
    double total_ms{0.0};
    for (size_t i{0}; i < measure_runs; ++i)
    {
        auto start = std::chrono::high_resolution_clock::now();
        matmul(A, B, C, M, N, K);
        auto end = std::chrono::high_resolution_clock::now();
        double ms =
            std::chrono::duration<double, std::milli>(end - start).count();
        total_ms += ms;
    }
    double avg_ms{total_ms / static_cast<double>(measure_runs)};
    std::cout << std::setw(2) << name << ": " << avg_ms << " ms" << " (avg of "
              << measure_runs << " runs, " << warmup_runs << " warmup)"
              << std::endl;
}

int main()
{
    size_t const M{1024};
    size_t const N{1024};
    size_t const K{1024};
    std::vector<int> A_row(M * K), A_col(M * K);
    std::vector<int> B_row(K * N), B_col(K * N);
    std::vector<int> C(M * N);

    size_t const warmup_runs{2};
    size_t const measure_runs{5};

    // Fill A and B in both layouts
    fill_random(A_row);
    fill_random(B_row);

    // Convert A_row to col-major
    for (size_t m{0}; m < M; ++m)
    {
        for (size_t k{0}; k < K; ++k)
        {
            A_col[k * M + m] = A_row[m * K + k];
        }
    }
    // Convert B_row to col-major
    for (size_t k{0}; k < K; ++k)
    {
        for (size_t n{0}; n < N; ++n)
        {
            B_col[n * K + k] = B_row[k * N + n];
        }
    }

    std::cout << "Benchmarking " << M << "x" << N << "x" << K
              << " matrix multiplication (single-threaded, CPU)" << std::endl;
    benchmark("TN", matmul_TN, A_row, B_col, C, M, N, K, warmup_runs,
              measure_runs);
    benchmark("NT", matmul_NT, A_col, B_row, C, M, N, K, warmup_runs,
              measure_runs);
    benchmark("NN", matmul_NN, A_col, B_col, C, M, N, K, warmup_runs,
              measure_runs);
    benchmark("TT", matmul_TT, A_row, B_row, C, M, N, K, warmup_runs,
              measure_runs);

    return 0;
}

We could see from the benchmark results below from an Intel Core i9-9900K CPU that even with a naive single-threaded implementation, TN layout still outperforms other layouts by a large margin, especially when compiler optimizations are enabled with -O1, -O2, or -O3 flags.

$ g++ benchmark_matmul_layouts.cpp -o benchmark_matmul_layouts -O0
$ ./benchmark_matmul_layouts
Benchmarking 1024x1024x1024 matrix multiplication (single-threaded, CPU)
TN: 3896.94 ms
NT: 7866.73 ms
NN: 5844.9 ms
TT: 6077.64 ms
$ g++ benchmark_matmul_layouts.cpp -o benchmark_matmul_layouts -O1
$ ./benchmark_matmul_layouts
Benchmarking 1024x1024x1024 matrix multiplication (single-threaded, CPU)
TN: 377.64 ms
NT: 6355.77 ms
NN: 1433.67 ms
TT: 1362.49 ms
$ g++ benchmark_matmul_layouts.cpp -o benchmark_matmul_layouts -O2
$ ./benchmark_matmul_layouts
Benchmarking 1024x1024x1024 matrix multiplication (single-threaded, CPU)
TN: 455.691 ms
NT: 6346.08 ms
NN: 1391.93 ms
TT: 1318.56 ms
$ g++ benchmark_matmul_layouts.cpp -o benchmark_matmul_layouts -O3
$ ./benchmark_matmul_layouts
Benchmarking 1024x1024x1024 matrix multiplication (single-threaded, CPU)
TN: 455.647 ms
NT: 6503.69 ms
NN: 1500.05 ms
TT: 1358.68 ms

Therefore, TN layout is generally the most important layout to optimize for GEMM problems across different hardware architectures. Whenever we invented a new hardware architecture or an algorithm for GEMM problems, we should always consider optimizing TN layout first to compete with existing solutions.

NVIDIA Tensor Core TN Layout MMA Instructions

In the following figure, I am showing an example of NVIDIA Tensor Core TN layout MMA instruction SM80_16x8x16_F16F16F16F16_TN, which performs a matrix multiplication of an input matrix $A$ of shape $(16, 16)$ in row-major layout and an input matrix $B$ of shape $(16, 8)$ in column-major layout, resulting in an output matrix $C$ of shape $(16, 8)$ in column-major layout.

Typically, the data is stored in shared memory before being loaded into registers for the MMA instruction. We could see that multiple low-bitwidth values along the $K$ dimension, stored as row-major for matrix $A$ and column-major for matrix $B$, can be loaded into one 32-bit register by each thread in the warp (via the ldmatrix instruction). The shared memory bank conflicts are usually eliminated by swizzling the data in shared memory before loading them into registers.

Implementing Non-TN Layout GEMM Using NVIDIA Tensor Core TN Layout MMA Instructions

The next question is, given there are no non-TN layout MMA instructions on most NVIDIA GPU architectures, how could we implement NT, NN, and TT layout GEMM solutions? Will it still be efficient?

Suppose we want to implement an NN layout GEMM solution using NVIDIA Tensor Core TN layout MMA instructions. This means the input matrix $A$ has to be transposed conceptually from column-major to row-major layout. We could think of two possible ways to achieve this:

Physically transpose the input matrix $A$ from column-major to row-major layout in shared memory before loading it into registers for the MMA instruction. This is typically done when loading data from global memory to shared memory.
Logically transpose the input matrix $A$ by swapping the thread mapping and the value mapping when loading data from shared memory to registers for the MMA instruction. This is typically done when loading data from shared memory to registers.

The first approach is relatively straightforward and it is MMA-instruction agnostic. However, it may introduce additional overhead due to shared memory conflicts. With proper techniques to avoid shared memory bank conflicts, such as swizzling, the overhead could be minimized to almost zero for most GEMM problems.

The second approach is relatively MMA-instruction dependent. Taking the TN layout MMA instruction SM80_16x8x16_F16F16F16F16_TN as an example, the thread $T0$ has to load two FP16 values from the logical coordinates $(0, 0)$ and $(0, 1)$ of matrix $A$ in row-major layout. If the matrix $A$ is stored in column-major layout, the two FP16 values are actually located at the physical coordinates $(0, 0)$ and $(1, 0)$. There are utility functions to help loading data from shared memory to registers with logical transposition, such as the ldmatrix instruction and its transposed version.

Conclusions

In this article, we discussed why TN layout is the most optimized layout for GEMM problems across different hardware architectures. We also explored how NVIDIA Tensor Core MMA instructions are designed specifically for TN layout and how non-TN layout GEMM solutions can be implemented efficiently using these TN layout MMA instructions. Because of this, NVIDIA does not have to support all four layouts for MMA instructions on GPU.