Row-Major VS Column-Major

05-12-202305-12-2023 blog 28 minutes read (About 4154 words) visits

Introduction

In computing, row-major order and column-major order are two ways for storing multidimensional arrays in linear storage such as random access memory. The following figure demonstrated the row-major order and the column-major for a 2D matrix.

In this blog post, I would like to discuss the difference between the row-major order and the column-major order, and their consequence for matrix multiplication performance.

Row-Major VS Column-Major

Given a matrix $A$ of shape $(M, N)$, if it is stored in row-major order, its leading dimension is $N$, and if it is stored in column-major order, its leading dimension is $M$.

To read $A^{\top}$ from the same piece of the memory in which $A$ was stored in row-major order with a leading dimension of $N$, we could just treat the matrix in the memory as if it were stored in column-major order and the leading dimension is still $N$.

To read $A^{\top}$ from the same piece of the memory in which $A$ was stored in column-major order with a leading dimension of $M$, we could just treat the matrix in the memory as if it were stored in row-major order and the leading dimension is still $M$.

For example, we have a matrix $A$,

$$
\begin{align}
A &=
\begin{bmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
\end{bmatrix} \\
\end{align}
$$

If $A$ is stored in row-major order, the matrix values in the linear memory is $[1, 2, 3, 4, 5, 6]$.
If $A$ is stored in column-major order, the matrix values in the linear memory is $[1, 4, 2, 5, 3, 6]$.

The transpose of $A$, $A^{\top}$, is

$$
\begin{align}
A^{\top} &=
\begin{bmatrix}
1 & 4 \\
2 & 5 \\
3 & 6 \\
\end{bmatrix} \\
\end{align}
$$

If $A^{\top}$ is stored in row-major order, the matrix values in the linear memory is $[1, 4, 2, 5, 3, 6]$.
If $A^{\top}$ is stored in column-major order, the matrix values in the linear memory is $[1, 2, 3, 4, 5, 6]$.

It’s easy to see that $A$ is stored in row-major order is exactly the same as $A^{\top}$ is stored in column-major order in the memory and $A$ is stored in column-major order is exactly the same as $A^{\top}$ is stored in row-major order in the memory.

For a matrix $A$ stored in row-major order, reading rows of $A$ and reading columns of $A^{\top}$ are fast and cache-friendly, whereas reading columns of $A$ and reading rows of $A^{\top}$ are slow and invalids caching.

For a matrix $A$ stored in column-major order, reading columns of $A$ and reading rows of $A^{\top}$ are fast and cache-friendly, whereas reading rows of $A$ and reading columns of $A^{\top}$ are slow and invalids caching.

Matrix Multiplication

The way of storing matrices in memory affects the performance of matrix multiplication on lots of processors, such as CPU and GPU. Usually, depending on whether the matrices for multiplications needs to be mathematically transposed for matrix multiplication, there are four ways of computing matrix multiplication, $C=AB$, $C=A^{\top}B$, $C=AB^{\top}$, and $C=A^{\top}B^{\top}$. Each of them performs better than others depending on the storage orderings of matrices $A$ and $B$, even though the theoretical MACs of the operations remain the same.

$C=AB$

Suppose a matrix $A$ is of shape $(M, K)$ and a matrix $B$ is of shape $(K, N)$, to compute $C = AB$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one row of size $K$ in the matrix $A$ and one column of size $K$ in the matrix $B$.

Depending on the storage ordering of the two matrices, there are four scenarios.

Matrix $A$ Storage Order	Matrix $B$ Storage Order	Matrix $A$ Row Reading	Matrix $B$ Column Reading
column-major	column-major	slow	fast
column-major	row-major	slow	slow
row-major	column-major	fast	fast
row-major	row-major	fast	slow

When $A$ is stored in row-major order and $B$ is stored in column-major order, reading both rows from $A$ and columns from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.

Therefore, the matrix multiplication $C=AB$ is more suitable for the situation where $A$ is stored in row-major order and $B$ is stored in column-major order.

$C=A^{\top}B$

Suppose a matrix $A$ is of shape $(K, M)$ and a matrix $B$ is of shape $(K, N)$, to compute $C = A^{\top}B$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one column of size $K$ in the matrix $A$ and one column of size $K$ in the matrix $B$.

Depending on the storage ordering of the two matrices, there are four scenarios.

Matrix $A$ Storage Order	Matrix $B$ Storage Order	Matrix $A$ Column Reading	Matrix $B$ Column Reading
column-major	column-major	fast	fast
column-major	row-major	fast	slow
row-major	column-major	slow	fast
row-major	row-major	slow	slow

When $A$ is stored in column-major order and $B$ is stored in column-major order, reading both columns from $A$ and columns from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.

Therefore, the matrix multiplication $C=A^{\top}B$ is more suitable for the situation where $A$ is stored in column-major order and $B$ is stored in column-major order.

$C=AB^{\top}$

Suppose a matrix $A$ is of shape $(M, K)$ and a matrix $B$ is of shape $(N, K)$, to compute $C = AB^{\top}$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one row of size $K$ in the matrix $A$ and one row of size $K$ in the matrix $B$.

Depending on the storage ordering of the two matrices, there are four scenarios.

Matrix $A$ Storage Order	Matrix $B$ Storage Order	Matrix $A$ Row Reading	Matrix $B$ Row Reading
column-major	column-major	slow	slow
column-major	row-major	slow	fast
row-major	column-major	fast	slow
row-major	row-major	fast	fast

When $A$ is stored in row-major order and $B$ is stored in row-major order, reading both rows from $A$ and rows from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.

Therefore, the matrix multiplication $C=AB^{\top}$ is more suitable for the situation where $A$ is stored in row-major order and $B$ is stored in row-major order.

$C=A^{\top}B^{\top}$

Suppose a matrix $A$ is of shape $(K, M)$ and a matrix $B$ is of shape $(N, K)$, to compute $C = A^{\top}B^{\top}$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one column of size $K$ in the matrix $A$ and one row of size $K$ in the matrix $B$.

Depending on the storage ordering of the two matrices, there are four scenarios.

Matrix $A$ Storage Order	Matrix $B$ Storage Order	Matrix $A$ Column Reading	Matrix $B$ Row Reading
column-major	column-major	fast	slow
column-major	row-major	fast	fast
row-major	column-major	slow	slow
row-major	row-major	slow	fast

When $A$ is stored in column-major order and $B$ is stored in row-major order, reading both columns from $A$ and rows from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.

Therefore, the matrix multiplication $C=A^{\top}B^{\top}$ is more suitable for the situation where $A$ is stored in column-major order and $B$ is stored in row-major order.

Matrix Multiplication Preference

The matrix multiplication preference for different combinations of the storage orders of matrices beings multiplied can be summarized as follows.

Matrix $A$ Storage Order	Matrix $B$ Storage Order	Matrix Multiplication Preference
column-major	column-major	$C = A^{\top}B$
column-major	row-major	$C = A^{\top}B^{\top}$
row-major	column-major	$C = AB$
row-major	row-major	$C = AB^{\top}$

Because usually the all matrices in one software framework would use the same storage order, this means only $C = A^{\top}B$ and $C = AB^{\top}$ are preferred for those scenarios.

Optimizations can reduce the performance gap between the optimal matrix multiplication option and the other options, sometimes even to almost zero, depending on the implementation and the processor.

In addition, sometimes it’s not a good idea to physically transpose a matrix in memory just in order to use the most performant matrix multiplication option among the four, because the overhead of transposing a matrix in memory might be much larger than the difference between the four matrix multiplication options especially when they are all well optimized.

Matrix Multiplication Benchmarks

Additionally, we could verify our analysis using C++ single-threaded naive implementations for matrix multiplication.

naive_mm.cpp

#include <cassert>
#include <chrono>
#include <cstdint>
#include <functional>
#include <iomanip>
#include <iostream>
#include <tuple>
#include <utility>
#include <vector>

template <class T>
float measure_performance(std::function<T(void)> bound_function,
                          int num_repeats = 100, int num_warmups = 100)
{
    for (int i{0}; i < num_warmups; ++i)
    {
        bound_function();
    }

    std::chrono::steady_clock::time_point time_start{
        std::chrono::steady_clock::now()};
    for (int i{0}; i < num_repeats; ++i)
    {
        bound_function();
    }
    std::chrono::steady_clock::time_point time_end{
        std::chrono::steady_clock::now()};

    auto time_elapsed{std::chrono::duration_cast<std::chrono::milliseconds>(
                          time_end - time_start)
                          .count()};
    float latency{time_elapsed / static_cast<float>(num_repeats)};

    return latency;
}

// A and B are column-major matrices.
template <typename T>
void mm_a_col_major_b_col_major(T const* A, T const* B, T* C, uint32_t m,
                                uint32_t n, uint32_t k, uint32_t lda,
                                uint32_t ldb, uint32_t ldc, bool is_A_transpose,
                                bool is_B_transpose)
{
    for (uint32_t ni{0}; ni < n; ++ni)
    {
        for (uint32_t mi{0}; mi < m; ++mi)
        {
            // Compute C[mi, ni]
            T accum{0};
            // A * B
            if ((!is_A_transpose) && (!is_B_transpose))
            {
                for (uint32_t ki{0}; ki < k; ++ki)
                {
                    // A[mi, ki] * B[ki, ni]
                    accum += A[ki * lda + mi] * B[ni * ldb + ki];
                }
            }
            // A^T * B
            else if ((is_A_transpose) && (!is_B_transpose))
            {
                for (uint32_t ki{0}; ki < k; ++ki)
                {
                    // A[ki, mi] * B[ki, ni]
                    accum += A[mi * lda + ki] * B[ni * ldb + ki];
                }
            }
            // A * B^T
            else if ((!is_A_transpose) && (is_B_transpose))
            {
                for (uint32_t ki{0}; ki < k; ++ki)
                {
                    // A[mi, ki] * B[ni, ki]
                    accum += A[ki * lda + mi] * B[ki * ldb + ni];
                }
            }
            // A^T * B^T
            else
            {
                for (uint32_t ki{0}; ki < k; ++ki)
                {
                    // A[ki, mi] * B[ni, ki]
                    accum += A[mi * lda + ki] * B[ki * ldb + ni];
                }
            }
            C[ni * ldc + mi] = accum;
        }
    }
}

void print_latency(float latency)
{
    std::cout << std::fixed << std::setprecision(3) << "Latency: " << latency
              << " ms" << std::endl;
}

int main()
{
    constexpr uint32_t num_repeats{10};
    constexpr uint32_t num_warmups{10};

    constexpr uint32_t M{256};
    constexpr uint32_t K{256};
    constexpr uint32_t N{256};

    std::vector<float> matrix_a(M * K);
    std::vector<float> matrix_b(K * N);
    std::vector<float> matrix_c(M * N);

    float const* A{matrix_a.data()};
    float const* B{matrix_b.data()};
    float* C{matrix_c.data()};

    uint32_t const matrix_a_col_major_ld{M};
    uint32_t const matrix_a_row_major_ld{K};
    uint32_t const matrix_a_transpose_col_major_ld{matrix_a_row_major_ld};
    uint32_t const matrix_a_transpose_row_major_ld{matrix_a_col_major_ld};

    uint32_t const matrix_b_col_major_ld{K};
    uint32_t const matrix_b_row_major_ld{N};
    uint32_t const matrix_b_transpose_col_major_ld{matrix_b_row_major_ld};
    uint32_t const matrix_b_transpose_row_major_ld{matrix_b_col_major_ld};

    uint32_t const matrix_c_col_major_ld{M};
    uint32_t const matrix_c_row_major_ld{N};
    uint32_t const matrix_c_transpose_col_major_ld{matrix_c_row_major_ld};
    uint32_t const matrix_c_transpose_row_major_ld{matrix_c_col_major_ld};

    std::function<void(void)> const mm_a_col_major_b_col_major_a_b{
        std::bind(mm_a_col_major_b_col_major<float>, A, B, C, M, N, K,
                  matrix_a_col_major_ld, matrix_b_col_major_ld,
                  matrix_c_col_major_ld, false, false)};

    std::function<void(void)> const mm_a_col_major_b_col_major_a_transpose_b{
        std::bind(mm_a_col_major_b_col_major<float>, A, B, C, M, N, K,
                  matrix_a_transpose_col_major_ld, matrix_b_col_major_ld,
                  matrix_c_col_major_ld, true, false)};

    std::function<void(void)> const
        mm_a_col_major_b_col_major_a_transpose_b_transpose{std::bind(
            mm_a_col_major_b_col_major<float>, A, B, C, M, N, K,
            matrix_a_transpose_col_major_ld, matrix_b_transpose_col_major_ld,
            matrix_c_col_major_ld, true, true)};

    std::function<void(void)> const mm_a_col_major_b_col_major_a_b_transpose{
        std::bind(mm_a_col_major_b_col_major<float>, A, B, C, M, N, K,
                  matrix_a_col_major_ld, matrix_b_transpose_col_major_ld,
                  matrix_c_col_major_ld, false, true)};

    std::cout << "C = A * B" << std::endl;
    float const latency_a_b = measure_performance(
        mm_a_col_major_b_col_major_a_b, num_repeats, num_warmups);
    print_latency(latency_a_b);

    std::cout << "C = A^T * B" << std::endl;
    float const latency_a_transpose_b = measure_performance(
        mm_a_col_major_b_col_major_a_transpose_b, num_repeats, num_warmups);
    print_latency(latency_a_transpose_b);

    std::cout << "C = A * B^T" << std::endl;
    float const latency_a_b_transpose = measure_performance(
        mm_a_col_major_b_col_major_a_b_transpose, num_repeats, num_warmups);
    print_latency(latency_a_b_transpose);

    std::cout << "C = A^T * B^T" << std::endl;
    float const latency_a_transpose_b_transpose =
        measure_performance(mm_a_col_major_b_col_major_a_transpose_b_transpose,
                            num_repeats, num_warmups);
    print_latency(latency_a_transpose_b_transpose);

    assert(latency_a_transpose_b ==
           std::min({latency_a_b, latency_a_transpose_b, latency_a_b_transpose,
                     latency_a_transpose_b_transpose}));
    assert(latency_a_b_transpose ==
           std::max({latency_a_b, latency_a_transpose_b, latency_a_b_transpose,
                     latency_a_transpose_b_transpose}));
}

We could see that given matrix $A$ and matrix $B$ are stored in column-major order, as expected, the performance of $C = A^{\top}B$ is the best and the performance of $C = AB^{\top}$ is the worst.

$ g++ naive_mm.cpp -o naive_mm
$ ./naive_mm
C = A * B
Latency: 45.400 ms
C = A^T * B
Latency: 32.500 ms
C = A * B^T
Latency: 57.800 ms
C = A^T * B^T
Latency: 48.300 ms

Using multi-threaded optimized matrix multiplication implementations, such as the GEMM functions from the cuBLAS library, can eliminate the difference between the four options.

cublas_mm.cu

#include <cassert>
#include <chrono>
#include <cstdint>
#include <functional>
#include <iomanip>
#include <iostream>
#include <tuple>
#include <utility>
#include <vector>

#include <cublas_v2.h>
#include <cuda_runtime.h>

#define CHECK_CUBLAS_ERROR(val) checkCuBlas((val), #val, __FILE__, __LINE__)
template <typename T>
void checkCuBlas(T err, const char* const func, const char* const file,
                 const int line)
{
    if (err != CUBLAS_STATUS_SUCCESS)
    {
        std::cerr << "cuBlas Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

#define CHECK_CUDA_ERROR(val) checkCuda((val), #val, __FILE__, __LINE__)
template <typename T>
void checkCuda(T err, const char* const func, const char* const file,
               const int line)
{
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

#define CHECK_LAST_CUDA_ERROR() checkCudaLast(__FILE__, __LINE__)
void checkCudaLast(const char* const file, const int line)
{
    cudaError_t const err{cudaGetLastError()};
    if (err != cudaSuccess)
    {
        std::cerr << "CUDA Runtime Error at: " << file << ":" << line
                  << std::endl;
        std::cerr << cudaGetErrorString(err) << std::endl;
        std::exit(EXIT_FAILURE);
    }
}

float measure_cublas_performance(
    std::function<cublasStatus_t(void)> bound_cublas_function,
    cudaStream_t stream, int num_repeats = 100, int num_warmups = 100)
{
    cudaEvent_t start, stop;
    float time;

    CHECK_CUDA_ERROR(cudaEventCreate(&start));
    CHECK_CUDA_ERROR(cudaEventCreate(&stop));

    for (int i{0}; i < num_warmups; ++i)
    {
        CHECK_CUBLAS_ERROR(bound_cublas_function());
    }

    CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));

    CHECK_CUDA_ERROR(cudaEventRecord(start, stream));
    for (int i{0}; i < num_repeats; ++i)
    {
        CHECK_CUBLAS_ERROR(bound_cublas_function());
    }
    CHECK_CUDA_ERROR(cudaEventRecord(stop, stream));
    CHECK_CUDA_ERROR(cudaEventSynchronize(stop));
    CHECK_LAST_CUDA_ERROR();
    CHECK_CUDA_ERROR(cudaEventElapsedTime(&time, start, stop));
    CHECK_CUDA_ERROR(cudaEventDestroy(start));
    CHECK_CUDA_ERROR(cudaEventDestroy(stop));

    float const latency{time / num_repeats};

    return latency;
}

void print_latency(float latency)
{
    std::cout << std::fixed << std::setprecision(3) << "Latency: " << latency
              << " ms" << std::endl;
}

int main()
{
    constexpr uint32_t num_repeats{100};
    constexpr uint32_t num_warmups{100};

    constexpr uint32_t M{256};
    constexpr uint32_t K{256};
    constexpr uint32_t N{256};

    float* A{nullptr};
    float* B{nullptr};
    float* C{nullptr};

    CHECK_CUDA_ERROR(cudaMalloc(&A, M * K * sizeof(float)));
    CHECK_CUDA_ERROR(cudaMalloc(&B, K * N * sizeof(float)));
    CHECK_CUDA_ERROR(cudaMalloc(&C, M * N * sizeof(float)));

    uint32_t const matrix_a_col_major_ld{M};
    uint32_t const matrix_a_row_major_ld{K};
    uint32_t const matrix_a_transpose_col_major_ld{matrix_a_row_major_ld};
    uint32_t const matrix_a_transpose_row_major_ld{matrix_a_col_major_ld};

    uint32_t const matrix_b_col_major_ld{K};
    uint32_t const matrix_b_row_major_ld{N};
    uint32_t const matrix_b_transpose_col_major_ld{matrix_b_row_major_ld};
    uint32_t const matrix_b_transpose_row_major_ld{matrix_b_col_major_ld};

    uint32_t const matrix_c_col_major_ld{M};
    uint32_t const matrix_c_row_major_ld{N};
    uint32_t const matrix_c_transpose_col_major_ld{matrix_c_row_major_ld};
    uint32_t const matrix_c_transpose_row_major_ld{matrix_c_col_major_ld};

    cublasHandle_t cublas_handle;
    cudaStream_t stream;

    CHECK_CUDA_ERROR(cudaStreamCreate(&stream));
    CHECK_CUBLAS_ERROR(cublasCreate(&cublas_handle));
    CHECK_CUBLAS_ERROR(cublasSetStream(cublas_handle, stream));

    float const alpha{1.0};
    float const beta{0.0};

    // cublasSgemm assumes column-major matrices.
    std::function<cublasStatus_t(void)> const mm_a_col_major_b_col_major_a_b{
        std::bind(cublasSgemm, cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, M, N, K,
                  &alpha, A, matrix_a_col_major_ld, B, matrix_b_col_major_ld,
                  &beta, C, matrix_c_col_major_ld)};

    std::function<cublasStatus_t(void)> const
        mm_a_col_major_b_col_major_a_transpose_b{
            std::bind(cublasSgemm, cublas_handle, CUBLAS_OP_T, CUBLAS_OP_N, M,
                      N, K, &alpha, A, matrix_a_transpose_col_major_ld, B,
                      matrix_b_col_major_ld, &beta, C, matrix_c_col_major_ld)};

    std::function<cublasStatus_t(void)> const
        mm_a_col_major_b_col_major_a_transpose_b_transpose{std::bind(
            cublasSgemm, cublas_handle, CUBLAS_OP_T, CUBLAS_OP_T, M, N, K,
            &alpha, A, matrix_a_transpose_col_major_ld, B,
            matrix_b_transpose_col_major_ld, &beta, C, matrix_c_col_major_ld)};

    std::function<cublasStatus_t(void)> const
        mm_a_col_major_b_col_major_a_b_transpose{std::bind(
            cublasSgemm, cublas_handle, CUBLAS_OP_N, CUBLAS_OP_T, M, N, K,
            &alpha, A, matrix_a_col_major_ld, B,
            matrix_b_transpose_col_major_ld, &beta, C, matrix_c_col_major_ld)};

    std::cout << "C = A * B" << std::endl;
    float const latency_a_b = measure_cublas_performance(
        mm_a_col_major_b_col_major_a_b, stream, num_repeats, num_warmups);
    print_latency(latency_a_b);

    std::cout << "C = A^T * B" << std::endl;
    float const latency_a_transpose_b =
        measure_cublas_performance(mm_a_col_major_b_col_major_a_transpose_b,
                                   stream, num_repeats, num_warmups);
    print_latency(latency_a_transpose_b);

    std::cout << "C = A * B^T" << std::endl;
    float const latency_a_b_transpose =
        measure_cublas_performance(mm_a_col_major_b_col_major_a_b_transpose,
                                   stream, num_repeats, num_warmups);
    print_latency(latency_a_b_transpose);

    std::cout << "C = A^T * B^T" << std::endl;
    float const latency_a_transpose_b_transpose = measure_cublas_performance(
        mm_a_col_major_b_col_major_a_transpose_b_transpose, stream, num_repeats,
        num_warmups);
    print_latency(latency_a_transpose_b_transpose);

    CHECK_CUDA_ERROR(cudaFree(A));
    CHECK_CUDA_ERROR(cudaFree(B));
    CHECK_CUDA_ERROR(cudaFree(C));
    CHECK_CUBLAS_ERROR(cublasDestroy(cublas_handle));
    CHECK_CUDA_ERROR(cudaStreamDestroy(stream));
}

With the highly optimized implementations, there is almost no difference between the four options.

$ nvcc cublas_mm.cu -o cublas_mm -lcublas
$ ./cublas_mm
C = A * B
Latency: 0.008 ms
C = A^T * B
Latency: 0.010 ms
C = A * B^T
Latency: 0.009 ms
C = A^T * B^T
Latency: 0.008 ms

References

Row- and Column-Major Order - Wikipedia

Row-Major VS Column-Major

https://leimao.github.io/blog/Row-Major-VS-Column-Major/

Author

Lei Mao

Posted on

05-12-2023

Updated on

05-12-2023

Licensed under

CPP,

CUDA,

Computer Architecture,

Memory

Row-Major VS Column-Major

Introduction

Row-Major VS Column-Major

Matrix Multiplication

$C=AB$

$C=A^{\top}B$

$C=AB^{\top}$

$C=A^{\top}B^{\top}$

Matrix Multiplication Preference

Matrix Multiplication Benchmarks

References

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Advertisement

Catalogue