# Row-Major VS Column-Major

## Introduction

In computing, row-major order and column-major order are two ways for storing multidimensional arrays in linear storage such as random access memory. The following figure demonstrated the row-major order and the column-major for a 2D matrix.

In this blog post, I would like to discuss the difference between the row-major order and the column-major order, and their consequence for matrix multiplication performance.

## Row-Major VS Column-Major

Given a matrix $A$ of shape $(M, N)$, if it is stored in row-major order, its leading dimension is $N$, and if it is stored in column-major order, its leading dimension is $M$.

To read $A^{\top}$ from the same piece of the memory in which $A$ was stored in row-major order with a leading dimension of $N$, we could just treat the matrix in the memory as if it were stored in column-major order and the leading dimension is still $N$.

To read $A^{\top}$ from the same piece of the memory in which $A$ was stored in column-major order with a leading dimension of $M$, we could just treat the matrix in the memory as if it were stored in row-major order and the leading dimension is still $M$.

For example, we have a matrix $A$,

\begin{align} A &= \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ \end{bmatrix} \\ \end{align}

If $A$ is stored in row-major order, the matrix values in the linear memory is $[1, 2, 3, 4, 5, 6]$.
If $A$ is stored in column-major order, the matrix values in the linear memory is $[1, 4, 2, 5, 3, 6]$.

The transpose of $A$, $A^{\top}$, is

\begin{align} A^{\top} &= \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \\ \end{bmatrix} \\ \end{align}

If $A^{\top}$ is stored in row-major order, the matrix values in the linear memory is $[1, 4, 2, 5, 3, 6]$.
If $A^{\top}$ is stored in column-major order, the matrix values in the linear memory is $[1, 2, 3, 4, 5, 6]$.

It’s easy to see that $A$ is stored in row-major order is exactly the same as $A^{\top}$ is stored in column-major order in the memory and $A$ is stored in column-major order is exactly the same as $A^{\top}$ is stored in row-major order in the memory.

For a matrix $A$ stored in row-major order, reading rows of $A$ and reading columns of $A^{\top}$ are fast and cache-friendly, whereas reading columns of $A$ and reading rows of $A^{\top}$ are slow and invalids caching.

For a matrix $A$ stored in column-major order, reading columns of $A$ and reading rows of $A^{\top}$ are fast and cache-friendly, whereas reading rows of $A$ and reading columns of $A^{\top}$ are slow and invalids caching.

## Matrix Multiplication

The way of storing matrices in memory affects the performance of matrix multiplication on lots of processors, such as CPU and GPU. Usually, depending on whether the matrices for multiplications needs to be mathematically transposed for matrix multiplication, there are four ways of computing matrix multiplication, $C=AB$, $C=A^{\top}B$, $C=AB^{\top}$, and $C=A^{\top}B^{\top}$. Each of them performs better than others depending on the storage orderings of matrices $A$ and $B$, even though the theoretical MACs of the operations remain the same.

### $C=AB$

Suppose a matrix $A$ is of shape $(M, K)$ and a matrix $B$ is of shape $(K, N)$, to compute $C = AB$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one row of size $K$ in the matrix $A$ and one column of size $K$ in the matrix $B$.

Depending on the storage ordering of the two matrices, there are four scenarios.

Matrix $A$ Storage Order Matrix $B$ Storage Order Matrix $A$ Row Reading Matrix $B$ Column Reading
column-major column-major slow fast
column-major row-major slow slow
row-major column-major fast fast
row-major row-major fast slow

When $A$ is stored in row-major order and $B$ is stored in column-major order, reading both rows from $A$ and columns from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.

Therefore, the matrix multiplication $C=AB$ is more suitable for the situation where $A$ is stored in row-major order and $B$ is stored in column-major order.

### $C=A^{\top}B$

Suppose a matrix $A$ is of shape $(K, M)$ and a matrix $B$ is of shape $(K, N)$, to compute $C = A^{\top}B$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one column of size $K$ in the matrix $A$ and one column of size $K$ in the matrix $B$.

Depending on the storage ordering of the two matrices, there are four scenarios.

Matrix $A$ Storage Order Matrix $B$ Storage Order Matrix $A$ Column Reading Matrix $B$ Column Reading
column-major column-major fast fast
column-major row-major fast slow
row-major column-major slow fast
row-major row-major slow slow

When $A$ is stored in column-major order and $B$ is stored in column-major order, reading both columns from $A$ and columns from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.

Therefore, the matrix multiplication $C=A^{\top}B$ is more suitable for the situation where $A$ is stored in column-major order and $B$ is stored in column-major order.

### $C=AB^{\top}$

Suppose a matrix $A$ is of shape $(M, K)$ and a matrix $B$ is of shape $(N, K)$, to compute $C = AB^{\top}$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one row of size $K$ in the matrix $A$ and one row of size $K$ in the matrix $B$.

Depending on the storage ordering of the two matrices, there are four scenarios.

Matrix $A$ Storage Order Matrix $B$ Storage Order Matrix $A$ Row Reading Matrix $B$ Row Reading
column-major column-major slow slow
column-major row-major slow fast
row-major column-major fast slow
row-major row-major fast fast

When $A$ is stored in row-major order and $B$ is stored in row-major order, reading both rows from $A$ and rows from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.

Therefore, the matrix multiplication $C=AB^{\top}$ is more suitable for the situation where $A$ is stored in row-major order and $B$ is stored in row-major order.

### $C=A^{\top}B^{\top}$

Suppose a matrix $A$ is of shape $(K, M)$ and a matrix $B$ is of shape $(N, K)$, to compute $C = A^{\top}B^{\top}$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one column of size $K$ in the matrix $A$ and one row of size $K$ in the matrix $B$.

Depending on the storage ordering of the two matrices, there are four scenarios.

Matrix $A$ Storage Order Matrix $B$ Storage Order Matrix $A$ Column Reading Matrix $B$ Row Reading
column-major column-major fast slow
column-major row-major fast fast
row-major column-major slow slow
row-major row-major slow fast

When $A$ is stored in column-major order and $B$ is stored in row-major order, reading both columns from $A$ and rows from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.

Therefore, the matrix multiplication $C=A^{\top}B^{\top}$ is more suitable for the situation where $A$ is stored in column-major order and $B$ is stored in row-major order.

## Matrix Multiplication Preference

The matrix multiplication preference for different combinations of the storage orders of matrices beings multiplied can be summarized as follows.

Matrix $A$ Storage Order Matrix $B$ Storage Order Matrix Multiplication Preference
column-major column-major $C = A^{\top}B$
column-major row-major $C = A^{\top}B^{\top}$
row-major column-major $C = AB$
row-major row-major $C = AB^{\top}$

Because usually the all matrices in one software framework would use the same storage order, this means only $C = A^{\top}B$ and $C = AB^{\top}$ are preferred for those scenarios.

Optimizations can reduce the performance gap between the optimal matrix multiplication option and the other options, sometimes even to almost zero, depending on the implementation and the processor.

In addition, sometimes it’s not a good idea to physically transpose a matrix in memory just in order to use the most performant matrix multiplication option among the four, because the overhead of transposing a matrix in memory might be much larger than the difference between the four matrix multiplication options especially when they are all well optimized.

## Matrix Multiplication Benchmarks

Additionally, we could verify our analysis using C++ single-threaded naive implementations for matrix multiplication.

We could see that given matrix $A$ and matrix $B$ are stored in column-major order, as expected, the performance of $C = A^{\top}B$ is the best and the performance of $C = AB^{\top}$ is the worst.

Using multi-threaded optimized matrix multiplication implementations, such as the GEMM functions from the cuBLAS library, can eliminate the difference between the four options.

With the highly optimized implementations, there is almost no difference between the four options.

Lei Mao

05-12-2023

05-12-2023