In computing, row-major order and column-major order are two ways for storing multidimensional arrays in linear storage such as random access memory. The following figure demonstrated the row-major order and the column-major for a 2D matrix.
In this blog post, I would like to discuss the difference between the row-major order and the column-major order, and their consequence for matrix multiplication performance.
Row-Major VS Column-Major
Given a matrix $A$ of shape $(M, N)$, if it is stored in row-major order, its leading dimension is $N$, and if it is stored in column-major order, its leading dimension is $M$.
To read $A^{\top}$ from the same piece of the memory in which $A$ was stored in row-major order with a leading dimension of $N$, we could just treat the matrix in the memory as if it were stored in column-major order and the leading dimension is still $N$.
To read $A^{\top}$ from the same piece of the memory in which $A$ was stored in column-major order with a leading dimension of $M$, we could just treat the matrix in the memory as if it were stored in row-major order and the leading dimension is still $M$.
If $A$ is stored in row-major order, the matrix values in the linear memory is $[1, 2, 3, 4, 5, 6]$. If $A$ is stored in column-major order, the matrix values in the linear memory is $[1, 4, 2, 5, 3, 6]$.
If $A^{\top}$ is stored in row-major order, the matrix values in the linear memory is $[1, 4, 2, 5, 3, 6]$. If $A^{\top}$ is stored in column-major order, the matrix values in the linear memory is $[1, 2, 3, 4, 5, 6]$.
It’s easy to see that $A$ is stored in row-major order is exactly the same as $A^{\top}$ is stored in column-major order in the memory and $A$ is stored in column-major order is exactly the same as $A^{\top}$ is stored in row-major order in the memory.
For a matrix $A$ stored in row-major order, reading rows of $A$ and reading columns of $A^{\top}$ are fast and cache-friendly, whereas reading columns of $A$ and reading rows of $A^{\top}$ are slow and invalids caching.
For a matrix $A$ stored in column-major order, reading columns of $A$ and reading rows of $A^{\top}$ are fast and cache-friendly, whereas reading rows of $A$ and reading columns of $A^{\top}$ are slow and invalids caching.
Matrix Multiplication
The way of storing matrices in memory affects the performance of matrix multiplication on lots of processors, such as CPU and GPU. Usually, depending on whether the matrices for multiplications needs to be mathematically transposed for matrix multiplication, there are four ways of computing matrix multiplication, $C=AB$, $C=A^{\top}B$, $C=AB^{\top}$, and $C=A^{\top}B^{\top}$. Each of them performs better than others depending on the storage orderings of matrices $A$ and $B$, even though the theoretical MACs of the operations remain the same.
$C=AB$
Suppose a matrix $A$ is of shape $(M, K)$ and a matrix $B$ is of shape $(K, N)$, to compute $C = AB$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one row of size $K$ in the matrix $A$ and one column of size $K$ in the matrix $B$.
Depending on the storage ordering of the two matrices, there are four scenarios.
Matrix $A$ Storage Order
Matrix $B$ Storage Order
Matrix $A$ Row Reading
Matrix $B$ Column Reading
column-major
column-major
slow
fast
column-major
row-major
slow
slow
row-major
column-major
fast
fast
row-major
row-major
fast
slow
When $A$ is stored in row-major order and $B$ is stored in column-major order, reading both rows from $A$ and columns from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.
Therefore, the matrix multiplication $C=AB$ is more suitable for the situation where $A$ is stored in row-major order and $B$ is stored in column-major order.
$C=A^{\top}B$
Suppose a matrix $A$ is of shape $(K, M)$ and a matrix $B$ is of shape $(K, N)$, to compute $C = A^{\top}B$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one column of size $K$ in the matrix $A$ and one column of size $K$ in the matrix $B$.
Depending on the storage ordering of the two matrices, there are four scenarios.
Matrix $A$ Storage Order
Matrix $B$ Storage Order
Matrix $A$ Column Reading
Matrix $B$ Column Reading
column-major
column-major
fast
fast
column-major
row-major
fast
slow
row-major
column-major
slow
fast
row-major
row-major
slow
slow
When $A$ is stored in column-major order and $B$ is stored in column-major order, reading both columns from $A$ and columns from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.
Therefore, the matrix multiplication $C=A^{\top}B$ is more suitable for the situation where $A$ is stored in column-major order and $B$ is stored in column-major order.
$C=AB^{\top}$
Suppose a matrix $A$ is of shape $(M, K)$ and a matrix $B$ is of shape $(N, K)$, to compute $C = AB^{\top}$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one row of size $K$ in the matrix $A$ and one row of size $K$ in the matrix $B$.
Depending on the storage ordering of the two matrices, there are four scenarios.
Matrix $A$ Storage Order
Matrix $B$ Storage Order
Matrix $A$ Row Reading
Matrix $B$ Row Reading
column-major
column-major
slow
slow
column-major
row-major
slow
fast
row-major
column-major
fast
slow
row-major
row-major
fast
fast
When $A$ is stored in row-major order and $B$ is stored in row-major order, reading both rows from $A$ and rows from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.
Therefore, the matrix multiplication $C=AB^{\top}$ is more suitable for the situation where $A$ is stored in row-major order and $B$ is stored in row-major order.
$C=A^{\top}B^{\top}$
Suppose a matrix $A$ is of shape $(K, M)$ and a matrix $B$ is of shape $(N, K)$, to compute $C = A^{\top}B^{\top}$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one column of size $K$ in the matrix $A$ and one row of size $K$ in the matrix $B$.
Depending on the storage ordering of the two matrices, there are four scenarios.
Matrix $A$ Storage Order
Matrix $B$ Storage Order
Matrix $A$ Column Reading
Matrix $B$ Row Reading
column-major
column-major
fast
slow
column-major
row-major
fast
fast
row-major
column-major
slow
slow
row-major
row-major
slow
fast
When $A$ is stored in column-major order and $B$ is stored in row-major order, reading both columns from $A$ and rows from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.
Therefore, the matrix multiplication $C=A^{\top}B^{\top}$ is more suitable for the situation where $A$ is stored in column-major order and $B$ is stored in row-major order.
Matrix Multiplication Preference
The matrix multiplication preference for different combinations of the storage orders of matrices beings multiplied can be summarized as follows.
Matrix $A$ Storage Order
Matrix $B$ Storage Order
Matrix Multiplication Preference
column-major
column-major
$C = A^{\top}B$
column-major
row-major
$C = A^{\top}B^{\top}$
row-major
column-major
$C = AB$
row-major
row-major
$C = AB^{\top}$
Because usually the all matrices in one software framework would use the same storage order, this means only $C = A^{\top}B$ and $C = AB^{\top}$ are preferred for those scenarios.
Optimizations can reduce the performance gap between the optimal matrix multiplication option and the other options, sometimes even to almost zero, depending on the implementation and the processor.
In addition, sometimes it’s not a good idea to physically transpose a matrix in memory just in order to use the most performant matrix multiplication option among the four, because the overhead of transposing a matrix in memory might be much larger than the difference between the four matrix multiplication options especially when they are all well optimized.
Matrix Multiplication Benchmarks
Additionally, we could verify our analysis using C++ single-threaded naive implementations for matrix multiplication.
std::function<void(void)> const mm_a_col_major_b_col_major_a_b{ std::bind(mm_a_col_major_b_col_major<float>, A, B, C, M, N, K, matrix_a_col_major_ld, matrix_b_col_major_ld, matrix_c_col_major_ld, false, false)};
std::function<void(void)> const mm_a_col_major_b_col_major_a_transpose_b{ std::bind(mm_a_col_major_b_col_major<float>, A, B, C, M, N, K, matrix_a_transpose_col_major_ld, matrix_b_col_major_ld, matrix_c_col_major_ld, true, false)};
std::function<void(void)> const mm_a_col_major_b_col_major_a_transpose_b_transpose{std::bind( mm_a_col_major_b_col_major<float>, A, B, C, M, N, K, matrix_a_transpose_col_major_ld, matrix_b_transpose_col_major_ld, matrix_c_col_major_ld, true, true)};
std::function<void(void)> const mm_a_col_major_b_col_major_a_b_transpose{ std::bind(mm_a_col_major_b_col_major<float>, A, B, C, M, N, K, matrix_a_col_major_ld, matrix_b_transpose_col_major_ld, matrix_c_col_major_ld, false, true)};
We could see that given matrix $A$ and matrix $B$ are stored in column-major order, as expected, the performance of $C = A^{\top}B$ is the best and the performance of $C = AB^{\top}$ is the worst.
1 2 3 4 5 6 7 8 9 10
$ g++ naive_mm.cpp -o naive_mm $ ./naive_mm C = A * B Latency: 45.400 ms C = A^T * B Latency: 32.500 ms C = A * B^T Latency: 57.800 ms C = A^T * B^T Latency: 48.300 ms
Using multi-threaded optimized matrix multiplication implementations, such as the GEMM functions from the cuBLAS library, can eliminate the difference between the four options.
CHECK_CUDA_ERROR(cudaMalloc(&A, M * K * sizeof(float))); CHECK_CUDA_ERROR(cudaMalloc(&B, K * N * sizeof(float))); CHECK_CUDA_ERROR(cudaMalloc(&C, M * N * sizeof(float)));
With the highly optimized implementations, there is almost no difference between the four options.
1 2 3 4 5 6 7 8 9 10
$ nvcc cublas_mm.cu -o cublas_mm -lcublas $ ./cublas_mm C = A * B Latency: 0.008 ms C = A^T * B Latency: 0.010 ms C = A * B^T Latency: 0.009 ms C = A^T * B^T Latency: 0.008 ms