Row-Major VS Column-Major

Introduction

In computing, row-major order and column-major order are two ways for storing multidimensional arrays in linear storage such as random access memory. The following figure demonstrated the row-major order and the column-major for a 2D matrix.

Row-Major VS Column-Major

In this blog post, I would like to discuss the difference between the row-major order and the column-major order, and their consequence for matrix multiplication performance.

Row-Major VS Column-Major

Given a matrix $A$ of shape $(M, N)$, if it is stored in row-major order, its leading dimension is $N$, and if it is stored in column-major order, its leading dimension is $M$.

To read $A^{\top}$ from the same piece of the memory in which $A$ was stored in row-major order with a leading dimension of $N$, we could just treat the matrix in the memory as if it were stored in column-major order and the leading dimension is still $N$.

To read $A^{\top}$ from the same piece of the memory in which $A$ was stored in column-major order with a leading dimension of $M$, we could just treat the matrix in the memory as if it were stored in row-major order and the leading dimension is still $M$.

For example, we have a matrix $A$,

$$
\begin{align}
A &=
\begin{bmatrix}
1 & 2 & 3 \\
4 & 5 & 6 \\
\end{bmatrix} \\
\end{align}
$$

If $A$ is stored in row-major order, the matrix values in the linear memory is $[1, 2, 3, 4, 5, 6]$.
If $A$ is stored in column-major order, the matrix values in the linear memory is $[1, 4, 2, 5, 3, 6]$.

The transpose of $A$, $A^{\top}$, is

$$
\begin{align}
A^{\top} &=
\begin{bmatrix}
1 & 4 \\
2 & 5 \\
3 & 6 \\
\end{bmatrix} \\
\end{align}
$$

If $A^{\top}$ is stored in row-major order, the matrix values in the linear memory is $[1, 4, 2, 5, 3, 6]$.
If $A^{\top}$ is stored in column-major order, the matrix values in the linear memory is $[1, 2, 3, 4, 5, 6]$.

It’s easy to see that $A$ is stored in row-major order is exactly the same as $A^{\top}$ is stored in column-major order in the memory and $A$ is stored in column-major order is exactly the same as $A^{\top}$ is stored in row-major order in the memory.

For a matrix $A$ stored in row-major order, reading rows of $A$ and reading columns of $A^{\top}$ are fast and cache-friendly, whereas reading columns of $A$ and reading rows of $A^{\top}$ are slow and invalids caching.

For a matrix $A$ stored in column-major order, reading columns of $A$ and reading rows of $A^{\top}$ are fast and cache-friendly, whereas reading rows of $A$ and reading columns of $A^{\top}$ are slow and invalids caching.

Matrix Multiplication

The way of storing matrices in memory affects the performance of matrix multiplication on lots of processors, such as CPU and GPU. Usually, depending on whether the matrices for multiplications needs to be mathematically transposed for matrix multiplication, there are four ways of computing matrix multiplication, $C=AB$, $C=A^{\top}B$, $C=AB^{\top}$, and $C=A^{\top}B^{\top}$. Each of them performs better than others depending on the storage orderings of matrices $A$ and $B$, even though the theoretical MACs of the operations remain the same.

$C=AB$

Suppose a matrix $A$ is of shape $(M, K)$ and a matrix $B$ is of shape $(K, N)$, to compute $C = AB$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one row of size $K$ in the matrix $A$ and one column of size $K$ in the matrix $B$.

Depending on the storage ordering of the two matrices, there are four scenarios.

Matrix $A$ Storage Order Matrix $B$ Storage Order Matrix $A$ Row Reading Matrix $B$ Column Reading
column-major column-major slow fast
column-major row-major slow slow
row-major column-major fast fast
row-major row-major fast slow

When $A$ is stored in row-major order and $B$ is stored in column-major order, reading both rows from $A$ and columns from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.

Therefore, the matrix multiplication $C=AB$ is more suitable for the situation where $A$ is stored in row-major order and $B$ is stored in column-major order.

$C=A^{\top}B$

Suppose a matrix $A$ is of shape $(K, M)$ and a matrix $B$ is of shape $(K, N)$, to compute $C = A^{\top}B$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one column of size $K$ in the matrix $A$ and one column of size $K$ in the matrix $B$.

Depending on the storage ordering of the two matrices, there are four scenarios.

Matrix $A$ Storage Order Matrix $B$ Storage Order Matrix $A$ Column Reading Matrix $B$ Column Reading
column-major column-major fast fast
column-major row-major fast slow
row-major column-major slow fast
row-major row-major slow slow

When $A$ is stored in column-major order and $B$ is stored in column-major order, reading both columns from $A$ and columns from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.

Therefore, the matrix multiplication $C=A^{\top}B$ is more suitable for the situation where $A$ is stored in column-major order and $B$ is stored in column-major order.

$C=AB^{\top}$

Suppose a matrix $A$ is of shape $(M, K)$ and a matrix $B$ is of shape $(N, K)$, to compute $C = AB^{\top}$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one row of size $K$ in the matrix $A$ and one row of size $K$ in the matrix $B$.

Depending on the storage ordering of the two matrices, there are four scenarios.

Matrix $A$ Storage Order Matrix $B$ Storage Order Matrix $A$ Row Reading Matrix $B$ Row Reading
column-major column-major slow slow
column-major row-major slow fast
row-major column-major fast slow
row-major row-major fast fast

When $A$ is stored in row-major order and $B$ is stored in row-major order, reading both rows from $A$ and rows from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.

Therefore, the matrix multiplication $C=AB^{\top}$ is more suitable for the situation where $A$ is stored in row-major order and $B$ is stored in row-major order.

$C=A^{\top}B^{\top}$

Suppose a matrix $A$ is of shape $(K, M)$ and a matrix $B$ is of shape $(N, K)$, to compute $C = A^{\top}B^{\top}$ where $C$ is an matrix of shape $(M, N)$, each element in $C$ is an accumulated sum of one column of size $K$ in the matrix $A$ and one row of size $K$ in the matrix $B$.

Depending on the storage ordering of the two matrices, there are four scenarios.

Matrix $A$ Storage Order Matrix $B$ Storage Order Matrix $A$ Column Reading Matrix $B$ Row Reading
column-major column-major fast slow
column-major row-major fast fast
row-major column-major slow slow
row-major row-major slow fast

When $A$ is stored in column-major order and $B$ is stored in row-major order, reading both columns from $A$ and rows from $B$ are fast for matrix multiplication because of the caching mechanism of modern processors, and faster readings results in better performance, given the same amount of math to compute.

Therefore, the matrix multiplication $C=A^{\top}B^{\top}$ is more suitable for the situation where $A$ is stored in column-major order and $B$ is stored in row-major order.

Matrix Multiplication Preference

The matrix multiplication preference for different combinations of the storage orders of matrices beings multiplied can be summarized as follows.

Matrix $A$ Storage Order Matrix $B$ Storage Order Matrix Multiplication Preference
column-major column-major $C = A^{\top}B$
column-major row-major $C = A^{\top}B^{\top}$
row-major column-major $C = AB$
row-major row-major $C = AB^{\top}$

Because usually the all matrices in one software framework would use the same storage order, this means only $C = A^{\top}B$ and $C = AB^{\top}$ are preferred for those scenarios.

Optimizations can reduce the performance gap between the optimal matrix multiplication option and the other options, sometimes even to almost zero, depending on the implementation and the processor.

In addition, sometimes it’s not a good idea to physically transpose a matrix in memory just in order to use the most performant matrix multiplication option among the four, because the overhead of transposing a matrix in memory might be much larger than the difference between the four matrix multiplication options especially when they are all well optimized.

Matrix Multiplication Benchmarks

Additionally, we could verify our analysis using C++ single-threaded naive implementations for matrix multiplication.

naive_mm.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
#include <cassert>
#include <chrono>
#include <cstdint>
#include <functional>
#include <iomanip>
#include <iostream>
#include <tuple>
#include <utility>
#include <vector>

template <class T>
float measure_performance(std::function<T(void)> bound_function,
int num_repeats = 100, int num_warmups = 100)
{
for (int i{0}; i < num_warmups; ++i)
{
bound_function();
}

std::chrono::steady_clock::time_point time_start{
std::chrono::steady_clock::now()};
for (int i{0}; i < num_repeats; ++i)
{
bound_function();
}
std::chrono::steady_clock::time_point time_end{
std::chrono::steady_clock::now()};

auto time_elapsed{std::chrono::duration_cast<std::chrono::milliseconds>(
time_end - time_start)
.count()};
float latency{time_elapsed / static_cast<float>(num_repeats)};

return latency;
}

// A and B are column-major matrices.
template <typename T>
void mm_a_col_major_b_col_major(T const* A, T const* B, T* C, uint32_t m,
uint32_t n, uint32_t k, uint32_t lda,
uint32_t ldb, uint32_t ldc, bool is_A_transpose,
bool is_B_transpose)
{
for (uint32_t ni{0}; ni < n; ++ni)
{
for (uint32_t mi{0}; mi < m; ++mi)
{
// Compute C[mi, ni]
T accum{0};
// A * B
if ((!is_A_transpose) && (!is_B_transpose))
{
for (uint32_t ki{0}; ki < k; ++ki)
{
// A[mi, ki] * B[ki, ni]
accum += A[ki * lda + mi] * B[ni * ldb + ki];
}
}
// A^T * B
else if ((is_A_transpose) && (!is_B_transpose))
{
for (uint32_t ki{0}; ki < k; ++ki)
{
// A[ki, mi] * B[ki, ni]
accum += A[mi * lda + ki] * B[ni * ldb + ki];
}
}
// A * B^T
else if ((!is_A_transpose) && (is_B_transpose))
{
for (uint32_t ki{0}; ki < k; ++ki)
{
// A[mi, ki] * B[ni, ki]
accum += A[ki * lda + mi] * B[ki * ldb + ni];
}
}
// A^T * B^T
else
{
for (uint32_t ki{0}; ki < k; ++ki)
{
// A[ki, mi] * B[ni, ki]
accum += A[mi * lda + ki] * B[ki * ldb + ni];
}
}
C[ni * ldc + mi] = accum;
}
}
}

void print_latency(float latency)
{
std::cout << std::fixed << std::setprecision(3) << "Latency: " << latency
<< " ms" << std::endl;
}

int main()
{
constexpr uint32_t num_repeats{10};
constexpr uint32_t num_warmups{10};

constexpr uint32_t M{256};
constexpr uint32_t K{256};
constexpr uint32_t N{256};

std::vector<float> matrix_a(M * K);
std::vector<float> matrix_b(K * N);
std::vector<float> matrix_c(M * N);

float const* A{matrix_a.data()};
float const* B{matrix_b.data()};
float* C{matrix_c.data()};

uint32_t const matrix_a_col_major_ld{M};
uint32_t const matrix_a_row_major_ld{K};
uint32_t const matrix_a_transpose_col_major_ld{matrix_a_row_major_ld};
uint32_t const matrix_a_transpose_row_major_ld{matrix_a_col_major_ld};

uint32_t const matrix_b_col_major_ld{K};
uint32_t const matrix_b_row_major_ld{N};
uint32_t const matrix_b_transpose_col_major_ld{matrix_b_row_major_ld};
uint32_t const matrix_b_transpose_row_major_ld{matrix_b_col_major_ld};

uint32_t const matrix_c_col_major_ld{M};
uint32_t const matrix_c_row_major_ld{N};
uint32_t const matrix_c_transpose_col_major_ld{matrix_c_row_major_ld};
uint32_t const matrix_c_transpose_row_major_ld{matrix_c_col_major_ld};

std::function<void(void)> const mm_a_col_major_b_col_major_a_b{
std::bind(mm_a_col_major_b_col_major<float>, A, B, C, M, N, K,
matrix_a_col_major_ld, matrix_b_col_major_ld,
matrix_c_col_major_ld, false, false)};

std::function<void(void)> const mm_a_col_major_b_col_major_a_transpose_b{
std::bind(mm_a_col_major_b_col_major<float>, A, B, C, M, N, K,
matrix_a_transpose_col_major_ld, matrix_b_col_major_ld,
matrix_c_col_major_ld, true, false)};

std::function<void(void)> const
mm_a_col_major_b_col_major_a_transpose_b_transpose{std::bind(
mm_a_col_major_b_col_major<float>, A, B, C, M, N, K,
matrix_a_transpose_col_major_ld, matrix_b_transpose_col_major_ld,
matrix_c_col_major_ld, true, true)};

std::function<void(void)> const mm_a_col_major_b_col_major_a_b_transpose{
std::bind(mm_a_col_major_b_col_major<float>, A, B, C, M, N, K,
matrix_a_col_major_ld, matrix_b_transpose_col_major_ld,
matrix_c_col_major_ld, false, true)};

std::cout << "C = A * B" << std::endl;
float const latency_a_b = measure_performance(
mm_a_col_major_b_col_major_a_b, num_repeats, num_warmups);
print_latency(latency_a_b);

std::cout << "C = A^T * B" << std::endl;
float const latency_a_transpose_b = measure_performance(
mm_a_col_major_b_col_major_a_transpose_b, num_repeats, num_warmups);
print_latency(latency_a_transpose_b);

std::cout << "C = A * B^T" << std::endl;
float const latency_a_b_transpose = measure_performance(
mm_a_col_major_b_col_major_a_b_transpose, num_repeats, num_warmups);
print_latency(latency_a_b_transpose);

std::cout << "C = A^T * B^T" << std::endl;
float const latency_a_transpose_b_transpose =
measure_performance(mm_a_col_major_b_col_major_a_transpose_b_transpose,
num_repeats, num_warmups);
print_latency(latency_a_transpose_b_transpose);

assert(latency_a_transpose_b ==
std::min({latency_a_b, latency_a_transpose_b, latency_a_b_transpose,
latency_a_transpose_b_transpose}));
assert(latency_a_b_transpose ==
std::max({latency_a_b, latency_a_transpose_b, latency_a_b_transpose,
latency_a_transpose_b_transpose}));
}

We could see that given matrix $A$ and matrix $B$ are stored in column-major order, as expected, the performance of $C = A^{\top}B$ is the best and the performance of $C = AB^{\top}$ is the worst.

1
2
3
4
5
6
7
8
9
10
$ g++ naive_mm.cpp -o naive_mm
$ ./naive_mm
C = A * B
Latency: 45.400 ms
C = A^T * B
Latency: 32.500 ms
C = A * B^T
Latency: 57.800 ms
C = A^T * B^T
Latency: 48.300 ms

Using multi-threaded optimized matrix multiplication implementations, such as the GEMM functions from the cuBLAS library, can eliminate the difference between the four options.

cublas_mm.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
#include <cassert>
#include <chrono>
#include <cstdint>
#include <functional>
#include <iomanip>
#include <iostream>
#include <tuple>
#include <utility>
#include <vector>

#include <cublas_v2.h>
#include <cuda_runtime.h>

#define CHECK_CUBLAS_ERROR(val) checkCuBlas((val), #val, __FILE__, __LINE__)
template <typename T>
void checkCuBlas(T err, const char* const func, const char* const file,
const int line)
{
if (err != CUBLAS_STATUS_SUCCESS)
{
std::cerr << "cuBlas Runtime Error at: " << file << ":" << line
<< std::endl;
std::exit(EXIT_FAILURE);
}
}

#define CHECK_CUDA_ERROR(val) checkCuda((val), #val, __FILE__, __LINE__)
template <typename T>
void checkCuda(T err, const char* const func, const char* const file,
const int line)
{
if (err != cudaSuccess)
{
std::cerr << "CUDA Runtime Error at: " << file << ":" << line
<< std::endl;
std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
std::exit(EXIT_FAILURE);
}
}

#define CHECK_LAST_CUDA_ERROR() checkCudaLast(__FILE__, __LINE__)
void checkCudaLast(const char* const file, const int line)
{
cudaError_t const err{cudaGetLastError()};
if (err != cudaSuccess)
{
std::cerr << "CUDA Runtime Error at: " << file << ":" << line
<< std::endl;
std::cerr << cudaGetErrorString(err) << std::endl;
std::exit(EXIT_FAILURE);
}
}

float measure_cublas_performance(
std::function<cublasStatus_t(void)> bound_cublas_function,
cudaStream_t stream, int num_repeats = 100, int num_warmups = 100)
{
cudaEvent_t start, stop;
float time;

CHECK_CUDA_ERROR(cudaEventCreate(&start));
CHECK_CUDA_ERROR(cudaEventCreate(&stop));

for (int i{0}; i < num_warmups; ++i)
{
CHECK_CUBLAS_ERROR(bound_cublas_function());
}

CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));

CHECK_CUDA_ERROR(cudaEventRecord(start, stream));
for (int i{0}; i < num_repeats; ++i)
{
CHECK_CUBLAS_ERROR(bound_cublas_function());
}
CHECK_CUDA_ERROR(cudaEventRecord(stop, stream));
CHECK_CUDA_ERROR(cudaEventSynchronize(stop));
CHECK_LAST_CUDA_ERROR();
CHECK_CUDA_ERROR(cudaEventElapsedTime(&time, start, stop));
CHECK_CUDA_ERROR(cudaEventDestroy(start));
CHECK_CUDA_ERROR(cudaEventDestroy(stop));

float const latency{time / num_repeats};

return latency;
}

void print_latency(float latency)
{
std::cout << std::fixed << std::setprecision(3) << "Latency: " << latency
<< " ms" << std::endl;
}

int main()
{
constexpr uint32_t num_repeats{100};
constexpr uint32_t num_warmups{100};

constexpr uint32_t M{256};
constexpr uint32_t K{256};
constexpr uint32_t N{256};

float* A{nullptr};
float* B{nullptr};
float* C{nullptr};

CHECK_CUDA_ERROR(cudaMalloc(&A, M * K * sizeof(float)));
CHECK_CUDA_ERROR(cudaMalloc(&B, K * N * sizeof(float)));
CHECK_CUDA_ERROR(cudaMalloc(&C, M * N * sizeof(float)));

uint32_t const matrix_a_col_major_ld{M};
uint32_t const matrix_a_row_major_ld{K};
uint32_t const matrix_a_transpose_col_major_ld{matrix_a_row_major_ld};
uint32_t const matrix_a_transpose_row_major_ld{matrix_a_col_major_ld};

uint32_t const matrix_b_col_major_ld{K};
uint32_t const matrix_b_row_major_ld{N};
uint32_t const matrix_b_transpose_col_major_ld{matrix_b_row_major_ld};
uint32_t const matrix_b_transpose_row_major_ld{matrix_b_col_major_ld};

uint32_t const matrix_c_col_major_ld{M};
uint32_t const matrix_c_row_major_ld{N};
uint32_t const matrix_c_transpose_col_major_ld{matrix_c_row_major_ld};
uint32_t const matrix_c_transpose_row_major_ld{matrix_c_col_major_ld};

cublasHandle_t cublas_handle;
cudaStream_t stream;

CHECK_CUDA_ERROR(cudaStreamCreate(&stream));
CHECK_CUBLAS_ERROR(cublasCreate(&cublas_handle));
CHECK_CUBLAS_ERROR(cublasSetStream(cublas_handle, stream));

float const alpha{1.0};
float const beta{0.0};

// cublasSgemm assumes column-major matrices.
std::function<cublasStatus_t(void)> const mm_a_col_major_b_col_major_a_b{
std::bind(cublasSgemm, cublas_handle, CUBLAS_OP_N, CUBLAS_OP_N, M, N, K,
&alpha, A, matrix_a_col_major_ld, B, matrix_b_col_major_ld,
&beta, C, matrix_c_col_major_ld)};

std::function<cublasStatus_t(void)> const
mm_a_col_major_b_col_major_a_transpose_b{
std::bind(cublasSgemm, cublas_handle, CUBLAS_OP_T, CUBLAS_OP_N, M,
N, K, &alpha, A, matrix_a_transpose_col_major_ld, B,
matrix_b_col_major_ld, &beta, C, matrix_c_col_major_ld)};

std::function<cublasStatus_t(void)> const
mm_a_col_major_b_col_major_a_transpose_b_transpose{std::bind(
cublasSgemm, cublas_handle, CUBLAS_OP_T, CUBLAS_OP_T, M, N, K,
&alpha, A, matrix_a_transpose_col_major_ld, B,
matrix_b_transpose_col_major_ld, &beta, C, matrix_c_col_major_ld)};

std::function<cublasStatus_t(void)> const
mm_a_col_major_b_col_major_a_b_transpose{std::bind(
cublasSgemm, cublas_handle, CUBLAS_OP_N, CUBLAS_OP_T, M, N, K,
&alpha, A, matrix_a_col_major_ld, B,
matrix_b_transpose_col_major_ld, &beta, C, matrix_c_col_major_ld)};

std::cout << "C = A * B" << std::endl;
float const latency_a_b = measure_cublas_performance(
mm_a_col_major_b_col_major_a_b, stream, num_repeats, num_warmups);
print_latency(latency_a_b);

std::cout << "C = A^T * B" << std::endl;
float const latency_a_transpose_b =
measure_cublas_performance(mm_a_col_major_b_col_major_a_transpose_b,
stream, num_repeats, num_warmups);
print_latency(latency_a_transpose_b);

std::cout << "C = A * B^T" << std::endl;
float const latency_a_b_transpose =
measure_cublas_performance(mm_a_col_major_b_col_major_a_b_transpose,
stream, num_repeats, num_warmups);
print_latency(latency_a_b_transpose);

std::cout << "C = A^T * B^T" << std::endl;
float const latency_a_transpose_b_transpose = measure_cublas_performance(
mm_a_col_major_b_col_major_a_transpose_b_transpose, stream, num_repeats,
num_warmups);
print_latency(latency_a_transpose_b_transpose);

CHECK_CUDA_ERROR(cudaFree(A));
CHECK_CUDA_ERROR(cudaFree(B));
CHECK_CUDA_ERROR(cudaFree(C));
CHECK_CUBLAS_ERROR(cublasDestroy(cublas_handle));
CHECK_CUDA_ERROR(cudaStreamDestroy(stream));
}

With the highly optimized implementations, there is almost no difference between the four options.

1
2
3
4
5
6
7
8
9
10
$ nvcc cublas_mm.cu -o cublas_mm -lcublas
$ ./cublas_mm
C = A * B
Latency: 0.008 ms
C = A^T * B
Latency: 0.010 ms
C = A * B^T
Latency: 0.009 ms
C = A^T * B^T
Latency: 0.008 ms

References

Author

Lei Mao

Posted on

05-12-2023

Updated on

05-12-2023

Licensed under


Comments