CUDA Zero Copy Mapped Memory

Introduction

Unified memory is used on NVIDIA embedding platforms, such as NVIDIA Drive series and NVIDIA Jetson series. Since the same memory is used for both the CPU and the integrated GPU, it is possible to eliminate the CUDA memory copy between host and device that normally happens on a system that uses discrete GPU so that the GPU can directly the access the outputs from CPU and the CPU can also directly access the outputs from GPU. In this way, the system performance could be improved significantly in some use cases.

In this blog post, I would like to discuss the CUDA mapped pinned memory versus CUDA non-mapped pinned memory and compare their performance on memory bound kernels.

CUDA Pinned Mapped Memory

CUDA pinned mapped memory enables GPU threads to directly access host memory. For this purpose, it requires mapped pinned (non-pageable, page-locked) memory. On integrated GPUs (i.e., GPUs with the integrated field of the CUDA device properties structure set to 1), mapped pinned memory is always a performance gain because it avoids superfluous copies as integrated GPU and CPU memory are physically the same. On discrete GPUs, mapped pinned memory is advantageous only in certain cases. Because the data is not cached on the GPU, mapped pinned memory should be read or written only once, and the global loads and stores that read and write the memory should be coalesced. Zero copy can be used in place of streams because kernel-originated data transfers automatically overlap kernel execution without the overhead of setting up and determining the optimal number of streams.

CUDA Pinned Memory Non-Mapped VS Mapped

The following implementation compares the latency of a memory-bound kernel and its memory copy between host and device if necessary.

CUDA mapped memory also uses pinned memory. For CUDA pinned memory, we still need to allocate device memory and transfer the data between the host memory and the device memory, whereas for CUDA mapped memory, the device memory allocation and memory transfer, if there is any, are abstracted.

mapped_memory.cu
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
#include <cassert>
#include <chrono>
#include <functional>
#include <iomanip>
#include <iostream>
#include <stdexcept>
#include <thread>
#include <tuple>
#include <utility>
#include <vector>

#include <cuda_runtime.h>

#define CHECK_CUDA_ERROR(val) check((val), #val, __FILE__, __LINE__)
void check(cudaError_t err, const char* const func, const char* const file,
const int line)
{
if (err != cudaSuccess)
{
std::cerr << "CUDA Runtime Error at: " << file << ":" << line
<< std::endl;
std::cerr << cudaGetErrorString(err) << " " << func << std::endl;
std::exit(EXIT_FAILURE);
}
}

#define CHECK_LAST_CUDA_ERROR() checkLast(__FILE__, __LINE__)
void checkLast(const char* const file, const int line)
{
cudaError_t const err{cudaGetLastError()};
if (err != cudaSuccess)
{
std::cerr << "CUDA Runtime Error at: " << file << ":" << line
<< std::endl;
std::cerr << cudaGetErrorString(err) << std::endl;
std::exit(EXIT_FAILURE);
}
}

template <class T>
float measure_performance(std::function<T(cudaStream_t)> bound_function,
cudaStream_t stream, int num_repeats = 100,
int num_warmups = 100)
{
cudaEvent_t start, stop;
float time;

CHECK_CUDA_ERROR(cudaEventCreate(&start));
CHECK_CUDA_ERROR(cudaEventCreate(&stop));

for (int i{0}; i < num_warmups; ++i)
{
bound_function(stream);
}

CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));

CHECK_CUDA_ERROR(cudaEventRecord(start, stream));
for (int i{0}; i < num_repeats; ++i)
{
bound_function(stream);
}
CHECK_CUDA_ERROR(cudaEventRecord(stop, stream));
CHECK_CUDA_ERROR(cudaEventSynchronize(stop));
CHECK_LAST_CUDA_ERROR();
CHECK_CUDA_ERROR(cudaEventElapsedTime(&time, start, stop));
CHECK_CUDA_ERROR(cudaEventDestroy(start));
CHECK_CUDA_ERROR(cudaEventDestroy(stop));

float const latency{time / num_repeats};

return latency;
}

__global__ void float_addition(float* output, float const* input_1,
float const* input_2, uint32_t n)
{
const uint32_t idx{blockDim.x * blockIdx.x + threadIdx.x};
const uint32_t stride{blockDim.x * gridDim.x};
for (uint32_t i{idx}; i < n; i += stride)
{
output[i] = input_1[i] + input_2[i];
}
}

void launch_float_addition_non_mapped_pinned_memory(
float* h_output, float const* h_input_1, float const* h_input_2,
float* d_output, float* d_input_1, float* d_input_2, uint32_t n,
cudaStream_t stream)
{
CHECK_CUDA_ERROR(cudaMemcpyAsync(d_input_1, h_input_1, n * sizeof(float),
cudaMemcpyHostToDevice, stream));
CHECK_CUDA_ERROR(cudaMemcpyAsync(d_input_2, h_input_2, n * sizeof(float),
cudaMemcpyHostToDevice, stream));
dim3 const threads_per_block{1024};
dim3 const blocks_per_grid{32};
float_addition<<<blocks_per_grid, threads_per_block, 0, stream>>>(
d_output, d_input_1, d_input_2, n);
CHECK_LAST_CUDA_ERROR();
CHECK_CUDA_ERROR(cudaMemcpyAsync(h_output, d_output, n * sizeof(float),
cudaMemcpyDeviceToHost, stream));
}

void launch_float_addition_mapped_pinned_memory(float* d_output,
float* d_input_1,
float* d_input_2, uint32_t n,
cudaStream_t stream)
{
dim3 const threads_per_block{1024};
dim3 const blocks_per_grid{32};
float_addition<<<blocks_per_grid, threads_per_block, 0, stream>>>(
d_output, d_input_1, d_input_2, n);
CHECK_LAST_CUDA_ERROR();
}

void initialize_host_memory(float* h_buffer, uint32_t n, float value)
{
for (int i{0}; i < n; ++i)
{
h_buffer[i] = value;
}
}

bool verify_host_memory(float* h_buffer, uint32_t n, float value)
{
for (int i{0}; i < n; ++i)
{
if (h_buffer[i] != value)
{
return false;
}
}
return true;
}

int main()
{
constexpr int const num_repeats{10};
constexpr int const num_warmups{10};

constexpr int const n{1000000};
cudaStream_t stream;
CHECK_CUDA_ERROR(cudaStreamCreate(&stream));

float const v_input_1{1.0f};
float const v_input_2{1.0f};
float const v_output{0.0f};
float const v_output_reference{v_input_1 + v_input_2};

cudaDeviceProp prop;
CHECK_CUDA_ERROR(cudaGetDeviceProperties(&prop, 0));
if (!prop.canMapHostMemory)
{
throw std::runtime_error{"Device does not supported mapped memory."};
}

float *h_input_1, *h_input_2, *h_output;
float *d_input_1, *d_input_2, *d_output;

float *a_input_1, *a_input_2, *a_output;
float *m_input_1, *m_input_2, *m_output;

CHECK_CUDA_ERROR(cudaMallocHost(&h_input_1, n * sizeof(float)));
CHECK_CUDA_ERROR(cudaMallocHost(&h_input_2, n * sizeof(float)));
CHECK_CUDA_ERROR(cudaMallocHost(&h_output, n * sizeof(float)));

CHECK_CUDA_ERROR(cudaMalloc(&d_input_1, n * sizeof(float)));
CHECK_CUDA_ERROR(cudaMalloc(&d_input_2, n * sizeof(float)));
CHECK_CUDA_ERROR(cudaMalloc(&d_output, n * sizeof(float)));

CHECK_CUDA_ERROR(
cudaHostAlloc(&a_input_1, n * sizeof(float), cudaHostAllocMapped));
CHECK_CUDA_ERROR(
cudaHostAlloc(&a_input_2, n * sizeof(float), cudaHostAllocMapped));
CHECK_CUDA_ERROR(
cudaHostAlloc(&a_output, n * sizeof(float), cudaHostAllocMapped));

CHECK_CUDA_ERROR(cudaHostGetDevicePointer(&m_input_1, a_input_1, 0));
CHECK_CUDA_ERROR(cudaHostGetDevicePointer(&m_input_2, a_input_2, 0));
CHECK_CUDA_ERROR(cudaHostGetDevicePointer(&m_output, a_output, 0));

// Verify the implementation correctness.
initialize_host_memory(h_input_1, n, v_input_1);
initialize_host_memory(h_input_2, n, v_input_2);
initialize_host_memory(h_output, n, v_output);
launch_float_addition_non_mapped_pinned_memory(
h_output, h_input_1, h_input_2, d_output, d_input_1, d_input_2, n,
stream);
CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));
assert(verify_host_memory(h_output, n, v_output_reference));

initialize_host_memory(a_input_1, n, v_input_1);
initialize_host_memory(a_input_2, n, v_input_2);
initialize_host_memory(a_output, n, v_output);
launch_float_addition_mapped_pinned_memory(m_output, m_input_1, m_input_2,
n, stream);
CHECK_CUDA_ERROR(cudaStreamSynchronize(stream));
assert(verify_host_memory(a_output, n, v_output_reference));

// Measure latencies.
std::function<void(cudaStream_t)> function_non_mapped_pinned_memory{
std::bind(launch_float_addition_non_mapped_pinned_memory, h_output,
h_input_1, h_input_2, d_output, d_input_1, d_input_2, n,
std::placeholders::_1)};
std::function<void(cudaStream_t)> function_mapped_pinned_memory{
std::bind(launch_float_addition_mapped_pinned_memory, m_output,
m_input_1, m_input_2, n, std::placeholders::_1)};
float const latency_non_mapped_pinned_memory{measure_performance(
function_non_mapped_pinned_memory, stream, num_repeats, num_warmups)};
float const latency_mapped_pinned_memory{measure_performance(
function_mapped_pinned_memory, stream, num_repeats, num_warmups)};
std::cout << std::fixed << std::setprecision(3)
<< "CUDA Kernel With Non-Mapped Pinned Memory Latency: "
<< latency_non_mapped_pinned_memory << " ms" << std::endl;
std::cout << std::fixed << std::setprecision(3)
<< "CUDA Kernel With Mapped Pinned Memory Latency: "
<< latency_mapped_pinned_memory << " ms" << std::endl;

CHECK_CUDA_ERROR(cudaFree(d_input_1));
CHECK_CUDA_ERROR(cudaFree(d_input_2));
CHECK_CUDA_ERROR(cudaFree(d_output));
CHECK_CUDA_ERROR(cudaFreeHost(h_input_1));
CHECK_CUDA_ERROR(cudaFreeHost(h_input_2));
CHECK_CUDA_ERROR(cudaFreeHost(h_output));
CHECK_CUDA_ERROR(cudaFreeHost(a_input_1));
CHECK_CUDA_ERROR(cudaFreeHost(a_input_2));
CHECK_CUDA_ERROR(cudaFreeHost(a_output));
CHECK_CUDA_ERROR(cudaStreamDestroy(stream));
}

Discrete GPU

This is the latency profiling on a desktop that has Intel Core i9-9900K CPU and NVIDIA RTX 3090 GPU.

1
2
3
4
$ nvcc mapped_memory.cu -o mapped_memory -std=c++14
$ ./mapped_memory
CUDA Kernel With Non-Mapped Pinned Memory Latency: 0.964 ms
CUDA Kernel With Mapped Pinned Memory Latency: 0.631 ms

We could see that for memory-bound kernel, on a platform that uses discrete GPU, separate host memory, and device memory, using mapped pinned memory is almost 30% faster than using non-mapped pinned memory.

Integrated GPU

This is the latency profiling on an NVIDIA Jetson Xavier.

1
2
3
4
$ nvcc mapped_memory.cu -o mapped_memory -std=c++14
$ ./mapped_memory
CUDA Kernel With Non-Mapped Pinned Memory Latency: 2.343 ms
CUDA Kernel With Mapped Pinned Memory Latency: 0.431 ms

We could see that for memory-bound kernel, on a platform that uses integrated GPU and unified memory, using mapped pinned memory is almost 6x faster than using non-mapped pinned memory. This is because the using mapped memory truly eliminated the memory copy between host and device on unified memory.

Caveats

CUDA zero copy memory disables data cache on GPUs, there might be performance drop for math bound kernels.

References

Author

Lei Mao

Posted on

12-16-2022

Updated on

12-16-2022

Licensed under


Comments