Page Table for Page-Locked Host Memory

Introduction

In GPU programming, page-locked (pinned) host memory is a type of memory that is allocated on the CPU and is prevented from being paged out to disk. This allows for faster data transfer between the CPU and GPU, as the GPU can directly access the page-locked memory without the overhead of copying data to a temporary buffer.

In some large-scale GPU applications, especially those that involve multiple processes accessing the same page-locked memory, there can be significant GPU memory overhead due to the page tables that map the virtual addresses of the page-locked memory to physical addresses in RAM.

In this blog post, I would like to discuss the concept of page tables for page-locked host memory, the GPU memory overhead associated with it, and how to share page-locked host memory across multiple processes while minimizing GPU memory usage.

Virtual Memory

Computers don’t usually let programs access physical RAM directly. Instead, they give each program a “fake” or Virtual Address Space. Virtual memory is divided into fixed-size blocks called pages. Physical memory (RAM) is divided into blocks of the same size called frames.

It should be noted that there are concepts of virtual memory contiguous pages and physical memory contiguous frames. A program may have a buffer that is allocated as contiguous pages in the virtual address space, but those pages may be mapped to non-contiguous frames in physical memory. For example, when we allocate memory using malloc in C or cudaMalloc in CUDA, the allocated memory is contiguous in the virtual address space, but it may not be contiguous in physical memory. The operating system’s memory management unit (MMU) handles the mapping between virtual pages and physical frames, and it can map virtual pages to non-contiguous physical frames as needed to optimize memory usage and performance.

Page Size

The size of a page can vary, depending on the architecture and operating system. In a GPU system, there are pages of CPU and pages of GPU. The page size for CPU tasks is typically 4 KB, while the page size for GPU tasks typically varies from 4 KB to 2 MB and it is not always a constant value during the runtime.

The page of size 2 MB is called a “huge page” and it can be used to reduce the overhead of managing many small pages, which can improve performance for certain workloads. However, using huge pages can also lead to increased memory fragmentation and may not always be beneficial, depending on the workload and memory access patterns. In addition, to use huge pages, the physical memory RAM corresponding to the pages must be allocated in contiguous frames, which can be difficult to always achieve in a system with small memory and high memory fragmentation.

Page Table

A page table is the “map” or data structure used by the operating system to translate those fake virtual addresses into the actual physical locations in your RAM. When a program wants to read data from “Virtual Address A,” the processor looks at the page table to find which “Physical Frame B” that data actually lives in.

Page tables also store “permissions.” They tell the processor if a piece of memory is read-only, if it can be executed (like code), or if it’s off-limits to certain programs. This is why the user would see a “Segmentation Fault” if they try to access memory they shouldn’t in a program.

In a GPU system, there are both CPU and GPU page tables. The CPU page table manages the virtual memory for the CPU, while the GPU page table manages the virtual memory for the GPU. The CPU page tables are highly optimized for latency, while the GPU page tables are optimized for throughput.

A page table entry (PTE) is a data structure that contains information about a single page of virtual memory. Each PTE typically takes 8 bytes (64 bits) and contains the following information:

  • Valid bit: Indicates whether the page is currently mapped to a physical frame.
  • Physical frame number: The address of the physical frame in RAM that the virtual page is mapped to.
  • Permissions: Information about whether the page is read-only, read-write, executable, etc.
  • Other flags: Additional information such as whether the page is dirty (has been modified), accessed, or if it is a large page.

The page table is process-specific. Each process has its own page table, even if in some cases, different processes may want to access the same physical memory and some entries in their page tables may point to the same physical frames.

Page-Locked Memory

Pageable memory is the default type of memory allocated on the host by the operating system, and it can be paged out to disk when the system is under memory pressure. Page-locked memory, also known as pinned memory, is a type of memory that is allocated on the host and is “locked” in place, meaning it cannot be paged out to disk. This ensures that the data in page-locked memory remains in RAM and can be accessed quickly by the GPU.

To truly achieve asynchronous data transfer between the CPU and GPU, page-locked memory must be used. If pageable memory is used in asynchronous data transfer, the data transfer will not be truly asynchronous because the CPU will first copy the data from pageable memory to a temporary page-locked buffer before transferring it to the GPU, which introduces additional overhead and prevented true asynchronous data transfer, resulting in lower performance in some cases.

Page Table for Page-Locked Host Memory

To perform data transfer between CPU host memory and GPU device memory, because both use virtual memory addresses, there must be a page table that maps the virtual addresses of the page-locked memory to the physical addresses in RAM. This page table, sometimes referred to as the “page-locked memory page table”, is used by the GPU to access the data in page-locked memory during data transfer operations and it is located in the GPU memory.

Allocating too much page-locked memory can reduce the available memory for other applications and the operating system, which can lead to performance degradation and system instability. However, even if the advanced system has a huge amount of RAM, allocating a large amount of page-locked memory can still have a critical downside for GPU. This is because the page-locked memory page table in the GPU also takes up GPU memory, and if too much page-locked memory is allocated, the page table can become very large and consume a significant portion of the GPU’s memory resources, reducing the GPU memory available for the other applications.

Nowadays, some very advanced host systems can have 16 TB of CPU memory, but the GPU memory is still limited to a few hundred of GBs. For example, NVIDIA B200 Superpod has 8 Blackwell GPUs, each with 180 GB “usable” GPU memory. Let’s suppose I allocated a trunk of 1 TB page-locked memory on the host, and obviously no single GPU can use all the data on that page-locked memory at once.

In the best scenario, the page size is 2 MB, and the page-locked memory page table will have 1 TB / 2 MB = 512K entries, and each entry takes 8 bytes, so the page table will take 512K × 8 bytes = 4 MB of GPU memory. However, in the worst scenario, the page size is 4 KB, and the page-locked memory page table will have 1 TB / 4 KB = 256M entries, and each entry takes 8 bytes, so the page-locked memory page table will take 256M × 8 bytes = 2 GB of GPU memory. If the GPU has smaller memory, for example H100 GPU with 96 GB or A100 with 40 GB, the amount of GPU memory that page-locked memory page table takes will be not negligible for a single GPU.

Things will become even worse if some applications have multiple process, and each process tries to access the same page-locked memory. Because each process has its own page table, the page-locked memory page table will be duplicated for each process, consuming even more GPU memory, even though page-locked memory page tables are almost exactly the same.

For example, a straightforward approach to share pinned memory across processes is to use POSIX shared memory combined with cudaHostRegister. One process creates the shared memory region and all processes map it into their address spaces and register it with CUDA. However, this will result in duplicated GPU page table entries because each process registers the same physical memory independently.

creator.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
#include <cuda_runtime.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <semaphore.h>
#include <unistd.h>
#include <cstdio>
#include <cstring>

#define SHM_NAME "/shared_pinned_memory"
#define SEM_READY "/shm_ready"
#define SEM_DONE "/shm_done"
#define BUFFER_SIZE (1ULL << 30) // 1 GB

int main()
{
// Create synchronization semaphores.
// sem_ready: creator signals workers that the shared memory is ready.
// sem_done: worker signals creator that it has finished using the memory.
sem_t* sem_ready = sem_open(SEM_READY, O_CREAT | O_EXCL, 0666, 0);
sem_t* sem_done = sem_open(SEM_DONE, O_CREAT | O_EXCL, 0666, 0);

// Create and size the POSIX shared memory object.
int shm_fd = shm_open(SHM_NAME, O_CREAT | O_RDWR, 0666);
ftruncate(shm_fd, BUFFER_SIZE);

// Map the shared memory into this process's address space.
void* shm_ptr = mmap(
nullptr, BUFFER_SIZE,
PROT_READ | PROT_WRITE,
MAP_SHARED, shm_fd, 0
);

// Initialize shared memory with data.
memset(shm_ptr, 0xAB, BUFFER_SIZE);

// Register the shared memory region as page-locked (pinned) with CUDA.
// cudaHostRegisterPortable makes the mapping valid in all CUDA contexts.
// This builds a GPU page table covering these physical frames.
cudaHostRegister(shm_ptr, BUFFER_SIZE, cudaHostRegisterPortable);

// Allocate GPU device memory and transfer data from the pinned shared memory.
void* device_ptr;
cudaMalloc(&device_ptr, BUFFER_SIZE);

cudaStream_t stream;
cudaStreamCreate(&stream);

cudaMemcpyAsync(device_ptr, shm_ptr, BUFFER_SIZE,
cudaMemcpyHostToDevice, stream);
cudaStreamSynchronize(stream);

// Signal the worker that the shared memory region is ready.
sem_post(sem_ready);

// Wait for the worker to finish using the shared memory before cleanup.
sem_wait(sem_done);

// Cleanup.
cudaStreamDestroy(stream);
cudaFree(device_ptr);
cudaHostUnregister(shm_ptr);
munmap(shm_ptr, BUFFER_SIZE);
close(shm_fd);
shm_unlink(SHM_NAME);
sem_close(sem_ready);
sem_close(sem_done);
sem_unlink(SEM_READY);
sem_unlink(SEM_DONE);
return 0;
}
worker.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#include <cuda_runtime.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <semaphore.h>
#include <unistd.h>
#include <cstdio>

#define SHM_NAME "/shared_pinned_memory"
#define SEM_READY "/shm_ready"
#define SEM_DONE "/shm_done"
#define BUFFER_SIZE (1ULL << 30) // 1 GB

int main()
{
// Open synchronization semaphores created by the creator process.
sem_t* sem_ready = sem_open(SEM_READY, 0);
sem_t* sem_done = sem_open(SEM_DONE, 0);

// Wait for the creator to signal that shared memory is initialized.
sem_wait(sem_ready);

// Open the existing shared memory object created by the creator process.
int shm_fd = shm_open(SHM_NAME, O_RDWR, 0666);

// Map the same shared memory into this process's address space.
// The virtual address may differ from the creator's, but the underlying
// physical frames are identical.
void* shm_ptr = mmap(
nullptr, BUFFER_SIZE,
PROT_READ | PROT_WRITE,
MAP_SHARED, shm_fd, 0
);

// Each worker independently registers the region with CUDA.
// This causes the GPU to build a separate page table for this process,
// even though the entries point to the same physical frames as the
// creator's page table -- duplicating GPU memory usage per process.
cudaHostRegister(shm_ptr, BUFFER_SIZE, cudaHostRegisterPortable);

// Allocate GPU device memory and transfer data from the pinned shared memory.
void* device_ptr;
cudaMalloc(&device_ptr, BUFFER_SIZE);

cudaStream_t stream;
cudaStreamCreate(&stream);

cudaMemcpyAsync(device_ptr, shm_ptr, BUFFER_SIZE,
cudaMemcpyHostToDevice, stream);
cudaStreamSynchronize(stream);

// Signal the creator that this worker is done with the shared memory.
sem_post(sem_done);

// Cleanup.
cudaStreamDestroy(stream);
cudaFree(device_ptr);
cudaHostUnregister(shm_ptr);
munmap(shm_ptr, BUFFER_SIZE);
close(shm_fd);
sem_close(sem_ready);
sem_close(sem_done);
return 0;
}

Note that in this example, we did not use cudaMallocHost to allocate page-locked memory because cudaMallocHost does not support sharing memory across processes.

To mitigate the memory overhead of accessing the same page-locked memory from multiple processes, we could take advantage of CUDA Inter-Process Communication (IPC) for transferring data between different processes. More concretely, the processes could be divided into one master process and multiple worker processes. The master process is responsible for allocating the page-locked memory, owning the page-locked memory page table, and transferring data from the page-locked memory to the GPU memory in different worker processes.

For example, the master process can allocate a trunk of page-locked memory using cudaHostAlloc and then use cudaIpcOpenMemHandle to get the GPU memory address from the worker process. The master process can then transfer data from the page-locked memory to the GPU memory in the worker process using cudaMemcpyAsync. In this way, only the master process needs to build the page-locked memory page table, and the worker processes can access the same page-locked memory without building their own page tables, thus avoiding duplicated GPU memory usage.

master.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#include <cuda_runtime.h>
#include <fcntl.h>
#include <unistd.h>
#include <cstdio>
#include <cstring>

#define BUFFER_SIZE (1ULL << 20) // 1 MB
#define FIFO_PATH "/tmp/cuda_ipc_fifo"

int main()
{
// Allocate page-locked (pinned) host memory.
// Only the master process allocates pinned memory, so only one
// GPU page table is built for this buffer.
void* master_pinned_ptr;
cudaHostAlloc(&master_pinned_ptr, BUFFER_SIZE, cudaHostAllocDefault);

// Initialize the buffer with data to transfer.
memset(master_pinned_ptr, 0xAB, BUFFER_SIZE);

// Receive the IPC memory handle from the worker via a named pipe.
// The worker creates the FIFO and writes the handle first.
cudaIpcMemHandle_t received_handle;
int fd = open(FIFO_PATH, O_RDONLY);
read(fd, &received_handle, sizeof(received_handle));
close(fd);

// Open the handle to obtain a local pointer to the worker's GPU memory.
void* remote_dev_ptr;
cudaIpcOpenMemHandle(
&remote_dev_ptr,
received_handle,
cudaIpcMemLazyEnablePeerAccess
);

// Create a stream and transfer data from pinned host memory
// to the worker's GPU memory asynchronously.
cudaStream_t stream;
cudaStreamCreate(&stream);

cudaMemcpyAsync(
remote_dev_ptr, // Destination: Worker's GPU memory
master_pinned_ptr, // Source: Master's pinned host memory
BUFFER_SIZE,
cudaMemcpyHostToDevice,
stream
);
cudaStreamSynchronize(stream);

// Cleanup.
cudaIpcCloseMemHandle(remote_dev_ptr);
cudaStreamDestroy(stream);
cudaFreeHost(master_pinned_ptr);
return 0;
}
worker.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#include <cuda_runtime.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <unistd.h>
#include <cstdio>

#define BUFFER_SIZE (1ULL << 20) // 1 MB
#define FIFO_PATH "/tmp/cuda_ipc_fifo"

int main()
{
// Allocate GPU device memory that the master will write into.
void* device_ptr;
cudaMalloc(&device_ptr, BUFFER_SIZE);

// Obtain an IPC handle for the device allocation.
cudaIpcMemHandle_t handle;
cudaIpcGetMemHandle(&handle, device_ptr);

// Create a named pipe and send the handle to the master process.
// The master opens the FIFO for reading after this write completes.
mkfifo(FIFO_PATH, 0666);
int fd = open(FIFO_PATH, O_WRONLY);
write(fd, &handle, sizeof(handle));
close(fd);

// Wait for the master to finish the transfer.
// In production, replace with a semaphore or condition variable.
sleep(2);

// Cleanup.
cudaFree(device_ptr);
unlink(FIFO_PATH);
return 0;
}

References

Author

Lei Mao

Posted on

04-12-2026

Updated on

04-12-2026

Licensed under


Comments