Page Table for Page-Locked Host Memory
Introduction
In GPU programming, page-locked (pinned) host memory is a type of memory that is allocated on the CPU and is prevented from being paged out to disk. This allows for faster data transfer between the CPU and GPU, as the GPU can directly access the page-locked memory without the overhead of copying data to a temporary buffer.
In some large-scale GPU applications, especially those that involve multiple processes accessing the same page-locked memory, there can be significant GPU memory overhead due to the page tables that map the virtual addresses of the page-locked memory to physical addresses in RAM.
In this blog post, I would like to discuss the concept of page tables for page-locked host memory, the GPU memory overhead associated with it, and how to share page-locked host memory across multiple processes while minimizing GPU memory usage.
Virtual Memory
Computers don’t usually let programs access physical RAM directly. Instead, they give each program a “fake” or Virtual Address Space. Virtual memory is divided into fixed-size blocks called pages. Physical memory (RAM) is divided into blocks of the same size called frames.
It should be noted that there are concepts of virtual memory contiguous pages and physical memory contiguous frames. A program may have a buffer that is allocated as contiguous pages in the virtual address space, but those pages may be mapped to non-contiguous frames in physical memory. For example, when we allocate memory using malloc in C or cudaMalloc in CUDA, the allocated memory is contiguous in the virtual address space, but it may not be contiguous in physical memory. The operating system’s memory management unit (MMU) handles the mapping between virtual pages and physical frames, and it can map virtual pages to non-contiguous physical frames as needed to optimize memory usage and performance.
Page Size
The size of a page can vary, depending on the architecture and operating system. In a GPU system, there are pages of CPU and pages of GPU. The page size for CPU tasks is typically 4 KB, while the page size for GPU tasks typically varies from 4 KB to 2 MB and it is not always a constant value during the runtime.
The page of size 2 MB is called a “huge page” and it can be used to reduce the overhead of managing many small pages, which can improve performance for certain workloads. However, using huge pages can also lead to increased memory fragmentation and may not always be beneficial, depending on the workload and memory access patterns. In addition, to use huge pages, the physical memory RAM corresponding to the pages must be allocated in contiguous frames, which can be difficult to always achieve in a system with small memory and high memory fragmentation.
Page Table
A page table is the “map” or data structure used by the operating system to translate those fake virtual addresses into the actual physical locations in your RAM. When a program wants to read data from “Virtual Address A,” the processor looks at the page table to find which “Physical Frame B” that data actually lives in.
Page tables also store “permissions.” They tell the processor if a piece of memory is read-only, if it can be executed (like code), or if it’s off-limits to certain programs. This is why the user would see a “Segmentation Fault” if they try to access memory they shouldn’t in a program.
In a GPU system, there are both CPU and GPU page tables. The CPU page table manages the virtual memory for the CPU, while the GPU page table manages the virtual memory for the GPU. The CPU page tables are highly optimized for latency, while the GPU page tables are optimized for throughput.
A page table entry (PTE) is a data structure that contains information about a single page of virtual memory. Each PTE typically takes 8 bytes (64 bits) and contains the following information:
- Valid bit: Indicates whether the page is currently mapped to a physical frame.
- Physical frame number: The address of the physical frame in RAM that the virtual page is mapped to.
- Permissions: Information about whether the page is read-only, read-write, executable, etc.
- Other flags: Additional information such as whether the page is dirty (has been modified), accessed, or if it is a large page.
The page table is process-specific. Each process has its own page table, even if in some cases, different processes may want to access the same physical memory and some entries in their page tables may point to the same physical frames.
Page-Locked Memory
Pageable memory is the default type of memory allocated on the host by the operating system, and it can be paged out to disk when the system is under memory pressure. Page-locked memory, also known as pinned memory, is a type of memory that is allocated on the host and is “locked” in place, meaning it cannot be paged out to disk. This ensures that the data in page-locked memory remains in RAM and can be accessed quickly by the GPU.
To truly achieve asynchronous data transfer between the CPU and GPU, page-locked memory must be used. If pageable memory is used in asynchronous data transfer, the data transfer will not be truly asynchronous because the CPU will first copy the data from pageable memory to a temporary page-locked buffer before transferring it to the GPU, which introduces additional overhead and prevented true asynchronous data transfer, resulting in lower performance in some cases.
Page Table for Page-Locked Host Memory
To perform data transfer between CPU host memory and GPU device memory, because both use virtual memory addresses, there must be a page table that maps the virtual addresses of the page-locked memory to the physical addresses in RAM. This page table, sometimes referred to as the “page-locked memory page table”, is used by the GPU to access the data in page-locked memory during data transfer operations and it is located in the GPU memory.
Allocating too much page-locked memory can reduce the available memory for other applications and the operating system, which can lead to performance degradation and system instability. However, even if the advanced system has a huge amount of RAM, allocating a large amount of page-locked memory can still have a critical downside for GPU. This is because the page-locked memory page table in the GPU also takes up GPU memory, and if too much page-locked memory is allocated, the page table can become very large and consume a significant portion of the GPU’s memory resources, reducing the GPU memory available for the other applications.
Nowadays, some very advanced host systems can have 16 TB of CPU memory, but the GPU memory is still limited to a few hundred of GBs. For example, NVIDIA B200 Superpod has 8 Blackwell GPUs, each with 180 GB “usable” GPU memory. Let’s suppose I allocated a trunk of 1 TB page-locked memory on the host, and obviously no single GPU can use all the data on that page-locked memory at once.
In the best scenario, the page size is 2 MB, and the page-locked memory page table will have 1 TB / 2 MB = 512K entries, and each entry takes 8 bytes, so the page table will take 512K × 8 bytes = 4 MB of GPU memory. However, in the worst scenario, the page size is 4 KB, and the page-locked memory page table will have 1 TB / 4 KB = 256M entries, and each entry takes 8 bytes, so the page-locked memory page table will take 256M × 8 bytes = 2 GB of GPU memory. If the GPU has smaller memory, for example H100 GPU with 96 GB or A100 with 40 GB, the amount of GPU memory that page-locked memory page table takes will be not negligible for a single GPU.
Things will become even worse if some applications have multiple process, and each process tries to access the same page-locked memory. Because each process has its own page table, the page-locked memory page table will be duplicated for each process, consuming even more GPU memory, even though page-locked memory page tables are almost exactly the same.
For example, a straightforward approach to share pinned memory across processes is to use POSIX shared memory combined with cudaHostRegister. One process creates the shared memory region and all processes map it into their address spaces and register it with CUDA. However, this will result in duplicated GPU page table entries because each process registers the same physical memory independently.
1 |
|
1 |
|
Note that in this example, we did not use cudaMallocHost to allocate page-locked memory because cudaMallocHost does not support sharing memory across processes.
To mitigate the memory overhead of accessing the same page-locked memory from multiple processes, we could take advantage of CUDA Inter-Process Communication (IPC) for transferring data between different processes. More concretely, the processes could be divided into one master process and multiple worker processes. The master process is responsible for allocating the page-locked memory, owning the page-locked memory page table, and transferring data from the page-locked memory to the GPU memory in different worker processes.
For example, the master process can allocate a trunk of page-locked memory using cudaHostAlloc and then use cudaIpcOpenMemHandle to get the GPU memory address from the worker process. The master process can then transfer data from the page-locked memory to the GPU memory in the worker process using cudaMemcpyAsync. In this way, only the master process needs to build the page-locked memory page table, and the worker processes can access the same page-locked memory without building their own page tables, thus avoiding duplicated GPU memory usage.
1 |
|
1 |
|
References
Page Table for Page-Locked Host Memory
https://leimao.github.io/blog/Page-Locked-Host-Memory-Page-Table/