
Accelerating data transfer with CUDA devices using pinned memory
2024.08.16.
16 minute read
Maximizing the performance of a program is not possible without considering the target hardware. After all, different sets of hardware require different solutions. When it comes to working with CUDA devices, pinned memory is an important optimization tool. If properly utilized, it can improve the bandwidth with the device, which is important in many areas, including deep learning.
Incentive for performance
“The purpose of all programs, and all parts of those programs, is to transform data from one form to another.”, said Mike Acton in his presentation at CppCon [1]. The former VP of Unity3d and Engine Director at Insomniac Games later continued: “Software does not run in a magic fairy aether…”
Throughout his talk, Acton stressed the importance of programmers understanding the data and the hardware that they work on, so they can understand the (data transformation) problem, and solve it appropriately. His points are demonstrated through examples. He shows the significance of data arrangement, and how it affects the utilization of the CPU cache [2], the memory bandwidth and the overall speed.
In the video game industry, performance is key [1]. Creators can only add elements (be those mechanics or visuals) if there is enough space for them; otherwise, the time requirements of a frame would not be met and the user-experience would suffer. For console games, developers want to take full advantage of the limited hardware. For PC games, they want to lower the minimal requirements of the game to reach a big enough audience. Not surprising, therefore, that Mike Acton’s points are also advocated by other members of the industry [3], [4], [5].
Video games, however, are not the only area where performance is important. In embedded and mobile systems, resources are limited just as in game consoles. On the other end, performance can not be neglected even where computational capacity can be scaled, such as with deep (reinforcement) learning projects.
OpenAI Five [6] was the first AI system to defeat the world champions at an esports game, Dota 2, in 2019. The rerun of it – which utilized the conclusions of the original ten-month-long run – used 1024 GPUs and 51200 CPUs over two months of training. (The numbers for the original version fluctuated between this scale and three-times larger.) At such a scale, the reduction of the running costs alone is enough of an incentive to maximize performance.
Programs can be accelerated in multiple ways: upgrading the hardware, using a faster language, switching the data structures and fine-tuning the algorithms. Depending on the field of application, these topics are discussed online to various degrees. I feel, however, that one particular topic in the context of PyTorch, C++ and CUDA deserves more publicity: pinned host memory [7], [8], [9], [10].
Pinned memory
Modern computers use virtual memory (VM) [8], [11], [12], [13]. It is an abstraction that decouples user code from the storage resources available on a given machine. Programs access memory using virtual addresses; these are translated by the processor and the operating system (OS) into physical addresses, specifying where the data is actually physically located. (The many benefits of VM is discussed in [12].)
In most virtual memory subsystems, the virtual address space is divided into pages, contiguous regions of virtual memory managed as a single unit. Similarly, the physical address space is divided into page frames. As the pages are mapped to the page frames, they are equal in size.
Pages may not reside only in memory, but also on a secondary (backing) storage. This allows programs to exceed the size of available RAM. For certain uses, however, data must not be paged-out (swapped) from memory to such a storage. A good example is peripheral devices using direct memory access (DMA), such as GPUs. DMA enables devices to access main system memory independently of the CPU. If the data that is being read or written would be paged-out in the middle of the DMA transfer and another page would be paged-in to the same page frame, then the data would be corrupted. To prevent this, pages can be pinned, which marks them from being paged-out.
Host (CPU) data allocations are pageable by default. If a pageable host array is the source of a device data transfer, then the host operating system first allocates a temporary pinned host array as a staging area; afterwards, the host data is copied into this array, which is then read by the device. Similarly, if a pageable host array is the destination of a device data transfer, then a pinned array is allocated, the device transfers the data there, then the CPU copies that into the destination array. In both scenarios, the pinned buffer may be smaller than the total amount of data, in which case the transfer occurs in multiple stages.
By using pinned host memory directly as the source or destination, the extra overhead of the host-to-host copy could be avoided. This enables data to be read or written with much higher bandwidth. (Be wary though, that excessive use of pinned memory may degrade system performance.)
The degree of acceleration will depend on the actual hardware. A PC equipped with a Ryzen 5900x CPU, an Nvidia RTX 4080 GPU in a PCI Express x16 Gen4 slot and 2666 MHz DDR4 RAMs yields the following bandwidths in a CUDA benchmark:
Device: NVIDIA GeForce RTX 4080
Transfer size (MB): 16
Pageable transfers
Host to Device bandwidth (GB/s): 17.115696
Device to Host bandwidth (GB/s): 10.471728
Pinned transfers
Host to Device bandwidth (GB/s): 26.002480
Device to Host bandwidth (GB/s): 26.780814
Here I would like to note, that page-locked memory (locked pages) and pinned memory (pinned pages) may not be interchangeable as the sources [7], [8], [9] and [10] lead to believe. According to [14], although neither can be paged-out, a locked page can be moved within the RAM, while a pinned page must remain in the same physical location. DMA transfers require the latter; thus I only use the phrase pinned.
Pinned memory with PyTorch
PyTorch is an optimized tensor library for deep learning using GPUs and CPUs. Its documentation briefly discusses memory pinning as a technique that enables faster and asynchronous copies from the host to the GPU [10], [15], [16]. The underlying reasons, however, are not mentioned; neither is the possibility to use it with device-to-host transfers.
In PyTorch, memory pinning is used in two cases: tensors and
data-loaders (DataLoader
objects). Tensors can be
initialized in pinned memory by passing pin_memory=True
,
and can be copied into it by calling .pin_memory()
.
Data-loaders, objects that iterate through datasets and yield batches of
data, can be instructed to place the batches into pinned memory by
passing pin_memory=True
.
To transfer tensors, PyTorch provides several options. Some of these also offer the possibility of changing the destination data type.
The .cpu()
, .cuda()
,
.to(dtype)
, .to(device, dtype)
and
.to(other)
functions return a copy of the caller. They
differ primarily on how the device and the type of the destination
tensor is determined; in the same circumstances, they perform equally –
based on my measurements. As a result, the Python benchmarks presented
in this post only use .to(device, dtype)
.
The .copy_(src)
function copies the elements of the
source tensor into the caller. Compared to the previous ones, this
function provides more control over the destination. As far as I know –
this is the only way to utilize pinned memory in device-to-host
transfers using a PyTorch function.
Measuring the performance of the tensor transfers between the host and the device yields these results:
Data size: 16.0 MB
Iterations: 10001
Abbreviations: d-device, h-host, p-pinned
tensor.to(device, dtype)
h to d 985.15 us 15.9 GB/s
hp to d 662.33 us 23.6 GB/s
d to h 7256.03 us 2.2 GB/s
tensor.copy_(src)
h to d 981.33 us 15.9 GB/s
hp to d 645.16 us 24.2 GB/s
d to h 1588.58 us 9.8 GB/s
d to hp 625.61 us 25.0 GB/s
Similarly to the earlier CUDA benchmark, the pinned transfers clearly
outperform the pageable ones. Interestingly though, the
.to(device, dtype)
function in the device-to-host transfer
is much slower than expected: ~2 GB/s instead of ~10 GB/s. This raises
some questions and makes the .copy_(src)
even more
appealing in this transfer direction.
C++ containers in pinned memory
C++ is commonly used in areas that require high performance. It is no surprise therefore, why developers would want to utilize pinned memory with it when data-transfer between the host and the device is involved.
As far as I know – the C++ standard library does not offer a way to
pin memory. In CUDA C/C++, on the other hand, pinned memory can be
allocated using cudaMallocHost()
or
cudaHostAlloc()
, and deallocated with
cudaFreeHost()
[7]. Using these, a custom C++ allocator
would enable (many of) the standard library containers to be created
within pinned memory, combining their convenience with the improved
performance.
Using the allocator implementation of [17] as base, the pinned allocator would look as:
#include "cuda_runtime.h" // "cuda_runtime_api.h" is C header
#include <stdexcept>
/* The allocator class */
template <typename T>
class pinned_alloc {
public:
using value_type = T;
using pointer = value_type *;
using size_type = std::size_t;
pinned_alloc() noexcept = default;
template <typename U>
pinned_alloc(pinned_alloc<U> const &) noexcept
{
}
auto allocate(size_type n, const void * = 0) -> value_type *
{
value_type *tmp;
auto error = cudaMallocHost((void **)&tmp, n * sizeof(T));
if (error != cudaSuccess) {
throw std::runtime_error{cudaGetErrorString(error)};
}
return tmp;
}
auto deallocate(pointer p, size_type n) -> void
{
if (p) {
auto error = cudaFreeHost(p);
if (error != cudaSuccess) {
throw std::runtime_error{cudaGetErrorString(error)};
}
}
}
};
template <class T, class U>
auto operator==(pinned_alloc<T> const &, pinned_alloc<U> const &) -> bool
{
return true;
}
template <class T, class U>
auto operator!=(pinned_alloc<T> const &, pinned_alloc<U> const &) -> bool
{
return false;
}
Using the allocator, creating a std::vector
in pinned
memory is simple enough:
With an alias template [17], the repetition of specifying the allocator could be omitted:
// Alias template
template <typename T>
using pinned_vector = std::vector<T, pinned_alloc<T>>;
auto my_vec = pinned_vector<T>; // replace T with desired type
For more information on allocators and for a good read, see [17].
Pinned C++ arrays as PyTorch tensors
To improve runtime, Python programs often have their
performance-critical parts written in fast languages, such as C++. In
such scenarios, the underlying data structures usually need to be
accessed through Python as well. When it comes to simple arrays,
exposing the data through a NumPy ndarray
is a reasonable
and convenient solution.
NumPy [18], [19], [20] is a fundamental library for
scientific computing in Python. It provides a multidimensional array
object called ndarray
, and utilizes optimized pre-compiled
C code for operating on large numbers of data in a rapid fashion. NumPy
arrays are used in several popular Python libraries such as Pandas,
Scikit-learn and SciPy. OpenCV, an open-source computer-vision library
written in C++, uses it for representing its n-dimensional dense array
class cv::Mat
in its Python bindings [21].
In PyTorch, NumPy arrays can be used for creating tensors with the
torch.from_numpy()
function (if the data type is
supported). A tensor created this way shares memory with the
ndarray
. This is beneficial for two reasons. Firstly, the
performance is increased if the data is not copied. Secondly, it
provides a simple approach for creating a tensor that shares memory with
a C++ container: just use the NumPy array of the container to create the
Tensor
. As an implication, when the data is in pinned
memory, so is the tensor.
NumPy offers multiple approaches to treating a foreign object as a
ndarray
. These are described in [22]. With Pybind11, a header-only
library for binding C++ code to Python, exposing our pinned
std::vector
as a NumPy array would look as:
// include the pinned vector implementation
#include <pybind11/numpy.h>
#include <pybind11/pybind11.h>
namespace py = pybind11;
/// Pinned vector
template <typename T>
py::array_t<T> as_numpy(const std::vector<T, pinned_alloc<T>> &myvec)
{
return py::array_t<T>({myvec.size()}, // Shape
{sizeof(T)}, // Stride
myvec.data(), // Data pointer
py::none());
}
Here, the ndarray
shares memory with the
std::vector
. As the memory is not managed by Python, the
user is responsible for managing the lifetime of the vector; using the
Numpy array after the C++ object is deleted will result in errors.
By binding both the regular pageable and the pinned
std::vector
variants to Python, the transfer rates of the
vector-turned-tensors and the regular tensors can be compared:
Data size: 16.0 MB
Iterations: 10001
Abbreviations: d-device, h-host, p-pinned, v-vector
--- tensor ---
tensor.to(device, dtype)
h to d 979.66 us 15.9 GB/s
hp to d 669.72 us 23.3 GB/s
d to h 7148.27 us 2.2 GB/s
tensor.copy_(src)
h to d 976.80 us 16.0 GB/s
hp to d 645.16 us 24.2 GB/s
d to h 1583.10 us 9.9 GB/s
d to hp 625.61 us 25.0 GB/s
--- vector ---
tensor.to(device, dtype)
v to d 979.66 us 15.9 GB/s
vp to d 652.31 us 24.0 GB/s
tensor.copy_(src)
v to d 975.85 us 16.0 GB/s
vp to d 645.16 us 24.2 GB/s
d to v 1539.23 us 10.2 GB/s
d to vp 625.61 us 25.0 GB/s
cast
v as ndarray 0.24 us
v as tensor 1.19 us
vp as ndarray 0.24 us
vp as tensor 1.19 us
The results reveal three key points. First of all, creating a NumPy array or PyTorch tensor from a bound vector is fast with the zero-copy approaches. Secondly, a tensor that shares memory with a bound array automatically receives the same benefits from being in pinned memory as a regular tensor does; no need for additional tinkering. Lastly, a vector-turned-tensor and a regular PyTorch tensor perform equally under the same conditions.
Summary
Certain areas have a strong incentive for maximizing the performance of programs. To do so, it is crucial to understand the data and the hardware that is worked on. When it comes to working with PyTorch, C++ and CUDA devices, utilizing pinned memory is an important optimization step that I feel is worth more publicity.
Modern computers use virtual memory. It is an abstraction that decouples programs from the storage resources of a given machine. Alongside the RAM, computers can also make use of a secondary storage (disks, SSDs), which enables them to exceed the size of the available memory. For certain uses, however, data must remain – at the same location – within the RAM. Devices using direct memory access (DMA), such as GPUs, are prime examples. To achieve this, memory segments can be pinned.
Unlike regular pageable memory, pinned memory can be directly accessed by the device (GPU). This allows data to be read or written with much higher bandwidth. In a custom PyTorch benchmark, switching to pinned host memory increased the host-to-device transfer rate from 16 GB/s to 24 GB/s; in the device-to-host direction, from 10 GB/s to 25 GB/s. In addition to this speedup, switching also enables the GPU to move data asynchronously with respect to the CPU, which frees the latter to do other tasks.
Python programs often utilize C++ code to increase their performance. When doing so in PyTorch projects, creating a tensor that shares memory with a bound array can be beneficial: it grants PyTorch functions direct access to the data, which simplifies integrating custom code into the project. Based on the measurements, a tensor created this way receives the same benefits from being in pinned memory as a regular tensor does. Furthermore, they perform equally under the same conditions.
All in all, pinning memory is an important tool for improving the performance of projects that rely on CUDA GPUs. It enables increased bandwidth between the host and the device, can free the CPU for other tasks in the duration of the data transfer and can be used in tandem with leveraging C++ code in Python. The degree of acceleration, nonetheless, will depend on the actual data and hardware.
The presented results were acquired using a PC equipped with a Ryzen 5900x CPU, an Nvidia RTX 4080 GPU in a PCI Express x16 Gen4 slot and 2666 MHz DDR4 RAMs. The benchmarks are available on GitHub.
References
Appendix
The primary goal of this post was to raise awareness of pinned
memory; it is by no means an in-depth guide. This is reflected by the
microbenchmarks: they just print the median execution time for a single
data size without further analysis. Yet, the performance of moving data
from the device to the host with .to(device, dtype)
is
worth a second look.
The tests measured this transfer to be the slowest, reaching a rate
of only 2 GB/s. For comparison, .copy_(src)
reached 10 GB/s
with pageable and 25 GB/s with pinned host memory. I started
investigating.
I checked if the function is not aimed for device-to-host transfer,
but .cpu()
performed equally. I tested if the slowness is
caused by the repeated allocation of the destination container; however,
creating the destination tensor myself and calling
.copy_(src)
with it reached 6 GB/s with pageable and 14
GB/s with pinned memory. See:
Data size: 16.0 MB
Iterations: 5001
Abbreviations: d-device, h-host, p-pinned
tensor.to(device, dtype)
h to d 980.14 us 15.9 GB/s
hp to d 648.50 us 24.1 GB/s
d to h 7480.86 us 2.1 GB/s
tensor.copy_(src)
h to d 972.99 us 16.1 GB/s
hp to d 645.40 us 24.2 GB/s
d to h 1600.27 us 9.8 GB/s
d to hp 625.85 us 25.0 GB/s
tensor.cpu()
d to h 7508.99 us 2.1 GB/s
torch.zeros() > tensor.copy_(src)
d to h 2511.02 us 6.2 GB/s
d to hp 1111.27 us 14.1 GB/s
Unexpectedly, decreasing the data size from 16 MB to 11 MB increased the transfer rate to 10 GB/s.
Data size: 11.0 MB
Iterations: 5001
Abbreviations: d-device, h-host, p-pinned
tensor.to(device, dtype)
h to d 686.41 us 15.6 GB/s
hp to d 448.70 us 23.9 GB/s
d to h 1074.79 us 10.0 GB/s
tensor.copy_(src)
h to d 682.12 us 15.7 GB/s
hp to d 445.84 us 24.1 GB/s
d to h 1087.43 us 9.9 GB/s
d to hp 432.49 us 24.8 GB/s
tensor.cpu()
d to h 1071.93 us 10.0 GB/s
torch.zeros() > tensor.copy_(src)
d to h 1419.54 us 7.6 GB/s
d to hp 744.58 us 14.4 GB/s
To me – this suggests some PyTorch mechanic to be the underlying
culprit. Regardless, the conclusion stays the same:
.copy_(src)
with pinned host memory is significantly faster
than .to(device, dtype)
for moving data from the device to
the host in PyTorch.