Accelerating data transfer with CUDA devices using pinned memory

2024.08.16.

16 minute read

Maximizing the performance of a program is not possible without considering the target hardware. After all, different sets of hardware require different solutions. When it comes to working with CUDA devices, pinned memory is an important optimization tool. If properly utilized, it can improve the bandwidth with the device, which is important in many areas, including deep learning.

Incentive for performance

“The purpose of all programs, and all parts of those programs, is to transform data from one form to another.”, said Mike Acton in his presentation at CppCon [1]. The former VP of Unity3d and Engine Director at Insomniac Games later continued: “Software does not run in a magic fairy aether…”

Throughout his talk, Acton stressed the importance of programmers understanding the data and the hardware that they work on, so they can understand the (data transformation) problem, and solve it appropriately. His points are demonstrated through examples. He shows the significance of data arrangement, and how it affects the utilization of the CPU cache [2], the memory bandwidth and the overall speed.

In the video game industry, performance is key [1]. Creators can only add elements (be those mechanics or visuals) if there is enough space for them; otherwise, the time requirements of a frame would not be met and the user-experience would suffer. For console games, developers want to take full advantage of the limited hardware. For PC games, they want to lower the minimal requirements of the game to reach a big enough audience. Not surprising, therefore, that Mike Acton’s points are also advocated by other members of the industry [3], [4], [5].

Video games, however, are not the only area where performance is important. In embedded and mobile systems, resources are limited just as in game consoles. On the other end, performance can not be neglected even where computational capacity can be scaled, such as with deep (reinforcement) learning projects.

OpenAI Five [6] was the first AI system to defeat the world champions at an esports game, Dota 2, in 2019. The rerun of it – which utilized the conclusions of the original ten-month-long run – used 1024 GPUs and 51200 CPUs over two months of training. (The numbers for the original version fluctuated between this scale and three-times larger.) At such a scale, the reduction of the running costs alone is enough of an incentive to maximize performance.

Programs can be accelerated in multiple ways: upgrading the hardware, using a faster language, switching the data structures and fine-tuning the algorithms. Depending on the field of application, these topics are discussed online to various degrees. I feel, however, that one particular topic in the context of PyTorch, C++ and CUDA deserves more publicity: pinned host memory [7], [8], [9], [10].

Pinned memory

Modern computers use virtual memory (VM) [8], [11], [12], [13]. It is an abstraction that decouples user code from the storage resources available on a given machine. Programs access memory using virtual addresses; these are translated by the processor and the operating system (OS) into physical addresses, specifying where the data is actually physically located. (The many benefits of VM is discussed in [12].)

In most virtual memory subsystems, the virtual address space is divided into pages, contiguous regions of virtual memory managed as a single unit. Similarly, the physical address space is divided into page frames. As the pages are mapped to the page frames, they are equal in size.

Pages may not reside only in memory, but also on a secondary (backing) storage. This allows programs to exceed the size of available RAM. For certain uses, however, data must not be paged-out (swapped) from memory to such a storage. A good example is peripheral devices using direct memory access (DMA), such as GPUs. DMA enables devices to access main system memory independently of the CPU. If the data that is being read or written would be paged-out in the middle of the DMA transfer and another page would be paged-in to the same page frame, then the data would be corrupted. To prevent this, pages can be pinned, which marks them from being paged-out.

Host (CPU) data allocations are pageable by default. If a pageable host array is the source of a device data transfer, then the host operating system first allocates a temporary pinned host array as a staging area; afterwards, the host data is copied into this array, which is then read by the device. Similarly, if a pageable host array is the destination of a device data transfer, then a pinned array is allocated, the device transfers the data there, then the CPU copies that into the destination array. In both scenarios, the pinned buffer may be smaller than the total amount of data, in which case the transfer occurs in multiple stages.

By using pinned host memory directly as the source or destination, the extra overhead of the host-to-host copy could be avoided. This enables data to be read or written with much higher bandwidth. (Be wary though, that excessive use of pinned memory may degrade system performance.)

The degree of acceleration will depend on the actual hardware. A PC equipped with a Ryzen 5900x CPU, an Nvidia RTX 4080 GPU in a PCI Express x16 Gen4 slot and 2666 MHz DDR4 RAMs yields the following bandwidths in a CUDA benchmark:

Device: NVIDIA GeForce RTX 4080
Transfer size (MB): 16

Pageable transfers
  Host to Device bandwidth (GB/s): 17.115696
  Device to Host bandwidth (GB/s): 10.471728

Pinned transfers
  Host to Device bandwidth (GB/s): 26.002480
  Device to Host bandwidth (GB/s): 26.780814

Here I would like to note, that page-locked memory (locked pages) and pinned memory (pinned pages) may not be interchangeable as the sources [7], [8], [9] and [10] lead to believe. According to [14], although neither can be paged-out, a locked page can be moved within the RAM, while a pinned page must remain in the same physical location. DMA transfers require the latter; thus I only use the phrase pinned.

Pinned memory with PyTorch

PyTorch is an optimized tensor library for deep learning using GPUs and CPUs. Its documentation briefly discusses memory pinning as a technique that enables faster and asynchronous copies from the host to the GPU [10], [15], [16]. The underlying reasons, however, are not mentioned; neither is the possibility to use it with device-to-host transfers.

In PyTorch, memory pinning is used in two cases: tensors and data-loaders (DataLoader objects). Tensors can be initialized in pinned memory by passing pin_memory=True, and can be copied into it by calling .pin_memory(). Data-loaders, objects that iterate through datasets and yield batches of data, can be instructed to place the batches into pinned memory by passing pin_memory=True.

To transfer tensors, PyTorch provides several options. Some of these also offer the possibility of changing the destination data type.

The .cpu(), .cuda(), .to(dtype), .to(device, dtype) and .to(other) functions return a copy of the caller. They differ primarily on how the device and the type of the destination tensor is determined; in the same circumstances, they perform equally – based on my measurements. As a result, the Python benchmarks presented in this post only use .to(device, dtype).

The .copy_(src) function copies the elements of the source tensor into the caller. Compared to the previous ones, this function provides more control over the destination. As far as I know – this is the only way to utilize pinned memory in device-to-host transfers using a PyTorch function.

Measuring the performance of the tensor transfers between the host and the device yields these results:

Data size: 16.0 MB
Iterations: 10001
Abbreviations: d-device, h-host, p-pinned

tensor.to(device, dtype)
    h to d                985.15 us     15.9 GB/s
    hp to d               662.33 us     23.6 GB/s
    d to h               7256.03 us      2.2 GB/s

tensor.copy_(src)
    h to d                981.33 us     15.9 GB/s
    hp to d               645.16 us     24.2 GB/s
    d to h               1588.58 us      9.8 GB/s
    d to hp               625.61 us     25.0 GB/s

Similarly to the earlier CUDA benchmark, the pinned transfers clearly outperform the pageable ones. Interestingly though, the .to(device, dtype) function in the device-to-host transfer is much slower than expected: ~2 GB/s instead of ~10 GB/s. This raises some questions and makes the .copy_(src) even more appealing in this transfer direction.

C++ containers in pinned memory

C++ is commonly used in areas that require high performance. It is no surprise therefore, why developers would want to utilize pinned memory with it when data-transfer between the host and the device is involved.

As far as I know – the C++ standard library does not offer a way to pin memory. In CUDA C/C++, on the other hand, pinned memory can be allocated using cudaMallocHost() or cudaHostAlloc(), and deallocated with cudaFreeHost() [7]. Using these, a custom C++ allocator would enable (many of) the standard library containers to be created within pinned memory, combining their convenience with the improved performance.

Using the allocator implementation of [17] as base, the pinned allocator would look as:

#include "cuda_runtime.h" // "cuda_runtime_api.h" is C header

#include <stdexcept>

/* The allocator class */
template <typename T>
class pinned_alloc {
  public:
    using value_type = T;
    using pointer = value_type *;
    using size_type = std::size_t;

    pinned_alloc() noexcept = default;

    template <typename U>
    pinned_alloc(pinned_alloc<U> const &) noexcept
    {
    }

    auto allocate(size_type n, const void * = 0) -> value_type *
    {
        value_type *tmp;
        auto error = cudaMallocHost((void **)&tmp, n * sizeof(T));
        if (error != cudaSuccess) {
            throw std::runtime_error{cudaGetErrorString(error)};
        }
        return tmp;
    }

    auto deallocate(pointer p, size_type n) -> void
    {
        if (p) {
            auto error = cudaFreeHost(p);
            if (error != cudaSuccess) {
                throw std::runtime_error{cudaGetErrorString(error)};
            }
        }
    }
};

template <class T, class U>
auto operator==(pinned_alloc<T> const &, pinned_alloc<U> const &) -> bool
{
    return true;
}

template <class T, class U>
auto operator!=(pinned_alloc<T> const &, pinned_alloc<U> const &) -> bool
{
    return false;
}

Using the allocator, creating a std::vector in pinned memory is simple enough:

auto my_vec = std::vector<T, pinned_alloc<T>>;  // replace T with desired type

With an alias template [17], the repetition of specifying the allocator could be omitted:

// Alias template
template <typename T>
using pinned_vector = std::vector<T, pinned_alloc<T>>;

auto my_vec = pinned_vector<T>;  // replace T with desired type

For more information on allocators and for a good read, see [17].

Pinned C++ arrays as PyTorch tensors

To improve runtime, Python programs often have their performance-critical parts written in fast languages, such as C++. In such scenarios, the underlying data structures usually need to be accessed through Python as well. When it comes to simple arrays, exposing the data through a NumPy ndarray is a reasonable and convenient solution.

NumPy [18], [19], [20] is a fundamental library for scientific computing in Python. It provides a multidimensional array object called ndarray, and utilizes optimized pre-compiled C code for operating on large numbers of data in a rapid fashion. NumPy arrays are used in several popular Python libraries such as Pandas, Scikit-learn and SciPy. OpenCV, an open-source computer-vision library written in C++, uses it for representing its n-dimensional dense array class cv::Mat in its Python bindings [21].

In PyTorch, NumPy arrays can be used for creating tensors with the torch.from_numpy() function (if the data type is supported). A tensor created this way shares memory with the ndarray. This is beneficial for two reasons. Firstly, the performance is increased if the data is not copied. Secondly, it provides a simple approach for creating a tensor that shares memory with a C++ container: just use the NumPy array of the container to create the Tensor. As an implication, when the data is in pinned memory, so is the tensor.

NumPy offers multiple approaches to treating a foreign object as a ndarray. These are described in [22]. With Pybind11, a header-only library for binding C++ code to Python, exposing our pinned std::vector as a NumPy array would look as:

// include the pinned vector implementation
#include <pybind11/numpy.h>
#include <pybind11/pybind11.h>

namespace py = pybind11;

/// Pinned vector
template <typename T>
py::array_t<T> as_numpy(const std::vector<T, pinned_alloc<T>> &myvec)
{
    return py::array_t<T>({myvec.size()}, // Shape
                          {sizeof(T)},    // Stride
                          myvec.data(),   // Data pointer
                          py::none());
}

Here, the ndarray shares memory with the std::vector. As the memory is not managed by Python, the user is responsible for managing the lifetime of the vector; using the Numpy array after the C++ object is deleted will result in errors.

By binding both the regular pageable and the pinned std::vector variants to Python, the transfer rates of the vector-turned-tensors and the regular tensors can be compared:

Data size: 16.0 MB
Iterations: 10001
Abbreviations: d-device, h-host, p-pinned, v-vector

--- tensor ---

tensor.to(device, dtype)
    h to d                979.66 us     15.9 GB/s
    hp to d               669.72 us     23.3 GB/s
    d to h               7148.27 us      2.2 GB/s

tensor.copy_(src)
    h to d                976.80 us     16.0 GB/s
    hp to d               645.16 us     24.2 GB/s
    d to h               1583.10 us      9.9 GB/s
    d to hp               625.61 us     25.0 GB/s

--- vector ---

tensor.to(device, dtype)
    v to d                979.66 us     15.9 GB/s
    vp to d               652.31 us     24.0 GB/s

tensor.copy_(src)
    v to d                975.85 us     16.0 GB/s
    vp to d               645.16 us     24.2 GB/s
    d to v               1539.23 us     10.2 GB/s
    d to vp               625.61 us     25.0 GB/s

cast
    v as ndarray            0.24 us 
    v as tensor             1.19 us 
    vp as ndarray           0.24 us 
    vp as tensor            1.19 us

The results reveal three key points. First of all, creating a NumPy array or PyTorch tensor from a bound vector is fast with the zero-copy approaches. Secondly, a tensor that shares memory with a bound array automatically receives the same benefits from being in pinned memory as a regular tensor does; no need for additional tinkering. Lastly, a vector-turned-tensor and a regular PyTorch tensor perform equally under the same conditions.

Summary

Certain areas have a strong incentive for maximizing the performance of programs. To do so, it is crucial to understand the data and the hardware that is worked on. When it comes to working with PyTorch, C++ and CUDA devices, utilizing pinned memory is an important optimization step that I feel is worth more publicity.

Modern computers use virtual memory. It is an abstraction that decouples programs from the storage resources of a given machine. Alongside the RAM, computers can also make use of a secondary storage (disks, SSDs), which enables them to exceed the size of the available memory. For certain uses, however, data must remain – at the same location – within the RAM. Devices using direct memory access (DMA), such as GPUs, are prime examples. To achieve this, memory segments can be pinned.

Unlike regular pageable memory, pinned memory can be directly accessed by the device (GPU). This allows data to be read or written with much higher bandwidth. In a custom PyTorch benchmark, switching to pinned host memory increased the host-to-device transfer rate from 16 GB/s to 24 GB/s; in the device-to-host direction, from 10 GB/s to 25 GB/s. In addition to this speedup, switching also enables the GPU to move data asynchronously with respect to the CPU, which frees the latter to do other tasks.

Python programs often utilize C++ code to increase their performance. When doing so in PyTorch projects, creating a tensor that shares memory with a bound array can be beneficial: it grants PyTorch functions direct access to the data, which simplifies integrating custom code into the project. Based on the measurements, a tensor created this way receives the same benefits from being in pinned memory as a regular tensor does. Furthermore, they perform equally under the same conditions.

All in all, pinning memory is an important tool for improving the performance of projects that rely on CUDA GPUs. It enables increased bandwidth between the host and the device, can free the CPU for other tasks in the duration of the data transfer and can be used in tandem with leveraging C++ code in Python. The degree of acceleration, nonetheless, will depend on the actual data and hardware.

The presented results were acquired using a PC equipped with a Ryzen 5900x CPU, an Nvidia RTX 4080 GPU in a PCI Express x16 Gen4 slot and 2666 MHz DDR4 RAMs. The benchmarks are available on GitHub.

References

[1]

M. Acton, “Data-Oriented Design and C++,” 2014. Accessed: Jul. 08, 2024. [Online]. Available: https://www.youtube.com/watch?v=rX0ItVEVjHc

[2]

S. Meyers, “Cpu Caches and Why You Care,” NOKIA Technology Center Wrocław, 2014. Accessed: Jul. 08, 2024. [Online]. Available: https://www.youtube.com/watch?v=WDIkqP4JbkE

[3]

C. Ericson, “The Latency Elephant.” Seven Degrees Of Freedom. Accessed: Jul. 09, 2024. [Online]. Available: http://seven-degrees-of-freedom.blogspot.com/2009/10/latency-elephant.html

[4]

N. Llopis, “Data-Oriented Design (Or Why You Might Be Shooting Yourself in The Foot With OOP),” Games from Within. Accessed: Jul. 09, 2024. [Online]. Available: https://gamesfromwithin.com/data-oriented-design

[5]

DICE, “Introduction to Data Oriented Design,” Electronic Arts Inc. - Frostbite. Accessed: Jul. 09, 2024. [Online]. Available: https://www.ea.com/frostbite/news/introduction-to-data-oriented-design

[6]

OpenAI et al., “Dota 2 with Large Scale Deep Reinforcement Learning,” Dec. 13, 2019, arXiv: arXiv:1912.06680. Accessed: Jul. 10, 2024. [Online]. Available: http://arxiv.org/abs/1912.06680

[7]

M. Harris, “How to Optimize Data Transfers in CUDA C/C++,” NVIDIA Technical Blog. Accessed: Jul. 08, 2024. [Online]. Available: https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/

[8]

S. Midkiff, “GPU Teaching Kit Module 14: Efficient Host-Device Data Transfer,” Accessed: Jul. 14, 2024. [Online]. Available: https://engineering.purdue.edu/~smidkiff/ece563/NVidiaGPUTeachingToolkit/Mod14DataXfer/Mod14DataXfer.pdf

[9]

M. Fatica and G. Ruetsch, “Optimization,” in CUDA Fortran for Scientists and Engineers, Elsevier, 2014, pp. 43–114. doi: 10.1016/B978-0-12-416970-8.00003-1.

[10]

“CUDA semantics,” PyTorch 2.3 documentation. Accessed: Jul. 15, 2024. [Online]. Available: https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-pinning

[11]

“Virtual memory,” Wikipedia. Jan. 17, 2024. Accessed: Jul. 14, 2024. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Virtual_memory&oldid=1196272296

[12]

A. Bhattacharjee and D. Lustig, Architectural and Operating System Support for Virtual Memory. Morgan & Claypool Publishers, 2017.

[13]

“Virtual Memory: pages and page frames,” The Monkey Trials. Accessed: Aug. 10, 2024. [Online]. Available: https://web.archive.org/web/20160611231342/http://blog.cs.miami.edu/burt/2012/10/31/virtual-memory-pages-and-page-frames/

[14]

J. Corbet, “Locking and pinning,” LWN.net. Accessed: Jul. 18, 2024. [Online]. Available: https://lwn.net/Articles/600502/

[15]

S. Migacz, “Performance Tuning Guide,” PyTorch Tutorials. Accessed: Jul. 15, 2024. [Online]. Available: https://pytorch.org/tutorials/recipes/recipes/tuning_guide.html#enable-async-data-loading-and-augmentation

[16]

“torch.utils.data,” PyTorch 2.3 documentation. Accessed: Jul. 15, 2024. [Online]. Available: https://pytorch.org/docs/stable/data.html#memory-pinning

[17]

G. Davey, “Simplifying CUDA Memory Management,” CommitThis! blog. Accessed: Jul. 08, 2024. [Online]. Available: https://committhis.github.io/2020/10/06/cuda-abstractions.html

[18]

“What is NumPy?” NumPy documentation. Accessed: Jul. 29, 2024. [Online]. Available: https://numpy.org/doc/stable/user/whatisnumpy.html

[19]

“What is NumPy?” NVIDIA Data Science Glossary. Accessed: Jul. 29, 2024. [Online]. Available: https://www.nvidia.com/en-us/glossary/numpy/

[20]

“NumPy,” Pybind11 documentation. Accessed: Jul. 22, 2024. [Online]. Available: https://pybind11.readthedocs.io/en/stable/advanced/pycpp/numpy.html#buffer-protocol

[21]

“Introduction to OpenCV-Python Tutorials,” OpenCV documentation. Accessed: Jul. 23, 2024. [Online]. Available: https://docs.opencv.org/3.4/d0/de3/tutorial_py_intro.html

[22]

“Interoperability with NumPy,” NumPy documentation. Accessed: Aug. 06, 2024. [Online]. Available: https://numpy.org/doc/stable/user/basics.interoperability.html

Appendix

The primary goal of this post was to raise awareness of pinned memory; it is by no means an in-depth guide. This is reflected by the microbenchmarks: they just print the median execution time for a single data size without further analysis. Yet, the performance of moving data from the device to the host with .to(device, dtype) is worth a second look.

The tests measured this transfer to be the slowest, reaching a rate of only 2 GB/s. For comparison, .copy_(src) reached 10 GB/s with pageable and 25 GB/s with pinned host memory. I started investigating.

I checked if the function is not aimed for device-to-host transfer, but .cpu() performed equally. I tested if the slowness is caused by the repeated allocation of the destination container; however, creating the destination tensor myself and calling .copy_(src) with it reached 6 GB/s with pageable and 14 GB/s with pinned memory. See:

Data size: 16.0 MB
Iterations: 5001
Abbreviations: d-device, h-host, p-pinned

tensor.to(device, dtype)
    h to d                980.14 us     15.9 GB/s
    hp to d               648.50 us     24.1 GB/s
    d to h               7480.86 us      2.1 GB/s

tensor.copy_(src)
    h to d                972.99 us     16.1 GB/s
    hp to d               645.40 us     24.2 GB/s
    d to h               1600.27 us      9.8 GB/s
    d to hp               625.85 us     25.0 GB/s

tensor.cpu()
    d to h               7508.99 us      2.1 GB/s

torch.zeros() > tensor.copy_(src)
    d to h               2511.02 us      6.2 GB/s
    d to hp              1111.27 us     14.1 GB/s

Unexpectedly, decreasing the data size from 16 MB to 11 MB increased the transfer rate to 10 GB/s.

Data size: 11.0 MB
Iterations: 5001
Abbreviations: d-device, h-host, p-pinned

tensor.to(device, dtype)
    h to d                686.41 us     15.6 GB/s
    hp to d               448.70 us     23.9 GB/s
    d to h               1074.79 us     10.0 GB/s

tensor.copy_(src)
    h to d                682.12 us     15.7 GB/s
    hp to d               445.84 us     24.1 GB/s
    d to h               1087.43 us      9.9 GB/s
    d to hp               432.49 us     24.8 GB/s

tensor.cpu()
    d to h               1071.93 us     10.0 GB/s

torch.zeros() > tensor.copy_(src)
    d to h               1419.54 us      7.6 GB/s
    d to hp               744.58 us     14.4 GB/s

To me – this suggests some PyTorch mechanic to be the underlying culprit. Regardless, the conclusion stays the same: .copy_(src) with pinned host memory is significantly faster than .to(device, dtype) for moving data from the device to the host in PyTorch.