Mathieu GAILLARD

Add with CUDA: memory management

There are multiple ways to manage memory transfers from host to device when performing addition of two vectors in CUDA. I briefly benchmarked different methods: Zero-Copy Host Memory, Standard Copy, Thrust, Unified Memory.

Method

We choose a memory bound algorithm such as adding two very long vectors. Thus, most of the time is spent transferring data. We generate two random integer arrays of size N, which varies between 1,000 and 10 million. For each different value of N, we add the two generated vectors 5 times and measure the average runtime as well as the standard deviation.

We compare five different methods:

Standard Copy

The textbook implementation:

  1. Dynamic allocation on the device using cudaMalloc
  2. Copy the data from host to device using cudaMemcpy
  3. Kernel execution
  4. Copy back the result from device to host

See the function addWithCuda in the code section.

Zero-Copy Register

Use the Zero-Copy Memory and pin an existing portion of memory to the GPU using the function cudaHostRegister.

See the function addWithCudaZeroCopyRegister in the code section.

Zero-Copy Alloc

Use the Zero-Copy Memory and allocate a portion of memory on the host and make it accessible by the GPU using the function cudaHostAlloc.

See the function addWithCudaZeroCopyAlloc in the code section.

Thrust

Similar to the standard copy method except that we let Thrust perform the copy using the thrust::device_vector<T> class.

See the function addWithThrust in the code section.

Unified Memory

See the function addWithCudaManaged in the code section.

Result

The code is presented in the next section. It has been compiled on Windows 10 with MSVC 2017 (toolset v14.13) and CUDA 9.2; The computer is equipped with an Intel i7-7820HQ @ 2.90GHz and an Nvidia Quadro M2200.

The following table shows the average and standard deviation runtime according to the size of the two arrays.

  1 000 10 000 100k 1M 10M
addWithCudaZeroCopyRegister 0.14 ± 0.09 0.10 ± 0.00 0.24 ± 0.09 1.10 ± 0.23 9.78 ± 0.87
addWithCudaZeroCopyAlloc 0.10 ± 0.00 0.10 ± 0.00 0.22 ± 0.04 1.68 ± 0.47 15.30 ± 0.58
addWithCuda 1.02 ± 0.28 0.84 ± 0.09 1.96 ± 0.19 7.08 ± 0.54 57.14 ± 8.98
addWithThrust 0.34 ± 0.05 0.42 ± 0.04 1.76 ± 0.43 9.70 ± 0.51 76.28 ± 2.94
addWithCudaManaged 0.32 ± 0.04 0.72 ± 0.30 4.10 ± 0.42 29.48 ± 4.95 271.54 ± 16.63

Zero-Copy Memory is the fastest approach. In fact, the data is read or written only once. Since GPUs are very good at hiding latencies associated with memory accesses, they stream the data from the host and process it as it comes to the device. This leads to a huge boost in performance over other approaches that actually copy data.

Thrust is equivalent to the standard approach, it copies the data to the device before running the kernel. We can see that for small arrays (less than 100k elements), it is faster. However, for bigger arrays (more than 1M elements) it is slower.

Unified Memory is not as performant as the other approaches, however, it is much more convenient to program.

Update on September 13, 2020:

I ran the same benchmark on a laptop equipped with an Intel Core i7-10875H with an Nvidia GeForce RTX 2060 Max-Q.

  1 000 10 000 100k 1M 10M
addWithCudaZeroCopyRegister 0.40 ± 0.00 0.40 ± 0.00 0.64 ± 0.11 2.32 ± 0.26 17.58 ± 0.59
addWithCudaZeroCopyAlloc 1.24 ± 0.09 1.22 ± 0.04 1.86 ± 0.09 7.30 ± 0.21 49.60 ± 6.61
addWithCuda 0.30 ± 0.00 0.32 ± 0.04 1.96 ± 0.19 3.58 ± 0.24 28.22 ± 1.23
addWithThrust 0.40 ± 0.00 0.44 ± 0.05 1.76 ± 0.43 5.18 ± 0.40 37.46 ± 2.20
addWithCudaManaged 0.80 ± 0.12 1.50 ± 0.12 10.58 ± 0.16 120.14 ± 8.67 822.28 ± 14.54

Overall, the conclusion is the same. But interestingly, some things differ between Maxwell and Turing architectures. We can see that on the newer GPU, Zero-Copy with allocation is actually slower than the standard way of doing things. Also, unified memory became a lot slower with the new architecture.

Code

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <string.h>

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/copy.h>

const int threadPerBlock = 512;

bool benchmark(int *c, const int *a, const int *b, unsigned int arraySize);
bool checkResult(const int* c, const int *a, const int *b, unsigned int size);
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);
cudaError_t addWithCudaZeroCopyRegister(int *c, const int *a, const int *b, unsigned int size);
cudaError_t addWithCudaZeroCopyAlloc(int *c, const int *a, const int *b, unsigned int size);
cudaError_t addWithCudaManaged(int *c, const int *a, const int *b, unsigned int size);
cudaError_t addWithThrust(int *c, const int *a, const int *b, unsigned int size);

__global__ void addKernel(int *c, const int *a, const int *b, int n)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;

    if (i < n)
    {
        c[i] = a[i] + b[i];
    }
}

int main()
{
    const int arraySize = 10000000;

    int* a = new int[arraySize];
    int* b = new int[arraySize];
    int* c = new int[arraySize];

    // Init A and B
    for (int i = 0; i < arraySize; i++)
    {
        a[i] = i + 1;
        b[i] = 10 * (i + 1);
    }

    for (int size = 10; size <= arraySize; size *= 10)
    {
        for (int i = 0; i < 5; i++)
        {
            benchmark(c, a, b, size);
        }
    }

    delete[] a;
    delete[] b;
    delete[] c;

    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaError_t cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return false;
    }

    return 0;
}

bool benchmark(int *c, const int *a, const int *b, unsigned int arraySize)
{
    printf("\nBenchmark for size\t%d\n", arraySize);

    cudaError_t cudaStatus;

    // Add vectors in parallel with memory copy
    printf("addWithCuda\t");
    cudaStatus = addWithCuda(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess || !checkResult(c, a, b, arraySize)) {
        fprintf(stderr, "addWithCuda failed!");
        return false;
    }

    // Add vectors in parallel with memory copy
    printf("addWithCudaZeroCopyRegister\t");
    cudaStatus = addWithCudaZeroCopyRegister(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess || !checkResult(c, a, b, arraySize)) {
        fprintf(stderr, "addWithCuda failed!");
        return false;
    }

    // Add vectors in parallel with zero copy
    printf("addWithCudaZeroCopyAlloc\t");
    cudaStatus = addWithCudaZeroCopyAlloc(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess || !checkResult(c, a, b, arraySize)) {
        fprintf(stderr, "addWithCudaZeroCopyAlloc failed!");
        return false;
    }

    // Add vectors in parallel with managed memory
    printf("addWithCudaManaged\t");
    cudaStatus = addWithCudaManaged(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess || !checkResult(c, a, b, arraySize)) {
        fprintf(stderr, "addWithCudaManaged failed!");
        return false;
    }

    printf("addWithThrust\t");
    cudaStatus = addWithThrust(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess || !checkResult(c, a, b, arraySize)) {
        fprintf(stderr, "addWithThrust failed!");
        return false;
    }

    return true;
}

bool checkResult(const int* c, const int *a, const int *b, unsigned int size)
{
    for (unsigned int i = 0; i < size; i++)
    {
        if (c[i] != (a[i] + b[i]))
        {
            return false;
        }
    }

    return true;
}

cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;

    const int blocks = (size + threadPerBlock - 1) / threadPerBlock;

    // Record computing time
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }

    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }

    // Copy input vectors from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    // Launch a kernel on the GPU.
    addKernel<<<blocks, threadPerBlock>>>(dev_c, dev_a, dev_b, size);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }
    
    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

    // Copy output vector from GPU buffer to host memory.
    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }

    // Get stop time, and display the timing results
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop);
    printf("%3.1f ms\n", elapsedTime);

Error:
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
    
    return cudaStatus;
}

cudaError_t addWithCudaZeroCopyRegister(int *c, const int *a, const int *b, unsigned int size)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;

    const int blocks = (size + threadPerBlock - 1) / threadPerBlock;

    // Record computing time
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);

    // Allocate zero copy memory
    cudaStatus = cudaHostRegister((void *)a, size * sizeof(int), cudaHostRegisterMapped);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaHostRegister A failed! %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }

    cudaStatus = cudaHostRegister((void *)b, size * sizeof(int), cudaHostRegisterMapped);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaHostRegister B failed! %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }

    cudaStatus = cudaHostRegister((void *)c, size * sizeof(int), cudaHostRegisterMapped);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaHostRegister C failed! %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }

    // Pass pointer to device
    cudaStatus = cudaHostGetDevicePointer((void **)&dev_a, (void *)a, 0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaHostGetDevicePointer failed!");
        goto Error;
    }

    cudaStatus = cudaHostGetDevicePointer((void **)&dev_b, (void *)b, 0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaHostGetDevicePointer failed!");
        goto Error;
    }

    cudaStatus = cudaHostGetDevicePointer((void **)&dev_c, (void *)c, 0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaHostGetDevicePointer failed!");
        goto Error;
    }

    // Execute kernel with zero copy memory
    addKernel<<<blocks, threadPerBlock>>>(dev_c, dev_a, dev_b, size);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }

    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

    // Get stop time, and display the timing results
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop);
    printf("%3.1f ms\n", elapsedTime);

Error:
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    cudaHostUnregister((void *)a);
    cudaHostUnregister((void *)b);
    cudaHostUnregister((void *)c);

    return cudaStatus;
}

cudaError_t addWithCudaZeroCopyAlloc(int *c, const int *a, const int *b, unsigned int size)
{
    int *host_a = 0;
    int *host_b = 0;
    int *host_c = 0;

    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;

    const int blocks = (size + threadPerBlock - 1) / threadPerBlock;

    // Record computing time
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);

    // Allocate zero copy memory
    cudaStatus = cudaHostAlloc((void **)&host_a, size * sizeof(int), cudaHostAllocWriteCombined | cudaHostAllocMapped);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaHostAlloc failed!");
        goto Error;
    }

    cudaStatus = cudaHostAlloc((void **)&host_b, size * sizeof(int), cudaHostAllocWriteCombined | cudaHostAllocMapped);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaHostAlloc failed!");
        goto Error;
    }

    cudaStatus = cudaHostAlloc((void **)&host_c, size * sizeof(int), cudaHostAllocMapped);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaHostAlloc failed!");
        goto Error;
    }

    // Copy data on host
    memcpy(host_a, a, size * sizeof(int));
    memcpy(host_b, b, size * sizeof(int));

    // Pass pointer to device
    cudaStatus = cudaHostGetDevicePointer((void **)&dev_a, (void *)host_a, 0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaHostGetDevicePointer failed!");
        goto Error;
    }

    cudaStatus = cudaHostGetDevicePointer((void **)&dev_b, (void *)host_b, 0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaHostGetDevicePointer failed!");
        goto Error;
    }

    cudaStatus = cudaHostGetDevicePointer((void **)&dev_c, (void *)host_c, 0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaHostGetDevicePointer failed!");
        goto Error;
    }

    // Execute kernel with zero copy memory
    addKernel<<<blocks, threadPerBlock>>>(dev_c, dev_a, dev_b, size);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
        goto Error;
    }

    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
        goto Error;
    }

    // Copy the result back
    memcpy(c, host_c, size * sizeof(int));

    // Get stop time, and display the timing results
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop);
    printf("%3.1f ms\n", elapsedTime);

Error:
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    cudaFreeHost(host_a);
    cudaFreeHost(host_b);
    cudaFreeHost(host_c);

    return cudaStatus;
}

cudaError_t addWithCudaManaged(int *c, const int *a, const int *b, unsigned int size)
{
    int *managed_a = 0;
    int *managed_b = 0;
    int *managed_c = 0;
    cudaError_t cudaStatus;

    const int blocks = (size + threadPerBlock - 1) / threadPerBlock;

    // Record computing time
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);

    // Allocate zero copy memory
    cudaStatus = cudaMallocManaged((void **)&managed_a, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMallocManaged failed!");
    }

    cudaStatus = cudaMallocManaged((void **)&managed_b, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMallocManaged failed!");
    }

    cudaStatus = cudaMallocManaged((void **)&managed_c, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMallocManaged failed!");
    }

    // Copy data on host to managed zone
    memcpy(managed_a, a, size * sizeof(int));
    memcpy(managed_b, b, size * sizeof(int));

    // Execute kernel with zero copy memory
    addKernel<<<blocks, threadPerBlock>>>(managed_c, managed_a, managed_b, size);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
    }

    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
    }

    // Copy the result back
    memcpy(c, managed_c, size * sizeof(int));

    // Get stop time, and display the timing results
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop);
    printf("%3.1f ms\n", elapsedTime);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return cudaStatus;
}

cudaError_t addWithThrust(int *c, const int *a, const int *b, unsigned int size)
{
    cudaError_t cudaStatus;
    const int blocks = (size + threadPerBlock - 1) / threadPerBlock;

    // Record computing time
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);

    thrust::device_vector<int> dev_a(a, a + size);
    thrust::device_vector<int> dev_b(b, b + size);
    thrust::device_vector<int> dev_c(size);

    // Execute kernel with zero copy memory
    addKernel<<<blocks, threadPerBlock>>>(thrust::raw_pointer_cast(dev_c.data()),
                                          thrust::raw_pointer_cast(dev_a.data()),
                                          thrust::raw_pointer_cast(dev_b.data()),
                                          size);

    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
    }

    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
    }

    // Copy back the results
    thrust::copy(dev_c.begin(), dev_c.end(), c);

    // Get stop time, and display the timing results
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    float elapsedTime;
    cudaEventElapsedTime(&elapsedTime, start, stop);
    printf("%3.1f ms\n", elapsedTime);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return cudaStatus;
}