Pass Function Pointers to Kernels in CUDA Programming

Introduction

Ever since I started to learn to CUDA, my impression of CUDA kernels is that it is a very isolated piece of code in the program and has lots of different restrictions. Because of this, I used to write CUDA kernel functions that have code duplications and do similar jobs. Today let us take a look at how to use C++ templates and function pointers for CUDA kernels to reduce the code duplications.

It should be noted that to the best of my knowledge there is no similar tutorial on this. I experimented a lot and make the final program available to the public.

Tutorial

Code

The following is the code to compute the sum and the product of two values by passing different function pointers to the CUDA kernel. It also uses C++ template extensively. The code is also available on my GitHub Gist.

The key to passing function pointers to CUDA kernel is to use static pointers to device pointers followed by copying the pointers to the host side. Otherwise, I am sure you will get different kinds of weird errors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
#include <iostream>

// Since C++ 11
template<typename T>
using func_t = T (*) (T, T);

template <typename T>
__device__ T add_func (T x, T y)
{
return x + y;
}

template <typename T>
__device__ T mul_func (T x, T y)
{
return x * y;
}

// Required for functional pointer argument in kernel function
// Static pointers to device functions
template <typename T>
__device__ func_t<T> p_add_func = add_func<T>;
template <typename T>
__device__ func_t<T> p_mul_func = mul_func<T>;


template <typename T>
__global__ void kernel(func_t<T> op, T * d_x, T * d_y, T * result)
{
*result = (*op)(*d_x, *d_y);
}

template <typename T>
void test(T x, T y)
{
func_t<T> h_add_func;
func_t<T> h_mul_func;

T * d_x, * d_y;
cudaMalloc(&d_x, sizeof(T));
cudaMalloc(&d_y, sizeof(T));
cudaMemcpy(d_x, &x, sizeof(T), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, &y, sizeof(T), cudaMemcpyHostToDevice);

T result;
T * d_result, * h_result;
cudaMalloc(&d_result, sizeof(T));
h_result = &result;

// Copy device function pointer to host side
cudaMemcpyFromSymbol(&h_add_func, p_add_func<T>, sizeof(func_t<T>));
cudaMemcpyFromSymbol(&h_mul_func, p_mul_func<T>, sizeof(func_t<T>));

kernel<T><<<1,1>>>(h_add_func, d_x, d_y, d_result);
cudaDeviceSynchronize();
cudaMemcpy(h_result, d_result, sizeof(T), cudaMemcpyDeviceToHost);
std::cout << "Sum: " << result << std::endl;

kernel<T><<<1,1>>>(h_mul_func, d_x, d_y, d_result);
cudaDeviceSynchronize();
cudaMemcpy(h_result, d_result, sizeof(T), cudaMemcpyDeviceToHost);
std::cout << "Product: " << result << std::endl;
}

int main()
{
std::cout << "Test int for type int ..." << std::endl;
test<int>(2.05, 10.00);

std::cout << "Test float for type float ..." << std::endl;
test<float>(2.05, 10.00);

std::cout << "Test double for type double ..." << std::endl;
test<double>(2.05, 10.00);
}

Compile

To compile the program, use nvcc.

1
$ nvcc main.cu -o main

Run

If the program compiles successfully, you should be able to see the following message when you run the program.

1
2
3
4
5
6
7
8
9
10
$ ./main
Test int for type int ...
Sum: 12
Product: 20
Test float for type float ...
Sum: 12.05
Product: 20.5
Test double for type double ...
Sum: 12.05
Product: 20.5

References

Pass Function Pointers to Kernels in CUDA Programming

https://leimao.github.io/blog/Pass-Function-Pointers-to-Kernels-CUDA/

Author

Lei Mao

Posted on

04-28-2019

Updated on

04-28-2019

Licensed under


Comments