Lei Mao

Machine Learning, Artificial Intelligence, Computer Science.

CUDA Block and Grid

Introduction

I just started to learn CUDA and read this useful blog post “An Even Easier Introduction to CUDA” from NVIDIA. However, I found the images of “Block” and “Grid” in the original blog post was not quite matching with the code in the blog post. So I think I need to express it in a better way.

Basic Code

This is the piece of CUDA code that I copied from the blog post.

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}

int main(void)
{
int N = 1<<20;
float *x, *y;

// Allocate Unified Memory – accessible from CPU or GPU
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&y, N*sizeof(float));

// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}

// Run the kernel
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;

// Wait for GPU to finish before accessing on host

// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i]-3.0f));
std::cout << "Max error: " << maxError << std::endl;

// Free memory
cudaFree(x);
cudaFree(y);

return 0;
}


Block and Grid

I found the figure 1 in the NVIDIA blog post did not quite reflect how the add function was conducted in parallel. So I have made my versions.

Block

A block consists many threads. In our case, block_dim == block_size == num_threads = 256.

In the above figure, each small rectangle is a basic element in the array. When there is only one block, the parallel process could be imagined as block_dim pointers moving asynchronously. That is why you see the index are moving with a stride of block_dim in the following add function when there is only one block.

__global__
void add(int n, float *x, float *y)
{
int stride = blockDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}


Grid

Similarly, a grid consists many blocks. In our case, grid_dim == grid_size = 4096.

In the above figure, each small rectangle is a block in the grid. The parallel process could be imagined as block_dim * grid_dim pointers moving asynchronously. That is why you see the index are moving with a stride of block_dim * grid_dim in the following add function.

__global__
void add(int n, float *x, float *y)
{
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}


Final Remarks

I personally feel it is easier to understand the concept of block and grid with the CUDA code using my figures instead of the one in the original blog post, although that figure was also correct if you think of that a grid wraps a bunch of blocks, a block wraps a bunch of threads, and a threads wraps a bunch of basic array elements.