CUDA Block and Grid

03-12-201903-12-2019 blog 4 minutes read (About 590 words) visits

Introduction

I just started to learn CUDA and read this useful blog post “An Even Easier Introduction to CUDA” from NVIDIA. However, I found the images of “Block” and “Grid” in the original blog post was not quite matching with the code in the blog post. So I think I need to express it in a better way.

Basic Code

This is the piece of CUDA code that I copied from the blog post.

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run the kernel
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;
  // add<<<1, blockSize>>>(N, x, y);
  add<<<numBlocks, blockSize>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);

  return 0;
}

Block and Grid

I found the figure 1 in the NVIDIA blog post did not quite reflect how the add function was conducted in parallel. So I have made my versions.

Block

A block consists many threads. In our case, block_dim == block_size == num_threads = 256.

In the above figure, each small rectangle is a basic element in the array. When there is only one block, the parallel process could be imagined as block_dim pointers moving asynchronously. That is why you see the index are moving with a stride of block_dim in the following add function when there is only one block.

__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

Grid

Similarly, a grid consists many blocks. In our case, grid_dim == grid_size = 4096.

In the above figure, each small rectangle is a block in the grid. The parallel process could be imagined as block_dim _ grid_dim pointers moving asynchronously. That is why you see the index are moving with a stride of block_dim _ grid_dim in the following add function.

__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

Final Remarks

I personally feel it is easier to understand the concept of block and grid with the CUDA code using my figures instead of the one in the original blog post, although that figure was also correct if you think of that a grid wraps a bunch of blocks, a block wraps a bunch of threads, and a thread wraps a bunch of basic array elements.

CUDA Block and Grid

https://leimao.github.io/blog/CUDA-Concept-Block-Grid/

Author

Lei Mao

Posted on

03-12-2019

Updated on

03-12-2019

Licensed under

CUDA

CUDA Block and Grid

Introduction

Basic Code

Block and Grid

Block

Grid

Final Remarks

Author

Posted on

Updated on

Licensed under

Like this article? Support the author with

Comments

Advertisement

Catalogue