CUDA C/C++ 기초

References

Fundamentals of Accelerated Computing with CUDA C/C++ (NVIDIA Online Training)

Writing Application Code for the GPU(CUDA C/C++)
CUDA Thread Hierarchy
Allocating Memory to be accessed on the GPU and the CPU
Grid Size Mismatch
Grid-Stride Loops
Error Handling

CUDA를 공부하면서 여러 포스팅들을 작성했었는데, 최근까지 CUDA를 자주 사용하지는 않아서 잊어버린 부분들이 많았습니다. 우연히 NVIDIA Online Training를 접할 수 있는 기회가 생겨서, 이번 기회에 NVIDIA에서 제공하는 온라인 트레이닝을 들어보면서 기본적인 내용들을 다시 복습하며 정리해보고자 합니다.

Accelerated Systems

Accelerated Systems 또는 Heteroneneous Systems는 CPU와 GPU가 혼합된 것을 의미합니다. 이러한 시스템은 CPU 프로그램을 실행하고, GPU에서 제공되는 대규모 병렬처리의 이점을 누릴 수 있는 함수를 실행합니다. Accelerated Systems은 다양하겠지만, 우리는 NVIDIA GPU에 대해서 집중적으로 살펴보도록 하겠습니다.

GPU-accelerated VS. CPU-only Applications

어떠한 데이터를 초기화(initialize)하고, 태스크를 수행(performWork)하고 그 결과를 검증(verifyWork)하는 일련의 과정이 있다고 가정해봅시다.

만약 CPU만 사용하는 프로그램이라면 다음과 같이 모든 과정이 순차적으로 진행됩니다.

반면에, GPU 가속을 사용하는 프로그램의 경우에는 데이터를 처리하는 태스크를 병렬로 수행할 수 있습니다. 따라서 performWork() 함수의 수행 시간이 훨씬 더 짧아지게 됩니다.

위 과정에서 데이터의 메모리는 cudaMallocManaged() 함수를 통해서 할당되었다고 가정합니다. 따라서 별도의 커맨드없이 메모리는 CPU에서 GPU로, GPU에서 CPU로 migration합니다. 뒤에서 조금 더 살펴볼 예정이니 여기서는 태스크를 수행하는 작업이 CPU만을 사용하는 것보다 더 빠르다는 것에 주목합니다.

또한, GPU는 비동기(asynchronous)로 동작하기 때문에 CPU가 동시에 다른 작업을 수행할 수 있습니다. 따라서, CPU 코드와 GPU의 비동기 작업의 동기화(synchronize)를 위해서 GPU 작업이 완료될 때까지 기다려야 합니다. 이는 CUDA APIs의 cudaDeviceSynchronize()를 통해서 가능합니다.

Writing Application Code for the GPU

CUDA는 많은 일반적인 프로그래밍 언어에서 지원되며, 이번 포스팅에서는 C/C++을 사용합니다. 기존에 익숙한 언어를 사용하기 때문에 GPU를 사용하기 위해서 언어를 새롭게 배울 필요가 없으며, GPU를 사용하기 위한 확장된 것들만 숙지하면 됩니다.

CUDA-가속 프로그램은 '.cu'라는 파일 확장자를 갖습니다. 아래의 코드를 살펴보도록 하겠습니다.

 #include <stdio.h>
 
void CPUFunction()
{
    printf("This function is defined to run on the CPU.\n");
}
 
__global__ void GPUFunction()
{
    printf("This function is defined to run on the GPU.\n");
}
 
int main()
{
    CPUFunction();
 
    GPUFunction<<<1, 1>>>();
    cudaDeviceSynchronize();
}

위 코드에는 두 개의 함수가 있는데, 첫 번째 함수는 CPU에서 동작하는 함수이며 두 번째 함수는 GPU에서 동작하는 함수입니다.

GPU에서 동작하는 함수를 살펴보면, 다음과 같이 정의되어 있습니다.

__global__ void GPUFucntion()

여기서 '__global__' 키워드는 이 함수가 GPU에서 실행되며 globally하게 호출할 수 있다는 것을 의미합니다. global하다는 것은 CPU에 의해 호출될 수도 있고, GPU에 의해 호출될 수도 있다는 것을 의미합니다.

일반적으로 CPU에서 실행되는 코드를 host 코드라고 부르고, GPU에서 실행되는 코드를 device 코드라고 부릅니다.

여기서 주목할 점은 이 함수의 리턴 타입이 void라는 것인데, __global__ 키워드와 함께 정의된 함수는 반드시 리턴 타입이 void이어야만 합니다.

이렇게 정의한 GPU 함수는 일반적으로 커널(kernel)이라고 부르며, 이를 런치(launch)한다라고 합니다.

launch를 시작, 구동, 호출 등 다양하게 부르는 것 같습니다. 여기서는 '호출'한다라고 지칭하도록 하겠습니다.

위 코드에서 main 함수를 살펴보면, 다음과 같이 커널을 호출합니다.

GPUFunction<<<1, 1>>>();

커널을 호출할 때는 execution configuration을 제공해야 하는데, 이는 '<<< ... >>>' 문법을 커널 함수 이름과 인자 리스트를 감싸는 괄호 사이에 위치시킵니다. High level에서 execution configuration를 커널 실행을 위한 thread hierarchy를 지정합니다. 이 스레드 계층은 스레드 그룹(block)의 수와 각 블록에서 실행할 스레드 갯수를 정의합니다.

여기서는 1개의 스레드 블록(첫 번째 인수)과 1개의 스레드(두 번째 인수)로 지정하였습니다.

main 함수 마지막에는 CPU와 GPU의 동기화를 위한 CUDA API를 호출합니다.

cudaDeviceSynchronize();

위에서 언급했듯이, 실행된 커널은 비동기로 동작합니다. 따라서, 별다른 조치를 취하지 않으면 CPU 코드는 커널 실행이 완료될 때까지 기다리지 않고 계속 진행됩니다. CUDA runtime에서 제공되는 cudaDeviceSynchronize() 함수를 호출하면 host 코드(GPU)가 device 코드(GPU)가 완료될 때까지 기다리도록 하며, CPU는 GPU 커널 실행이 완료된 이후에 다시 재개됩니다.

Compiling and Running CUDA Code

이번에는 작성한 .cu 프로그램을 컴파일하는 방법에 대해서 살펴보도록 하겠습니다.

CUDA 플랫폼은 NVIDIA CUDA 컴파일러인 nvcc와 함께 제공되며, 이를 통해 host와 device 코드를 모두 포함하는 CUDA-가속 어플리케이션을 컴파일할 수 있습니다. nvcc에 대한 자세한 내용은 공식 문서(link)에서 살펴볼 수 있습니다.

그러면 이제 아래 코드를 컴파일하고 실행해보도록 하겠습니다.

 // 01_hello-gpu.cu
#include <stdio.h>
 
void helloCPU()
{
    printf("Hello from the CPU.\n");
}
 
__global__ void helloGPU()
{
    printf("Hello from the GPU.\n");
}
 
int main()
{
    helloCPU();
 
    helloGPU<<<1, 1>>>();
    cudaDeviceSynchronize();
}

nvcc는 gcc와 매우 유사한데, 위 파일(01_hello-gpu.cu)을 컴파일하기 위해서는 터미널에 다음과 같이 커맨드를 입력하면 됩니다.

nvcc -arch=sm_75 -o hello-gpu 01_hello-gpu.cu -run

여기서 o 플래그는 output file을 지정해주기 위해 사용되며, arch 플래그는 컴파일되는 파일이 어떤 아키텍처를 위한 것인지 지정합니다. 저의 경우, 사용하는 GPU의 compute capability가 7.5이기 때문에 sm_75로 입력하였습니다.

이에 대한 자세한 내용은 공식 문서의 arch_flag를 참조하시길 바랍니다.

마지막의 run 플래그는 컴파일이 성공적으로 수행된 이후에 바로 실행하도록 해주는 옵션입니다.

CUDA Thread Hierarchy

Launching Parallel Kernels

Execution Configuration을 통해 프로그래머가 커널을 병렬로 실행되도록 세부사항을 지정할 수 있습니다. 자세히 말하자면 execution configuration을 통해서 스레드 블록(thread blocks)의 수와 각 스레드 블록에 포함되는 스레드의 수를 지정할 수 있습니다. 문법은 다음과 같습니다.

'<<< NUMBER_OF_BLOCKS, NUMBER_OF_THREADS_PER_BLOCK >>>'

포스팅 초반에 살펴봤던, 일련의 프로그램 흐름을 다시 가져왔습니다.

여기서 performWork()는 데이터에 어떠한 작업을 수행하는 함수이며, GPU에서 동작하기 때문에 커널이라고 할 수 있습니다. 이러한 커널은 GPU에서 병렬로 수행될 수 있습니다.

그 작업들은 하나의 스레드(thread)에서 수행되며, 많은 스레드들이 병렬로 실행될 수 있습니다. 위 그림에서 초록색 테두리의 사각형들이 바로 각각의 스레드입니다.

그리고 스레드들이 모여서 블록(block)을 이루는데, 여기서는 4개의 스레드가 모여 하나의 블록을 구성하고 있습니다. 물론 많은 블록들이 존재할 수 있으며, 여기서는 두 개의 블록이 존재합니다.

여러 블록들은 하나의 그리드(grid)를 형성합니다. 여기서는 두 개의 블록이 모여 하나의 그리드를 구성하고 있습니다.

커널은 다음의 문법을 통해서 호출됩니다.

performWork<<<2, 4>>>();

execution configuration은 '<<<2, 4>>>'로 지정되어 있으며, 첫 번째 인수는 그리드에서 블록의 수를 정의하고 두 번째 인수는 각 블록에서의 스레드 수를 정의합니다. 따라서, 이 경우에는 커널을 실행하는데 하나의 그리드에 두 개의 블록이 있으며, 각 블록에는 4개의 스레드를 가지도록 정의합니다.

아래 코드를 컴파일하고 실행해보도록 하겠습니다.

 // 02_first-parallel.cu
#include <stdio.h>
 
__global__
void firstParallel()
{
    printf("This is running in parallel.\n");
}
 
int main()
{
    firstParallel<<<5, 5>>>();
    cudaDeviceSynchronize();
}

firstParallel이라는 커널의 정의되어 있으며, main 함수에서 이 커널을 총 25개의 스레드(5개의 블록, 각 블록은 5개의 스레드로 구성됨)에서 수행하도록 지정하였습니다.

전체 결과창을 캡처하지는 않았지만, 'This is running in parallel.'이라는 문장이 총 25번 출력되는 것을 확인할 수 있을 것입니다.

CUDA-Provided Thread Hierarchy Variables

각 스레드들은 해당 스레드가 속한 스레드 블록 내에서 인덱스가 부여되며, 인덱스는 0부터 시작합니다. 또한, 각 블록 또한 0부터 시작되는 인덱스가 부여됩니다. 블록들이 모여 CUDA 스레드 계층의 최상위 개체인 그리드를 이룹니다. 요약하면, CUDA 커널은 하나 이상의 블록들로 구성된 그리드에서 실행되며, 각 스레드 블록은 1개 이상의 동일한 갯수의 스레드들을 포함하고 있습니다.

CUDA 커널에는 커널을 수행하는 스레드의 인덱스와 해당 스레드가 속한 블록의 인덱스를 식별할 수 있는 특별한 변수가 정의되어 있습니다. threadIdx.x와 blockIdx.x 등 여러 변수들이 있는데 간단하게 그림을 통해서 살펴보겠습니다.

위 커널을 실행할 때, 그리드는 총 2개의 블록을 포함하고 있습니다. 커널 내에서 해당 정보는 'gridDim.x'로 접근할 수 있으며 이 변수는 그리드 내에서 블록의 수를 의미합니다. 이 경우, 이 변수의 값은 2가 됩니다.

'blockIdx.x'는 그리드 내에서 현재 블록의 인덱스를 식별합니다. 초록색 테두리 내에 있는 스레드에서 실행 중인 커널에서 이 변수의 값은 0이 됩니다.

커널 내에서 'blockDim.x'는 블록의 스레드 개수입니다. 이 경우, 이 변수의 값은 4가 됩니다. 그리드 내에 모든 블록은 동일한 수의 스레드를 포함하고 있기 때문에 이 값은 어디에서나 동일합니다.

커널 내에서 'threadIdx.x'는 한 블록 내에서의 스레드의 인덱스를 식별합니다. 위의 경우 1번째 블록의 1번째 스레드이므로, 이 값은 0이 됩니다.

또한, 2번째 블록, 1번째 스레드에서의 'threadIdx.x'의 값도 0입니다.

이전에 CUDA의 프로그래밍 모델에 대한 포스팅이 있는데, 이를 참고하셔도 좋을 것 같습니다.

CUDA Programming Model

References https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/ CUDA Toolkit Documentation https://github.com/nvidia/cuda-samples Contents CUDA kernel and thread hierarchy Matrix..

junstar92.tistory.com

이렇게 내장된 변수들을 사용한 간단한 코드를 살펴보겠습니다.

 __global__
void printSuccessForCorrectExecutionConfiguration()
{
    if (threadIdx.x == 1023 && blockIdx.x == 255) {
        printf("Success.\n");
    }
}

위 코드를 살펴보면, if문의 조건이 'threadIdx.x == 1023 && blockIdx.x == 255' 입니다. 즉, if문을 통과하려면 해당 스레드는 256번째 블록의 1024번째 스레드이어야만 합니다. 따라서, 'Success'라는 문장을 출력하기 위해서 최소한 다음과 같이 execution configration을 지정해주어야 합니다.

printSuccessForCorrectExecutionConfiguration<<<256, 1024>>>();

Accelerating For Loops

CPU-only 프로그램에서의 for 루프는 GPU-가속을 적용하기에 적합합니다. 루프의 각 반복을 순차적으로 실행하는 대신, 루프의 각 반복을 스레드에서 병렬로 실행할 수 있습니다. 다음의 루프를 살펴보도록 하겠습니다.

 int N = 2 << 20;
for (int i = 0; i < N; i++) {
    printf("%d\n", i);
}

위 루프를 병렬화하려면 어떻게 해야 할까요?

먼저 우리는 커널을 병렬로 수행할 수 있다고 알고 있습니다. 따라서 작성되는 커널은 위 루프의 하나의 반복을 한 스레드에서 수행하도록 작성해야 합니다.

그리고, 커널이 루프의 횟수와 동일하도록 execution configration을 설정해주어야 합니다.

예를 들어, 다음의 루프를 병렬화하는 커널을 작성하면,

 int N = 10;
for (int i = 0; i < N; i++) {
    printf("This is iteration number %d\n", i);
}

다음과 같습니다.

 __global__
void loop()
{
    printf("This is iteration number %d\n", threadIdx.x);
}

이렇게 정의한 커널을 다음과 같이 실행하면,

 int main()
{
    loop<<<1, 10>>>();
    cudaDeviceSynchronize();
}

위와 같은 출력을 확인할 수 있습니다.

Coordinating Parallel Threads

데이터 처리를 예로 들면, GPU-가속을 통해서 배열 또는 벡터에 있는 각 데이터를 병렬로 처리하게 됩니다. 이때, GPU의 각 스레드는 일반적으로 전체 데이터에서 일부 데이터를 처리하게 됩니다. 이때, 각 스레드에서 수행하는 커널이 처리하는 데이터를 배치하는 방법이 필요합니다.

예를 들어, 8개의 데이터를 각 블록이 4개의 스레드로 구성되어 있고, 2개의 블록으로 구성한 그리드로 커널을 실행하여 처리한다고 가정해봅시다. 데이터의 갯수와 총 스레드 갯수가 동일하므로, 하나의 스레드가 하나의 데이터를 처리하도록 데이터와 스레드를 다음과 같이 매핑시켜주면 됩니다.

데이터 매핑은 커널에 내장된 변수들(blockIdx.x, threadIdx.x 등)을 이용합니다. 각 블록은 4개의 스레드를 포함하고 있기 때문에 blockDim.x의 값은 4라는 것을 알 수 있습니다. 그리고 그리드 내에 속한 블록들의 인덱스를 식별하는 blockIdx.x(이 경우, 0과 1)와 각 블록에서 스레드의 인덱스를 식별하는 threadIdx.x(이 경우, 0~3)을 함께 사용하면, 'threadIdx.x + blockIdx.x * blockDim.x' 식을 통해 스레드와 데이터를 1:1로 매핑시켜줄 수 있습니다.

예를 들어, 세 번째 데이터(ex, data[2])는 다음과 같이 매칭될 수 있습니다.

참고로, 스레드 블록 내에서 존재할 수 있는 스레드의 수에 제한이 있는데, 그 수는 1024입니다. 따라서, 병렬 처리량을 증가시키려면 스레드 블록을 여러 개 사용해야 합니다.

위에서 정의한 loop 커널을 다음과 같이 수정하고 스레드 블록의 수를 증가시켜서 실행시켜보면,

 // 05_multiple-block-loop
#include <stdio.h>
 
__global__
void loop()
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    printf("This is iteration number %d\n", idx);
}
 
int main()
{
    loop<<<2, 5>>>();
    cudaDeviceSynchronize();
}

위와 같은 출력을 확인할 수 있습니다.

Data Mapping과 관련하여 이전에 작성한 포스팅이 있는데, 필요하시다면 참조바랍니다 !

CUDA Thread 구조와 Data Mapping (예제 : 이미지 흑백, Blur 처리)

References Programming Massively Parallel Processors CUDA Toolkit Documentation https://github.com/nvidia/cuda-samples Contents CUDA Thread Organization Built-in variables : threadIdx, blockIdx, blo..

junstar92.tistory.com

Grids and Blocks of 2 and 3 Dimensions

그리드와 블록은 3차원까지 정의될 수 있습니다. 다차원으로 정의하더라도 성능에는 영향을 미치지는 않으며, 다차원 그리드/블록을 사용하면 다차원으로 구성된 데이터를 처리할 때 도움이 될 수 있습니다. 예를 들어, 2차원 행렬을 계산할 때 유용합니다. 2차원 또는 3차원의 그리드 또는 블록을 정의하려면, CUDA에서 제공하는 'dim3' 타입을 다음과 같이 사용하면 됩니다.

 dim3 threads_per_block(16, 16, 1);
dim3 number_of_blocks(16, 16, 1);
someKernel<<<number_of_blocks, threads_per_block>>>();

위의 경우, someKernel 내에서 gridDim.x, gridDim.y, blockDim.x, blockDim.y의 값은 모두 16이 됩니다.

Allocating Memory to be accessed on the GPU and the CPU

최근 버전의 CUDA(6 이상)에서는 CPU와 GPU 디바이스 모두에서 사용 가능한 메모리를 쉽게 할당할 수 있습니다. 이러한 메모리를 Unified Memory라고 하며, 이 메모리의 목적은 성능이 아닌 편리함이 포인트입니다. 따라서, 조금 더 좋은 성능을 위해 최적화하려면 다른 방법을 사용하는 것이 좋습니다.

전에 Unified Memory와 관련한 포스팅을 작성했는데, 필요하시다면 참조하시길 바랍니다.

Unified Memory

References Professional CUDA C Programming https://developer.nvidia.com/blog/unified-memory-in-cuda-6/ Contents Unified Memory Zero-Copy Memory & Unified Virtual Addressing Zero-Copy Memory & Unifie..

junstar92.tistory.com

메모리를 할당/해제하고, host와 device 코드에서 참조할 수 있는 포인터 주소를 얻기 위해서는 CPU에서 메모리를 할당하고 해제하는 함수인 malloc/free처럼 cudaMallocManaged/cudaFree를 사용하면 됩니다.

예를 들면, 다음과 같습니다.

 int N = 2<<20;
size_t size = N * sizeof(int);
 
int *a;
// Note the address of `a` is passed as first argument.
cudaMallocManaged(&a, size);
 
// Use `a` on the CPU and/or on any GPU in the accelerated system.
 
cudaFree(a);

아래 예제 코드는 할당된 데이터의 모든 값을 두 배 해주는 프로그램입니다.

 // 06_double-elements.cu
#include <stdio.h>
 
void init(int *a, const int N)
{
    for (int i = 0; i < N; i++) {
        a[i] = i;
    }
}
 
__global__
void doubleElements(int *a, const int N)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        a[i] *= 2;
    }
}
 
bool checkElementsAreDoubled(int *a, const int N)
{
    for (int i = 0; i < N; i++) {
        if (a[i] != i * 2)
            return false;
    }
 
    return true;
}
 
int main()
{
    int N = 1000;
    int *a;
 
    size_t size = N * sizeof(int);
 
    // Use 'cudaMallocManaged' to allocate pointer 'a' available
    // on both the host and the device.
    cudamallocManaged(&a, size);
 
    init(a, N);
 
    size_t threads_per_block = 256;
    size_t number_of_blocks = (N + threads_per_block - 1) / threads_per_block;
 
    doubleElements<<<number_of_blocks, threads_per_block>>>(a, N);
    cudaDeviceSynchronize();
    
    bool areDoubled = checkElementsAreDoubled(a, N);
    printf("All elements were doubled? %s\n", areDoubled ? "TRUE" : "FALSE");
 
    // Use 'cudaFree' to free memory allocated with 'cudaMallocManaged'
    cudaFree(a);
}

Grid Size Mismatch

performWork 예시에서 데이터의 갯수와 전체 스레드의 갯수가 일치했는데, 만약 다음과 같이 전체 스레드 갯수와 데이터의 수가 일치하지 않으면 어떻게 될까요 ? 아마 대부분의 경우, 스레드의 총 갯수와 데이터의 총 갯수는 일치하지 않는 경우일 것입니다.

이 경우에는 존재하지 않는 요소들에 대한 접근을 시도하게 되고, 런타임 에러가 발생할 것 입니다.

이전 예제 코드의 doubleElements 커널에서 이 문제의 해결 방안이 적용되어 있습니다.

이전 예제 코드를 살펴보면, 총 1000개의 데이터가 있습니다. 따라서, 우리는 1000개의 병렬 태스크를 실행해야 합니다. 하지만 각 블록이 256개의 스레드를 갖도록 설정해주었기 때문에 최소 4개의 블록(256 x 4 = 1024)이 필요합니다. 따라서, 별도로 처리를 해주지 않으면 1001번째부터 1024번째 스레드가 유효하지 않은 데이터 주소에 접근하려고 시도하기 때문에 런타임 에러가 발생할 수 있습니다.

 __global__
void doubleElements(int *a, const int N)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        a[i] *= 2;
    }
}

이 문제를 해결하기 위해서 doubleElements 커널에서는 총 데이터의 수를 파라미터로 전달하여, 커널 내에서 계산된 idx 값이 유효할 때(idx < N)만 작업을 수행하도록 합니다.

Grid-Stride Loops

종종 더 좋은 성능을 위해서 또는 필요에 따라 그리드 내의 스레드 수가 데이터의 수보다 작을 수 있습니다. 예를 들어, 1000개의 데이터와 250개의 스레드가 있는 경우를 생각해보겠습니다. 이 경우에서는 각 스레드는 4개의 데이터를 처리해야 모든 데이터를 처리할 수 있습니다. 이를 수행하는 한 가지 방법은 커널 내에서 Grid-Stride Loops를 사용하는 것입니다.

예를 들면, 첫 번째 루프에서 다음과 같이 데이터를 처리하고,

두 번째 루프에서 그 다음 데이터들을 처리하는 방식입니다.

stride의 값은 'gridDim.x * blockDim.x'로 계산합니다.

이전 예제 코드처럼 각 데이터를 두 배시키는 doubleElements 커널에 grid-stride loop를 적용하면 다음과 같이 작성할 수 있습니다.

 __global__
void doubleElements(int *a, const int N)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;
    for (int i = idx; i < N; i += stride) {
        a[i] *= 2;
    }
}

Error Handling

어떠한 프로그램에서든지 CUDA 코드에서의 error handling은 필수입니다. 대부분의 함수는 아니지만 많은 함수(ex, 메모리 관리 함수)들은 cudaError_t 타입의 값을 리턴합니다. 이 값은 함수를 호출하는 동안 에러가 발생했는지 여부를 확인하는데 사용할 수 있습니다. 아래 예제 코드는 cudaMallocManaged를 호출했을 때 어떻게 에러를 처리하는지 보여줍니다.

 cudaError_t err;
err = cudaMallocManaged(&a, N);
 
if (err != cudaSuccess) {
    printf("Error: %s\n", cudaGetErrorString(err));
}

리턴 타입이 void인 커널을 호출할 때는 cudaError_t 타입의 값을 리턴하지 않습니다. 이러한 커널이, 예를 들어, 잘못된 configuration으로 실행되었을 때, 발생하는 에러를 확인하기 위해서 CUDA는 cudaError_t 타입의 값을 리턴하는 cudaGetLastError() 함수를 제공합니다.

 someKernel<<<1, -1>>>(); // -1 is not a valid number of threads.
 
cudaError_t err;
err = cudaGetLastError(); // 'cudaGetLastError' will return the error from above.
if (err != cudaSuccess) {
    printf("Error: %s\n", cudaGetErrorString(err));
}

마지막으로, 비동기적으로 발생하는 에러를 처리하기 위해서는, cudaDeviceSynchronize()와 같은 CUDA 런타임 API 호출을 통해 반환되는 상태를 체크해야 합니다. 만약 이전에 호출된 커널이 실패했다면, cudaDeviceSynchronize()도 에러를 리턴할 것입니다.

아래 예제 코드는 위에서 작성한 예제 코드(doubleElements)에서 스레드 갯수를 2048로 설정하여 발생하는 에러를 처리하도록 작성하였습니다.

 // 08_add-error-handling.cu
#include <stdio.h>
 
void init(int *a, const int N)
{
    for (int i = 0; i < N; i++) {
        a[i] = i;
    }
}
 
__global__
void doubleElements(int *a, const int N)
{
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;
    for (int i = idx; i < N; i += stride) {
        a[i] *= 2;
    }
}
 
bool checkElementsAreDoubled(int *a, const int N)
{
    for (int i = 0; i < N; i++) {
        if (a[i] != i * 2)
            return false;
    }
 
    return true;
}
 
int main()
{
    int N = 1000;
    int *a;
 
    size_t size = N * sizeof(int);
 
    // Use 'cudaMallocManaged' to allocate pointer 'a' available
    // on both the host and the device.
    cudaMallocManaged(&a, size);
 
    init(a, N);
 
    size_t threads_per_block = 2048;
    size_t number_of_blocks = 32;
 
    cudaError_t syncErr, asyncErr;
 
    doubleElements<<<number_of_blocks, threads_per_block>>>(a, N);
 
    // catch errors for both the kernel launch above and any errors that
    // occur during the asynchronous 'doubleElements' kernel execution.
    syncErr = cudaGetLastError();
    asyncErr = cudaDeviceSynchronize();
 
    // print errors should they exist.
    if (syncErr != cudaSuccess)
        printf("Error(sync): %s\n", cudaGetErrorString(syncErr));
    if (asyncErr != cudaSuccess)
        printf("Error(async): %s\n", cudaGetErrorString(asyncErr));
 
    bool areDoubled = checkElementsAreDoubled(a, N);
    printf("All elements were doubled? %s\n", areDoubled ? "TRUE" : "FALSE");
 
    // Use 'cudaFree' to free memory allocated with 'cudaMallocManaged'
    cudaFree(a);
}

위 코드를 컴파일하고 실행하면, 다음의 에러 출력을 확인할 수 있습니다.

일반적으로 Error Handling Function을 작성하여 편리하게 사용합니다.

 #include <stdio.h>
#include <assert.h>
 
inline cudaError_t checkCuda(cudaError_t result)
{
    if (result != cudaSuccess) {
        fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
        assert(result == cudaSuccess);
    }
    return result;
}
 
int main()
{
 
    /*
     * The macro can be wrapped around any function returning
     * a value of type `cudaError_t`.
     */
 
    checkCuda(cudaDeviceSynchronize())
}

위에서 작성한 모든 코드는 아래 github에서 확인하실 수 있습니다.

GitHub - junstar92/parallel_programming_study: Study parallel programming - CUDA, OpenMP, MPI, Pthread

Study parallel programming - CUDA, OpenMP, MPI, Pthread - GitHub - junstar92/parallel_programming_study: Study parallel programming - CUDA, OpenMP, MPI, Pthread

github.com

'NVIDIA > CUDA' 카테고리의 다른 글

CUDA C/C++ 기초 - (3) (0)	2022.06.14
CUDA C/C++ 기초 - (2) (0)	2022.06.13
CUDA Instructions (2) - Instruction 최적화 (0)	2022.01.28
CUDA Instructions (1) (0)	2022.01.26
Streams and Events (3) - Kernel and Data Transfer, Stream Callback (0)	2022.01.25

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

별준

CUDA C/C++ 기초 - (1)

References

Contents

Accelerated Systems

GPU-accelerated VS. CPU-only Applications

Writing Application Code for the GPU

Compiling and Running CUDA Code

CUDA Thread Hierarchy

Launching Parallel Kernels

CUDA-Provided Thread Hierarchy Variables

Accelerating For Loops

Coordinating Parallel Threads

Grids and Blocks of 2 and 3 Dimensions

Allocating Memory to be accessed on the GPU and the CPU

Grid Size Mismatch

Grid-Stride Loops

Error Handling

'NVIDIA > CUDA' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

	#include <stdio.h>

	void CPUFunction()
	{
	printf("This function is defined to run on the CPU.\n");
	}

	__global__ void GPUFunction()
	{
	printf("This function is defined to run on the GPU.\n");
	}

	int main()
	{
	CPUFunction();

	GPUFunction<<<1, 1>>>();
	cudaDeviceSynchronize();
	}

	// 01_hello-gpu.cu
	#include <stdio.h>

	void helloCPU()
	{
	printf("Hello from the CPU.\n");
	}

	__global__ void helloGPU()
	{
	printf("Hello from the GPU.\n");
	}

	int main()
	{
	helloCPU();

	helloGPU<<<1, 1>>>();
	cudaDeviceSynchronize();
	}

	// 02_first-parallel.cu
	#include <stdio.h>

	__global__
	void firstParallel()
	{
	printf("This is running in parallel.\n");
	}

	int main()
	{
	firstParallel<<<5, 5>>>();
	cudaDeviceSynchronize();
	}

	__global__
	void printSuccessForCorrectExecutionConfiguration()
	{
	if (threadIdx.x == 1023 && blockIdx.x == 255) {
	printf("Success.\n");
	}
	}

	int N = 2 << 20;
	for (int i = 0; i < N; i++) {
	printf("%d\n", i);
	}

	int N = 10;
	for (int i = 0; i < N; i++) {
	printf("This is iteration number %d\n", i);
	}

	__global__
	void loop()
	{
	printf("This is iteration number %d\n", threadIdx.x);
	}

	// 05_multiple-block-loop
	#include <stdio.h>

	__global__
	void loop()
	{
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	printf("This is iteration number %d\n", idx);
	}

	int main()
	{
	loop<<<2, 5>>>();
	cudaDeviceSynchronize();
	}

	dim3 threads_per_block(16, 16, 1);
	dim3 number_of_blocks(16, 16, 1);
	someKernel<<<number_of_blocks, threads_per_block>>>();

	int N = 2<<20;
	size_t size = N * sizeof(int);

	int *a;
	// Note the address of `a` is passed as first argument.
	cudaMallocManaged(&a, size);

	// Use `a` on the CPU and/or on any GPU in the accelerated system.

	cudaFree(a);

	// 06_double-elements.cu
	#include <stdio.h>

	void init(int *a, const int N)
	{
	for (int i = 0; i < N; i++) {
	a[i] = i;
	}
	}

	__global__
	void doubleElements(int *a, const int N)
	{
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	if (idx < N) {
	a[i] *= 2;
	}
	}

	bool checkElementsAreDoubled(int *a, const int N)
	{
	for (int i = 0; i < N; i++) {
	if (a[i] != i * 2)
	return false;
	}

	return true;
	}

	int main()
	{
	int N = 1000;
	int *a;

	size_t size = N * sizeof(int);

	// Use 'cudaMallocManaged' to allocate pointer 'a' available
	// on both the host and the device.
	cudamallocManaged(&a, size);

	init(a, N);

	size_t threads_per_block = 256;
	size_t number_of_blocks = (N + threads_per_block - 1) / threads_per_block;

	doubleElements<<<number_of_blocks, threads_per_block>>>(a, N);
	cudaDeviceSynchronize();

	bool areDoubled = checkElementsAreDoubled(a, N);
	printf("All elements were doubled? %s\n", areDoubled ? "TRUE" : "FALSE");

	// Use 'cudaFree' to free memory allocated with 'cudaMallocManaged'
	cudaFree(a);
	}

	cudaError_t err;
	err = cudaMallocManaged(&a, N);

	if (err != cudaSuccess) {
	printf("Error: %s\n", cudaGetErrorString(err));
	}

	someKernel<<<1, -1>>>(); // -1 is not a valid number of threads.

	cudaError_t err;
	err = cudaGetLastError(); // 'cudaGetLastError' will return the error from above.
	if (err != cudaSuccess) {
	printf("Error: %s\n", cudaGetErrorString(err));
	}

	// 08_add-error-handling.cu
	#include <stdio.h>

	void init(int *a, const int N)
	{
	for (int i = 0; i < N; i++) {
	a[i] = i;
	}
	}

	__global__
	void doubleElements(int *a, const int N)
	{
	int idx = blockIdx.x * blockDim.x + threadIdx.x;
	int stride = gridDim.x * blockDim.x;
	for (int i = idx; i < N; i += stride) {
	a[i] *= 2;
	}
	}

	bool checkElementsAreDoubled(int *a, const int N)
	{
	for (int i = 0; i < N; i++) {
	if (a[i] != i * 2)
	return false;
	}

	return true;
	}

	int main()
	{
	int N = 1000;
	int *a;

	size_t size = N * sizeof(int);

	// Use 'cudaMallocManaged' to allocate pointer 'a' available
	// on both the host and the device.
	cudaMallocManaged(&a, size);

	init(a, N);

	size_t threads_per_block = 2048;
	size_t number_of_blocks = 32;

	cudaError_t syncErr, asyncErr;

	doubleElements<<<number_of_blocks, threads_per_block>>>(a, N);

	// catch errors for both the kernel launch above and any errors that
	// occur during the asynchronous 'doubleElements' kernel execution.
	syncErr = cudaGetLastError();
	asyncErr = cudaDeviceSynchronize();

	// print errors should they exist.
	if (syncErr != cudaSuccess)
	printf("Error(sync): %s\n", cudaGetErrorString(syncErr));
	if (asyncErr != cudaSuccess)
	printf("Error(async): %s\n", cudaGetErrorString(asyncErr));

	bool areDoubled = checkElementsAreDoubled(a, N);
	printf("All elements were doubled? %s\n", areDoubled ? "TRUE" : "FALSE");

	// Use 'cudaFree' to free memory allocated with 'cudaMallocManaged'
	cudaFree(a);
	}

	#include <stdio.h>
	#include <assert.h>

	inline cudaError_t checkCuda(cudaError_t result)
	{
	if (result != cudaSuccess) {
	fprintf(stderr, "CUDA Runtime Error: %s\n", cudaGetErrorString(result));
	assert(result == cudaSuccess);
	}
	return result;
	}

	int main()
	{

	/*
	* The macro can be wrapped around any function returning
	* a value of type `cudaError_t`.
	*/

	checkCuda(cudaDeviceSynchronize())
	}

CUDA C/C++ 기초 - (1)

References

Contents

Accelerated Systems

GPU-accelerated VS. CPU-only Applications

Writing Application Code for the GPU

Compiling and Running CUDA Code

CUDA Thread Hierarchy

Launching Parallel Kernels

CUDA-Provided Thread Hierarchy Variables

Accelerating For Loops

Coordinating Parallel Threads

Grids and Blocks of 2 and 3 Dimensions

Allocating Memory to be accessed on the GPU and the CPU

Grid Size Mismatch

Grid-Stride Loops

Error Handling

'NVIDIA > CUDA' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역