Stf CS149 Parallel Programming - Assign3

Part1

Code:

saxpy serial cpu output from assign1 prog5:

(base) ➜  prog5_saxpy git:(master) ✗ ./saxpy
[saxpy serial]:         [20.605] ms     [14.464] GB/s   [1.941] GFLOPS
[saxpy ispc]:           [17.866] ms     [16.681] GB/s   [2.239] GFLOPS
[saxpy task ispc]:      [3.122] ms      [95.446] GB/s   [12.810] GFLOPS                                                                                                                                                           (5.72x speedup from use of tasks)

saxpy gpu cuda output

Found 4 CUDA devices
Device 0: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 1: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 2: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 3: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
---------------------------------------------------------
Running 3 timing tests:
Effective BW by CUDA saxpy: 225.263 ms          [4.961 GB/s]
kernel execution time: 1.503ms
Effective BW by CUDA saxpy: 247.816 ms          [4.510 GB/s]
kernel execution time: 1.504ms
Effective BW by CUDA saxpy: 245.998 ms          [4.543 GB/s]
kernel execution time: 1.506ms

Looks like gpu bandwidth is lower than cpu

kernel execution time is super short and all the time is taken for memory copy.

I am a little bit confused about the two bandwidths listed in this doc https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf

gpu memory bandwidth is 300GB/sec and interconnect bandwidth is 32 GB/sec.

I guess gpu memory bandwidth is the bandwidth that is used in internal SMs in gpu

And interconnect bandwidth is the bandwidth during transfer data between cpu and gp

command to run when on A800

./cudaSaxpy: error while loading shared libraries: libcudart.so.12: cannot open shared object file: No such file or directory

[nsccgz_qylin_1@gpu72%tianhe2-K saxpy]$ echo $LD_LIBRARY_PATH | grep dart
[nsccgz_qylin_1@gpu72%tianhe2-K saxpy]$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
[nsccgz_qylin_1@gpu72%tianhe2-K saxpy]$ export LD_LIBRARY_PATH=/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH

Part2: parallel prefix sum

Get this libstd lib version issue when running execution binary

[nsccgz_qylin_1@gpu72%tianhe2-K scan]$ ./cudaScan ./cudaScan: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./cudaScan) ./cudaScan: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by ./cudaScan) ./cudaScan: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./cudaScan)

Solution: Use conda to install libstdcxx and include it in LD_LIBRARY_PATH

conda activate myenv
conda install -c conda-forge libstdcxx-ng
find $CONDA_PREFIX -name "libstdc++.so.6"

 export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Found 4 CUDA devices
Device 0: NVIDIA A800 80GB PCIe
   SMs:        108
   Global mem: 81229 MB
   CUDA Cap:   8.0
Device 1: NVIDIA A800 80GB PCIe
   SMs:        108
   Global mem: 81229 MB
   CUDA Cap:   8.0
Device 2: NVIDIA A800 80GB PCIe
   SMs:        108
   Global mem: 81229 MB
   CUDA Cap:   8.0
Device 3: NVIDIA A800 80GB PCIe
   SMs:        108
   Global mem: 81229 MB
   CUDA Cap:   8.0
---------------------------------------------------------
Array size: 64
Student GPU time: 0.069 ms
Scan outputs are correct!

Round input length power of 2 for cudaScan

find repeats

I did not know why find repeats could be parallelized and how it can be done until I read code from others.

The idea is that the return result of find repeats is the indices of A[i]==A[i+1]

This find repeats process can be parallelized with exclusive scan but we first need to generate a intermetidate representation of input arr.

So basically we first generate a indices array which runs on cuda that can be parallelized. The output indices array is that arr[i] = 1 if A[i]==A[i+1] else = 0.

This flags array is then passed to cudascan function for parallel exclusive scan to get a new array flags_sum_arr where flags_sum_arr[i] indicates how many repeated elements are accumulated so far.

And then this flags_sum_arr is passed to another cuda kernel which is also parallelized to generated the final indices array indices[i].

This is the fun part that shows the core of parallel programming which is that each subtask has not dependency on each other. The output writing is totally independent.

// cuda kernal code
if(flags_sum_arr[i] < flags_sum_arr[i+1]) {
    // this indicates this is a repeated element in input array
    // The position of output value we need to write is flags_sum_arr[i]
    // The repeated element index is i.
    indices[flags_sum_arr[i]] = i; 
}

Problem:

Input array in cpu and input array in gpu is not the same

Why is that?

input arr
0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:1
input arr
0:1 1:1 2:337500088 3:10935 4:0 5:0 6:0 7:0
flags arr
0:1 1:1 2:337500088 3:10935 4:0 5:0 6:0 7:0

This is because I did not copy all bytes of intput element

 cudaMemcpy(arr, device_input, length*sizeof(int), cudaMemcpyDeviceToHost);

and


cudaMemcpy(arr, device_input, length, cudaMemcpyDeviceToHost);

Need to copy length of length*sizeof(int) instead of length

memcpy copies number of bytes specified.

Code:

__global__ void 
flag_repeats_kernel(int* input, int* output, int N) {
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  if(index < N-1 && input[index] == input[index+1]) {
    output[index] = 1;
  }


}

__global__ void 
flags_extract_indices(int *input, int* output, int N ) {
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  if(index < N-1 && input[index] < input[index+1]) {
    output[input[index]] = index;
  }
}

// find_repeats --
//
// Given an array of integers `device_input`, returns an array of all
// indices `i` for which `device_input[i] == device_input[i+1]`.
//
// Returns the total number of pairs found
int find_repeats(int* device_input, int length, int* device_output) {

    // CS149 TODO:
    //
    // Implement this function. You will probably want to
    // make use of one or more calls to exclusive_scan(), as well as
    // additional CUDA kernel launches.
    //    
    // Note: As in the scan code, the calling code ensures that
    // allocated arrays are a power of 2 in size, so you can use your
    // exclusive_scan function with them. However, your implementation
    // must ensure that the results of find_repeats are correct given
    // the actual array length.
  int *flags_arr;
  int *flags_sum_arr;

  cudaMalloc((void**)&flags_arr, length * sizeof(int));
  cudaMalloc((void**)&flags_sum_arr, length * sizeof(int));

  const int threadsPerBlock = 512;
  const int blocks = (length + threadsPerBlock -1 ) / threadsPerBlock;
  int repeat_indices_count;
  int *arr = (int*)malloc(length * sizeof(int));
  // cudaMemcpy(arr, device_input, length*sizeof(int), cudaMemcpyDeviceToHost);
  cudaDeviceSynchronize();
  // printf("input arr2\n");
  // print_arr(arr, length);
  flag_repeats_kernel<<<blocks, threadsPerBlock>>>(device_input, flags_arr, length);
  // cudaMemcpy(arr, flags_arr, length*sizeof(int), cudaMemcpyDeviceToHost);
  // printf("flags arr\n");
  // print_arr(arr, length);
  cudaScan(flags_arr, flags_arr+length, flags_sum_arr) ;
  flags_extract_indices<<<blocks, threadsPerBlock>>>(flags_sum_arr, device_output, length);
  cudaMemcpy(&repeat_indices_count, flags_sum_arr+length-1, 1*sizeof(int), cudaMemcpyDeviceToHost);
  free(arr);

  return repeat_indices_count; 

}


//
// cudaFindRepeats --
//
// Timing wrapper around find_repeats. You should not modify this function.
double cudaFindRepeats(int *input, int length, int *output, int *output_length) {

    int *device_input;
    int *device_output;
    int rounded_length = nextPow2(length);
    
    // printf("input arr1\n");
    // print_arr(input, length);  
    cudaMalloc((void **)&device_input, rounded_length * sizeof(int));
    cudaMalloc((void **)&device_output, rounded_length * sizeof(int));
    cudaMemcpy(device_input, input, length * sizeof(int), cudaMemcpyHostToDevice);

    cudaDeviceSynchronize();
    double startTime = CycleTimer::currentSeconds();
    
    int result = find_repeats(device_input, length, device_output);

    cudaDeviceSynchronize();
    double endTime = CycleTimer::currentSeconds();

    // set output count and results array
    *output_length = result;
    cudaMemcpy(output, device_output, length * sizeof(int), cudaMemcpyDeviceToHost);

    // printf("output length:%d\n", *output_length);
    // printf("output indices\n");
    // print_arr(output, length);
    cudaFree(device_input);
    cudaFree(device_output);

    float duration = endTime - startTime; 
    return duration;
}

Output:

$ yhrun -p gpu_v100 ./cudaScan -m find_repeats -n 8  -i ones
---------------------------------------------------------
Found 4 CUDA devices
Device 0: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 1: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 2: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 3: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
---------------------------------------------------------
Array size: 8
flags arr
0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:0
output length:7
output indices
0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:0
flags arr
0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:0
output length:7
output indices
0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:0
flags arr
0:1 1:1 2:1 3:1 4:1 5:1 6:1 7:0
output length:7
output indices
0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:0
Student GPU time: 0.199 ms
Find_repeats outputs are correct!

Part 3: renderer

Install opengl library before compile the project

conda install -c anaconda pyopengl
conda install -c anaconda freeglut

Install opengl library using conda and then compile c code that use opengl library with Makefile

The thing to note about is that we need to include conda library path in CFLAGS and LDFLAGS

Step-by-Step Guide

Install OpenGL using conda:

conda create --name opengl_env
conda activate opengl_env
conda install -c anaconda pyopengl
conda install -c anaconda freeglut

Write Your C Code:

Create a file named main.c with the following content:

#include <GL/glut.h>

void display() {
    glClear(GL_COLOR_BUFFER_BIT);
    glBegin(GL_TRIANGLES);
    glVertex2f(-0.5, -0.5);
    glVertex2f(0.5, -0.5);
    glVertex2f(0.0, 0.5);
    glEnd();
    glFlush();
}

int main(int argc, char** argv) {
    glutInit(&argc, argv);
    glutCreateWindow("OpenGL Setup Test");
    glutDisplayFunc(display);
    glutMainLoop();
    return 0;
}

Create a Makefile:

Create a file named Makefile with the following content:

CC = gcc
CFLAGS = -I$(CONDA_PREFIX)/include
LDFLAGS = -L$(CONDA_PREFIX)/lib -lGL -lGLU -lglut

all: main

main: main.o
    $(CC) -o main main.o $(LDFLAGS)

main.o: main.c
    $(CC) -c main.c $(CFLAGS)

clean:
    rm -f main main.o

Compile and Run Your Code:
- Run the following commands in your terminal:
```
make
./main
```

Explanation

CC: Specifies the compiler to use (gcc in this case).
CFLAGS: Specifies the include directory for the OpenGL headers.
LDFLAGS: Specifies the library directory and the libraries to link against (-lGL, -lGLU, -lglut).
all: The default target that builds the main executable.
main: The target that links the object file to create the executable.
main.o: The target that compiles the source file into an object file.
clean: A target to clean up the compiled files.

This Makefile ensures that the compiler and linker use the correct paths for the OpenGL headers and libraries installed via conda.

One problem is that how to get orders of drawing a same pixel when multiple circles overlap at the same pixel.

The hints say that I can use prefix sum to help with this assignment but I don’t know how to do that.

I know that once we have caculation order array for each pixel then we can parallize the image drawing for eall pixels in parallel.

Take a look at shadePixel

Run pixel rendering in parallel instead of rendering circles in parllel. Ref repo

The naive solution is slow because for each pixel thread it has to iterate all circles to see if each circle contributes to current pixel.

The good news is that we don’t need to worry about the order issue and the correctness is guaranteed.

I don’t know how prefix sum can be used to solve this problem yet.

Ref repo that use prefix sum to create unique offset for pixels of all circles so that each circles thread can draw pixels in correct order covered by all circles

First naive solution: render all pixels in parallel

Ref implementation that uses similar idea to render pixel in parallel

The naive solution is slow because for each pixel thread it has to iterate all circles to see if each circle contributes to current pixel.

Code:

void
CudaRenderer::render() {

    // 256 threads per block is a healthy number
    dim3 blockDim(THREADS_PER_BLOCK_X, THREADS_PER_BLOCK_Y);
    dim3 gridDim((image->width+THREADS_PER_BLOCK_X-1)/THREADS_PER_BLOCK_X, (image->height + THREADS_PER_BLOCK_Y-1)/THREADS_PER_BLOCK_Y );

    kernelRenderPixels<<<gridDim, blockDim>>>();
    cudaDeviceSynchronize();
}


__global__ void kernelRenderPixels() {
  int pixel_X =  blockIdx.x * blockDim.x + threadIdx.x;
  int pixel_Y = blockIdx.y * blockDim.y + threadIdx.y;


  int width = cuConstRendererParams.imageWidth;
  int height = cuConstRendererParams.imageHeight;

  // float boxL = pixel_X -1;
  // float boxR = pixel_X +1;
  // float boxT = pixel_Y - 1;
  // float boxB = pixel_Y + 1;


  if (pixel_X >= width || pixel_Y >= height)
      return;

  float4* imgPtr = (float4*)(&cuConstRendererParams.imageData[4 * (pixel_Y * width + pixel_X)]);
  int num_circles = cuConstRendererParams.numCircles ;

  float invWidth = 1.f / width;
  float invHeight = 1.f / height;

  float2 pixelCenterNorm = make_float2(invWidth * (static_cast<float>(pixel_X) + 0.5f),
                                       invHeight * (static_cast<float>(pixel_Y) + 0.5f));
  for(int circle_index=0; circle_index< num_circles; circle_index++) {
    int circle_index3 = 3 * circle_index;  

    float3 p = *(float3*)(&cuConstRendererParams.position[circle_index3]);
    // float  rad = cuConstRendererParams.radius[circle_index];
    shadePixel(circle_index, pixelCenterNorm, p, imgPtr);

  }

}

Output:

The output shows that render time reduces 17x with gpu.

And it’s correct and passes the correctness check..

[nsccgz_qylin_1@ln101%tianhe2-K render]$ yhrun -p gpu_v100 ./render -r cpuref rand10k
Rendering to 1024x1024 image
Loaded scene with 10000 circles

Running benchmark, 1 frames, beginning at frame 0 ...
Dumping frames to output_xxx.ppm
Wrote image file output_0000.ppm
Clear:    661.9973 ms
Advance:  0.0079 ms
Render:   390.4090 ms
Total:    1052.4143 ms
File IO:  61.6616 ms

Overall:  1.1314 sec (note units are seconds)



[nsccgz_qylin_1@ln101%tianhe2-K render]$ yhrun -p gpu_v100 ./render -r cuda rand10k
Rendering to 1024x1024 image
Loaded scene with 10000 circles
---------------------------------------------------------
Initializing CUDA for CudaRenderer
Found 4 CUDA devices
Device 0: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 1: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 2: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 3: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
---------------------------------------------------------

Running benchmark, 1 frames, beginning at frame 0 ...
Dumping frames to output_xxx.ppm
Copying image data from device
Wrote image file output_0000.ppm
Clear:    0.1329 ms
Advance:  0.0057 ms
Render:   23.3001 ms
Total:    23.4387 ms
File IO:  99.4054 ms

Overall:  0.1410 sec (note units are seconds)

Correctness check

[nsccgz_qylin_1@ln101%tianhe2-K render]$ yhrun -p gpu_v100 ./render -r cuda rand10k -c
Rendering to 1024x1024 image
Loaded scene with 10000 circles
Loaded scene with 10000 circles
---------------------------------------------------------
Initializing CUDA for CudaRenderer
Found 4 CUDA devices
Device 0: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 1: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 2: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 3: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
---------------------------------------------------------

Running benchmark, 1 frames, beginning at frame 0 ...
Dumping frames to output_xxx.ppm
Copying image data from device
Wrote image file output_0000.ppm
Copying image data from device
***************** Correctness check passed **************************
Clear:    0.1450 ms
Advance:  0.0058 ms
Render:   23.3115 ms
Total:    23.4624 ms
File IO:  69.5518 ms

Overall:  0.4857 sec (note units are seconds)

Part1

Part2: parallel prefix sum

Round input length power of 2 for cudaScan

find repeats

Part 3: renderer

Step-by-Step Guide

Explanation

First naive solution: render all pixels in parallel

Enjoy Reading This Article?