Stf CS149 Parallel Programming - Assign3

Part1

Code:

saxpy serial cpu output from assign1 prog5:

(base) ➜  prog5_saxpy git:(master) ✗ ./saxpy
[saxpy serial]:         [20.605] ms     [14.464] GB/s   [1.941] GFLOPS
[saxpy ispc]:           [17.866] ms     [16.681] GB/s   [2.239] GFLOPS
[saxpy task ispc]:      [3.122] ms      [95.446] GB/s   [12.810] GFLOPS                                                                                                                                                           (5.72x speedup from use of tasks)

saxpy gpu cuda output

Found 4 CUDA devices
Device 0: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 1: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 2: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
Device 3: Tesla V100-SXM2-16GB
   SMs:        80
   Global mem: 16160 MB
   CUDA Cap:   7.0
---------------------------------------------------------
Running 3 timing tests:
Effective BW by CUDA saxpy: 225.263 ms          [4.961 GB/s]
kernel execution time: 1.503ms
Effective BW by CUDA saxpy: 247.816 ms          [4.510 GB/s]
kernel execution time: 1.504ms
Effective BW by CUDA saxpy: 245.998 ms          [4.543 GB/s]
kernel execution time: 1.506ms

Looks like gpu bandwidth is lower than cpu

kernel execution time is super short and all the time is taken for memory copy.

I am a little bit confused about the two bandwidths listed in this doc https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/tesla-t4/t4-tensor-core-datasheet-951643.pdf

gpu memory bandwidth is 300GB/sec and interconnect bandwidth is 32 GB/sec.

I guess gpu memory bandwidth is the bandwidth that is used in internal SMs in gpu

And interconnect bandwidth is the bandwidth during transfer data between cpu and gp

command to run when on A800

./cudaSaxpy: error while loading shared libraries: libcudart.so.12: cannot open shared object file: No such file or directory

[nsccgz_qylin_1@gpu72%tianhe2-K saxpy]$ echo $LD_LIBRARY_PATH | grep dart
[nsccgz_qylin_1@gpu72%tianhe2-K saxpy]$ export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
[nsccgz_qylin_1@gpu72%tianhe2-K saxpy]$ export LD_LIBRARY_PATH=/usr/local/cuda-12.0/lib64:$LD_LIBRARY_PATH

Part2: parallel prefix sum

Get this libstd lib version issue when running execution binary

[nsccgz_qylin_1@gpu72%tianhe2-K scan]$ ./cudaScan ./cudaScan: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.8' not found (required by ./cudaScan) ./cudaScan: /usr/lib64/libstdc++.so.6: version `CXXABI_1.3.9' not found (required by ./cudaScan) ./cudaScan: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./cudaScan)

Solution: Use conda to install libstdcxx and include it in LD_LIBRARY_PATH

conda activate myenv
conda install -c conda-forge libstdcxx-ng
find $CONDA_PREFIX -name "libstdc++.so.6"

 export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
Found 4 CUDA devices
Device 0: NVIDIA A800 80GB PCIe
   SMs:        108
   Global mem: 81229 MB
   CUDA Cap:   8.0
Device 1: NVIDIA A800 80GB PCIe
   SMs:        108
   Global mem: 81229 MB
   CUDA Cap:   8.0
Device 2: NVIDIA A800 80GB PCIe
   SMs:        108
   Global mem: 81229 MB
   CUDA Cap:   8.0
Device 3: NVIDIA A800 80GB PCIe
   SMs:        108
   Global mem: 81229 MB
   CUDA Cap:   8.0
---------------------------------------------------------
Array size: 64
Student GPU time: 0.069 ms
Scan outputs are correct!



Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Learning-based memory allocation for C++ server workloads summary
  • my question:
  • Binary search algorithm variant
  • Docker Rocksdb build
  • Difference between Dockerfile and Docker Compose