Stf CS149 Parallel Programming - Lecture 7 - Cuda programming model
CUDA programming model ( abstraction )
Three execution unit and memory address
- thread
- thread block
- cuda kernel
A thread block contains bunch of threads.
A cuda kernal contains all the thread blocks.
Memory address space
- Each thread has its own memory address space
- Each thread block has its own shared memory address space for all threads in the thread block
- All threads across all thread blocks share a process memory address space
Why this 3 level hierachy adress space ? For efficient memory access when threads in thread block are scheduled in the same core.
Nvidia gpu (implementation)
A warp in nvidia gpu is a gropu of 32 threads in thread block.
Different CUDA thread has it own PC(Program counter) even though they are in the same warp.
However, since all threads in the same warp is likely to execute the same code and same instructions it effectively looks like that there are only 4 unique PCs even though in reality there are 4 * 32 = 128 PCs.
Difference between warp and thread block.
A thread block is an programming model abstraction.
A warp in hardware implementation.
Both represent the concept of group of threads .
sub-core has 4 warp in the diagram below.
Each SM(streaming multi-processor) has 4 sub-core.
V100 has 80 SMs in total.
For V100, each SM(streaming multi-processor) has 4 sub-cores.
Instruction execution.
Since we have more execution context than ALUs, each instructions is finished half of the work in one cycle and another half of the work in the next cycle.
Enjoy Reading This Article?
Here are some more articles you might like to read next: