Stf CS149 Parallel Programming - Lecture 7 - Cuda programming model

Lecture 7 slides

Video lecture

CUDA programming model ( abstraction )

Three execution unit and memory address

thread
thread block
cuda kernel

A thread block contains bunch of threads.

A cuda kernal contains all the thread blocks.

Memory address space

Each thread has its own memory address space
Each thread block has its own shared memory address space for all threads in the thread block
All threads across all thread blocks share a process memory address space

Why this 3 level hierachy adress space ? For efficient memory access when threads in thread block are scheduled in the same core.

Nvidia gpu (implementation)

A warp in nvidia gpu is a gropu of 32 threads in thread block.

Different CUDA thread has it own PC(Program counter) even though they are in the same warp.

However, since all threads in the same warp is likely to execute the same code and same instructions it effectively looks like that there are only 4 unique PCs even though in reality there are 4 * 32 = 128 PCs.

Difference between warp and thread block.

A thread block is an programming model abstraction.

A warp in hardware implementation.

Both represent the concept of group of threads .

sub-core has 4 warp in the diagram below.

Each SM(streaming multi-processor) has 4 sub-core.

V100 has 80 SMs in total.

For V100, each SM(streaming multi-processor) has 4 sub-cores.

Instruction execution.

Since we have more execution context than ALUs, each instructions is finished half of the work in one cycle and another half of the work in the next cycle.

CUDA programming model ( abstraction )

Nvidia gpu (implementation)

Enjoy Reading This Article?