Stf CS149 Parallel Programming - Lecture 5&6 - Performance optimization

Lecture 5

Video Deciding granularity is important for dynamic scheduling in parallel programming.

Small granularity leads to better workload distribution but comes with higher synchronization overhead.

Lecture 6

Performance optimization: locality, communication and contention.

Reduce costs of communication between:

  1. processors.
  2. between processors and memory.

Shared memory communication. Numa: non-uniform memory access

Message passing blocking send and non-blocking send

Reduce communication is important to achieve max utilization of cpu. Just to keep cpu busy

Roofline model: image

To achieve maximum computation throughput GFLOPS/s of cpu or gpus one has to have algorithm that has high operation intensity -> high flops/bytes.

Need to has many computation per byte access unit.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Learning-based memory allocation for C++ server workloads summary
  • my question:
  • Binary search algorithm variant
  • Docker Rocksdb build
  • Difference between Dockerfile and Docker Compose