Stf CS149 Parallel Programming - Assign1

Source link

Program1: generate image with multiple threads.

Hardware: hyperthreads?

Code: partition the image generation task.

Plot speedup line with number of threads as x axis.

(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --threads 2
[mandelbrot serial]:            [525.650] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [280.569] ms
Wrote image file mandelbrot-thread.ppm
                                (1.87x speedup from 2 threads)
(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --threads 3
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [341.063] ms
Wrote image file mandelbrot-thread.ppm                                                                                                                                                       (1.56x speedup from 3 threads)
(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --threads 4
[mandelbrot serial]:            [531.202] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [237.027] ms
Wrote image file mandelbrot-thread.ppm                                                                                                                                                       (2.24x speedup from 4 threads)
(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --threads 5
[mandelbrot serial]:            [532.980] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [213.517] ms
Wrote image file mandelbrot-thread.ppm                                                                                                                                                       (2.50x speedup from 5 threads)
(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --threads 6
[mandelbrot serial]:            [531.480] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [182.457] ms
Wrote image file mandelbrot-thread.ppm                                                                                                                                                       (2.91x speedup from 6 threads)
(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --threads 7
[mandelbrot serial]:            [530.595] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [173.007] ms
Wrote image file mandelbrot-thread.ppm
Mismatch : [1197][142], Expected : 1, Actual : 0
Error : Output from threads does not match serial output
(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --threads 8
[mandelbrot serial]:            [534.651] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot thread]:            [150.080] ms
Wrote image file mandelbrot-thread.ppm                                                                                                                                                       (3.56x speedup from 8 threads)

Time for each thread

Not all threads run with same finish time.

Why is that?

(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --threads 2
[mandelbrot serial]:            [529.644] ms
Wrote image file mandelbrot-serial.ppm
exe time: 275.106996 ms
exe time: 294.014979 ms
exe time: 278.083589 ms
exe time: 280.188169 ms
exe time: 268.240355 ms
exe time: 288.978558 ms
exe time: 274.672702 ms
exe time: 285.212621 ms
exe time: 275.313959 ms
exe time: 291.359153 ms
[mandelbrot thread]:            [280.327] ms
Wrote image file mandelbrot-thread.ppm                                                                                                                                                       
(1.89x speedup from 2 threads)


(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --threads 3
[mandelbrot serial]:            [532.551] ms
Wrote image file mandelbrot-serial.ppm
exe time: 120.048959 ms
exe time: 127.574176 ms
exe time: 346.292444 ms
exe time: 122.336693 ms
exe time: 122.885458 ms
exe time: 342.198521 ms
exe time: 123.949669 ms
exe time: 123.917334 ms
exe time: 343.334582 ms
exe time: 121.276554 ms
exe time: 121.796411 ms
exe time: 339.319866 ms
exe time: 122.690346 ms
exe time: 123.405423 ms
exe time: 341.921013 ms
[mandelbrot thread]:            [339.491] ms
Wrote image file mandelbrot-thread.ppm
                                (1.57x speedup from 3 threads)


(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --threads 4
[mandelbrot serial]:            [532.573] ms
Wrote image file mandelbrot-serial.ppm
exe time: 66.314548 ms
exe time: 69.146506 ms
exe time: 236.007646 ms
exe time: 236.860119 ms
exe time: 67.293212 ms
exe time: 68.643764 ms
exe time: 235.531762 ms
exe time: 235.957604 ms
exe time: 67.872606 ms
exe time: 68.252590 ms
exe time: 231.048137 ms
exe time: 236.915503 ms
exe time: 68.757534 ms
exe time: 70.160590 ms
exe time: 219.524853 ms
exe time: 238.315675 ms
exe time: 66.293016 ms
exe time: 66.733379 ms
exe time: 233.316239 ms
exe time: 234.051295 ms
[mandelbrot thread]:            [234.236] ms
Wrote image file mandelbrot-thread.ppm
                                (2.27x speedup from 4 threads)


(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --threads 8
[mandelbrot serial]:            [533.956] ms
Wrote image file mandelbrot-serial.ppm
exe time: 21.111727 ms
exe time: 21.316495 ms
exe time: 59.669584 ms
exe time: 59.972882 ms
exe time: 101.592321 ms
exe time: 104.972839 ms
exe time: 144.783191 ms
exe time: 145.647489 ms
[mandelbrot thread]:            [133.506] ms
Wrote image file mandelbrot-thread.ppm
                                (4.00x speedup from 8 threads)

More efficient implementation

cacheline aware ?

In this code image is divided by numThreads blocks and each thread with threadId accesses idx:threadId of each block.

This means that it’s highly likely that at the same moment each thread access the same memory block that is in cache.

This is just my understanding.

The reason for some threads taking much longer time to finish the job is that some adjacent rows need much more time to compute. If we use round-robin assignment strategy then we can distribute the computation job evenly and each thread can get equal amount of computation job.

GPT give me following answer when I asked it why to use round-robin assignment:

Thread 2 has one more row to process than the other threads, which can lead to a slight imbalance. However, the imbalance becomes more pronounced if the work done per row is not uniform. For example, if the computation for the Mandelbrot set is more complex for certain rows, the threads processing those rows will take longer to complete, leading to idle time for other threads.

This imbalance can be avoided by using a better load balancing strategy, such as the round-robin assignment used in the original code, which distributes the rows more evenly across the threads. 😊

void workerThreadStart(WorkerArgs * const args) {

  // TODO FOR CS149 STUDENTS: Implement the body of the worker
  // thread here. Each thread should make a call to mandelbrotSerial()
  // to compute a part of the output image.  For example, in a
  // program that uses two threads, thread 0 could compute the top
  // half of the image and thread 1 could compute the bottom half.
  //
  double startTime = CycleTimer::currentSeconds();
  float x0 = args->x0;
  float y0 = args->y0;
  float x1 = args->x1;
  float y1 = args->y1;
  int width = args->width;
  int height = args->height;

  // printf("cur start row: %d, cur total row:%d\n", cur_start_row, cur_total_rows);
  for(int i=0; i < height/args->numThreads; i++) {
    int start_row = args->threadId + i * args->numThreads;
    int num_rows = 1;
    mandelbrotSerial(x0, y0, x1, y1,
                     width, height,
                     start_row, num_rows,
                     args->maxIterations, args->output);


  }
  double endTime = CycleTimer::currentSeconds();
  double exe_time = endTime - startTime;
  printf("exe time: %f ms\n", exe_time*1000);


  // printf("Hello world from thread %d\n", args->threadId);
}
(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --threads 4
[mandelbrot serial]:            [533.651] ms
Wrote image file mandelbrot-serial.ppm
exe time: 142.692391 ms
exe time: 157.330707 ms
exe time: 157.454997 ms
exe time: 157.499295 ms
exe time: 141.290620 ms
exe time: 150.249667 ms
exe time: 150.373559 ms
exe time: 150.334526 ms
exe time: 138.450164 ms
exe time: 151.096858 ms
exe time: 151.156478 ms
exe time: 151.210513 ms
exe time: 138.270788 ms
exe time: 150.668491 ms
exe time: 150.774766 ms
exe time: 150.800020 ms
exe time: 138.079636 ms
exe time: 150.737014 ms
exe time: 150.741972 ms
exe time: 150.848471 ms
[mandelbrot thread]:            [150.471] ms
Wrote image file mandelbrot-thread.ppm
                                (3.55x speedup from 4 threads)

Comparison between inefficient assignment and round-robin assignment

Naive sequential assignment:

Thread running time for another image genration

(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --view 2 --threads 4
[mandelbrot serial]:            [311.051] ms
Wrote image file mandelbrot-serial.ppm
exe time: 84.468454 ms
exe time: 86.912777 ms
exe time: 87.307919 ms
exe time: 133.808278 ms
[mandelbrot thread]:            [119.725] ms
Wrote image file mandelbrot-thread.ppm                                                                                                                                                       
(2.60x speedup from 4 threads)

Round-robin assignment:

(base) ➜  prog1_mandelbrot_threads git:(master) ✗ ./mandelbrot --view 2 --threads 4
[mandelbrot serial]:            [310.842] ms
Wrote image file mandelbrot-serial.ppm
exe time: 83.830711 ms
exe time: 93.051653 ms
exe time: 93.096461 ms
exe time: 93.373701 ms
[mandelbrot thread]:            [93.562] ms
Wrote image file mandelbrot-thread.ppm                                                                                                                                                       (3.32x speedup from 4 threads)

Prog2

cpu xeon 6230n Instruction set extensions: Intel® SSE4.2, Intel® AVX, Intel® AVX2, Intel® AVX-512

cpu i7-7700k Instruction set extensions: Intel® SSE4.1, Intel® SSE4.2, Intel® AVX2

Both support hyper-threading technology which means each hardware core has two processing threads per physical core.

Prog2: Vectorizing code using SIMD intrinsics

Assignment link

Solution idea:

Just translate the clampedExpSerial the code to use SIMD. Refer to this absVector and absSerial to see how translation works.

To deal with situation that total number of loop iterations is not a multiple of SIMD vector width we can set maskAll at the beginning of the function so that only valid values of input vector is used for SIMD computation.

    if(i + VECTOR_WIDTH > N) {
      maskAll = _cs149_init_ones(remain_count);
    } else {
      maskAll = _cs149_init_ones();
    }

code:

void clampedExpVector(float* values, int* exponents, float* output, int N) {

  //
  // CS149 STUDENTS TODO: Implement your vectorized version of
  // clampedExpSerial() here.
  //
  // Your solution should work for any value of
  // N and VECTOR_WIDTH, not just when VECTOR_WIDTH divides N
  //
  
  __cs149_vec_float x;
  __cs149_vec_int   y_exponents;
  __cs149_vec_float result;
  __cs149_vec_int zeros_int = _cs149_vset_int(0);
  __cs149_vec_int ones_int = _cs149_vset_int(1);
  __cs149_vec_float ones_float = _cs149_vset_float(1.0f);
  __cs149_mask maskAll, maskExponentIsZero, maskExponentNotZero;  
  int remain_count = N % VECTOR_WIDTH;
  // int first_valid_count = remain_count > 0 ? VECTOR_WIDTH - remain_count : VECTOR_WIDTH;
  for(int i=0; i < N; i+= VECTOR_WIDTH) {
    if(i + VECTOR_WIDTH > N) {
      maskAll = _cs149_init_ones(remain_count);
    } else {
      maskAll = _cs149_init_ones();
    }
    // CS149Logger.addLog("initones", maskAll, VECTOR_WIDTH);

    maskExponentIsZero = _cs149_init_ones(0);

    _cs149_vload_float(x, values+i, maskAll);
    _cs149_vload_int(y_exponents, exponents+i, maskAll);

    // if y== 0
    _cs149_veq_int(maskExponentIsZero, y_exponents, zeros_int, maskAll);
    maskExponentNotZero = _cs149_mask_not(maskExponentIsZero);

    // x == 1  if y_exponents == 0
    //
    _cs149_vmove_float(result, ones_float, maskExponentIsZero);      

    // else 
    // result = x;
    _cs149_vmove_float(result, x, maskExponentNotZero);
    // count = y -1;
    _cs149_vsub_int(y_exponents, y_exponents, ones_int, maskExponentNotZero);
    __cs149_mask maskExpNotZeroCnt;
    _cs149_vgt_int(maskExpNotZeroCnt, y_exponents,  zeros_int, maskAll);
    // while (count > 0)
    // result *= x ;
    // count--;
    while(_cs149_cntbits(maskExpNotZeroCnt)) {
      _cs149_vmult_float(result, result, x, maskExpNotZeroCnt);

      _cs149_vsub_int(y_exponents, y_exponents, ones_int, maskExponentNotZero);
      _cs149_vgt_int(maskExpNotZeroCnt, y_exponents,  zeros_int, maskAll);
    }


    // if( result > 9.999999f) {
    // result = 9.999999f;
    // }
    __cs149_mask mask_gt_9;
    __cs149_vec_float nine_float = _cs149_vset_float(9.999999f);
    _cs149_vgt_float(mask_gt_9, result,  nine_float, maskAll);
    _cs149_vmove_float(result, nine_float, mask_gt_9);

    // output[i] = result;
    _cs149_vstore_float(output + i, result, maskAll);

    
  }
}

Run ./myexp -s 10000 and sweep the vector width from 2, 4, 8, to 16. Record the resulting vector utilization. You can do this by changing the #define VECTOR_WIDTH value in CS149intrin.h. Does the vector utilization increase, decrease or stay the same as VECTOR_WIDTH changes? Why?

Answer: The vector utilization decrease as VECTOR_WIDTH increases. The reason I think it’s that not all values in vector is used for computation as VECTOR_WIDTH increases.

So it’s not a very good idea to have very large VECTOR_WIDTH?

vector width 2:

(base) ➜  prog2_vecintrin git:(master) ✗ ./myexp -s 10000
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width:              2
Total Vector Instructions: 167727
Vector Utilization:        88.7%
Utilized Vector Lanes:     297685
Total Vector Lanes:        335454
************************ Result Verification *************************
Passed!!!

vector width 4:

(base) ➜  prog2_vecintrin git:(master) ✗ ./myexp -s 10000
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width:              4
Total Vector Instructions: 97075
Vector Utilization:        86.2%
Utilized Vector Lanes:     334817
Total Vector Lanes:        388300
************************ Result Verification *************************
Passed!!!

vector width 8:

(base) ➜  prog2_vecintrin git:(master) ✗ ./myexp -s 10000
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width:              8
Total Vector Instructions: 52877
Vector Utilization:        85.0%
Utilized Vector Lanes:     359535
Total Vector Lanes:        423016
************************ Result Verification *************************
Passed!!!

vector width 16:

(base) ➜  prog2_vecintrin git:(master) ✗ ./myexp -s 10000
CLAMPED EXPONENT (required)
Results matched with answer!
****************** Printing Vector Unit Statistics *******************
Vector Width:              16
Total Vector Instructions: 27592
Vector Utilization:        84.4%
Utilized Vector Lanes:     372781
Total Vector Lanes:        441472
************************ Result Verification *************************
Passed!!!

Prog3

Speedup with ISPC

launching 80 tasks brings 62x speedup.


export void mandelbrot_ispc_withtasks(uniform float x0, uniform float y0,
                                      uniform float x1, uniform float y1,
                                      uniform int width, uniform int height,
                                      uniform int maxIterations,
                                      uniform int output[])
{

    uniform int rowsPerTask = height / 80;

    // create 2 tasks
    launch[80] mandelbrot_ispc_task(x0, y0, x1, y1,
                                     width, height,
                                     rowsPerTask,
                                     maxIterations,
                                     output); 
}
(base) ➜  prog3_mandelbrot_ispc git:(master) ✗ ./mandelbrot_ispc  --tasks
[mandelbrot serial]:            [268.624] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:              [55.141] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]:    [4.328] ms
Wrote image file mandelbrot-task-ispc.ppm
                                (4.87x speedup from ISPC)
                                (62.07x speedup from task ISPC)

Speedup for different image generation task with same task parallelism settings is different

(base) ➜  prog3_mandelbrot_ispc git:(master) ✗ ./mandelbrot_ispc  --tasks  --view 0
[mandelbrot serial]:            [267.687] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:              [54.364] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]:    [4.698] ms
Wrote image file mandelbrot-task-ispc.ppm                                                                                                                                                    (4.92x speedup from ISPC)                                                                                                                                    
(56.97x speedup from task ISPC)

(base) ➜  prog3_mandelbrot_ispc git:(master) ✗ ./mandelbrot_ispc  --tasks  --view 1
[mandelbrot serial]:            [266.777] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:              [53.877] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]:    [5.287] ms
Wrote image file mandelbrot-task-ispc.ppm
                                (4.95x speedup from ISPC)
                                (50.46x speedup from task ISPC)

(base) ➜  prog3_mandelbrot_ispc git:(master) ✗ ./mandelbrot_ispc  --tasks  --view 2
[mandelbrot serial]:            [159.744] ms
Wrote image file mandelbrot-serial.ppm
[mandelbrot ispc]:              [37.937] ms
Wrote image file mandelbrot-ispc.ppm
[mandelbrot multicore ispc]:    [5.020] ms
Wrote image file mandelbrot-task-ispc.ppm                                                                                                                                                    (4.21x speedup from ISPC)                                                                                                                                    (31.82x speedup from task ISPC)

Difference between launch and foreach in ISPC

TLDR; both launch and foreach can do task parallelism. But foreach can use SIMD and launch can not do that itself.

The difference between launch and foreach in ISPC lies in the type of parallelism they express and how they are used:

  • launch: This keyword is used to create a task that runs independently. When you use launch, you’re telling ISPC to execute a function asynchronously, potentially on a different core. This is useful for task parallelism, where you have separate tasks that can run concurrently. For example:
    launch myFunction();
    

    This will execute myFunction as a separate task.

  • foreach: This construct is used for data parallelism, where the same operation is applied to multiple data elements in parallel. The foreach loop allows ISPC to execute iterations in parallel, utilizing SIMD vector units. For example:
    foreach (i = 0 ... N) {
        // loop body
    }
    

    This will execute the loop body in parallel for each value of i from 0 to N-1.

In summary, launch is used for task parallelism to run independent tasks concurrently, while foreach is used for data parallelism to apply the same operation to multiple data elements in parallel. Both constructs help you write parallel code, but they are suited for different types of parallelism. 😊

The launch construct in ISPC is primarily designed for task parallelism, which means it can run tasks in parallel on multiple cores. However, launch itself does not directly utilize SIMD parallelism.

Here’s a more detailed breakdown:

  • launch: When you use launch, ISPC creates a new task that can run concurrently with other tasks. These tasks can be scheduled on different CPU cores, allowing for parallel execution. The focus of launch is on task-level parallelism, not on vectorization or SIMD parallelism.

  • foreach: On the other hand, foreach is designed for data parallelism and can utilize both SIMD and multi-core parallelism. It vectorizes the loop to use SIMD instructions and also distributes the iterations across multiple cores.

While launch does not directly use SIMD, the tasks that are launched can themselves contain foreach loops or other code that uses SIMD parallelism. So, you can combine launch and foreach to achieve a mix of task parallelism and data parallelism. For example:

void taskFunction(float* data, int N) {
    foreach (i = 0 ... N) {
        data[i] = data[i] * 2.0f;
    }
}

launch taskFunction(data, N);

In this example, taskFunction uses foreach to process data in parallel using SIMD, and launch runs taskFunction as a separate task on a different core.

Would you like to explore more about how to effectively combine launch and foreach in ISPC, or have any other questions? 😊

Prog4

Change values array to change speed up of sqrt

The best case is that all simd vectors handles the same value. And we would get max speed up if each SIMD vector takes long time to finish which reduce the percentage of sync and communication/ scheduling overhead brought by ispc.

(base) ➜  prog4_sqrt git:(master) ✗ ./sqrt;
[sqrt serial]:          [2062.501] ms
[sqrt ispc]:            [332.707] ms
[sqrt task ispc]:       [14.827] ms                                                                                                                                                          (6.20x speedup from ISPC)                                                                                                                                    (139.11x speedup from task ISPC)

The worst case is that one value in vector takes extremely long time to finish and all other values in vector take shortest time to finish.

(base) ➜  prog4_sqrt git:(master) ✗ ./sqrt;
[sqrt serial]:          [286.506] ms
[sqrt ispc]:            [330.483] ms
[sqrt task ispc]:       [14.916] ms
                                (0.87x speedup from ISPC)
                                (19.21x speedup from task ISPC)
    for (unsigned int i=0; i<N; i++)
    {
        // TODO: CS149 students.  Attempt to change the values in the
        // array here to meet the instructions in the handout: we want
        // to you generate best and worse-case speedups
        
      // best case
    values[i] = 2.998f;

    // worst case
    // if(i%8 == 0) {
    //   values[i] = 2.998f;
    // } else {
    //   values[i] = 1.f;
    // }
    //
    //random:
        // starter code populates array with random input values
        // values[i] = .001f + 2.998f * static_cast<float>(rand()) / RAND_MAX;
    }


Reference:

https://github.com/PKUFlyingPig/asst1/blob/master/prog4_sqrt/main.cpp

Prog5

Extra Credit: (1 point) Note that the total memory bandwidth consumed computation in main.cpp is TOTAL_BYTES = 4 * N * sizeof(float);. Even though saxpy loads one element from X, one element from Y, and writes one element to result the multiplier by 4 is correct. Why is this the case? (Hint, think about how CPU caches work.)

Answer: It’s because saxpy fetch 4 vector variables through the memory.

void saxpySerial(int N,
                       float scale,
                       float X[],
                       float Y[],
                       float result[])
{

    for (int i=0; i<N; i++) {
        result[i] = scale * X[i] + Y[i];
    }
}

Output:

(base) ➜  prog5_saxpy git:(master) ✗ ./saxpy
[saxpy serial]:         [20.605] ms     [14.464] GB/s   [1.941] GFLOPS
[saxpy ispc]:           [17.866] ms     [16.681] GB/s   [2.239] GFLOPS
[saxpy task ispc]:      [3.122] ms      [95.446] GB/s   [12.810] GFLOPS                                                                                                                                                           
(5.72x speedup from use of tasks)

Extra Credit: (points handled on a case-by-case basis) Improve the performance of saxpy. We’re looking for a significant speedup here, not just a few percentage points. If successful, describe how you did it and what a best-possible implementation on these systems might achieve. Also, if successful, come tell the staff, we’ll be interested. ;-)

Answer: I don’t know how to do that.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Learning-based memory allocation for C++ server workloads summary
  • my question:
  • Binary search algorithm variant
  • Docker Rocksdb build
  • Difference between Dockerfile and Docker Compose