728x90
- We focus just on the computation performed by the loop, since this is the dominating factor in performance for large vectors.
data:image/s3,"s3://crabby-images/7d0cd/7d0cd2aa555b02a0bd5fe388e90028537b7c8edd" alt=""
- The compiled code for this loop consists of four instructions, with registers %rdx holding a pointer to the ith element of array data, %rax holding a pointer to the end of the array, and %xmm0 holding the accumulated value acc.
- with the initial multiplication instruction being expanded into a load operation to read the source operand from memory, and a mul operation to perform the multiplication.
- register %rax is only used as a source value by the cmp operation, and so the register has the same value at the end of the loop as at the beginning. Register %rdx, on the other hand, is both used and updated within the loop.
- the chains of operations between loop registers determine the performance-limiting data dependencies.
data:image/s3,"s3://crabby-images/05f1d/05f1d5bafd7a8d2a79d747fc0b2790fbd7617121" alt=""
- The purpose of the compare and branch operations is to test the branch condition and notify the ICU if it is not taken.
- corresponding to the updating of program values acc and data+i with operations mul and add, respectively.
- Given that floating-point multiplication has a latency of 5 cycles, while integer addition has a latency of 1 cycle, we can see that the chain on the left will form a critical path, requiring 5n cycles to execute.
5.8 Loop Unrolling
- Loop unrolling is a program transformation that reduces the number of iterations for a loop by increasing the number of elements computed on each iteration.
- Loop unrolling can improve performance in two ways. First, it reduces the number of operations that do not contribute directly to the program result, such as loop indexing and conditional branching. Second, it exposes ways in which we can further transform the code to reduce the number of operations in the critical paths of the overall computation
- First, we make sure the first loop does not
- We include the second loop to step through the final few elements of the vector one at a time.
data:image/s3,"s3://crabby-images/ac67c/ac67c605f858f17cea115608bb38c6fbecb373b7" alt=""
- We see that the CPE for integer addition improves, achieving the latency bound of 1.00.
728x90
'csapp' 카테고리의 다른 글
5.7.2 Functional Unit Performance (0) | 2023.05.18 |
---|---|
5.7 Understanding Modern Processors (0) | 2023.05.17 |
5.3 Program Example (2) | 2023.05.16 |
5. Optimizing ProgramPerformance (0) | 2023.05.15 |
Diminishing Returns of Deep Pipelining (0) | 2023.05.08 |
댓글