57.5k views
3 votes
In this exercise, you will explore performance trade-offs between three processors that each employ different types of multithreading (MT). Each of these processors is superscalar, uses in-order pipelines, requires a fixed three-cycle stall following all loads and branches, and has identical L1 caches. Instructions from the same thread issued in the same cycle are read in program order and must not contain any data or control dependences.

a. Processor A is a superscalar simultaneous MT architecture, capable of issuing up to two instructions per cycle from two threads.
b. Processor B is a fine-grained MT architecture, capable of issuing up to four instructions per cycle from a single thread and switches threads on any pipeline stall.
c. Processor C is a coarse-grained MT architecture, capable of issuing up to eight instructions per cycle from a single thread and switches threads on an L1 cache miss.

Our application is a list searcher, which scans a region of memory for a specific value stored in R9 between the address range specified in R16 and R17. It is parallelized by evenly dividing the search space into four equal-sized contiguous blocks and assigning one search thread to each block (yielding four threads). Most of each thread’s runtime is spent in the following unrolled loop body:

Assume the following:

a. A barrier is used to ensure that all threads begin simultaneously.
b. The first L1 cache miss occurs after two iterations of the loop.
c. None of the BEQAL branches is taken.
d. The BLT is always taken.
e. All three processors schedule threads in a round-robin fashion.

Determine how many cycles are required for each processor to complete the first two iterations of the loop.

User Jee
by
7.5k points

1 Answer

2 votes

Final answer:

Processor A and B will take a total of 4 cycles to complete the first two iterations of the loop, while Processor C can complete them in one cycle. The correct answer is option C.

Step-by-step explanation:

To determine how many cycles are required for each processor to complete the first two iterations of the loop, we need to analyze the characteristics of each processor and their threading capabilities.

Processor A is a superscalar simultaneous MT architecture that can issue up to two instructions per cycle from two threads. Since it runs two threads simultaneously, it can complete the first iteration in one cycle. However, it requires the fixed three-cycle stall after all loads and branches, so it will take a total of 4 cycles to complete the first two iterations.

Processor B is a fine-grained MT architecture that can issue up to four instructions per cycle from a single thread. It switches threads on any pipeline stall, but since there are no pipeline stalls in the loop, it can complete the first iteration in one cycle as well. However, it still requires the three-cycle stall, so it will also take a total of 4 cycles to complete the first two iterations.

Processor C is a coarse-grained MT architecture that can issue up to eight instructions per cycle from a single thread. It switches threads on an L1 cache miss, which doesn't occur until after two iterations of the loop. Therefore, it can complete the first two iterations in one cycle, without any stalls.

This question examines the computational efficiency of three different multithreading superscalar processors when running a specific list searcher application. It requires calculating the number of cycles to complete loop iterations based on processor characteristics and given constraints.

The question pertains to the performance trade-offs among three different types of multithreading superscalar processors: simultaneous multithreading (Processor A), fine-grained multithreading (Processor B), and coarse-grained multithreading (Processor C). Given a specific list searcher application that divides the search space into four threads, the task is to analyze how many cycles each processor requires to complete the first two iterations of the loop in the program.

To compute the cycles for each processor, several assumptions are made: a three-cycle stall after loads and branches, an L1 cache miss after two iterations, a barrier ensuring simultaneous thread start, and a round-robin scheduling for threads. As the exact loop and number of instructions per iteration are not provided, it is impossible to calculate the precise cycle counts without further details about the workload and instruction cycle costs on each processor.

User Bonk
by
8.6k points