214k views
0 votes
Consider the all-reduce operation in which each processor starts with an array of m words, and needs to get the global sum of the respective words in the array at each processor. This operation can be implemented on a ring using one of the following three alternatives: All-to-all broadcast of all the arrays followed by a local computation of the sum of the respective elements of the array. Single node accumulation of the elements of the array, followed by a one-to-all broadcast of the result array. An algorithm that uses the pattern of the all-to-all broadcast, but simply adds numbers rather than concatenating messages. For each of the above cases, compute the run time in terms of m, ts, and tw.

User Jav
by
7.5k points

1 Answer

6 votes

Final answer:

The computation of run time for the all-reduce operation on a ring network involves different methods, each with its own time complexity dependent on the message startup latency (ts), bandwidth per word (tw), and the size of arrays (m).

Step-by-step explanation:

The question pertains to the run time calculation of an all-reduce operation on a ring network topology using different strategies for summing arrays of size m across processors. Here, ts represents the startup time for a message (latency), and tw represents the time taken to send one word (bandwidth). We need to compute the run time in terms of m, ts, and tw for three different cases of the all-reduce operation.

Case 1: All-to-all broadcast followed by local computation

The all-to-all broadcast would involve sending each processor's array to every other processor. Assuming there are p processors, each processor would send m words to p-1 other processors. The time for this broadcast is p × (ts + m × tw). After receiving all arrays, each processor computes the sum locally, which does not add to the communication time.

Case 2: Single node accumulation followed by one-to-all broadcast

One processor receives and accumulates the arrays, which takes (p-1) × (ts + m × tw) time. Then, this processor broadcasts the result to all others, taking an additional (p-1) × ts plus m words broadcasted to p-1 processors, resulting in an additional m × tw time.

Case 3: All-to-all broadcast adding numbers

This algorithm involves each processor adding numbers as they are received instead of waiting for all the values to arrive. The run time would be similar to the first case without the local computation step, yielding a time of p × (ts + m × tw).

User Sbrbot
by
7.7k points