97.7k views
4 votes
5.14 ◆ Write a version of the inner product procedure described in Problem 5.13 that uses 6 × 1 loop unrolling. For x86-64, our measurements of the unrolled version give a CPE of 1.07 for integer data but still 3.01 for both floating-point data. A. Explain why any (scalar) version of an inner product procedure running on an Intel Core i7 Haswell processor cannot achieve a CPE less than 1.00. B. Explain why the performance for floating-point data did not improve with loop unrolling.

1 Answer

5 votes

Answer:

(a) the number of times the value is performs is up to four cycles. and as such the integer i is executed up to 5 times. (b)The point version of the floating point can have CPE of 3.00, even when the multiplication operation required is either 4 or 5 clock.

Step-by-step explanation:

Solution

The two floating point versions can have CPEs of 3.00, even though the multiplication operation demands either 4 or 5 clock cycles by the latency suggests the total number of clock cycles needed to work the actual operation, while issues time to specify the minimum number of cycles between operations.

Now,

sum = sum + udata[i] * vdata[i]

in this case, the value of i performs from 0 to 3.

Thus,

The value of sum is denoted as,

sum = ((((sum + udata[0] * vdata[0])+(udata[1] * vdata[1]))+( udata[2] * vdata[2]))+(udata[3] * vdata[3]))

Thus,

(A)The number of times the value is executed is up to 4 cycle. And the integer i performed up to 5 times.

Thus,

(B) The floating point version can have CPE of 3.00, even though the multiplication operation required either 4 or 5 clock.

User Dtrunk
by
4.8k points