Print indd

Download 18,42 Mb.

Pdf ko'rish

bet	182/366
Sana	31.12.2021
Hajmi	18,42 Mb.
	#276933

1 ... 178 179 180 181 182 183 184 185 ... 366

Bog'liq
(Lecture Notes in Computer Science 10793) Mladen Berekovic, Rainer Buchty, Heiko Hamann, Dirk Koch, Thilo Pionteck - Architecture of Computing Systems – ARCS

4
Evaluation
The suggested approach was modeled in Gem5 [
6
]. Based on Butko et al. [
7
],
the fast and slow cores were conﬁgured to match the ARM Cortex-A15 and
ARM Cortex-A7, respectively. The per core 32 kB L1 caches are split into data
and instruction caches, while the per core 512 kB L2 caches are uniﬁed. Power

162
R. Amslinger et al.
consumption was approximated as the product of the simulated benchmark run-
time and the average power consumption of an Exynos 5430 SoC. It was assumed
that a lockstep system runs as fast as a corresponding single-core machine, but
consumes twice the energy. For our approach, a limit was put in place to prevent
the leading core from running too far ahead. The leading core stalls when it has
committed 1,000 instructions more than the trailing core, or if it tries to evict a
modiﬁed cache line that the trailing core has not written yet.
We implemented several sequential microbenchmarks with diﬀerent memory
access patterns. The following microbenchmarks were used to assess the perfor-
mance of the approach:
– The breadth-ﬁrst benchmark calculates the size of a tree by traversing it in
breadth-ﬁrst order. Each node has a random number of children.
– The heapsort benchmark sorts an array using heapsort. The array is initialized
with random positive integers.
– The matrixmul benchmark calculates the product of two dense matrices.
– The quicksort benchmark sorts an array using quicksort. The array is initial-
ized with random positive integers.
– The red-black tree benchmark inserts random entries into a red-black tree.
– The shuﬄe benchmark shuﬄes an array using the Fisher-Yates shuﬄe.
The seed of the random number generator was constant for all runs.
Figure
3
shows the microbenchmarks’ throughput on the y-axis and the cor-
responding energy consumption per run on the x-axis. The microbenchmarks
were executed on a lockstep system consisting of two Cortex-A7, another lock-
step system consisting of two Cortex-A15 and our approach, using a Cortex-A15
as leading core and a Cortex-A7 as trailing core. The cores’ clock frequency were
varied in their frequency ranges (Cortex-A7: 500–1300 MHz, Cortex-A15: 700–
1900 MHz) in 100 MHz steps. For our approach the trailing core’s frequency was
ﬁxed at 1300 MHz, while the leading core’s frequency was varied from 700 MHz
to 1900 MHz.
All systems were evaluated with and without a hardware prefetcher. The
Cortex-A7 utilizes early issue of memory operations [
3
] in all variants. A stride
prefetcher [
4
], which tries to detect regular access patterns on a per-instruction
basis, was used in the corresponding variants. If an instruction accesses memory
locations with a constant distance, the prefetcher will predict the next addresses
and preload them into the L1 cache. As the stride prefetcher works on physical
addresses, a detected stream will be terminated at the page boundary. For this
evaluation the prefetcher was conﬁgured to observe 32 distinct instructions on a
least recently used basis and prefetch up to 4 cache lines ahead of the last actual
access.
At ﬁrst only a small increase in voltage per 100 MHz step is required to
allow the core to run stable. Thus for the lockstep systems, a large increase in
throughput can be achieved at low frequencies by a small increase in power con-
sumption. Note that the frequency itself has only minor inﬂuence on the results,
as power consumption is measured per benchmark run and not per time unit.

Redundant Execution on Heterogeneous Multi-cores
163
0
20
40
60
80
0
20
40
breadth-ﬁrst
0
500
1,000
1,500
2,000
0
0.5
1
heapsort
0
50
100
0
10
20
30
matrixmul
0
200
400
600
800
0
1
2
3
quicksort
0
50
100
0
10
20
red-black tree
0
5
10
15
0
50
100
150
shuﬄe
Energy per run [mJ]
2x Cortex-A7
2x Cortex-A7 with prefetcher
2x Cortex-A15
2x Cortex-A15 with prefetcher
Cortex-A7 + Cortex-A15
Cortex-A7 + Cortex-A15 with prefetcher
Throughput
[1/s]

Download 18,42 Mb.

Do'stlaringiz bilan baham:

1 ... 178 179 180 181 182 183 184 185 ... 366