Table 2. % of accesses to upper-level memory and data migrated (in GB).
Benchmark
% of accesses to upper-level memory
GBs migrated
Cache-
mode
Static
FT
Static
train
Static
ref
Adaptive
ref
FTHP
1s
Cache-
mode
FTHP
1s
Adaptive
ref
512 KB cache
bzip2
80.92
69.03
77.08
70.59
81.06
94.09
207.44
0.99
16.94
gcc
96.17
16.78
24.61
80.57
70.14
69.24
68.70
3.73
16.99
mcf
20.04
48.80
48.50
48.48
48.50
63.39
256.59
6.15
0.00
milc
80.97
31.58
32.54
27.16
53.80
56.98
55.36
0.50
19.75
cactusADM
79.27
46.89
46.70
46.69
46.54
44.37
57.24
0.66
0.00
leslie3d
76.12
28.21
27.83
28.76
18.10
37.17
106.71
0.15
0.00
gobmk
95.39
27.84
13.78
14.90
14.18
37.53
13.19
0.04
0.00
soplex
59.65
15.87
46.37
52.92
61.66
51.52
137.88
0.12
0.13
hmmer
96.04
63.31
34.41
75.00
75.79
71.08
59.20
0.01
0.00
GemsFDTD
11.05
13.46
15.56
21.55
12.93
17.70
79.66
2.02
63.95
libquantum
99.53
11.55
11.54
11.54
11.54
13.53
1.04
0.02
0.00
h264ref
95.63
72.13
85.55
86.84
88.21
77.83
32.04
0.02
0.75
lbm
94.97
12.77
12.72
12.72
12.51
10.64
145.82
0.55
29.81
sphinx3
62.65
45.66
59.33
61.83
57.25
69.18
29.74
0.01
0.17
Average
74.88
35.99
38.32
45.68
46.59
51.02
89.33
1.07
10.61
8 MB cache
mcf
17.77
24.78
24.88
25.22
24.88
43.24
137.19
5.60
0.00
milc
57.89
15.48
16.09
21.58
32.42
29.44
43.87
0.58
20.92
cactusADM
30.58
29.74
29.74
29.69
32.66
27.31
13.69
0.52
0.00
leslie3d
43.66
20.81
20.96
20.75
20.04
14.74
52.86
0.10
0.00
soplex
43.93
30.85
19.64
30.39
35.40
30.16
80.07
0.11
2.17
GemsFDTD
4.72
14.40
15.00
14.99
11.87
9.60
62.14
1.38
61.97
libquantum
99.97
12.50
0.00
12.49
12.49
11.33
0.03
0.01
0.00
lbm
84.00
12.71
12.79
12.79
12.47
11.73
50.37
0.10
30.01
Average
47.82
20.16
17.39
20.99
22.78
22.19
55.03
1.05
14.38
Referring again to Fig.
6
, a third bar shows the results for FTHP relative to
DDR3-only, on an identical HBM-DDR3 platform with 12.5% HBM capacity.
Thus, even though they do not have the benefit of dynamic feedback from spe-
cialized hardware, the application guidance policies often achieve similar perfor-
mance as FTHP. With the 512 KB cache, static-ref and adaptive-ref respectively
outperform FTHP by 2.8% and 2.9%, while with the 8 MB cache, static-ref
performs slightly (1.9%) worse, and adaptive-ref performs 5.3% better. Even
static-train (shown in Fig.
4
) performs slightly (1.5%) better than FTHP with
the larger cache, but does exhibit some slowdown (9.6%) with the smaller cache.
Both adaptive-ref and FTHP limit the frequency of data migration to amor-
tize the cost of page faults and TLB synchronization. The final three columns
of Table
2
show the amount of data migrated (in GB) for each adaptive policy.
192
T. C. Effler et al.
Note also that the amount of migration for both FTHP and adaptive-ref depends
on the length of each epoch/phase. For instance, compared to the adaptive-ref
configuration in the table (with
k = 8, l = 100M), adaptive-ref with l = 10M
migrates almost 2
.4x more data over the course of each run, on average. Con-
sidering these results with the performance results in Fig.
6
and HBM traffic
comparison in Table
2
, we conclude that, although more frequent migration can
steer a higher portion of traffic to the HBM, the additional costs often outweigh
the performance benefits for our selected benchmarks.
6.5
Performance Summary
Table 3. Performance (IPC) summary of
different allocation strategies.
Policy
512 KB cache
8 MB cache
HBM-
DDR3
HBM-
DDR4
HBM-
DDR3
HBM-
DDR4
Cache-mode
1.173
1.173
0.833
0.907
Static-FT
1.223
1.165
1.094
1.045
Static-train
1.269
1.226
1.136
1.115
Static-ref
1.393
1.325
1.102
1.113
Adaptive-ref
1.393
1.323
1.173
1.099
FTHP
1.347
1.272
1.107
1.084
HBM-only
1.838
1.568
1.466
1.255
Table
3
presents the average IPC of
all of the management policies for
both HBM-DDR3 and HBM-DDR4
platforms with 12.5% capacity in the
HBM tier. Each set of results uses
the corresponding DDR3/4-only con-
figuration as its baseline. As expected,
the policies on the HBM-DDR4 plat-
form exhibit similar performance trends
as on the HBM-DDR3 platform. On
average, the application-guided policies
achieve the best performance on HBM-
DDR4, boosting performance with the small and large caches by more than 15%
and 20% compared to cache mode, by 16% and 7% compared to static FT, and
by 5% and 3% compared to FTHP.
7
Conclusions and Future Work
This work demonstrates that emerging hybrid memory systems will not be able
to rely solely on conventional hardware-based caching or coarse-grained software
approaches, such as static NUMA assignments, and stand to benefit greatly
from fine-grained, application-level guidance. The results point to a need for
developing new source, binary, and run-time capabilities to make application
guided memory tiering practical. While the current evaluation uses simulation,
our goal is to adapt our automated guidance framework for direct execution on
real hybrid memory hardware. The immediate next steps include development
of hardware-based sampling to profile memory accesses during native execution
and low-overhead context detection techniques as described in Sect.
4
.
Other findings in this study warrant additional investigation. In many cases,
we found that tailoring application guidance to each program phase has a rela-
tively small impact on performance. Further research is necessary to understand
the relationship between program phases and memory behavior, and whether
this result is specific to our selected benchmarks and experimental configura-
tion, or if it reflects a more fundamental property of hybrid memory systems.
On Automated Feedback-Driven Data Placement in Multi-tiered Memory
193
While investigating these issues, we also plan to explore the feasibility of using
pure static analysis, without program profiling, to guide hybrid memory man-
agement. Finally, we plan to evaluate the potential of extending guidance to
other parts of the memory hierarchy, such as caching or prefetching.
Acknowledgements. This research is supported in part by the National Science
Foundation under CCF-1619140, CCF-1617954, and CNS-1464288, as well as a grant
from the Software and Services Group (SSG) at Intel
R
.
References
1. Intel: 3D XPoint (2016).
http://www.intel.com/content/www/us/en/architecture-
and-technology/3d-xpoint-unveiled-video.html
2. Mittal, S., Vetter, J.S.: A survey of techniques for architecting DRAM caches.
IEEE Trans. Parallel Distrib. Syst. 27(6), 1852–1863 (2016)
3. Meswani, M., Blagodurov, S., Roberts, D., Slice, J., Ignatowski, M., Loh, G.: Het-
erogeneous memory architectures: a HW/SW approach for mixing die-stacked and
off-package memories. In: HPCA, 2015 (February 2015)
4. Li, Y., Ghose, S., Choi, J., Sun, J., Wang, H., Mutlu, O.: Utility-based hybrid
memory management. In: IEEE CLUSTER (September 2017)
5. Cantalupo, C., Venkatesan, V., Hammond, J.R.: User extensible heap manager
for heterogeneous memory platforms and mixed memory policies (2015).
http://
memkind.github.io/memkind/memkind arch 20150318.pdf
6. Dulloor, S.R., et al.: Data tiering in heterogeneous memory systems. In: Eleventh
European Conference on Computer Systems, p. 15. ACM (2016)
7. Agarwal, N., et al.: Page placement strategies for GPUs within heterogeneous
memory systems. SIGPLAN Not. 50(4), 607–618 (2015)
8. Luk, C.K., et al.: Pin: building customized program analysis tools with dynamic
instrumentation. SIGPLAN Not. 40(6), 190–200 (2005)
9. Evans, J.: A scalable concurrent malloc (3) implementation for FreeBSD (2006)
10. Kim, Y., Yang, W., Mutlu, O.: Ramulator: a fast and extensible DRAM simulator.
IEEE Comput. Archit. Lett. 15(1), 45–49 (2016).
https://doi.org/10.1109/LCA.
2015.2414456
11. Giardino, M., Doshi, K., Ferri, B.H.: Soft2LM: application guided heterogeneous
memory management. In: IEEE International Conference on Networking, Archi-
tecture and Storage (NAS), USA, pp. 1–10 (2016)
12. Agarwal, N., Wenisch, T.F.: Thermostat: application-transparent page manage-
ment for two-tiered main memory. In: ASPLOS. ASPLOS 2017, pp. 631–644. ACM,
New York (2017)
13. Peng, I.B., Gioiosa, R., Kestor, G., Cicotti, P., Laure, E., Markidis, S.: RTHMS: a
tool for data placement on hybrid memory system. In: ISMM (2017)
14. Servat, H., Pea, A.J., Llort, G., Mercadal, E., Hoppe, H., Labarta, J.: Automating
the application data placement in hybrid memory systems. In: 2017 IEEE Inter-
national Conference on Cluster Computing (CLUSTER) (September 2017)
15. Dashti, M., Fedorova, A., Funston, J., Gaud, F., Lachaize, R., Lepers, B., Quema,
V., Roth, M.: Traffic management: a holistic approach to memory placement on
NUMA systems. SIGPLAN Not. 48(4), 381–394 (2013)
16. Jantz, M.R., et al.: A framework for application guidance in virtual memory sys-
tems. In: Virtual Execution Environments. VEE 2013, pp. 155–166 (2013)
194
T. C. Effler et al.
17. Jantz, M.R., et al.: Cross-layer memory management for managed language appli-
cations. In: ACM/SIGPLAN OOPSLA. ACM, New York (2015)
18. Guo, R., Liao, X., Jin, H., Yue, J., Tan, G.: NightWatch: integrating lightweight
and transparent cache pollution control into dynamic memory allocation systems.
In: 2015 USENIX Annual Technical Conference (USENIX ATC 15), pp. 307–318
(2015)
19. Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program anal-
ysis & transformation. In: Code Generation and Optimization (2004)
20. Sodani, A.: Knights Landing (KNL): 2nd generation Intel
R
Xeon Phi processor.
In: 2015 IEEE Hot Chips 27 Symposium (HCS), pp. 1–24. IEEE (2015)
21. Hamerly, G., Perelman, E., Lau, J., Calder, B.: Simpoint 3.0. J. Instr. Level Par-
allelism 7(4), 1–28 (2005)
22. Henning, J.L.: SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput.
Archit. News 34(4), 1–17 (2006)
23. Nethercote, N., Seward, J.: Valgrind: a framework for heavyweight dynamic binary
instrumentation. In: PLDI (2007)
Operational Characterization of Weak
Memory Consistency Models
M. Senftleben
(
B
)
and K. Schneider
TU Kaiserslautern, 67653 Kaiserslautern, Germany
{senftleben,schneider}@cs.uni-kl.de
Abstract. To improve their overall performance, all current multicore
and multiprocessor systems are based on memory architectures that
allow behaviors that do not exist in interleaved (sequential) memory
systems. The possible behaviors of such systems can be described by so-
called weak memory consistency models. Several of these models have
been introduced so far, and different ways to specify these models have
been considered like axiomatic or view-based formalizations which have
their particular advantages and disadvantages. In this paper, we propose
the use of operational/architectural models to describe the semantics of
weak memory consistency models in an operational, i.e., executable way.
The operational semantics allow a more intuitive understanding of the
possible behaviors and clearly point out the differences of these models.
Furthermore, they can be used for simulation, formal verification, and
even to automatically synthesize such memory systems.
Keywords: Memory models
·
Weak memory consistency
Processor architecture
·
Memory architecture
1
Introduction
Historically, computer architectures were considered to consist of a single proces-
sor that is connected with a single memory via a bus (von Neumann architecture;
1945). The sequentialization of the read and write operations via the single bus
ensured that each read operation returns the value most recently written to the
corresponding memory location and that we can at all define the most recently
written value. Even if the processor of such a computer architecture would be
used to execute multiple processes by interleaving their executions, the memory
operations would still take place one after the other and will therefore form a
sequence where all memory operations are totally ordered.
Nowadays, essentially all computer architectures consist of multicore proces-
sors or even multiple processors which share a common main memory. Early
multiprocessor systems still connected multiple processors via a single bus with
the shared memory. This way, processors had to compete for bus access that
still enforced an ordering of the memory operations in a linear sequence. Mod-
ern multiprocessor systems, however, are based on much more complex memory
c
Springer International Publishing AG, part of Springer Nature 2018
M. Berekovic et al. (Eds.): ARCS 2018, LNCS 10793, pp. 195–208, 2018.
https://doi.org/10.1007/978-3-319-77610-1
_
15
196
M. Senftleben and K. Schneider
architectures that do not only make use of caches with cache coherence protocols,
but also add further local memories to improve their performance. In particular,
the use of local store buffers between the processor cores and the caches allows
a significantly faster execution: Using store buffers, processors simply ‘execute’
store operations by putting a pair consisting of an address and a value to be
stored at that address in a FIFO buffer. The processor can then continue with
the execution of its next instruction and may consult its own store buffer in case a
later load operation is executed. The store buffer will execute its store operations
as soon as it is given access to the main memory. This avoids idle times due to
waiting for the bus access for each store operation and allows a faster execution
in general. However, since processors cannot see the store buffers of other pro-
cessors, they will temporarily have different views on the shared memory. Note
that after the store buffers were finally emptied, the cache coherence protocol
ensures a coherent view on the shared memory, but before that point of time, the
different views that exist due to the contents of the local store buffers allow exe-
cutions that are otherwise impossible. For this reason, one speaks about weakly
consistent memory models that do not impose as strong constraints as the tra-
ditional sequential memory models that just interleaved the memory operations
of different processors.
Store buffers are one – but not the only – reason that lead to the introduction
of weak memory consistency models [
1
,
12
,
15
,
27
]. For example, in distributed
computer systems, the single memory is replaced by multiple distributed memo-
ries which can be specific to single processors or can be shared with all or some
other processors. Depending on the implemented memory architecture, very dif-
ferent weak memory models were developed through the past decades, and some
of them may lead to behaviors that are really unexpected by the programmers.
It is therefore very important that the designers of modern computer systems
are able to describe the potential memory behaviors of their systems in a precise
but yet comprehensive way so that the programmers are able to determine when
memory synchronization is required in their programs.
Memory consistency models have been defined in different ways: First
descriptions of weak memory models were just given in natural language and
were therefore often ambiguous. In fact, such ambiguous descriptions lead to
non-equivalent versions of the processor consistency model [
3
,
10
].
Another way to define a memory consistency model is the so-called view-
based approach where the different views processors may have during the execu-
tion of a multithreaded program are formally specified. From the viewpoint of a
Do'stlaringiz bilan baham: |