Print indd

“Release Phase” Timing Observations

Download 18,42 Mb.

Pdf ko'rish

bet	71/366
Sana	31.12.2021
Hajmi	18,42 Mb.
	#276933

1 ... 67 68 69 70 71 72 73 74 ... 366

Bog'liq
(Lecture Notes in Computer Science 10793) Mladen Berekovic, Rainer Buchty, Heiko Hamann, Dirk Koch, Thilo Pionteck - Architecture of Computing Systems – ARCS

Library Study.

“Release Phase” Timing Observations. Regarding the “release phase”, the
delays measured for our 16-core platform are presented in the Fig.
3
a. In order
to get relevant reproductive behaviors by limiting operating system artifacts, we
loop 400 times over the for loop of the Listing
1.1
.
This ﬁgure shows the release phase delays by thread. The Y-axis represents
the number of cycles between the moment the last thread became aware that
it is the last one and has to start the release process, and the instant a thread
leaves the barrier to resume its nominal execution ﬂow. The X-axis represents
the threads in order of release. For example, the ﬁrst column represents the ﬁrst
thread that resumes its execution whatever its thread ID is. The ﬁgure shows box
plots in which the red line is the median. The blue box contains 50% of the values.
The minimum value is represented by a dot, and the maximum value by an ‘X’.
We can see on the ﬁgure that box plots are well grouped around the median,
however the maximum values could be far above the core group of the values.

64
M. France-Pillois et al.
This is due to contention issues and interrupt management subroutines triggered
by the operating system independently from our application, which delay the
nominal execution ﬂow of threads. Hence values delayed by these artifacts should
not be taken into account in our analysis since these values are not the result of
the synchronization mechanism itself. Thus we focus our study on the median
value and the box plot for each thread.
We note on this ﬁgure that complete process until all threads resume their
execution lasts 13194 cycles, and that one thread is especially delayed compared
to the others. Hence we decide to analyze more accurately the sources of this
large amount of time with our non-intrusive measurement tool chain.
Fig. 3. Delays between barrier completion awareness and thread releases for 16 threads
bound on 16 cores for 400 barrier calls without (a) and with optimization (b)
Library Study. Analyzing more precisely the execution ﬂow inside the “release
phase” using a time annotated function call stack generated by our tool chain,
we remark that GNU OpenMP library calls every time the “threads wake-up
function” during the release phase: whatever threads are actually sleeping or not.
As a matter of fact, the dual active/passive wait policy implies that some threads
can do active wait and others can be sleeping, according to the thread waiting
time. In this case, threads performing active wait should be released by switching
the state of the waiting ﬂag, whereas sleeping threads should be awaken by calling
the wake-up function. However, if the software is well balanced, which is the case
most of the time with the OpenMP workload split policy, threads do not sleep
but just perform active wait due to short waiting durations. The program shown
in Listing
1.1
simulates this case of relatively well balanced threads.
We found that the time spent in the wake-up function is about 12891 cycles
whereas no thread has to be awakened, that is to say about 97
.7% of the whole
release process for 16 threads. This observation leads us to consider a workaround
to speed-up the release phase in the case of fully active wait policy.

Optimization of the GNU OpenMP Synchronization Barrier in MPSoC
65

Download 18,42 Mb.

Do'stlaringiz bilan baham:

1 ... 67 68 69 70 71 72 73 74 ... 366