“Release Phase” Timing Observations. Regarding the “release phase”, the
delays measured for our 16-core platform are presented in the Fig.
3
a. In order
to get relevant reproductive behaviors by limiting operating system artifacts, we
loop 400 times over the for loop of the Listing
1.1
.
This figure shows the release phase delays by thread. The Y-axis represents
the number of cycles between the moment the last thread became aware that
it is the last one and has to start the release process, and the instant a thread
leaves the barrier to resume its nominal execution flow. The X-axis represents
the threads in order of release. For example, the first column represents the first
thread that resumes its execution whatever its thread ID is. The figure shows box
plots in which the red line is the median. The blue box contains 50% of the values.
The minimum value is represented by a dot, and the maximum value by an ‘X’.
We can see on the figure that box plots are well grouped around the median,
however the maximum values could be far above the core group of the values.
64
M. France-Pillois et al.
This is due to contention issues and interrupt management subroutines triggered
by the operating system independently from our application, which delay the
nominal execution flow of threads. Hence values delayed by these artifacts should
not be taken into account in our analysis since these values are not the result of
the synchronization mechanism itself. Thus we focus our study on the median
value and the box plot for each thread.
We note on this figure that complete process until all threads resume their
execution lasts 13194 cycles, and that one thread is especially delayed compared
to the others. Hence we decide to analyze more accurately the sources of this
large amount of time with our non-intrusive measurement tool chain.
Fig. 3. Delays between barrier completion awareness and thread releases for 16 threads
bound on 16 cores for 400 barrier calls without (a) and with optimization (b)
Library Study. Analyzing more precisely the execution flow inside the “release
phase” using a time annotated function call stack generated by our tool chain,
we remark that GNU OpenMP library calls every time the “threads wake-up
function” during the release phase: whatever threads are actually sleeping or not.
As a matter of fact, the dual active/passive wait policy implies that some threads
can do active wait and others can be sleeping, according to the thread waiting
time. In this case, threads performing active wait should be released by switching
the state of the waiting flag, whereas sleeping threads should be awaken by calling
the wake-up function. However, if the software is well balanced, which is the case
most of the time with the OpenMP workload split policy, threads do not sleep
but just perform active wait due to short waiting durations. The program shown
in Listing
1.1
simulates this case of relatively well balanced threads.
We found that the time spent in the wake-up function is about 12891 cycles
whereas no thread has to be awakened, that is to say about 97
.7% of the whole
release process for 16 threads. This observation leads us to consider a workaround
to speed-up the release phase in the case of fully active wait policy.
Optimization of the GNU OpenMP Synchronization Barrier in MPSoC
65
Do'stlaringiz bilan baham: |