Bog'liq (Lecture Notes in Computer Science 10793) Mladen Berekovic, Rainer Buchty, Heiko Hamann, Dirk Koch, Thilo Pionteck - Architecture of Computing Systems – ARCS
Fig. 1. Node C fills the receive buffer of node B, which is currently waiting for flits from
node A. Thus, node B is busy processing received flits before it can answer the request
from node A which it was originally waiting for. Boxes represent local computation
times and arrows the delivery of flits.
The problem is illustrated in Fig.
1
: There are three nodes A, B and C run-
ning a parallel application. Each of them does some local computation (boxes),
followed by communication (flits represented as arrows). The computation of
node A takes a little bit longer than on nodes B and C. Meanwhile, node C
finishes its local computation and sends several flits to node B. Node A sends
a request to node B, but node B is busy processing flits sent by node C. In the
case when the receive buffer of node B is full, the request from node A even
cannot be stored there. Thus, A has to wait until C is finished, then it can send
its request to B again.
To avoid buffer overflows, we propose to add a synchronization mechanism:
each node planing to send data has to wait for a flit from its intended receiver
indicating that it is ready to handle incoming flits. A receiver node sends this flit
when reaching its receive operation, ensuring that it is fully capable to process
incoming flits. When implementing this synchronization in software, there might
still arrive a lot of synchronization flits at a node (at most one from each other
node). Thus, we suggest to realize it with hardware support. Our hardware
implementation stores synchronization information and makes it available for the
processing element when it asks for it. Thereby, we focus on minimal hardware
and synchronization overhead.
Altogether, the contribution of our paper is a cheap and simple hardware syn-
chronization mechanism which can easily be controlled in software, to increase
performance while decreasing receive buffer size and hardware costs. Our app-
roach is independent of router design and network topology.
The remainder of this paper is structured as follows: In the next section,
we present related work and backgrounds. Afterwards, first our synchroniza-
tion concept is explained in Sect.
3
, followed by the description of the hardware
implementation in Sect.
4
and subsequently it is evaluated in Sect.
5
. Finally, the
paper is concluded in Sect.
6
.