Bog'liq (Lecture Notes in Computer Science 10793) Mladen Berekovic, Rainer Buchty, Heiko Hamann, Dirk Koch, Thilo Pionteck - Architecture of Computing Systems – ARCS
2 Related Work Reinhardt and Mukherjee [
15
] propose to use the simultaneous multithreading
capabilities of modern processors for error detection. The program is executed
twice on the same core. The executions are shifted and can use different execution
units for the same instruction. It is proposed to maintain a constant slack to
minimize memory stalls.
AR-SMT [
16
] is a time redundant fault-tolerance approach. An SMT-
processor executes the same program twice with some delay. The execution state
of the second thread is used to restore the first thread to a safe state if an error
occurs.
The Slipstream Processor [
18
] is a concept which does not only provide fault
tolerance, but also higher performance by executing a second, slimmer version
of the program on either the same or another core. The second version of the
program is generated by leaving out instructions which are predicted to be inef-
fective. The resulting branch outcomes and prefetches are used to accelerate the
full version of the program. Errors are detected by comparing stores.
LBRA [
17
] is a loosely-coupled redundant architecture extending the trans-
action system LogTM-SE [
20
]. The old value for every memory location accessed
by the leading thread is stored in a log. For writes the new value is immediately
stored in memory. The trailing thread uses the log values for input duplication.
Both cores calculate signatures for every transaction. If an error is detected, the
log is traversed backwards to restore the old memory state.
FaulTM [
19
] is a fault-tolerance system utilizing transactional memory. Their
approach differs from ours in that it executes redundant transactions syn-
chronously. The write-sets and registers of both transactions are compared simul-
taneously, with one transaction committing to memory. This prohibits one thread
to run ahead of the other and thus suppresses possible cache-related performance
benefits.
Haas et al. [
9
,
10
] use the existing Intel HTM system (TSX) for error detection
and recovery. As Intel TSX does not support automatic transaction creation or
write set comparison, the protected software is instrumented to provide those
features itself. Transactional blocks are executed with an offset in the trailing
process, as TSX does not allow checksum transfer within active transactions. As
the redundant processes use the same virtual addresses, but different physical
memory locations, no speedups due to positive cache effects occur in the trailing
process.