Abstract:
Nowadays, improving the energy efficiency of high-performance com-
puting (HPC) systems is one of the main drivers in scientific and techno-
logical research. As large-scale HPC systems require some fault-tolerant
method, the opportunities to reduce energy consumption should be ex-
plored. In particular, rollback-recovery methods using uncoordinated
checkpoints prevent all processes from re-executing when a failure oc-
curs. In this context, it is possible to take actions to reduce the energy
consumption of the nodes whose processes do not re-execute. This work is
an extension of a previous one, in which we proposed a series of strategies
to manage energy consumption at failure-time. In this work, we have en-
riched our simulator and the experimentation by including non-blocking
communications (with and without system buffering) and a largest num-
ber of candidate processes to be analyzed. We have called the latter as
cascade analysis, because it includes processes that gets blocked by com-
munication indirectly with the failed process. The simulations show that
the savings were negligible in the worst case, but in some scenarios, it was
possible to achieve significant ones; the maximum saving achieved was
90% in a time interval of 16 minutes. As a result, we show the feasibility
of improving energy efficiency in HPC systems in the presence of a failure