Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Morán, Marina; Balladini, Javier; Rexachs, Dolores; Rucci, Enzo

Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Morán, Marina; Balladini, Javier; Rexachs, Dolores; Rucci, Enzo

URI: http://rdi.uncoma.edu.ar/handle/uncomaid/18119

Resumen:

Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure

Mostrar el registro completo del ítem