Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Morán, Marina; Balladini, Javier; Rexachs, Dolores; Rucci, Enzo

dc.creator	Morán, Marina
dc.creator	Balladini, Javier
dc.creator	Rexachs, Dolores
dc.creator	Rucci, Enzo
dc.date	2024
dc.date.accessioned	2024-09-04T15:39:28Z
dc.date.available	2024-09-04T15:39:28Z
dc.identifier.uri	http://rdi.uncoma.edu.ar/handle/uncomaid/18119
dc.description.abstract	Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure	es_ES
dc.format	application/pdf	es_ES
dc.format.extent	pp. 1-36	es_ES
dc.language	eng	es_ES
dc.publisher	Elsevier	es_ES
dc.relation.uri	https://doi.org/10.1016/j.jpdc.2023.104797	es_ES
dc.rights	Atribución-NoComercial-CompartirIgual 2.5 Argentina	es_ES
dc.rights.uri	https://creativecommons.org/licenses/by-nc-sa/2.5/ar/	es_ES
dc.source	Journal of Parallel and Distributed Computing Volume 185, March 2024	es_ES
dc.subject	Energy saving	es_ES
dc.subject	Fault tolerance methods	es_ES
dc.subject	Checkpoint parallel	es_ES
dc.subject	Applications ACPI DVFS	es_ES
dc.subject.other	Ciencias de la Computación e Información	es_ES
dc.title	Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems	es_ES
dc.type	Articulo	es
dc.type	article	eu
dc.type	acceptedVersion	eu
dc.description.fil	Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.	es_ES
dc.description.fil	Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.	es_ES
dc.description.fil	Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España.	es_ES
dc.description.fil	Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina.	es_ES
dc.cole	Artículos	es_ES