dc.creator |
Morán, Marina |
|
dc.creator |
Balladini, Javier |
|
dc.creator |
Rexachs, Dolores |
|
dc.creator |
Rucci, Enzo |
|
dc.date |
2024 |
|
dc.date.accessioned |
2024-09-04T15:39:28Z |
|
dc.date.available |
2024-09-04T15:39:28Z |
|
dc.identifier.uri |
http://rdi.uncoma.edu.ar/handle/uncomaid/18119 |
|
dc.description.abstract |
Nowadays, improving the energy efficiency of high-performance com-
puting (HPC) systems is one of the main drivers in scientific and techno-
logical research. As large-scale HPC systems require some fault-tolerant
method, the opportunities to reduce energy consumption should be ex-
plored. In particular, rollback-recovery methods using uncoordinated
checkpoints prevent all processes from re-executing when a failure oc-
curs. In this context, it is possible to take actions to reduce the energy
consumption of the nodes whose processes do not re-execute. This work is
an extension of a previous one, in which we proposed a series of strategies
to manage energy consumption at failure-time. In this work, we have en-
riched our simulator and the experimentation by including non-blocking
communications (with and without system buffering) and a largest num-
ber of candidate processes to be analyzed. We have called the latter as
cascade analysis, because it includes processes that gets blocked by com-
munication indirectly with the failed process. The simulations show that
the savings were negligible in the worst case, but in some scenarios, it was
possible to achieve significant ones; the maximum saving achieved was
90% in a time interval of 16 minutes. As a result, we show the feasibility
of improving energy efficiency in HPC systems in the presence of a failure |
es_ES |
dc.format |
application/pdf |
es_ES |
dc.format.extent |
pp. 1-36 |
es_ES |
dc.language |
eng |
es_ES |
dc.publisher |
Elsevier |
es_ES |
dc.relation.uri |
https://doi.org/10.1016/j.jpdc.2023.104797 |
es_ES |
dc.rights |
Atribución-NoComercial-CompartirIgual 2.5 Argentina |
es_ES |
dc.rights.uri |
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ |
es_ES |
dc.source |
Journal of Parallel and Distributed Computing Volume 185, March 2024 |
es_ES |
dc.subject |
Energy saving |
es_ES |
dc.subject |
Fault tolerance methods |
es_ES |
dc.subject |
Checkpoint parallel |
es_ES |
dc.subject |
Applications ACPI DVFS |
es_ES |
dc.subject.other |
Ciencias de la Computación e Información |
es_ES |
dc.title |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
es_ES |
dc.type |
Articulo |
es |
dc.type |
article |
eu |
dc.type |
acceptedVersion |
eu |
dc.description.fil |
Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. |
es_ES |
dc.description.fil |
Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. |
es_ES |
dc.description.fil |
Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España. |
es_ES |
dc.description.fil |
Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina. |
es_ES |
dc.cole |
Artículos |
es_ES |