Mostrar el registro sencillo del ítem
dc.creator | Morán, Marina | |
dc.creator | Balladini, Javier | |
dc.creator | Rexachs, Dolores | |
dc.creator | Rucci, Enzo | |
dc.date | 2024 | |
dc.date.accessioned | 2024-09-04T15:39:28Z | |
dc.date.available | 2024-09-04T15:39:28Z | |
dc.identifier.uri | http://rdi.uncoma.edu.ar/handle/uncomaid/18119 | |
dc.description.abstract | Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure | es_ES |
dc.format | application/pdf | es_ES |
dc.format.extent | pp. 1-36 | es_ES |
dc.language | eng | es_ES |
dc.publisher | Elsevier | es_ES |
dc.relation.uri | https://doi.org/10.1016/j.jpdc.2023.104797 | es_ES |
dc.rights | Atribución-NoComercial-CompartirIgual 2.5 Argentina | es_ES |
dc.rights.uri | https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ | es_ES |
dc.source | Journal of Parallel and Distributed Computing Volume 185, March 2024 | es_ES |
dc.subject | Energy saving | es_ES |
dc.subject | Fault tolerance methods | es_ES |
dc.subject | Checkpoint parallel | es_ES |
dc.subject | Applications ACPI DVFS | es_ES |
dc.subject.other | Ciencias de la Computación e Información | es_ES |
dc.title | Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems | es_ES |
dc.type | Articulo | es |
dc.type | article | eu |
dc.type | acceptedVersion | eu |
dc.description.fil | Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. | es_ES |
dc.description.fil | Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. | es_ES |
dc.description.fil | Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España. | es_ES |
dc.description.fil | Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina. | es_ES |
dc.cole | Artículos | es_ES |