Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Autores
Morán, Marina; Balladini, Javier; Rexachs, Dolores; Rucci, Enzo
Año de publicación
2024
Idioma
inglés
Tipo de recurso
artículo
Estado
versión aceptada
Descripción
Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure
Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.
Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.
Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España.
Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina.
Fuente
Journal of Parallel and Distributed Computing Volume 185, March 2024
Materia
Energy saving
Fault tolerance methods
Checkpoint parallel
Applications ACPI DVFS
Ciencias de la Computación e Información
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
Repositorio Digital Institucional (UNCo)
Institución
Universidad Nacional del Comahue
OAI Identificador
oai:rdi.uncoma.edu.ar:uncomaid/18119

id RDIUNCO_e603e9563ee26668a18a37e8e729bbbc
oai_identifier_str oai:rdi.uncoma.edu.ar:uncomaid/18119
network_acronym_str RDIUNCO
repository_id_str 7108
network_name_str Repositorio Digital Institucional (UNCo)
spelling Exploring Energy Saving Opportunities in Fault Tolerant HPC SystemsMorán, MarinaBalladini, JavierRexachs, DoloresRucci, EnzoEnergy savingFault tolerance methodsCheckpoint parallelApplications ACPI DVFSCiencias de la Computación e InformaciónNowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failureFil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España.Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina.Elsevier2024info:eu-repo/semantics/articleinfo:eu-repo/semantics/acceptedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfpp. 1-36application/pdfhttp://rdi.uncoma.edu.ar/handle/uncomaid/18119Journal of Parallel and Distributed Computing Volume 185, March 2024reponame:Repositorio Digital Institucional (UNCo)instname:Universidad Nacional del Comahueenghttps://doi.org/10.1016/j.jpdc.2023.104797info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/2025-10-16T10:05:37Zoai:rdi.uncoma.edu.ar:uncomaid/18119instacron:UNCoInstitucionalhttp://rdi.uncoma.edu.ar/Universidad públicaNo correspondehttp://rdi.uncoma.edu.ar/oaimirtha.mateo@biblioteca.uncoma.edu.ar; adriana.acuna@biblioteca.uncoma.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:71082025-10-16 10:05:37.715Repositorio Digital Institucional (UNCo) - Universidad Nacional del Comahuefalse
dc.title.none.fl_str_mv Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
spellingShingle Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
Morán, Marina
Energy saving
Fault tolerance methods
Checkpoint parallel
Applications ACPI DVFS
Ciencias de la Computación e Información
title_short Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_full Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_fullStr Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_full_unstemmed Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_sort Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
dc.creator.none.fl_str_mv Morán, Marina
Balladini, Javier
Rexachs, Dolores
Rucci, Enzo
author Morán, Marina
author_facet Morán, Marina
Balladini, Javier
Rexachs, Dolores
Rucci, Enzo
author_role author
author2 Balladini, Javier
Rexachs, Dolores
Rucci, Enzo
author2_role author
author
author
dc.subject.none.fl_str_mv Energy saving
Fault tolerance methods
Checkpoint parallel
Applications ACPI DVFS
Ciencias de la Computación e Información
topic Energy saving
Fault tolerance methods
Checkpoint parallel
Applications ACPI DVFS
Ciencias de la Computación e Información
dc.description.none.fl_txt_mv Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure
Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.
Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.
Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España.
Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina.
description Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure
publishDate 2024
dc.date.none.fl_str_mv 2024
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/acceptedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str acceptedVersion
dc.identifier.none.fl_str_mv http://rdi.uncoma.edu.ar/handle/uncomaid/18119
url http://rdi.uncoma.edu.ar/handle/uncomaid/18119
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv https://doi.org/10.1016/j.jpdc.2023.104797
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.format.none.fl_str_mv application/pdf
pp. 1-36
application/pdf
dc.publisher.none.fl_str_mv Elsevier
publisher.none.fl_str_mv Elsevier
dc.source.none.fl_str_mv Journal of Parallel and Distributed Computing Volume 185, March 2024
reponame:Repositorio Digital Institucional (UNCo)
instname:Universidad Nacional del Comahue
reponame_str Repositorio Digital Institucional (UNCo)
collection Repositorio Digital Institucional (UNCo)
instname_str Universidad Nacional del Comahue
repository.name.fl_str_mv Repositorio Digital Institucional (UNCo) - Universidad Nacional del Comahue
repository.mail.fl_str_mv mirtha.mateo@biblioteca.uncoma.edu.ar; adriana.acuna@biblioteca.uncoma.edu.ar
_version_ 1846145869266550784
score 12.712165