Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
- Autores
- Morán, Marina; Balladini, Javier; Rexachs, Dolores; Rucci, Enzo
- Año de publicación
- 2024
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión aceptada
- Descripción
- Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure
Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.
Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.
Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España.
Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina. - Fuente
- Journal of Parallel and Distributed Computing Volume 185, March 2024
- Materia
-
Energy saving
Fault tolerance methods
Checkpoint parallel
Applications ACPI DVFS
Ciencias de la Computación e Información - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
- Repositorio
- Institución
- Universidad Nacional del Comahue
- OAI Identificador
- oai:rdi.uncoma.edu.ar:uncomaid/18119
Ver los metadatos del registro completo
id |
RDIUNCO_e603e9563ee26668a18a37e8e729bbbc |
---|---|
oai_identifier_str |
oai:rdi.uncoma.edu.ar:uncomaid/18119 |
network_acronym_str |
RDIUNCO |
repository_id_str |
7108 |
network_name_str |
Repositorio Digital Institucional (UNCo) |
spelling |
Exploring Energy Saving Opportunities in Fault Tolerant HPC SystemsMorán, MarinaBalladini, JavierRexachs, DoloresRucci, EnzoEnergy savingFault tolerance methodsCheckpoint parallelApplications ACPI DVFSCiencias de la Computación e InformaciónNowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failureFil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España.Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina.Elsevier2024info:eu-repo/semantics/articleinfo:eu-repo/semantics/acceptedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfpp. 1-36application/pdfhttp://rdi.uncoma.edu.ar/handle/uncomaid/18119Journal of Parallel and Distributed Computing Volume 185, March 2024reponame:Repositorio Digital Institucional (UNCo)instname:Universidad Nacional del Comahueenghttps://doi.org/10.1016/j.jpdc.2023.104797info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/2025-10-16T10:05:37Zoai:rdi.uncoma.edu.ar:uncomaid/18119instacron:UNCoInstitucionalhttp://rdi.uncoma.edu.ar/Universidad públicaNo correspondehttp://rdi.uncoma.edu.ar/oaimirtha.mateo@biblioteca.uncoma.edu.ar; adriana.acuna@biblioteca.uncoma.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:71082025-10-16 10:05:37.715Repositorio Digital Institucional (UNCo) - Universidad Nacional del Comahuefalse |
dc.title.none.fl_str_mv |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
title |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
spellingShingle |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems Morán, Marina Energy saving Fault tolerance methods Checkpoint parallel Applications ACPI DVFS Ciencias de la Computación e Información |
title_short |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
title_full |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
title_fullStr |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
title_full_unstemmed |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
title_sort |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
dc.creator.none.fl_str_mv |
Morán, Marina Balladini, Javier Rexachs, Dolores Rucci, Enzo |
author |
Morán, Marina |
author_facet |
Morán, Marina Balladini, Javier Rexachs, Dolores Rucci, Enzo |
author_role |
author |
author2 |
Balladini, Javier Rexachs, Dolores Rucci, Enzo |
author2_role |
author author author |
dc.subject.none.fl_str_mv |
Energy saving Fault tolerance methods Checkpoint parallel Applications ACPI DVFS Ciencias de la Computación e Información |
topic |
Energy saving Fault tolerance methods Checkpoint parallel Applications ACPI DVFS Ciencias de la Computación e Información |
dc.description.none.fl_txt_mv |
Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España. Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina. |
description |
Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure |
publishDate |
2024 |
dc.date.none.fl_str_mv |
2024 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/acceptedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
format |
article |
status_str |
acceptedVersion |
dc.identifier.none.fl_str_mv |
http://rdi.uncoma.edu.ar/handle/uncomaid/18119 |
url |
http://rdi.uncoma.edu.ar/handle/uncomaid/18119 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
https://doi.org/10.1016/j.jpdc.2023.104797 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ |
dc.format.none.fl_str_mv |
application/pdf pp. 1-36 application/pdf |
dc.publisher.none.fl_str_mv |
Elsevier |
publisher.none.fl_str_mv |
Elsevier |
dc.source.none.fl_str_mv |
Journal of Parallel and Distributed Computing Volume 185, March 2024 reponame:Repositorio Digital Institucional (UNCo) instname:Universidad Nacional del Comahue |
reponame_str |
Repositorio Digital Institucional (UNCo) |
collection |
Repositorio Digital Institucional (UNCo) |
instname_str |
Universidad Nacional del Comahue |
repository.name.fl_str_mv |
Repositorio Digital Institucional (UNCo) - Universidad Nacional del Comahue |
repository.mail.fl_str_mv |
mirtha.mateo@biblioteca.uncoma.edu.ar; adriana.acuna@biblioteca.uncoma.edu.ar |
_version_ |
1846145869266550784 |
score |
12.712165 |