Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Autores
Morán, Marina; Balladini, Javier; Rexachs, Dolores; Rucci, Enzo
Año de publicación
2023
Idioma
inglés
Tipo de recurso
artículo
Estado
versión aceptada
Descripción
Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.
Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática; Argentina.
Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática; Argentina.
Fil: Rexachs, Dolores. Universitat Autónoma de Barcelona. Departamento de Arquitectura de Computadores y Sistemas Operativos; España.
Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina.
Fuente
Journal of Parallel and Distributed Computing, october 2023
Materia
Energy saving
Fault
Tolerance
Methods
Checkpoint
Parallel
Applications
ACPI
DVFS
Ciencias de la Computación e Información
Artículos
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
Repositorio Digital Institucional (UNCo)
Institución
Universidad Nacional del Comahue
OAI Identificador
oai:rdi.uncoma.edu.ar:uncomaid/19175

id RDIUNCO_5aa85b1f56d8cff973147af62a500294
oai_identifier_str oai:rdi.uncoma.edu.ar:uncomaid/19175
network_acronym_str RDIUNCO
repository_id_str 7108
network_name_str Repositorio Digital Institucional (UNCo)
spelling Exploring Energy Saving Opportunities in Fault Tolerant HPC SystemsMorán, MarinaBalladini, JavierRexachs, DoloresRucci, EnzoEnergy savingFaultToleranceMethodsCheckpointParallelApplicationsACPIDVFSCiencias de la Computación e InformaciónArtículosNowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática; Argentina.Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática; Argentina.Fil: Rexachs, Dolores. Universitat Autónoma de Barcelona. Departamento de Arquitectura de Computadores y Sistemas Operativos; España.Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina.arXiv2023info:eu-repo/semantics/articleinfo:eu-repo/semantics/acceptedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttps://rdi.uncoma.edu.ar/handle/uncomaid/19175Journal of Parallel and Distributed Computing, october 2023reponame:Repositorio Digital Institucional (UNCo)instname:Universidad Nacional del Comahueenghttps://arxiv.org/abs/2311.06419https://doi.org/10.1016/j.jpdc.2023.104797info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/4.0/2026-01-08T11:15:19Zoai:rdi.uncoma.edu.ar:uncomaid/19175instacron:UNCoInstitucionalhttp://rdi.uncoma.edu.ar/Universidad públicaNo correspondehttp://rdi.uncoma.edu.ar/oaimirtha.mateo@biblioteca.uncoma.edu.ar; adriana.acuna@biblioteca.uncoma.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:71082026-01-08 11:15:20.236Repositorio Digital Institucional (UNCo) - Universidad Nacional del Comahuefalse
dc.title.none.fl_str_mv Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
spellingShingle Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
Morán, Marina
Energy saving
Fault
Tolerance
Methods
Checkpoint
Parallel
Applications
ACPI
DVFS
Ciencias de la Computación e Información
Artículos
title_short Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_full Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_fullStr Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_full_unstemmed Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_sort Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
dc.creator.none.fl_str_mv Morán, Marina
Balladini, Javier
Rexachs, Dolores
Rucci, Enzo
author Morán, Marina
author_facet Morán, Marina
Balladini, Javier
Rexachs, Dolores
Rucci, Enzo
author_role author
author2 Balladini, Javier
Rexachs, Dolores
Rucci, Enzo
author2_role author
author
author
dc.subject.none.fl_str_mv Energy saving
Fault
Tolerance
Methods
Checkpoint
Parallel
Applications
ACPI
DVFS
Ciencias de la Computación e Información
Artículos
topic Energy saving
Fault
Tolerance
Methods
Checkpoint
Parallel
Applications
ACPI
DVFS
Ciencias de la Computación e Información
Artículos
dc.description.none.fl_txt_mv Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.
Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática; Argentina.
Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática; Argentina.
Fil: Rexachs, Dolores. Universitat Autónoma de Barcelona. Departamento de Arquitectura de Computadores y Sistemas Operativos; España.
Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina.
description Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.
publishDate 2023
dc.date.none.fl_str_mv 2023
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/acceptedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str acceptedVersion
dc.identifier.none.fl_str_mv https://rdi.uncoma.edu.ar/handle/uncomaid/19175
url https://rdi.uncoma.edu.ar/handle/uncomaid/19175
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv https://arxiv.org/abs/2311.06419
https://doi.org/10.1016/j.jpdc.2023.104797
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-sa/4.0/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv arXiv
publisher.none.fl_str_mv arXiv
dc.source.none.fl_str_mv Journal of Parallel and Distributed Computing, october 2023
reponame:Repositorio Digital Institucional (UNCo)
instname:Universidad Nacional del Comahue
reponame_str Repositorio Digital Institucional (UNCo)
collection Repositorio Digital Institucional (UNCo)
instname_str Universidad Nacional del Comahue
repository.name.fl_str_mv Repositorio Digital Institucional (UNCo) - Universidad Nacional del Comahue
repository.mail.fl_str_mv mirtha.mateo@biblioteca.uncoma.edu.ar; adriana.acuna@biblioteca.uncoma.edu.ar
_version_ 1853761292693667840
score 12.747614