Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Autores: Morán, Marina; Balladini, Javier; Rexachs, Dolores; Rucci, Enzo
Año de publicación: 2024
Idioma: inglés
Tipo de recurso: artículo
Estado: versión aceptada
Descripción: Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure
Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.
Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.
Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España.
Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina.
Fuente: Journal of Parallel and Distributed Computing Volume 185, March 2024
Materia: Energy saving
Fault tolerance methods
Checkpoint parallel
Applications ACPI DVFS
Ciencias de la Computación e Información
Nivel de accesibilidad: acceso abierto
Condiciones de uso: https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
Institución: Universidad Nacional del Comahue
OAI Identificador: oai:rdi.uncoma.edu.ar:uncomaid/18119

Acceder

id	RDIUNCO_e603e9563ee26668a18a37e8e729bbbc
oai_identifier_str	oai:rdi.uncoma.edu.ar:uncomaid/18119
network_acronym_str	RDIUNCO
repository_id_str	7108
network_name_str	Repositorio Digital Institucional (UNCo)
spelling	Exploring Energy Saving Opportunities in Fault Tolerant HPC SystemsMorán, MarinaBalladini, JavierRexachs, DoloresRucci, EnzoEnergy savingFault tolerance methodsCheckpoint parallelApplications ACPI DVFSCiencias de la Computación e InformaciónNowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failureFil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España.Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina.Elsevier2024info:eu-repo/semantics/articleinfo:eu-repo/semantics/acceptedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfpp. 1-36application/pdfhttp://rdi.uncoma.edu.ar/handle/uncomaid/18119Journal of Parallel and Distributed Computing Volume 185, March 2024reponame:Repositorio Digital Institucional (UNCo)instname:Universidad Nacional del Comahueenghttps://doi.org/10.1016/j.jpdc.2023.104797info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/2026-06-04T10:01:17Zoai:rdi.uncoma.edu.ar:uncomaid/18119instacron:UNCoInstitucionalhttp://rdi.uncoma.edu.ar/Universidad públicaNo correspondehttp://rdi.uncoma.edu.ar/oaimirtha.mateo@biblioteca.uncoma.edu.ar; adriana.acuna@biblioteca.uncoma.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:71082026-06-04 10:01:18.125Repositorio Digital Institucional (UNCo) - Universidad Nacional del Comahuefalse
dc.title.none.fl_str_mv	Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title	Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
spellingShingle	Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems Morán, Marina Energy saving Fault tolerance methods Checkpoint parallel Applications ACPI DVFS Ciencias de la Computación e Información
title_short	Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_full	Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_fullStr	Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_full_unstemmed	Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
title_sort	Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
dc.creator.none.fl_str_mv	Morán, Marina Balladini, Javier Rexachs, Dolores Rucci, Enzo
author	Morán, Marina
author_facet	Morán, Marina Balladini, Javier Rexachs, Dolores Rucci, Enzo
author_role	author
author2	Balladini, Javier Rexachs, Dolores Rucci, Enzo
author2_role	author author author
dc.subject.none.fl_str_mv	Energy saving Fault tolerance methods Checkpoint parallel Applications ACPI DVFS Ciencias de la Computación e Información
topic	Energy saving Fault tolerance methods Checkpoint parallel Applications ACPI DVFS Ciencias de la Computación e Información
dc.description.none.fl_txt_mv	Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España. Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina.
description	Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure
publishDate	2024
dc.date.none.fl_str_mv	2024
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/acceptedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	acceptedVersion
dc.identifier.none.fl_str_mv	http://rdi.uncoma.edu.ar/handle/uncomaid/18119
url	http://rdi.uncoma.edu.ar/handle/uncomaid/18119
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	https://doi.org/10.1016/j.jpdc.2023.104797
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
eu_rights_str_mv	openAccess
rights_invalid_str_mv	https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.format.none.fl_str_mv	application/pdf pp. 1-36 application/pdf
dc.publisher.none.fl_str_mv	Elsevier
publisher.none.fl_str_mv	Elsevier
dc.source.none.fl_str_mv	Journal of Parallel and Distributed Computing Volume 185, March 2024 reponame:Repositorio Digital Institucional (UNCo) instname:Universidad Nacional del Comahue
reponame_str	Repositorio Digital Institucional (UNCo)
collection	Repositorio Digital Institucional (UNCo)
instname_str	Universidad Nacional del Comahue
repository.name.fl_str_mv	Repositorio Digital Institucional (UNCo) - Universidad Nacional del Comahue
repository.mail.fl_str_mv	mirtha.mateo@biblioteca.uncoma.edu.ar; adriana.acuna@biblioteca.uncoma.edu.ar
_version_	1867072487630569472
score	13.343307

Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems

Publicaciones similares