Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems
- Autores
- Morán, Marina; Balladini, Javier; Rexachs, Dolores; Rucci, Enzo
- Año de publicación
- 2024
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión aceptada
- Descripción
- Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure
Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.
Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.
Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España.
Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina. - Fuente
- Journal of Parallel and Distributed Computing Volume 185, March 2024
- Materia
-
Energy saving
Fault tolerance methods
Checkpoint parallel
Applications ACPI DVFS
Ciencias de la Computación e Información - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
- Repositorio
.jpg)
- Institución
- Universidad Nacional del Comahue
- OAI Identificador
- oai:rdi.uncoma.edu.ar:uncomaid/18119
Ver los metadatos del registro completo
| id |
RDIUNCO_e603e9563ee26668a18a37e8e729bbbc |
|---|---|
| oai_identifier_str |
oai:rdi.uncoma.edu.ar:uncomaid/18119 |
| network_acronym_str |
RDIUNCO |
| repository_id_str |
7108 |
| network_name_str |
Repositorio Digital Institucional (UNCo) |
| spelling |
Exploring Energy Saving Opportunities in Fault Tolerant HPC SystemsMorán, MarinaBalladini, JavierRexachs, DoloresRucci, EnzoEnergy savingFault tolerance methodsCheckpoint parallelApplications ACPI DVFSCiencias de la Computación e InformaciónNowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failureFil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina.Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España.Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina.Elsevier2024info:eu-repo/semantics/articleinfo:eu-repo/semantics/acceptedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfpp. 1-36application/pdfhttp://rdi.uncoma.edu.ar/handle/uncomaid/18119Journal of Parallel and Distributed Computing Volume 185, March 2024reponame:Repositorio Digital Institucional (UNCo)instname:Universidad Nacional del Comahueenghttps://doi.org/10.1016/j.jpdc.2023.104797info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/2025-11-06T10:08:47Zoai:rdi.uncoma.edu.ar:uncomaid/18119instacron:UNCoInstitucionalhttp://rdi.uncoma.edu.ar/Universidad públicaNo correspondehttp://rdi.uncoma.edu.ar/oaimirtha.mateo@biblioteca.uncoma.edu.ar; adriana.acuna@biblioteca.uncoma.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:71082025-11-06 10:08:47.347Repositorio Digital Institucional (UNCo) - Universidad Nacional del Comahuefalse |
| dc.title.none.fl_str_mv |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
| title |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
| spellingShingle |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems Morán, Marina Energy saving Fault tolerance methods Checkpoint parallel Applications ACPI DVFS Ciencias de la Computación e Información |
| title_short |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
| title_full |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
| title_fullStr |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
| title_full_unstemmed |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
| title_sort |
Exploring Energy Saving Opportunities in Fault Tolerant HPC Systems |
| dc.creator.none.fl_str_mv |
Morán, Marina Balladini, Javier Rexachs, Dolores Rucci, Enzo |
| author |
Morán, Marina |
| author_facet |
Morán, Marina Balladini, Javier Rexachs, Dolores Rucci, Enzo |
| author_role |
author |
| author2 |
Balladini, Javier Rexachs, Dolores Rucci, Enzo |
| author2_role |
author author author |
| dc.subject.none.fl_str_mv |
Energy saving Fault tolerance methods Checkpoint parallel Applications ACPI DVFS Ciencias de la Computación e Información |
| topic |
Energy saving Fault tolerance methods Checkpoint parallel Applications ACPI DVFS Ciencias de la Computación e Información |
| dc.description.none.fl_txt_mv |
Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática. Departamento de Ingeniería de Computadoras; Argentina. Fil: Rexachs, Dolores. Universidad Autónoma de Barcelona. Departamento Arquitectura de Computadores y Sistemas Operativos; España. Fil: Rucci, Enzo. Universidad Nacional de La Plata. Facultad de Informática; Argentina. |
| description |
Nowadays, improving the energy efficiency of high-performance com- puting (HPC) systems is one of the main drivers in scientific and techno- logical research. As large-scale HPC systems require some fault-tolerant method, the opportunities to reduce energy consumption should be ex- plored. In particular, rollback-recovery methods using uncoordinated checkpoints prevent all processes from re-executing when a failure oc- curs. In this context, it is possible to take actions to reduce the energy consumption of the nodes whose processes do not re-execute. This work is an extension of a previous one, in which we proposed a series of strategies to manage energy consumption at failure-time. In this work, we have en- riched our simulator and the experimentation by including non-blocking communications (with and without system buffering) and a largest num- ber of candidate processes to be analyzed. We have called the latter as cascade analysis, because it includes processes that gets blocked by com- munication indirectly with the failed process. The simulations show that the savings were negligible in the worst case, but in some scenarios, it was possible to achieve significant ones; the maximum saving achieved was 90% in a time interval of 16 minutes. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure |
| publishDate |
2024 |
| dc.date.none.fl_str_mv |
2024 |
| dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/acceptedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
| format |
article |
| status_str |
acceptedVersion |
| dc.identifier.none.fl_str_mv |
http://rdi.uncoma.edu.ar/handle/uncomaid/18119 |
| url |
http://rdi.uncoma.edu.ar/handle/uncomaid/18119 |
| dc.language.none.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
https://doi.org/10.1016/j.jpdc.2023.104797 |
| dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ |
| eu_rights_str_mv |
openAccess |
| rights_invalid_str_mv |
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ |
| dc.format.none.fl_str_mv |
application/pdf pp. 1-36 application/pdf |
| dc.publisher.none.fl_str_mv |
Elsevier |
| publisher.none.fl_str_mv |
Elsevier |
| dc.source.none.fl_str_mv |
Journal of Parallel and Distributed Computing Volume 185, March 2024 reponame:Repositorio Digital Institucional (UNCo) instname:Universidad Nacional del Comahue |
| reponame_str |
Repositorio Digital Institucional (UNCo) |
| collection |
Repositorio Digital Institucional (UNCo) |
| instname_str |
Universidad Nacional del Comahue |
| repository.name.fl_str_mv |
Repositorio Digital Institucional (UNCo) - Universidad Nacional del Comahue |
| repository.mail.fl_str_mv |
mirtha.mateo@biblioteca.uncoma.edu.ar; adriana.acuna@biblioteca.uncoma.edu.ar |
| _version_ |
1848047775284985856 |
| score |
12.576249 |