Towards Management of Energy Consumption in HPC Systems with Fault Tolerance

Autores
Morán, Marina; Balladini, Javier; Rexachs del Rosario, Dolores; Rucci, Enzo
Año de publicación
2020
Idioma
inglés
Tipo de recurso
documento de conferencia
Estado
versión publicada
Descripción
High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based fault tolerance methods, such as uncoordinated checkpoints. These methods allow only some processes to go back in the face of failure, while the rest of the processes continue to run. In this article, we focus on the processes that continue execution, and propose a series of strategies to manage energy consumption when a failure occurs and uncoordinated checkpoints are used. We present an energy model to evaluate strategies and through simulation we analyze the behavior of an application under different configurations and failure time. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.
Instituto de Investigación en Informática
Comisión de Investigaciones Científicas de la provincia de Buenos Aires
Materia
Informática
Energy consumption
energy saving
Power management
Fault tolerance
uncoordinated checkpoint
HPC
Distributed memory
MPI
DVFS
ACPI
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/139146

id SEDICI_46813f5cbd85bf4d530af115d180d574
oai_identifier_str oai:sedici.unlp.edu.ar:10915/139146
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling Towards Management of Energy Consumption in HPC Systems with Fault ToleranceMorán, MarinaBalladini, JavierRexachs del Rosario, DoloresRucci, EnzoInformáticaEnergy consumptionenergy savingPower managementFault toleranceuncoordinated checkpointHPCDistributed memoryMPIDVFSACPIHigh-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based fault tolerance methods, such as uncoordinated checkpoints. These methods allow only some processes to go back in the face of failure, while the rest of the processes continue to run. In this article, we focus on the processes that continue execution, and propose a series of strategies to manage energy consumption when a failure occurs and uncoordinated checkpoints are used. We present an energy model to evaluate strategies and through simulation we analyze the behavior of an application under different configurations and failure time. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.Instituto de Investigación en InformáticaComisión de Investigaciones Científicas de la provincia de Buenos Aires2020info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/139146enginfo:eu-repo/semantics/altIdentifier/isbn/978-1-7281-5957-7info:eu-repo/semantics/altIdentifier/doi/10.1109/argencon49523.2020.9505498info:eu-repo/semantics/altIdentifier/arxiv/2012.11396info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-10-15T11:23:43Zoai:sedici.unlp.edu.ar:10915/139146Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-10-15 11:23:44.138SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
title Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
spellingShingle Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
Morán, Marina
Informática
Energy consumption
energy saving
Power management
Fault tolerance
uncoordinated checkpoint
HPC
Distributed memory
MPI
DVFS
ACPI
title_short Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
title_full Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
title_fullStr Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
title_full_unstemmed Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
title_sort Towards Management of Energy Consumption in HPC Systems with Fault Tolerance
dc.creator.none.fl_str_mv Morán, Marina
Balladini, Javier
Rexachs del Rosario, Dolores
Rucci, Enzo
author Morán, Marina
author_facet Morán, Marina
Balladini, Javier
Rexachs del Rosario, Dolores
Rucci, Enzo
author_role author
author2 Balladini, Javier
Rexachs del Rosario, Dolores
Rucci, Enzo
author2_role author
author
author
dc.subject.none.fl_str_mv Informática
Energy consumption
energy saving
Power management
Fault tolerance
uncoordinated checkpoint
HPC
Distributed memory
MPI
DVFS
ACPI
topic Informática
Energy consumption
energy saving
Power management
Fault tolerance
uncoordinated checkpoint
HPC
Distributed memory
MPI
DVFS
ACPI
dc.description.none.fl_txt_mv High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based fault tolerance methods, such as uncoordinated checkpoints. These methods allow only some processes to go back in the face of failure, while the rest of the processes continue to run. In this article, we focus on the processes that continue execution, and propose a series of strategies to manage energy consumption when a failure occurs and uncoordinated checkpoints are used. We present an energy model to evaluate strategies and through simulation we analyze the behavior of an application under different configurations and failure time. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.
Instituto de Investigación en Informática
Comisión de Investigaciones Científicas de la provincia de Buenos Aires
description High-performance computing continues to increase its computing power and energy efficiency. However, energy consumption continues to rise and finding ways to limit and/or decrease it is a crucial point in current research. For high-performance MPI applications, there are rollback recovery based fault tolerance methods, such as uncoordinated checkpoints. These methods allow only some processes to go back in the face of failure, while the rest of the processes continue to run. In this article, we focus on the processes that continue execution, and propose a series of strategies to manage energy consumption when a failure occurs and uncoordinated checkpoints are used. We present an energy model to evaluate strategies and through simulation we analyze the behavior of an application under different configurations and failure time. As a result, we show the feasibility of improving energy efficiency in HPC systems in the presence of a failure.
publishDate 2020
dc.date.none.fl_str_mv 2020
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
info:eu-repo/semantics/publishedVersion
Objeto de conferencia
http://purl.org/coar/resource_type/c_5794
info:ar-repo/semantics/documentoDeConferencia
format conferenceObject
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/139146
url http://sedici.unlp.edu.ar/handle/10915/139146
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/isbn/978-1-7281-5957-7
info:eu-repo/semantics/altIdentifier/doi/10.1109/argencon49523.2020.9505498
info:eu-repo/semantics/altIdentifier/arxiv/2012.11396
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1846064292269391872
score 13.22299