Checkpoint and Restart: An Energy Consumption Characterization in Clusters

Autores
Morán, Marina; Balladini, Javier; Rexachs, Dolores; Luque, Emilio
Año de publicación
2024
Idioma
inglés
Tipo de recurso
artículo
Estado
versión aceptada
Descripción
The fault tolerance method currently used in High Perfor- mance Computing (HPC) is the rollback-recovery method by using check- points. This, like any other fault tolerance method, adds an additional energy consumption to that of the execution of the application. The objective of this work is to determine the factors that affect the energy consumption of the computing nodes on homogeneous cluster, when per- forming checkpoint and restart operations, on SPMD (Single Program Multiple Data) applications. We have focused on the energetic study of compute nodes, contemplating different configurations of hardware and software parameters. We studied the effect of performance states (states P) and power states (states C) of processors, application problem size, checkpoint software (DMTCP) and distributed file system (NFS) config- uration. The results analysis allowed to identify opportunities to reduce the energy consumption of checkpoint and restart operations.
Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática; Argentina.
Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática; Argentina.
Fil: Rexachs, Dolores. Universitat Autónoma de Barcelona. Departamento de Arquitectura de Computadores y Sistemas Operativos; España.
Fil: Luque, Emilio. Universitat Autónoma de Barcelona. Departamento de Arquitectura de Computadores y Sistemas Operativos; España.
Materia
Checkpoint
Restart
Energy consumption
Power
Fault tol- erance methods
Ciencias de la Computación e Información
Artículos
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
Repositorio Digital Institucional (UNCo)
Institución
Universidad Nacional del Comahue
OAI Identificador
oai:rdi.uncoma.edu.ar:uncomaid/19173

id RDIUNCO_2451a6909bec0f8230d12e317929c3dc
oai_identifier_str oai:rdi.uncoma.edu.ar:uncomaid/19173
network_acronym_str RDIUNCO
repository_id_str 7108
network_name_str Repositorio Digital Institucional (UNCo)
spelling Checkpoint and Restart: An Energy Consumption Characterization in ClustersMorán, MarinaBalladini, JavierRexachs, DoloresLuque, EmilioCheckpointRestartEnergy consumptionPowerFault tol- erance methodsCiencias de la Computación e InformaciónArtículosThe fault tolerance method currently used in High Perfor- mance Computing (HPC) is the rollback-recovery method by using check- points. This, like any other fault tolerance method, adds an additional energy consumption to that of the execution of the application. The objective of this work is to determine the factors that affect the energy consumption of the computing nodes on homogeneous cluster, when per- forming checkpoint and restart operations, on SPMD (Single Program Multiple Data) applications. We have focused on the energetic study of compute nodes, contemplating different configurations of hardware and software parameters. We studied the effect of performance states (states P) and power states (states C) of processors, application problem size, checkpoint software (DMTCP) and distributed file system (NFS) config- uration. The results analysis allowed to identify opportunities to reduce the energy consumption of checkpoint and restart operations.Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática; Argentina.Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática; Argentina.Fil: Rexachs, Dolores. Universitat Autónoma de Barcelona. Departamento de Arquitectura de Computadores y Sistemas Operativos; España.Fil: Luque, Emilio. Universitat Autónoma de Barcelona. Departamento de Arquitectura de Computadores y Sistemas Operativos; España.arXiv2024info:eu-repo/semantics/articleinfo:eu-repo/semantics/acceptedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdf15 p.application/pdfhttps://rdi.uncoma.edu.ar/handle/uncomaid/19173enghttps://arxiv.org/abs/2409.02214info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/4.0/reponame:Repositorio Digital Institucional (UNCo)instname:Universidad Nacional del Comahue2026-01-08T11:15:23Zoai:rdi.uncoma.edu.ar:uncomaid/19173instacron:UNCoInstitucionalhttp://rdi.uncoma.edu.ar/Universidad públicaNo correspondehttp://rdi.uncoma.edu.ar/oaimirtha.mateo@biblioteca.uncoma.edu.ar; adriana.acuna@biblioteca.uncoma.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:71082026-01-08 11:15:23.67Repositorio Digital Institucional (UNCo) - Universidad Nacional del Comahuefalse
dc.title.none.fl_str_mv Checkpoint and Restart: An Energy Consumption Characterization in Clusters
title Checkpoint and Restart: An Energy Consumption Characterization in Clusters
spellingShingle Checkpoint and Restart: An Energy Consumption Characterization in Clusters
Morán, Marina
Checkpoint
Restart
Energy consumption
Power
Fault tol- erance methods
Ciencias de la Computación e Información
Artículos
title_short Checkpoint and Restart: An Energy Consumption Characterization in Clusters
title_full Checkpoint and Restart: An Energy Consumption Characterization in Clusters
title_fullStr Checkpoint and Restart: An Energy Consumption Characterization in Clusters
title_full_unstemmed Checkpoint and Restart: An Energy Consumption Characterization in Clusters
title_sort Checkpoint and Restart: An Energy Consumption Characterization in Clusters
dc.creator.none.fl_str_mv Morán, Marina
Balladini, Javier
Rexachs, Dolores
Luque, Emilio
author Morán, Marina
author_facet Morán, Marina
Balladini, Javier
Rexachs, Dolores
Luque, Emilio
author_role author
author2 Balladini, Javier
Rexachs, Dolores
Luque, Emilio
author2_role author
author
author
dc.subject.none.fl_str_mv Checkpoint
Restart
Energy consumption
Power
Fault tol- erance methods
Ciencias de la Computación e Información
Artículos
topic Checkpoint
Restart
Energy consumption
Power
Fault tol- erance methods
Ciencias de la Computación e Información
Artículos
dc.description.none.fl_txt_mv The fault tolerance method currently used in High Perfor- mance Computing (HPC) is the rollback-recovery method by using check- points. This, like any other fault tolerance method, adds an additional energy consumption to that of the execution of the application. The objective of this work is to determine the factors that affect the energy consumption of the computing nodes on homogeneous cluster, when per- forming checkpoint and restart operations, on SPMD (Single Program Multiple Data) applications. We have focused on the energetic study of compute nodes, contemplating different configurations of hardware and software parameters. We studied the effect of performance states (states P) and power states (states C) of processors, application problem size, checkpoint software (DMTCP) and distributed file system (NFS) config- uration. The results analysis allowed to identify opportunities to reduce the energy consumption of checkpoint and restart operations.
Fil: Morán, Marina. Universidad Nacional del Comahue. Facultad de Informática; Argentina.
Fil: Balladini, Javier. Universidad Nacional del Comahue. Facultad de Informática; Argentina.
Fil: Rexachs, Dolores. Universitat Autónoma de Barcelona. Departamento de Arquitectura de Computadores y Sistemas Operativos; España.
Fil: Luque, Emilio. Universitat Autónoma de Barcelona. Departamento de Arquitectura de Computadores y Sistemas Operativos; España.
description The fault tolerance method currently used in High Perfor- mance Computing (HPC) is the rollback-recovery method by using check- points. This, like any other fault tolerance method, adds an additional energy consumption to that of the execution of the application. The objective of this work is to determine the factors that affect the energy consumption of the computing nodes on homogeneous cluster, when per- forming checkpoint and restart operations, on SPMD (Single Program Multiple Data) applications. We have focused on the energetic study of compute nodes, contemplating different configurations of hardware and software parameters. We studied the effect of performance states (states P) and power states (states C) of processors, application problem size, checkpoint software (DMTCP) and distributed file system (NFS) config- uration. The results analysis allowed to identify opportunities to reduce the energy consumption of checkpoint and restart operations.
publishDate 2024
dc.date.none.fl_str_mv 2024
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/acceptedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str acceptedVersion
dc.identifier.none.fl_str_mv https://rdi.uncoma.edu.ar/handle/uncomaid/19173
url https://rdi.uncoma.edu.ar/handle/uncomaid/19173
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv https://arxiv.org/abs/2409.02214
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-sa/4.0/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-sa/4.0/
dc.format.none.fl_str_mv application/pdf
15 p.
application/pdf
dc.publisher.none.fl_str_mv arXiv
publisher.none.fl_str_mv arXiv
dc.source.none.fl_str_mv reponame:Repositorio Digital Institucional (UNCo)
instname:Universidad Nacional del Comahue
reponame_str Repositorio Digital Institucional (UNCo)
collection Repositorio Digital Institucional (UNCo)
instname_str Universidad Nacional del Comahue
repository.name.fl_str_mv Repositorio Digital Institucional (UNCo) - Universidad Nacional del Comahue
repository.mail.fl_str_mv mirtha.mateo@biblioteca.uncoma.edu.ar; adriana.acuna@biblioteca.uncoma.edu.ar
_version_ 1853761295167258624
score 12.747614