A Methodology for Soft Errors Detection and Automatic Recovery

Autores
Montezanti, Diego; De Giusti, Armando Eduardo; Naiouf, Marcelo; Villamayor, Jorge; Rexachs, Dolores; Luque, Emilio
Año de publicación
2017
Idioma
inglés
Tipo de recurso
documento de conferencia
Estado
versión publicada
Descripción
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.
Materia
Ingenierías y Tecnologías
soft error detection
automatic recovery
systemlevel checkpoint
user-level checkpoint
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
CIC Digital (CICBA)
Institución
Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
OAI Identificador
oai:digital.cic.gba.gob.ar:11746/8584

id CICBA_0694f2b7ea55781f22f53306b54b612f
oai_identifier_str oai:digital.cic.gba.gob.ar:11746/8584
network_acronym_str CICBA
repository_id_str 9441
network_name_str CIC Digital (CICBA)
spelling A Methodology for Soft Errors Detection and Automatic RecoveryMontezanti, DiegoDe Giusti, Armando EduardoNaiouf, MarceloVillamayor, JorgeRexachs, DoloresLuque, EmilioIngenierías y Tecnologíassoft error detectionautomatic recoverysystemlevel checkpointuser-level checkpointHandling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.2017info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfhttps://digital.cic.gba.gob.ar/handle/11746/8584enginfo:eu-repo/semantics/altIdentifier/doi/10.1109/HPCS.2017.71info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/reponame:CIC Digital (CICBA)instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Airesinstacron:CICBA2025-09-29T13:40:10Zoai:digital.cic.gba.gob.ar:11746/8584Institucionalhttp://digital.cic.gba.gob.arOrganismo científico-tecnológicoNo correspondehttp://digital.cic.gba.gob.ar/oai/snrdmarisa.degiusti@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:94412025-09-29 13:40:10.5CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Airesfalse
dc.title.none.fl_str_mv A Methodology for Soft Errors Detection and Automatic Recovery
title A Methodology for Soft Errors Detection and Automatic Recovery
spellingShingle A Methodology for Soft Errors Detection and Automatic Recovery
Montezanti, Diego
Ingenierías y Tecnologías
soft error detection
automatic recovery
systemlevel checkpoint
user-level checkpoint
title_short A Methodology for Soft Errors Detection and Automatic Recovery
title_full A Methodology for Soft Errors Detection and Automatic Recovery
title_fullStr A Methodology for Soft Errors Detection and Automatic Recovery
title_full_unstemmed A Methodology for Soft Errors Detection and Automatic Recovery
title_sort A Methodology for Soft Errors Detection and Automatic Recovery
dc.creator.none.fl_str_mv Montezanti, Diego
De Giusti, Armando Eduardo
Naiouf, Marcelo
Villamayor, Jorge
Rexachs, Dolores
Luque, Emilio
author Montezanti, Diego
author_facet Montezanti, Diego
De Giusti, Armando Eduardo
Naiouf, Marcelo
Villamayor, Jorge
Rexachs, Dolores
Luque, Emilio
author_role author
author2 De Giusti, Armando Eduardo
Naiouf, Marcelo
Villamayor, Jorge
Rexachs, Dolores
Luque, Emilio
author2_role author
author
author
author
author
dc.subject.none.fl_str_mv Ingenierías y Tecnologías
soft error detection
automatic recovery
systemlevel checkpoint
user-level checkpoint
topic Ingenierías y Tecnologías
soft error detection
automatic recovery
systemlevel checkpoint
user-level checkpoint
dc.description.none.fl_txt_mv Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.
description Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.
publishDate 2017
dc.date.none.fl_str_mv 2017
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_5794
info:ar-repo/semantics/documentoDeConferencia
format conferenceObject
status_str publishedVersion
dc.identifier.none.fl_str_mv https://digital.cic.gba.gob.ar/handle/11746/8584
url https://digital.cic.gba.gob.ar/handle/11746/8584
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/doi/10.1109/HPCS.2017.71
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/4.0/
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:CIC Digital (CICBA)
instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
instacron:CICBA
reponame_str CIC Digital (CICBA)
collection CIC Digital (CICBA)
instname_str Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
instacron_str CICBA
institution CICBA
repository.name.fl_str_mv CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
repository.mail.fl_str_mv marisa.degiusti@sedici.unlp.edu.ar
_version_ 1844618605323354112
score 13.070432