A methodology for soft errors detection and automatic recovery

Autores
Montezanti, Diego Miguel; De Giusti, Armando Eduardo; Naiouf, Marcelo; Villamayor, Jorge; Rexachs del Rosario, Dolores; Luque Fadón, Emilio
Año de publicación
2017
Idioma
inglés
Tipo de recurso
documento de conferencia
Estado
versión publicada
Descripción
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.
Instituto de Investigación en Informática
Materia
Ciencias Informáticas
Soft error detection
Automatic recovery
Systemlevel checkpoint
User-level checkpoint
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/129169

id SEDICI_95830341331dc8416d4ee94e795b846a
oai_identifier_str oai:sedici.unlp.edu.ar:10915/129169
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling A methodology for soft errors detection and automatic recoveryMontezanti, Diego MiguelDe Giusti, Armando EduardoNaiouf, MarceloVillamayor, JorgeRexachs del Rosario, DoloresLuque Fadón, EmilioCiencias InformáticasSoft error detectionAutomatic recoverySystemlevel checkpointUser-level checkpointHandling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.Instituto de Investigación en Informática2017-07info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf434-441http://sedici.unlp.edu.ar/handle/10915/129169enginfo:eu-repo/semantics/altIdentifier/isbn/978-1-5386-3250-5info:eu-repo/semantics/altIdentifier/doi/10.1109/HPCS.2017.71info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:31:28Zoai:sedici.unlp.edu.ar:10915/129169Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:31:28.604SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv A methodology for soft errors detection and automatic recovery
title A methodology for soft errors detection and automatic recovery
spellingShingle A methodology for soft errors detection and automatic recovery
Montezanti, Diego Miguel
Ciencias Informáticas
Soft error detection
Automatic recovery
Systemlevel checkpoint
User-level checkpoint
title_short A methodology for soft errors detection and automatic recovery
title_full A methodology for soft errors detection and automatic recovery
title_fullStr A methodology for soft errors detection and automatic recovery
title_full_unstemmed A methodology for soft errors detection and automatic recovery
title_sort A methodology for soft errors detection and automatic recovery
dc.creator.none.fl_str_mv Montezanti, Diego Miguel
De Giusti, Armando Eduardo
Naiouf, Marcelo
Villamayor, Jorge
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
author Montezanti, Diego Miguel
author_facet Montezanti, Diego Miguel
De Giusti, Armando Eduardo
Naiouf, Marcelo
Villamayor, Jorge
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
author_role author
author2 De Giusti, Armando Eduardo
Naiouf, Marcelo
Villamayor, Jorge
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
author2_role author
author
author
author
author
dc.subject.none.fl_str_mv Ciencias Informáticas
Soft error detection
Automatic recovery
Systemlevel checkpoint
User-level checkpoint
topic Ciencias Informáticas
Soft error detection
Automatic recovery
Systemlevel checkpoint
User-level checkpoint
dc.description.none.fl_txt_mv Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.
Instituto de Investigación en Informática
description Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.
publishDate 2017
dc.date.none.fl_str_mv 2017-07
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
info:eu-repo/semantics/publishedVersion
Objeto de conferencia
http://purl.org/coar/resource_type/c_5794
info:ar-repo/semantics/documentoDeConferencia
format conferenceObject
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/129169
url http://sedici.unlp.edu.ar/handle/10915/129169
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/isbn/978-1-5386-3250-5
info:eu-repo/semantics/altIdentifier/doi/10.1109/HPCS.2017.71
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv application/pdf
434-441
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1844616194678587392
score 13.070432