A methodology for soft errors detection and automatic recovery
- Autores
- Montezanti, Diego Miguel; De Giusti, Armando Eduardo; Naiouf, Marcelo; Villamayor, Jorge; Rexachs del Rosario, Dolores; Luque Fadón, Emilio
- Año de publicación
- 2017
- Idioma
- inglés
- Tipo de recurso
- documento de conferencia
- Estado
- versión publicada
- Descripción
- Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.
Instituto de Investigación en Informática - Materia
-
Ciencias Informáticas
Soft error detection
Automatic recovery
Systemlevel checkpoint
User-level checkpoint - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/4.0/
- Repositorio
- Institución
- Universidad Nacional de La Plata
- OAI Identificador
- oai:sedici.unlp.edu.ar:10915/129169
Ver los metadatos del registro completo
id |
SEDICI_95830341331dc8416d4ee94e795b846a |
---|---|
oai_identifier_str |
oai:sedici.unlp.edu.ar:10915/129169 |
network_acronym_str |
SEDICI |
repository_id_str |
1329 |
network_name_str |
SEDICI (UNLP) |
spelling |
A methodology for soft errors detection and automatic recoveryMontezanti, Diego MiguelDe Giusti, Armando EduardoNaiouf, MarceloVillamayor, JorgeRexachs del Rosario, DoloresLuque Fadón, EmilioCiencias InformáticasSoft error detectionAutomatic recoverySystemlevel checkpointUser-level checkpointHandling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.Instituto de Investigación en Informática2017-07info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf434-441http://sedici.unlp.edu.ar/handle/10915/129169enginfo:eu-repo/semantics/altIdentifier/isbn/978-1-5386-3250-5info:eu-repo/semantics/altIdentifier/doi/10.1109/HPCS.2017.71info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:31:28Zoai:sedici.unlp.edu.ar:10915/129169Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:31:28.604SEDICI (UNLP) - Universidad Nacional de La Platafalse |
dc.title.none.fl_str_mv |
A methodology for soft errors detection and automatic recovery |
title |
A methodology for soft errors detection and automatic recovery |
spellingShingle |
A methodology for soft errors detection and automatic recovery Montezanti, Diego Miguel Ciencias Informáticas Soft error detection Automatic recovery Systemlevel checkpoint User-level checkpoint |
title_short |
A methodology for soft errors detection and automatic recovery |
title_full |
A methodology for soft errors detection and automatic recovery |
title_fullStr |
A methodology for soft errors detection and automatic recovery |
title_full_unstemmed |
A methodology for soft errors detection and automatic recovery |
title_sort |
A methodology for soft errors detection and automatic recovery |
dc.creator.none.fl_str_mv |
Montezanti, Diego Miguel De Giusti, Armando Eduardo Naiouf, Marcelo Villamayor, Jorge Rexachs del Rosario, Dolores Luque Fadón, Emilio |
author |
Montezanti, Diego Miguel |
author_facet |
Montezanti, Diego Miguel De Giusti, Armando Eduardo Naiouf, Marcelo Villamayor, Jorge Rexachs del Rosario, Dolores Luque Fadón, Emilio |
author_role |
author |
author2 |
De Giusti, Armando Eduardo Naiouf, Marcelo Villamayor, Jorge Rexachs del Rosario, Dolores Luque Fadón, Emilio |
author2_role |
author author author author author |
dc.subject.none.fl_str_mv |
Ciencias Informáticas Soft error detection Automatic recovery Systemlevel checkpoint User-level checkpoint |
topic |
Ciencias Informáticas Soft error detection Automatic recovery Systemlevel checkpoint User-level checkpoint |
dc.description.none.fl_txt_mv |
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems. Instituto de Investigación en Informática |
description |
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems. |
publishDate |
2017 |
dc.date.none.fl_str_mv |
2017-07 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://sedici.unlp.edu.ar/handle/10915/129169 |
url |
http://sedici.unlp.edu.ar/handle/10915/129169 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/isbn/978-1-5386-3250-5 info:eu-repo/semantics/altIdentifier/doi/10.1109/HPCS.2017.71 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.format.none.fl_str_mv |
application/pdf 434-441 |
dc.source.none.fl_str_mv |
reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP |
reponame_str |
SEDICI (UNLP) |
collection |
SEDICI (UNLP) |
instname_str |
Universidad Nacional de La Plata |
instacron_str |
UNLP |
institution |
UNLP |
repository.name.fl_str_mv |
SEDICI (UNLP) - Universidad Nacional de La Plata |
repository.mail.fl_str_mv |
alira@sedici.unlp.edu.ar |
_version_ |
1844616194678587392 |
score |
13.070432 |