A Methodology for Soft Errors Detection and Automatic Recovery
- Autores
- Montezanti, Diego; De Giusti, Armando Eduardo; Naiouf, Marcelo; Villamayor, Jorge; Rexachs, Dolores; Luque, Emilio
- Año de publicación
- 2017
- Idioma
- inglés
- Tipo de recurso
- documento de conferencia
- Estado
- versión publicada
- Descripción
- Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.
- Materia
-
Ingenierías y Tecnologías
soft error detection
automatic recovery
systemlevel checkpoint
user-level checkpoint - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/4.0/
- Repositorio
- Institución
- Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
- OAI Identificador
- oai:digital.cic.gba.gob.ar:11746/8584
Ver los metadatos del registro completo
id |
CICBA_0694f2b7ea55781f22f53306b54b612f |
---|---|
oai_identifier_str |
oai:digital.cic.gba.gob.ar:11746/8584 |
network_acronym_str |
CICBA |
repository_id_str |
9441 |
network_name_str |
CIC Digital (CICBA) |
spelling |
A Methodology for Soft Errors Detection and Automatic RecoveryMontezanti, DiegoDe Giusti, Armando EduardoNaiouf, MarceloVillamayor, JorgeRexachs, DoloresLuque, EmilioIngenierías y Tecnologíassoft error detectionautomatic recoverysystemlevel checkpointuser-level checkpointHandling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems.2017info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfhttps://digital.cic.gba.gob.ar/handle/11746/8584enginfo:eu-repo/semantics/altIdentifier/doi/10.1109/HPCS.2017.71info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/reponame:CIC Digital (CICBA)instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Airesinstacron:CICBA2025-09-29T13:40:10Zoai:digital.cic.gba.gob.ar:11746/8584Institucionalhttp://digital.cic.gba.gob.arOrganismo científico-tecnológicoNo correspondehttp://digital.cic.gba.gob.ar/oai/snrdmarisa.degiusti@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:94412025-09-29 13:40:10.5CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Airesfalse |
dc.title.none.fl_str_mv |
A Methodology for Soft Errors Detection and Automatic Recovery |
title |
A Methodology for Soft Errors Detection and Automatic Recovery |
spellingShingle |
A Methodology for Soft Errors Detection and Automatic Recovery Montezanti, Diego Ingenierías y Tecnologías soft error detection automatic recovery systemlevel checkpoint user-level checkpoint |
title_short |
A Methodology for Soft Errors Detection and Automatic Recovery |
title_full |
A Methodology for Soft Errors Detection and Automatic Recovery |
title_fullStr |
A Methodology for Soft Errors Detection and Automatic Recovery |
title_full_unstemmed |
A Methodology for Soft Errors Detection and Automatic Recovery |
title_sort |
A Methodology for Soft Errors Detection and Automatic Recovery |
dc.creator.none.fl_str_mv |
Montezanti, Diego De Giusti, Armando Eduardo Naiouf, Marcelo Villamayor, Jorge Rexachs, Dolores Luque, Emilio |
author |
Montezanti, Diego |
author_facet |
Montezanti, Diego De Giusti, Armando Eduardo Naiouf, Marcelo Villamayor, Jorge Rexachs, Dolores Luque, Emilio |
author_role |
author |
author2 |
De Giusti, Armando Eduardo Naiouf, Marcelo Villamayor, Jorge Rexachs, Dolores Luque, Emilio |
author2_role |
author author author author author |
dc.subject.none.fl_str_mv |
Ingenierías y Tecnologías soft error detection automatic recovery systemlevel checkpoint user-level checkpoint |
topic |
Ingenierías y Tecnologías soft error detection automatic recovery systemlevel checkpoint user-level checkpoint |
dc.description.none.fl_txt_mv |
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems. |
description |
Handling faults is a growing concern in HPC; higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and they will propagate to generate errors that will range from process crashes to corrupted results because of undetected errors. In this article, we propose a methodology that improves system reliability against transient faults, when running parallel message-passing applications. The proposed solution, based on process replication, has the goal of helping programmers and users of parallel scientific applications to achieve reliable executions with correct results. This work presents a characterization of the strategy, defining its behavior in the presence of faults and modeling the temporal costs of employing it. As a result, we show its efficacy and viability to tolerate transient faults in HPC systems. |
publishDate |
2017 |
dc.date.none.fl_str_mv |
2017 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
https://digital.cic.gba.gob.ar/handle/11746/8584 |
url |
https://digital.cic.gba.gob.ar/handle/11746/8584 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/doi/10.1109/HPCS.2017.71 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:CIC Digital (CICBA) instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Aires instacron:CICBA |
reponame_str |
CIC Digital (CICBA) |
collection |
CIC Digital (CICBA) |
instname_str |
Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
instacron_str |
CICBA |
institution |
CICBA |
repository.name.fl_str_mv |
CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
repository.mail.fl_str_mv |
marisa.degiusti@sedici.unlp.edu.ar |
_version_ |
1844618605323354112 |
score |
13.070432 |