Characterizing a Detection Strategy for Transient Faults in HPC
- Autores
- Montezanti, Diego Miguel; Rexachs del Rosario, Dolores; Rucci, Enzo; Luque Fadón, Emilio; Naiouf, Marcelo; De Giusti, Armando Eduardo; Feierherd, Guillermo Eugenio; Pesado, Patricia Mabel; Russo, Claudia Cecilia
- Año de publicación
- 2016
- Idioma
- inglés
- Tipo de recurso
- parte de libro
- Estado
- versión publicada
- Descripción
- Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leading the system to a safe stop. This work presents a complete characterization, formally defining the behavior in the presence of faults and experimentally validating it in order to show its efficacy and viability to detect transient faults in HPC systems.
Red de Universidades con Carreras en Informática (RedUNCI) - Materia
-
Ciencias Informáticas
transient faults
detection
scientific parallel applications
silent data corruption
HPC
fault injection - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc/4.0/
- Repositorio
- Institución
- Universidad Nacional de La Plata
- OAI Identificador
- oai:sedici.unlp.edu.ar:10915/81217
Ver los metadatos del registro completo
id |
SEDICI_548fe77f5e5366c8002813d7cd88df43 |
---|---|
oai_identifier_str |
oai:sedici.unlp.edu.ar:10915/81217 |
network_acronym_str |
SEDICI |
repository_id_str |
1329 |
network_name_str |
SEDICI (UNLP) |
spelling |
Characterizing a Detection Strategy for Transient Faults in HPCMontezanti, Diego MiguelRexachs del Rosario, DoloresRucci, EnzoLuque Fadón, EmilioNaiouf, MarceloDe Giusti, Armando EduardoFeierherd, Guillermo EugenioPesado, Patricia MabelRusso, Claudia CeciliaCiencias Informáticastransient faultsdetectionscientific parallel applicationssilent data corruptionHPCfault injectionHandling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leading the system to a safe stop. This work presents a complete characterization, formally defining the behavior in the presence of faults and experimentally validating it in order to show its efficacy and viability to detect transient faults in HPC systems.Red de Universidades con Carreras en Informática (RedUNCI)Editorial de la Universidad Nacional de La Plata (EDULP)2016-03-31info:eu-repo/semantics/bookPartinfo:eu-repo/semantics/publishedVersionCapitulo de librohttp://purl.org/coar/resource_type/c_3248info:ar-repo/semantics/parteDeLibroapplication/pdf77-90http://sedici.unlp.edu.ar/handle/10915/81217enginfo:eu-repo/semantics/altIdentifier/isbn/978-987-4127-00-6info:eu-repo/semantics/reference/hdl/10915/58554info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc/4.0/Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:15:02Zoai:sedici.unlp.edu.ar:10915/81217Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:15:02.827SEDICI (UNLP) - Universidad Nacional de La Platafalse |
dc.title.none.fl_str_mv |
Characterizing a Detection Strategy for Transient Faults in HPC |
title |
Characterizing a Detection Strategy for Transient Faults in HPC |
spellingShingle |
Characterizing a Detection Strategy for Transient Faults in HPC Montezanti, Diego Miguel Ciencias Informáticas transient faults detection scientific parallel applications silent data corruption HPC fault injection |
title_short |
Characterizing a Detection Strategy for Transient Faults in HPC |
title_full |
Characterizing a Detection Strategy for Transient Faults in HPC |
title_fullStr |
Characterizing a Detection Strategy for Transient Faults in HPC |
title_full_unstemmed |
Characterizing a Detection Strategy for Transient Faults in HPC |
title_sort |
Characterizing a Detection Strategy for Transient Faults in HPC |
dc.creator.none.fl_str_mv |
Montezanti, Diego Miguel Rexachs del Rosario, Dolores Rucci, Enzo Luque Fadón, Emilio Naiouf, Marcelo De Giusti, Armando Eduardo Feierherd, Guillermo Eugenio Pesado, Patricia Mabel Russo, Claudia Cecilia |
author |
Montezanti, Diego Miguel |
author_facet |
Montezanti, Diego Miguel Rexachs del Rosario, Dolores Rucci, Enzo Luque Fadón, Emilio Naiouf, Marcelo De Giusti, Armando Eduardo Feierherd, Guillermo Eugenio Pesado, Patricia Mabel Russo, Claudia Cecilia |
author_role |
author |
author2 |
Rexachs del Rosario, Dolores Rucci, Enzo Luque Fadón, Emilio Naiouf, Marcelo De Giusti, Armando Eduardo Feierherd, Guillermo Eugenio Pesado, Patricia Mabel Russo, Claudia Cecilia |
author2_role |
author author author author author author author author |
dc.subject.none.fl_str_mv |
Ciencias Informáticas transient faults detection scientific parallel applications silent data corruption HPC fault injection |
topic |
Ciencias Informáticas transient faults detection scientific parallel applications silent data corruption HPC fault injection |
dc.description.none.fl_txt_mv |
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leading the system to a safe stop. This work presents a complete characterization, formally defining the behavior in the presence of faults and experimentally validating it in order to show its efficacy and viability to detect transient faults in HPC systems. Red de Universidades con Carreras en Informática (RedUNCI) |
description |
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leading the system to a safe stop. This work presents a complete characterization, formally defining the behavior in the presence of faults and experimentally validating it in order to show its efficacy and viability to detect transient faults in HPC systems. |
publishDate |
2016 |
dc.date.none.fl_str_mv |
2016-03-31 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/bookPart info:eu-repo/semantics/publishedVersion Capitulo de libro http://purl.org/coar/resource_type/c_3248 info:ar-repo/semantics/parteDeLibro |
format |
bookPart |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://sedici.unlp.edu.ar/handle/10915/81217 |
url |
http://sedici.unlp.edu.ar/handle/10915/81217 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/isbn/978-987-4127-00-6 info:eu-repo/semantics/reference/hdl/10915/58554 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc/4.0/ Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc/4.0/ Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) |
dc.format.none.fl_str_mv |
application/pdf 77-90 |
dc.publisher.none.fl_str_mv |
Editorial de la Universidad Nacional de La Plata (EDULP) |
publisher.none.fl_str_mv |
Editorial de la Universidad Nacional de La Plata (EDULP) |
dc.source.none.fl_str_mv |
reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP |
reponame_str |
SEDICI (UNLP) |
collection |
SEDICI (UNLP) |
instname_str |
Universidad Nacional de La Plata |
instacron_str |
UNLP |
institution |
UNLP |
repository.name.fl_str_mv |
SEDICI (UNLP) - Universidad Nacional de La Plata |
repository.mail.fl_str_mv |
alira@sedici.unlp.edu.ar |
_version_ |
1844616022146940928 |
score |
13.070432 |