Characterizing a Detection Strategy for Transient Faults in HPC

Autores
Montezanti, Diego Miguel; Rexachs del Rosario, Dolores; Rucci, Enzo; Luque Fadón, Emilio; Naiouf, Marcelo; De Giusti, Armando Eduardo; Feierherd, Guillermo Eugenio; Pesado, Patricia Mabel; Russo, Claudia Cecilia
Año de publicación
2016
Idioma
inglés
Tipo de recurso
parte de libro
Estado
versión publicada
Descripción
Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leading the system to a safe stop. This work presents a complete characterization, formally defining the behavior in the presence of faults and experimentally validating it in order to show its efficacy and viability to detect transient faults in HPC systems.
Red de Universidades con Carreras en Informática (RedUNCI)
Materia
Ciencias Informáticas
transient faults
detection
scientific parallel applications
silent data corruption
HPC
fault injection
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc/4.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/81217

id SEDICI_548fe77f5e5366c8002813d7cd88df43
oai_identifier_str oai:sedici.unlp.edu.ar:10915/81217
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling Characterizing a Detection Strategy for Transient Faults in HPCMontezanti, Diego MiguelRexachs del Rosario, DoloresRucci, EnzoLuque Fadón, EmilioNaiouf, MarceloDe Giusti, Armando EduardoFeierherd, Guillermo EugenioPesado, Patricia MabelRusso, Claudia CeciliaCiencias Informáticastransient faultsdetectionscientific parallel applicationssilent data corruptionHPCfault injectionHandling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leading the system to a safe stop. This work presents a complete characterization, formally defining the behavior in the presence of faults and experimentally validating it in order to show its efficacy and viability to detect transient faults in HPC systems.Red de Universidades con Carreras en Informática (RedUNCI)Editorial de la Universidad Nacional de La Plata (EDULP)2016-03-31info:eu-repo/semantics/bookPartinfo:eu-repo/semantics/publishedVersionCapitulo de librohttp://purl.org/coar/resource_type/c_3248info:ar-repo/semantics/parteDeLibroapplication/pdf77-90http://sedici.unlp.edu.ar/handle/10915/81217enginfo:eu-repo/semantics/altIdentifier/isbn/978-987-4127-00-6info:eu-repo/semantics/reference/hdl/10915/58554info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc/4.0/Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:15:02Zoai:sedici.unlp.edu.ar:10915/81217Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:15:02.827SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv Characterizing a Detection Strategy for Transient Faults in HPC
title Characterizing a Detection Strategy for Transient Faults in HPC
spellingShingle Characterizing a Detection Strategy for Transient Faults in HPC
Montezanti, Diego Miguel
Ciencias Informáticas
transient faults
detection
scientific parallel applications
silent data corruption
HPC
fault injection
title_short Characterizing a Detection Strategy for Transient Faults in HPC
title_full Characterizing a Detection Strategy for Transient Faults in HPC
title_fullStr Characterizing a Detection Strategy for Transient Faults in HPC
title_full_unstemmed Characterizing a Detection Strategy for Transient Faults in HPC
title_sort Characterizing a Detection Strategy for Transient Faults in HPC
dc.creator.none.fl_str_mv Montezanti, Diego Miguel
Rexachs del Rosario, Dolores
Rucci, Enzo
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
Feierherd, Guillermo Eugenio
Pesado, Patricia Mabel
Russo, Claudia Cecilia
author Montezanti, Diego Miguel
author_facet Montezanti, Diego Miguel
Rexachs del Rosario, Dolores
Rucci, Enzo
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
Feierherd, Guillermo Eugenio
Pesado, Patricia Mabel
Russo, Claudia Cecilia
author_role author
author2 Rexachs del Rosario, Dolores
Rucci, Enzo
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
Feierherd, Guillermo Eugenio
Pesado, Patricia Mabel
Russo, Claudia Cecilia
author2_role author
author
author
author
author
author
author
author
dc.subject.none.fl_str_mv Ciencias Informáticas
transient faults
detection
scientific parallel applications
silent data corruption
HPC
fault injection
topic Ciencias Informáticas
transient faults
detection
scientific parallel applications
silent data corruption
HPC
fault injection
dc.description.none.fl_txt_mv Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leading the system to a safe stop. This work presents a complete characterization, formally defining the behavior in the presence of faults and experimentally validating it in order to show its efficacy and viability to detect transient faults in HPC systems.
Red de Universidades con Carreras en Informática (RedUNCI)
description Handling faults is a growing concern in HPC; greater varieties, higher error rates, larger detection intervals and silent faults are expected in the future. It is projected that, in exascale systems, errors will occur several times a day, and that they will propagate to generate errors that will range from process crashes to corrupted results, with undetected errors in applications that are still running. In this article, we analyze a methodology for transient fault detection (called SMCV) for MPI applications. The methodology is based on software replication, and it assumes that data corruption is made apparent producing different messages between replicas. SMCV allows obtaining reliable executions with correct results, or, at least, leading the system to a safe stop. This work presents a complete characterization, formally defining the behavior in the presence of faults and experimentally validating it in order to show its efficacy and viability to detect transient faults in HPC systems.
publishDate 2016
dc.date.none.fl_str_mv 2016-03-31
dc.type.none.fl_str_mv info:eu-repo/semantics/bookPart
info:eu-repo/semantics/publishedVersion
Capitulo de libro
http://purl.org/coar/resource_type/c_3248
info:ar-repo/semantics/parteDeLibro
format bookPart
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/81217
url http://sedici.unlp.edu.ar/handle/10915/81217
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/isbn/978-987-4127-00-6
info:eu-repo/semantics/reference/hdl/10915/58554
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc/4.0/
Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc/4.0/
Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
dc.format.none.fl_str_mv application/pdf
77-90
dc.publisher.none.fl_str_mv Editorial de la Universidad Nacional de La Plata (EDULP)
publisher.none.fl_str_mv Editorial de la Universidad Nacional de La Plata (EDULP)
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1844616022146940928
score 13.070432