A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters

Autores
Montezanti, Diego Miguel; Rucci, Enzo; Rexachs del Rosario, Dolores; Luque Fadón, Emilio; Naiouf, Marcelo; De Giusti, Armando Eduardo
Año de publicación
2014
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Transient faults are becoming a critical concern among current trends of design of generalpurpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of re-launching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationallyintensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.
Facultad de Informática
Materia
Ciencias Informáticas
transient fault
parallel scientific application
soft error detection tool
message content validation
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc/3.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/34544

id SEDICI_b124412fbf9c75be99146d8634af7fae
oai_identifier_str oai:sedici.unlp.edu.ar:10915/34544
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling A tool for detecting transient faults in execution of parallel scientific applications on multicore clustersMontezanti, Diego MiguelRucci, EnzoRexachs del Rosario, DoloresLuque Fadón, EmilioNaiouf, MarceloDe Giusti, Armando EduardoCiencias Informáticastransient faultparallel scientific applicationsoft error detection toolmessage content validationTransient faults are becoming a critical concern among current trends of design of generalpurpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of re-launching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationallyintensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.Facultad de Informática2014-04info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArticulohttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdf32-38http://sedici.unlp.edu.ar/handle/10915/34544enginfo:eu-repo/semantics/altIdentifier/url/http://journal.info.unlp.edu.ar/wp-content/uploads/JCST-Apr14-5.pdfinfo:eu-repo/semantics/altIdentifier/issn/1666-6038info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc/3.0/Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T10:59:01Zoai:sedici.unlp.edu.ar:10915/34544Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 10:59:01.84SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
spellingShingle A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
Montezanti, Diego Miguel
Ciencias Informáticas
transient fault
parallel scientific application
soft error detection tool
message content validation
title_short A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_full A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_fullStr A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_full_unstemmed A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_sort A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
dc.creator.none.fl_str_mv Montezanti, Diego Miguel
Rucci, Enzo
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
author Montezanti, Diego Miguel
author_facet Montezanti, Diego Miguel
Rucci, Enzo
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
author_role author
author2 Rucci, Enzo
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
author2_role author
author
author
author
author
dc.subject.none.fl_str_mv Ciencias Informáticas
transient fault
parallel scientific application
soft error detection tool
message content validation
topic Ciencias Informáticas
transient fault
parallel scientific application
soft error detection tool
message content validation
dc.description.none.fl_txt_mv Transient faults are becoming a critical concern among current trends of design of generalpurpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of re-launching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationallyintensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.
Facultad de Informática
description Transient faults are becoming a critical concern among current trends of design of generalpurpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of re-launching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationallyintensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.
publishDate 2014
dc.date.none.fl_str_mv 2014-04
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
Articulo
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/34544
url http://sedici.unlp.edu.ar/handle/10915/34544
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/http://journal.info.unlp.edu.ar/wp-content/uploads/JCST-Apr14-5.pdf
info:eu-repo/semantics/altIdentifier/issn/1666-6038
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc/3.0/
Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc/3.0/
Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)
dc.format.none.fl_str_mv application/pdf
32-38
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1844615853799112704
score 13.069144