A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters

Autores
Montezanti, Diego Miguel; Rucci, Enzo; Rexachs del Rosario, Dolores; Luque Fadón, Emilio; Naiouf, Marcelo; De Giusti, Armando Eduardo
Año de publicación
2013
Idioma
inglés
Tipo de recurso
documento de conferencia
Estado
versión publicada
Descripción
Transient faults are becoming a critical concern among current trends of design of general-purpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of relaunching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationally-intensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.
WPDP- XIII Workshop procesamiento distribuido y paralelo
Red de Universidades con Carreras en Informática (RedUNCI)
Materia
Ciencias Informáticas
Validation
transient fault
parallel scientific application
soft error detection tool
message content validation
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/31729

id SEDICI_2f16f1dc6630a8e213e0345ecc1eb5bd
oai_identifier_str oai:sedici.unlp.edu.ar:10915/31729
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling A tool for detecting transient faults in execution of parallel scientific applications on multicore clustersMontezanti, Diego MiguelRucci, EnzoRexachs del Rosario, DoloresLuque Fadón, EmilioNaiouf, MarceloDe Giusti, Armando EduardoCiencias InformáticasValidationtransient faultparallel scientific applicationsoft error detection toolmessage content validationTransient faults are becoming a critical concern among current trends of design of general-purpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of relaunching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationally-intensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.WPDP- XIII Workshop procesamiento distribuido y paraleloRed de Universidades con Carreras en Informática (RedUNCI)2013-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/31729enginfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/2.5/ar/Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Argentina (CC BY-NC-SA 2.5)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T10:58:05Zoai:sedici.unlp.edu.ar:10915/31729Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 10:58:05.867SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
spellingShingle A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
Montezanti, Diego Miguel
Ciencias Informáticas
Validation
transient fault
parallel scientific application
soft error detection tool
message content validation
title_short A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_full A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_fullStr A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_full_unstemmed A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_sort A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
dc.creator.none.fl_str_mv Montezanti, Diego Miguel
Rucci, Enzo
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
author Montezanti, Diego Miguel
author_facet Montezanti, Diego Miguel
Rucci, Enzo
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
author_role author
author2 Rucci, Enzo
Rexachs del Rosario, Dolores
Luque Fadón, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
author2_role author
author
author
author
author
dc.subject.none.fl_str_mv Ciencias Informáticas
Validation
transient fault
parallel scientific application
soft error detection tool
message content validation
topic Ciencias Informáticas
Validation
transient fault
parallel scientific application
soft error detection tool
message content validation
dc.description.none.fl_txt_mv Transient faults are becoming a critical concern among current trends of design of general-purpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of relaunching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationally-intensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.
WPDP- XIII Workshop procesamiento distribuido y paralelo
Red de Universidades con Carreras en Informática (RedUNCI)
description Transient faults are becoming a critical concern among current trends of design of general-purpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of relaunching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationally-intensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.
publishDate 2013
dc.date.none.fl_str_mv 2013-10
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
info:eu-repo/semantics/publishedVersion
Objeto de conferencia
http://purl.org/coar/resource_type/c_5794
info:ar-repo/semantics/documentoDeConferencia
format conferenceObject
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/31729
url http://sedici.unlp.edu.ar/handle/10915/31729
dc.language.none.fl_str_mv eng
language eng
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Argentina (CC BY-NC-SA 2.5)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Argentina (CC BY-NC-SA 2.5)
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1844615842877145088
score 13.070432