A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters

Autores
Montezanti, Diego Miguel; Rucci, Enzo; Dolores Rexachs; Luque, Emilio; Naiouf, Ricardo Marcelo; de Giusti, Armando Eduardo
Año de publicación
2014
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Transient faults are becoming a critical concern among current trends of design of generalpurpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of re-launching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationallyintensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.
Fil: Montezanti, Diego Miguel. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Fil: Rucci, Enzo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Fil: Dolores Rexachs. Universitat Autònoma de Barcelona; España
Fil: Luque, Emilio. Universitat Autònoma de Barcelona; España
Fil: Naiouf, Ricardo Marcelo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Fil: de Giusti, Armando Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Materia
Transient fault
parallel scientific application
soft error detection tool
message content validation
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/32963

id CONICETDig_d31c378c6b53c8a328db57cface8dcd6
oai_identifier_str oai:ri.conicet.gov.ar:11336/32963
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling A tool for detecting transient faults in execution of parallel scientific applications on multicore clustersMontezanti, Diego MiguelRucci, EnzoDolores RexachsLuque, EmilioNaiouf, Ricardo Marcelode Giusti, Armando EduardoTransient faultparallel scientific applicationsoft error detection toolmessage content validationhttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Transient faults are becoming a critical concern among current trends of design of generalpurpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of re-launching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationallyintensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.Fil: Montezanti, Diego Miguel. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; ArgentinaFil: Rucci, Enzo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; ArgentinaFil: Dolores Rexachs. Universitat Autònoma de Barcelona; EspañaFil: Luque, Emilio. Universitat Autònoma de Barcelona; EspañaFil: Naiouf, Ricardo Marcelo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; ArgentinaFil: de Giusti, Armando Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; ArgentinaUniversidad Nacional de La Plata. Facultad de Informática2014-04info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/32963Montezanti, Diego Miguel; Dolores Rexachs; Rucci, Enzo; de Giusti, Armando Eduardo; Naiouf, Ricardo Marcelo; Luque, Emilio; et al.; A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters; Universidad Nacional de La Plata. Facultad de Informática; Journal of computer science and technology; 14; 1; 4-2014; 32-381666-6038CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/http://journal.info.unlp.edu.ar/wp-content/uploads/JCST-Apr14-5.pdfinfo:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T10:41:25Zoai:ri.conicet.gov.ar:11336/32963instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 10:41:25.443CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
spellingShingle A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
Montezanti, Diego Miguel
Transient fault
parallel scientific application
soft error detection tool
message content validation
title_short A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_full A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_fullStr A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_full_unstemmed A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
title_sort A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters
dc.creator.none.fl_str_mv Montezanti, Diego Miguel
Rucci, Enzo
Dolores Rexachs
Luque, Emilio
Naiouf, Ricardo Marcelo
de Giusti, Armando Eduardo
author Montezanti, Diego Miguel
author_facet Montezanti, Diego Miguel
Rucci, Enzo
Dolores Rexachs
Luque, Emilio
Naiouf, Ricardo Marcelo
de Giusti, Armando Eduardo
author_role author
author2 Rucci, Enzo
Dolores Rexachs
Luque, Emilio
Naiouf, Ricardo Marcelo
de Giusti, Armando Eduardo
author2_role author
author
author
author
author
dc.subject.none.fl_str_mv Transient fault
parallel scientific application
soft error detection tool
message content validation
topic Transient fault
parallel scientific application
soft error detection tool
message content validation
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.2
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv Transient faults are becoming a critical concern among current trends of design of generalpurpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of re-launching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationallyintensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.
Fil: Montezanti, Diego Miguel. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Fil: Rucci, Enzo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Fil: Dolores Rexachs. Universitat Autònoma de Barcelona; España
Fil: Luque, Emilio. Universitat Autònoma de Barcelona; España
Fil: Naiouf, Ricardo Marcelo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Fil: de Giusti, Armando Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
description Transient faults are becoming a critical concern among current trends of design of generalpurpose multiprocessors. Because of their capability to corrupt programs outputs, their impact gains importance when considering long duration, parallel scientific applications, due to the high cost of re-launching execution from the beginning in case of incorrect results. This paper introduces SMCV tool which improves reliability for high-performance systems. SMCV replicates application processes and validates the contents of the messages to be sent, preventing the propagation of errors to other processes and restricting detection latency and notification. To assess its utility, the overhead of SMCV tool is evaluated with three computationallyintensive, representative parallel scientific applications. The obtained results demonstrate the efficiency of SMCV tool to detect transient faults occurrences.
publishDate 2014
dc.date.none.fl_str_mv 2014-04
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/32963
Montezanti, Diego Miguel; Dolores Rexachs; Rucci, Enzo; de Giusti, Armando Eduardo; Naiouf, Ricardo Marcelo; Luque, Emilio; et al.; A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters; Universidad Nacional de La Plata. Facultad de Informática; Journal of computer science and technology; 14; 1; 4-2014; 32-38
1666-6038
CONICET Digital
CONICET
url http://hdl.handle.net/11336/32963
identifier_str_mv Montezanti, Diego Miguel; Dolores Rexachs; Rucci, Enzo; de Giusti, Armando Eduardo; Naiouf, Ricardo Marcelo; Luque, Emilio; et al.; A tool for detecting transient faults in execution of parallel scientific applications on multicore clusters; Universidad Nacional de La Plata. Facultad de Informática; Journal of computer science and technology; 14; 1; 4-2014; 32-38
1666-6038
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/http://journal.info.unlp.edu.ar/wp-content/uploads/JCST-Apr14-5.pdf
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
application/pdf
application/pdf
dc.publisher.none.fl_str_mv Universidad Nacional de La Plata. Facultad de Informática
publisher.none.fl_str_mv Universidad Nacional de La Plata. Facultad de Informática
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1844614444963856384
score 13.069144