SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters

Autores
Montezanti, Diego Miguel; Frati, Fernando Emmanuel; Rexachs, Dolores; Luquet, Emilio; Naiouf, Ricardo Marcelo; de Giusti, Armando Eduardo
Año de publicación
2012
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The  requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.
Fil: Montezanti, Diego Miguel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Fil: Frati, Fernando Emmanuel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Fil: Rexachs, Dolores. Universitat Autònoma de Barcelona; España
Fil: Luquet, Emilio. Universitat Autònoma de Barcelona; España
Fil: Naiouf, Ricardo Marcelo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Fil: de Giusti, Armando Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Materia
Parallel scientific application,
Multicore cluster
Transient fault
Soft error detection
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/66780

id CONICETDig_2165c055f174ca7f47c6d51f7bcb1fbf
oai_identifier_str oai:ri.conicet.gov.ar:11336/66780
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling SMCV: a Methodology for Detecting Transient Faults in Multicore ClustersMontezanti, Diego MiguelFrati, Fernando EmmanuelRexachs, DoloresLuquet, EmilioNaiouf, Ricardo Marcelode Giusti, Armando EduardoParallel scientific application,Multicore clusterTransient faultSoft error detectionhttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The  requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.Fil: Montezanti, Diego Miguel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; ArgentinaFil: Frati, Fernando Emmanuel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; ArgentinaFil: Rexachs, Dolores. Universitat Autònoma de Barcelona; EspañaFil: Luquet, Emilio. Universitat Autònoma de Barcelona; EspañaFil: Naiouf, Ricardo Marcelo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; ArgentinaFil: de Giusti, Armando Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; ArgentinaCentro Latinoamericano de Estudios en Informática2012-12info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/66780Montezanti, Diego Miguel; Frati, Fernando Emmanuel; Rexachs, Dolores; Luquet, Emilio; Naiouf, Ricardo Marcelo; et al.; SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters; Centro Latinoamericano de Estudios en Informática; CLEI Electronic Journal; 15; 3; 12-2012; 1-110717-5000CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/http://www.clei.org/cleiej-beta/index.php/cleiej/article/view/138info:eu-repo/semantics/altIdentifier/doi/10.19153/cleiej.15.3.5info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T10:35:35Zoai:ri.conicet.gov.ar:11336/66780instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 10:35:35.511CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
spellingShingle SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
Montezanti, Diego Miguel
Parallel scientific application,
Multicore cluster
Transient fault
Soft error detection
title_short SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title_full SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title_fullStr SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title_full_unstemmed SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title_sort SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
dc.creator.none.fl_str_mv Montezanti, Diego Miguel
Frati, Fernando Emmanuel
Rexachs, Dolores
Luquet, Emilio
Naiouf, Ricardo Marcelo
de Giusti, Armando Eduardo
author Montezanti, Diego Miguel
author_facet Montezanti, Diego Miguel
Frati, Fernando Emmanuel
Rexachs, Dolores
Luquet, Emilio
Naiouf, Ricardo Marcelo
de Giusti, Armando Eduardo
author_role author
author2 Frati, Fernando Emmanuel
Rexachs, Dolores
Luquet, Emilio
Naiouf, Ricardo Marcelo
de Giusti, Armando Eduardo
author2_role author
author
author
author
author
dc.subject.none.fl_str_mv Parallel scientific application,
Multicore cluster
Transient fault
Soft error detection
topic Parallel scientific application,
Multicore cluster
Transient fault
Soft error detection
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.2
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The  requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.
Fil: Montezanti, Diego Miguel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Fil: Frati, Fernando Emmanuel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Fil: Rexachs, Dolores. Universitat Autònoma de Barcelona; España
Fil: Luquet, Emilio. Universitat Autònoma de Barcelona; España
Fil: Naiouf, Ricardo Marcelo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
Fil: de Giusti, Armando Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina. Universidad Nacional de la Plata. Facultad de Informatica. Instituto de Investigación En Informatica Lidi; Argentina
description The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The  requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.
publishDate 2012
dc.date.none.fl_str_mv 2012-12
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/66780
Montezanti, Diego Miguel; Frati, Fernando Emmanuel; Rexachs, Dolores; Luquet, Emilio; Naiouf, Ricardo Marcelo; et al.; SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters; Centro Latinoamericano de Estudios en Informática; CLEI Electronic Journal; 15; 3; 12-2012; 1-11
0717-5000
CONICET Digital
CONICET
url http://hdl.handle.net/11336/66780
identifier_str_mv Montezanti, Diego Miguel; Frati, Fernando Emmanuel; Rexachs, Dolores; Luquet, Emilio; Naiouf, Ricardo Marcelo; et al.; SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters; Centro Latinoamericano de Estudios en Informática; CLEI Electronic Journal; 15; 3; 12-2012; 1-11
0717-5000
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/http://www.clei.org/cleiej-beta/index.php/cleiej/article/view/138
info:eu-repo/semantics/altIdentifier/doi/10.19153/cleiej.15.3.5
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
application/pdf
application/pdf
dc.publisher.none.fl_str_mv Centro Latinoamericano de Estudios en Informática
publisher.none.fl_str_mv Centro Latinoamericano de Estudios en Informática
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1844614374153519104
score 13.070432