SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters

Autores
Montezanti, Diego Miguel; Frati, Fernando Emmanuel; Rexachs del Rosario, Dolores; Luquet, Emilio; Naiouf, Marcelo; De Giusti, Armando Eduardo
Año de publicación
2012
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The  requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.
Instituto de Investigación en Informática
Materia
Ingeniería en Computación
Ciencias Informáticas
Parallel scientific application
Multicore cluster
Transient fault
Soft error detection
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/96550

id SEDICI_c1e35931a232aac9506943b98d433f1f
oai_identifier_str oai:sedici.unlp.edu.ar:10915/96550
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling SMCV: a Methodology for Detecting Transient Faults in Multicore ClustersMontezanti, Diego MiguelFrati, Fernando EmmanuelRexachs del Rosario, DoloresLuquet, EmilioNaiouf, MarceloDe Giusti, Armando EduardoIngeniería en ComputaciónCiencias InformáticasParallel scientific applicationMulticore clusterTransient faultSoft error detectionThe challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The  requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.Instituto de Investigación en Informática2012-12info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArticulohttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdf1-11http://sedici.unlp.edu.ar/handle/10915/96550enginfo:eu-repo/semantics/altIdentifier/url/https://ri.conicet.gov.ar/11336/66780info:eu-repo/semantics/altIdentifier/url/http://www2.clei.org/cleiej/paper.php?id=250info:eu-repo/semantics/altIdentifier/issn/0717-5000info:eu-repo/semantics/altIdentifier/doi/10.19153/cleiej.15.3.5info:eu-repo/semantics/altIdentifier/hdl/11336/66780info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/2.5/ar/Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Argentina (CC BY-NC-SA 2.5)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:20:20Zoai:sedici.unlp.edu.ar:10915/96550Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:20:21.068SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
spellingShingle SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
Montezanti, Diego Miguel
Ingeniería en Computación
Ciencias Informáticas
Parallel scientific application
Multicore cluster
Transient fault
Soft error detection
title_short SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title_full SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title_fullStr SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title_full_unstemmed SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
title_sort SMCV: a Methodology for Detecting Transient Faults in Multicore Clusters
dc.creator.none.fl_str_mv Montezanti, Diego Miguel
Frati, Fernando Emmanuel
Rexachs del Rosario, Dolores
Luquet, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
author Montezanti, Diego Miguel
author_facet Montezanti, Diego Miguel
Frati, Fernando Emmanuel
Rexachs del Rosario, Dolores
Luquet, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
author_role author
author2 Frati, Fernando Emmanuel
Rexachs del Rosario, Dolores
Luquet, Emilio
Naiouf, Marcelo
De Giusti, Armando Eduardo
author2_role author
author
author
author
author
dc.subject.none.fl_str_mv Ingeniería en Computación
Ciencias Informáticas
Parallel scientific application
Multicore cluster
Transient fault
Soft error detection
topic Ingeniería en Computación
Ciencias Informáticas
Parallel scientific application
Multicore cluster
Transient fault
Soft error detection
dc.description.none.fl_txt_mv The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The  requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.
Instituto de Investigación en Informática
description The challenge of improving the performance of current processors is achieved by increasing the integration scale. This carries a growing vulnerability to transient faults, which increase their impact on multicore clusters running large scientific parallel applications. The  requirement for enhancing the reliability of these systems, coupled with the high cost of rerunning the application from the beginning, create the motivation for having specific software strategies for the target systems. This paper introduces SMCV, which is a fully distributed technique that provides fault detection for message-passing parallel applications, by validating the contents of the messages to be sent, preventing the transmission of errors to other processes and leveraging the intrinsic hardware redundancy of the multicore. SMCV achieves a wide robustness against transient faults with a reduced overhead, and accomplishes a trade-off between moderate detection latency and low additional workload.
publishDate 2012
dc.date.none.fl_str_mv 2012-12
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
Articulo
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/96550
url http://sedici.unlp.edu.ar/handle/10915/96550
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/https://ri.conicet.gov.ar/11336/66780
info:eu-repo/semantics/altIdentifier/url/http://www2.clei.org/cleiej/paper.php?id=250
info:eu-repo/semantics/altIdentifier/issn/0717-5000
info:eu-repo/semantics/altIdentifier/doi/10.19153/cleiej.15.3.5
info:eu-repo/semantics/altIdentifier/hdl/11336/66780
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Argentina (CC BY-NC-SA 2.5)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Argentina (CC BY-NC-SA 2.5)
dc.format.none.fl_str_mv application/pdf
1-11
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1844616077785432064
score 13.069144