SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems

Autores
Montezanti, Diego Miguel
Año de publicación
2020
Idioma
inglés
Tipo de recurso
reseña artículo
Estado
versión publicada
Descripción
In the context of high error rates, unreliable results and high verification costs, the aim of this thesis is to help scientists and programmers of parallel applications to provide reliability to their results, within a predictable time. To accomplish this goal, we have designed and developed the SEDAR (Soft Error Detection and Automatic Recovery) methodology, which provides tolerance to transient faults in systems consisting in message passing applications that run in multicore clusters. SEDAR is based on process replication and monitoring of messages to be sent and of local computation, taking advantage of the intrinsic hardware redundancy of the multicores. SEDAR provides three variants: detection and automatic relaunch from the beginning; automatic recovery, based on the storage of multiple system-level checkpoints (periodic or synchronized with events); and automatic recovery, based on a single safe application-level checkpoint. The main goal is the design of the methodology and the functional validation of its effectiveness to detect transient faults and automatically recover executions, using an analytical verification model; a SEDAR prototype is also implemented. From the tests carried out with this prototype, the temporal behavior is characterized, i.e. the overhead introduced by each variant. The flexibility to dynamically choose the most convenient alternative to adapt to system requirements (such as maximum allowed overhead or completion time) is also evidenced, showing that SEDAR is a viable and effective methodology to tolerate transient faults in HPC. Unlike specific strategies, which provide partial resilience for certain applications, at the cost of modifying them, SEDAR is essentially transparent and agnostic regarding the protected algorithm.
Resumen de la tesis presentada por el autor el 18 de marzo de 2020 para la obtención del título de Doctor en Ciencias Informáticas de la Universidad Nacional de La Plata.
Facultad de Informática
Materia
Ciencias Informáticas
Soft Error Detection and Automatic Recovery
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc/4.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/108015

id SEDICI_6f7b4c6e81324fa116091e9a78f6e435
oai_identifier_str oai:sedici.unlp.edu.ar:10915/108015
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing SystemsMontezanti, Diego MiguelCiencias InformáticasSoft Error Detection and Automatic RecoveryIn the context of high error rates, unreliable results and high verification costs, the aim of this thesis is to help scientists and programmers of parallel applications to provide reliability to their results, within a predictable time. To accomplish this goal, we have designed and developed the SEDAR (Soft Error Detection and Automatic Recovery) methodology, which provides tolerance to transient faults in systems consisting in message passing applications that run in multicore clusters. SEDAR is based on process replication and monitoring of messages to be sent and of local computation, taking advantage of the intrinsic hardware redundancy of the multicores. SEDAR provides three variants: detection and automatic relaunch from the beginning; automatic recovery, based on the storage of multiple system-level checkpoints (periodic or synchronized with events); and automatic recovery, based on a single safe application-level checkpoint. The main goal is the design of the methodology and the functional validation of its effectiveness to detect transient faults and automatically recover executions, using an analytical verification model; a SEDAR prototype is also implemented. From the tests carried out with this prototype, the temporal behavior is characterized, i.e. the overhead introduced by each variant. The flexibility to dynamically choose the most convenient alternative to adapt to system requirements (such as maximum allowed overhead or completion time) is also evidenced, showing that SEDAR is a viable and effective methodology to tolerate transient faults in HPC. Unlike specific strategies, which provide partial resilience for certain applications, at the cost of modifying them, SEDAR is essentially transparent and agnostic regarding the protected algorithm.Resumen de la tesis presentada por el autor el 18 de marzo de 2020 para la obtención del título de Doctor en Ciencias Informáticas de la Universidad Nacional de La Plata.Facultad de Informática2020-10info:eu-repo/semantics/reviewinfo:eu-repo/semantics/publishedVersionRevisionhttp://purl.org/coar/resource_type/c_dcae04bcinfo:ar-repo/semantics/resenaArticuloapplication/pdf119-121http://sedici.unlp.edu.ar/handle/10915/108015enginfo:eu-repo/semantics/altIdentifier/issn/1666-6038info:eu-repo/semantics/altIdentifier/doi/10.24215/16666038.20.e14info:eu-repo/semantics/reference/hdl/10915/98816info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc/4.0/Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:24:27Zoai:sedici.unlp.edu.ar:10915/108015Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:24:27.703SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
title SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
spellingShingle SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
Montezanti, Diego Miguel
Ciencias Informáticas
Soft Error Detection and Automatic Recovery
title_short SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
title_full SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
title_fullStr SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
title_full_unstemmed SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
title_sort SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
dc.creator.none.fl_str_mv Montezanti, Diego Miguel
author Montezanti, Diego Miguel
author_facet Montezanti, Diego Miguel
author_role author
dc.subject.none.fl_str_mv Ciencias Informáticas
Soft Error Detection and Automatic Recovery
topic Ciencias Informáticas
Soft Error Detection and Automatic Recovery
dc.description.none.fl_txt_mv In the context of high error rates, unreliable results and high verification costs, the aim of this thesis is to help scientists and programmers of parallel applications to provide reliability to their results, within a predictable time. To accomplish this goal, we have designed and developed the SEDAR (Soft Error Detection and Automatic Recovery) methodology, which provides tolerance to transient faults in systems consisting in message passing applications that run in multicore clusters. SEDAR is based on process replication and monitoring of messages to be sent and of local computation, taking advantage of the intrinsic hardware redundancy of the multicores. SEDAR provides three variants: detection and automatic relaunch from the beginning; automatic recovery, based on the storage of multiple system-level checkpoints (periodic or synchronized with events); and automatic recovery, based on a single safe application-level checkpoint. The main goal is the design of the methodology and the functional validation of its effectiveness to detect transient faults and automatically recover executions, using an analytical verification model; a SEDAR prototype is also implemented. From the tests carried out with this prototype, the temporal behavior is characterized, i.e. the overhead introduced by each variant. The flexibility to dynamically choose the most convenient alternative to adapt to system requirements (such as maximum allowed overhead or completion time) is also evidenced, showing that SEDAR is a viable and effective methodology to tolerate transient faults in HPC. Unlike specific strategies, which provide partial resilience for certain applications, at the cost of modifying them, SEDAR is essentially transparent and agnostic regarding the protected algorithm.
Resumen de la tesis presentada por el autor el 18 de marzo de 2020 para la obtención del título de Doctor en Ciencias Informáticas de la Universidad Nacional de La Plata.
Facultad de Informática
description In the context of high error rates, unreliable results and high verification costs, the aim of this thesis is to help scientists and programmers of parallel applications to provide reliability to their results, within a predictable time. To accomplish this goal, we have designed and developed the SEDAR (Soft Error Detection and Automatic Recovery) methodology, which provides tolerance to transient faults in systems consisting in message passing applications that run in multicore clusters. SEDAR is based on process replication and monitoring of messages to be sent and of local computation, taking advantage of the intrinsic hardware redundancy of the multicores. SEDAR provides three variants: detection and automatic relaunch from the beginning; automatic recovery, based on the storage of multiple system-level checkpoints (periodic or synchronized with events); and automatic recovery, based on a single safe application-level checkpoint. The main goal is the design of the methodology and the functional validation of its effectiveness to detect transient faults and automatically recover executions, using an analytical verification model; a SEDAR prototype is also implemented. From the tests carried out with this prototype, the temporal behavior is characterized, i.e. the overhead introduced by each variant. The flexibility to dynamically choose the most convenient alternative to adapt to system requirements (such as maximum allowed overhead or completion time) is also evidenced, showing that SEDAR is a viable and effective methodology to tolerate transient faults in HPC. Unlike specific strategies, which provide partial resilience for certain applications, at the cost of modifying them, SEDAR is essentially transparent and agnostic regarding the protected algorithm.
publishDate 2020
dc.date.none.fl_str_mv 2020-10
dc.type.none.fl_str_mv info:eu-repo/semantics/review
info:eu-repo/semantics/publishedVersion
Revision
http://purl.org/coar/resource_type/c_dcae04bc
info:ar-repo/semantics/resenaArticulo
format review
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/108015
url http://sedici.unlp.edu.ar/handle/10915/108015
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/issn/1666-6038
info:eu-repo/semantics/altIdentifier/doi/10.24215/16666038.20.e14
info:eu-repo/semantics/reference/hdl/10915/98816
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc/4.0/
Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc/4.0/
Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
dc.format.none.fl_str_mv application/pdf
119-121
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1844616121754320896
score 13.070432