SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems
- Autores
- Montezanti, Diego Miguel
- Año de publicación
- 2020
- Idioma
- inglés
- Tipo de recurso
- reseña artículo
- Estado
- versión publicada
- Descripción
- In the context of high error rates, unreliable results and high verification costs, the aim of this thesis is to help scientists and programmers of parallel applications to provide reliability to their results, within a predictable time. To accomplish this goal, we have designed and developed the SEDAR (Soft Error Detection and Automatic Recovery) methodology, which provides tolerance to transient faults in systems consisting in message passing applications that run in multicore clusters. SEDAR is based on process replication and monitoring of messages to be sent and of local computation, taking advantage of the intrinsic hardware redundancy of the multicores. SEDAR provides three variants: detection and automatic relaunch from the beginning; automatic recovery, based on the storage of multiple system-level checkpoints (periodic or synchronized with events); and automatic recovery, based on a single safe application-level checkpoint. The main goal is the design of the methodology and the functional validation of its effectiveness to detect transient faults and automatically recover executions, using an analytical verification model; a SEDAR prototype is also implemented. From the tests carried out with this prototype, the temporal behavior is characterized, i.e. the overhead introduced by each variant. The flexibility to dynamically choose the most convenient alternative to adapt to system requirements (such as maximum allowed overhead or completion time) is also evidenced, showing that SEDAR is a viable and effective methodology to tolerate transient faults in HPC. Unlike specific strategies, which provide partial resilience for certain applications, at the cost of modifying them, SEDAR is essentially transparent and agnostic regarding the protected algorithm.
Resumen de la tesis presentada por el autor el 18 de marzo de 2020 para la obtención del título de Doctor en Ciencias Informáticas de la Universidad Nacional de La Plata.
Facultad de Informática - Materia
-
Ciencias Informáticas
Soft Error Detection and Automatic Recovery - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc/4.0/
- Repositorio
- Institución
- Universidad Nacional de La Plata
- OAI Identificador
- oai:sedici.unlp.edu.ar:10915/108015
Ver los metadatos del registro completo
id |
SEDICI_6f7b4c6e81324fa116091e9a78f6e435 |
---|---|
oai_identifier_str |
oai:sedici.unlp.edu.ar:10915/108015 |
network_acronym_str |
SEDICI |
repository_id_str |
1329 |
network_name_str |
SEDICI (UNLP) |
spelling |
SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing SystemsMontezanti, Diego MiguelCiencias InformáticasSoft Error Detection and Automatic RecoveryIn the context of high error rates, unreliable results and high verification costs, the aim of this thesis is to help scientists and programmers of parallel applications to provide reliability to their results, within a predictable time. To accomplish this goal, we have designed and developed the SEDAR (Soft Error Detection and Automatic Recovery) methodology, which provides tolerance to transient faults in systems consisting in message passing applications that run in multicore clusters. SEDAR is based on process replication and monitoring of messages to be sent and of local computation, taking advantage of the intrinsic hardware redundancy of the multicores. SEDAR provides three variants: detection and automatic relaunch from the beginning; automatic recovery, based on the storage of multiple system-level checkpoints (periodic or synchronized with events); and automatic recovery, based on a single safe application-level checkpoint. The main goal is the design of the methodology and the functional validation of its effectiveness to detect transient faults and automatically recover executions, using an analytical verification model; a SEDAR prototype is also implemented. From the tests carried out with this prototype, the temporal behavior is characterized, i.e. the overhead introduced by each variant. The flexibility to dynamically choose the most convenient alternative to adapt to system requirements (such as maximum allowed overhead or completion time) is also evidenced, showing that SEDAR is a viable and effective methodology to tolerate transient faults in HPC. Unlike specific strategies, which provide partial resilience for certain applications, at the cost of modifying them, SEDAR is essentially transparent and agnostic regarding the protected algorithm.Resumen de la tesis presentada por el autor el 18 de marzo de 2020 para la obtención del título de Doctor en Ciencias Informáticas de la Universidad Nacional de La Plata.Facultad de Informática2020-10info:eu-repo/semantics/reviewinfo:eu-repo/semantics/publishedVersionRevisionhttp://purl.org/coar/resource_type/c_dcae04bcinfo:ar-repo/semantics/resenaArticuloapplication/pdf119-121http://sedici.unlp.edu.ar/handle/10915/108015enginfo:eu-repo/semantics/altIdentifier/issn/1666-6038info:eu-repo/semantics/altIdentifier/doi/10.24215/16666038.20.e14info:eu-repo/semantics/reference/hdl/10915/98816info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc/4.0/Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:24:27Zoai:sedici.unlp.edu.ar:10915/108015Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:24:27.703SEDICI (UNLP) - Universidad Nacional de La Platafalse |
dc.title.none.fl_str_mv |
SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems |
title |
SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems |
spellingShingle |
SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems Montezanti, Diego Miguel Ciencias Informáticas Soft Error Detection and Automatic Recovery |
title_short |
SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems |
title_full |
SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems |
title_fullStr |
SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems |
title_full_unstemmed |
SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems |
title_sort |
SEDAR: Soft Error Detection and Automatic Recovery in High Performance Computing Systems |
dc.creator.none.fl_str_mv |
Montezanti, Diego Miguel |
author |
Montezanti, Diego Miguel |
author_facet |
Montezanti, Diego Miguel |
author_role |
author |
dc.subject.none.fl_str_mv |
Ciencias Informáticas Soft Error Detection and Automatic Recovery |
topic |
Ciencias Informáticas Soft Error Detection and Automatic Recovery |
dc.description.none.fl_txt_mv |
In the context of high error rates, unreliable results and high verification costs, the aim of this thesis is to help scientists and programmers of parallel applications to provide reliability to their results, within a predictable time. To accomplish this goal, we have designed and developed the SEDAR (Soft Error Detection and Automatic Recovery) methodology, which provides tolerance to transient faults in systems consisting in message passing applications that run in multicore clusters. SEDAR is based on process replication and monitoring of messages to be sent and of local computation, taking advantage of the intrinsic hardware redundancy of the multicores. SEDAR provides three variants: detection and automatic relaunch from the beginning; automatic recovery, based on the storage of multiple system-level checkpoints (periodic or synchronized with events); and automatic recovery, based on a single safe application-level checkpoint. The main goal is the design of the methodology and the functional validation of its effectiveness to detect transient faults and automatically recover executions, using an analytical verification model; a SEDAR prototype is also implemented. From the tests carried out with this prototype, the temporal behavior is characterized, i.e. the overhead introduced by each variant. The flexibility to dynamically choose the most convenient alternative to adapt to system requirements (such as maximum allowed overhead or completion time) is also evidenced, showing that SEDAR is a viable and effective methodology to tolerate transient faults in HPC. Unlike specific strategies, which provide partial resilience for certain applications, at the cost of modifying them, SEDAR is essentially transparent and agnostic regarding the protected algorithm. Resumen de la tesis presentada por el autor el 18 de marzo de 2020 para la obtención del título de Doctor en Ciencias Informáticas de la Universidad Nacional de La Plata. Facultad de Informática |
description |
In the context of high error rates, unreliable results and high verification costs, the aim of this thesis is to help scientists and programmers of parallel applications to provide reliability to their results, within a predictable time. To accomplish this goal, we have designed and developed the SEDAR (Soft Error Detection and Automatic Recovery) methodology, which provides tolerance to transient faults in systems consisting in message passing applications that run in multicore clusters. SEDAR is based on process replication and monitoring of messages to be sent and of local computation, taking advantage of the intrinsic hardware redundancy of the multicores. SEDAR provides three variants: detection and automatic relaunch from the beginning; automatic recovery, based on the storage of multiple system-level checkpoints (periodic or synchronized with events); and automatic recovery, based on a single safe application-level checkpoint. The main goal is the design of the methodology and the functional validation of its effectiveness to detect transient faults and automatically recover executions, using an analytical verification model; a SEDAR prototype is also implemented. From the tests carried out with this prototype, the temporal behavior is characterized, i.e. the overhead introduced by each variant. The flexibility to dynamically choose the most convenient alternative to adapt to system requirements (such as maximum allowed overhead or completion time) is also evidenced, showing that SEDAR is a viable and effective methodology to tolerate transient faults in HPC. Unlike specific strategies, which provide partial resilience for certain applications, at the cost of modifying them, SEDAR is essentially transparent and agnostic regarding the protected algorithm. |
publishDate |
2020 |
dc.date.none.fl_str_mv |
2020-10 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/review info:eu-repo/semantics/publishedVersion Revision http://purl.org/coar/resource_type/c_dcae04bc info:ar-repo/semantics/resenaArticulo |
format |
review |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://sedici.unlp.edu.ar/handle/10915/108015 |
url |
http://sedici.unlp.edu.ar/handle/10915/108015 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/issn/1666-6038 info:eu-repo/semantics/altIdentifier/doi/10.24215/16666038.20.e14 info:eu-repo/semantics/reference/hdl/10915/98816 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc/4.0/ Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc/4.0/ Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) |
dc.format.none.fl_str_mv |
application/pdf 119-121 |
dc.source.none.fl_str_mv |
reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP |
reponame_str |
SEDICI (UNLP) |
collection |
SEDICI (UNLP) |
instname_str |
Universidad Nacional de La Plata |
instacron_str |
UNLP |
institution |
UNLP |
repository.name.fl_str_mv |
SEDICI (UNLP) - Universidad Nacional de La Plata |
repository.mail.fl_str_mv |
alira@sedici.unlp.edu.ar |
_version_ |
1844616121754320896 |
score |
13.070432 |