Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability

Autores
Meyer, Hugo
Año de publicación
2016
Idioma
inglés
Tipo de recurso
reseña artículo
Estado
versión publicada
Descripción
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the number of components. With the growing scale of HPC applications has came an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Times Between Failures (MTBF) in current systems encourages the research of suitable Fault Tolerance (FT) solutions which makes it possible to guarantee the successful completion of parallel applications. By executing applications on HPC systems, we aim to improve the performance despite the failures that may affect systems. Our research focuses on analyzing and reducing the impact of scalable FT techniques based on rollback-recovery (e.g. uncoordinated checkpoint). As message logging is normally the main source of overhead when using uncoordinated checkpoint approaches, our research focuses on analyzing and reducing the impact of current pessimistic receiver-based message logging techniques. Taking into account the advent of multicore machines, our main contributions aim to make an efficient use of the parallel environment considering the interaction between applications processes and fault tolerance tasks. The main contributions of this research are described below.
Facultad de Informática
Materia
Ciencias Informáticas
Fault tolerance
Parallel
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by/3.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/52386

id SEDICI_6decd3694b0187b444dc26369d75d907
oai_identifier_str oai:sedici.unlp.edu.ar:10915/52386
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and DependabilityMeyer, HugoCiencias InformáticasFault toleranceParallelIn High Performance Computing (HPC) the demand for more performance is satisfied by increasing the number of components. With the growing scale of HPC applications has came an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Times Between Failures (MTBF) in current systems encourages the research of suitable Fault Tolerance (FT) solutions which makes it possible to guarantee the successful completion of parallel applications. By executing applications on HPC systems, we aim to improve the performance despite the failures that may affect systems. Our research focuses on analyzing and reducing the impact of scalable FT techniques based on rollback-recovery (e.g. uncoordinated checkpoint). As message logging is normally the main source of overhead when using uncoordinated checkpoint approaches, our research focuses on analyzing and reducing the impact of current pessimistic receiver-based message logging techniques. Taking into account the advent of multicore machines, our main contributions aim to make an efficient use of the parallel environment considering the interaction between applications processes and fault tolerance tasks. The main contributions of this research are described below.Facultad de Informática2016-04info:eu-repo/semantics/reviewinfo:eu-repo/semantics/publishedVersionRevisionhttp://purl.org/coar/resource_type/c_dcae04bcinfo:ar-repo/semantics/resenaArticuloapplication/pdf59-60http://sedici.unlp.edu.ar/handle/10915/52386enginfo:eu-repo/semantics/altIdentifier/url/http://journal.info.unlp.edu.ar/wp-content/uploads/2015/10/JCST-42-Thesis-Overview-1.pdfinfo:eu-repo/semantics/altIdentifier/issn/1666-6038info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/3.0/Creative Commons Attribution 3.0 Unported (CC BY 3.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-10-15T10:57:10Zoai:sedici.unlp.edu.ar:10915/52386Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-10-15 10:57:10.764SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
title Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
spellingShingle Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
Meyer, Hugo
Ciencias Informáticas
Fault tolerance
Parallel
title_short Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
title_full Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
title_fullStr Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
title_full_unstemmed Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
title_sort Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
dc.creator.none.fl_str_mv Meyer, Hugo
author Meyer, Hugo
author_facet Meyer, Hugo
author_role author
dc.subject.none.fl_str_mv Ciencias Informáticas
Fault tolerance
Parallel
topic Ciencias Informáticas
Fault tolerance
Parallel
dc.description.none.fl_txt_mv In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the number of components. With the growing scale of HPC applications has came an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Times Between Failures (MTBF) in current systems encourages the research of suitable Fault Tolerance (FT) solutions which makes it possible to guarantee the successful completion of parallel applications. By executing applications on HPC systems, we aim to improve the performance despite the failures that may affect systems. Our research focuses on analyzing and reducing the impact of scalable FT techniques based on rollback-recovery (e.g. uncoordinated checkpoint). As message logging is normally the main source of overhead when using uncoordinated checkpoint approaches, our research focuses on analyzing and reducing the impact of current pessimistic receiver-based message logging techniques. Taking into account the advent of multicore machines, our main contributions aim to make an efficient use of the parallel environment considering the interaction between applications processes and fault tolerance tasks. The main contributions of this research are described below.
Facultad de Informática
description In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the number of components. With the growing scale of HPC applications has came an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Times Between Failures (MTBF) in current systems encourages the research of suitable Fault Tolerance (FT) solutions which makes it possible to guarantee the successful completion of parallel applications. By executing applications on HPC systems, we aim to improve the performance despite the failures that may affect systems. Our research focuses on analyzing and reducing the impact of scalable FT techniques based on rollback-recovery (e.g. uncoordinated checkpoint). As message logging is normally the main source of overhead when using uncoordinated checkpoint approaches, our research focuses on analyzing and reducing the impact of current pessimistic receiver-based message logging techniques. Taking into account the advent of multicore machines, our main contributions aim to make an efficient use of the parallel environment considering the interaction between applications processes and fault tolerance tasks. The main contributions of this research are described below.
publishDate 2016
dc.date.none.fl_str_mv 2016-04
dc.type.none.fl_str_mv info:eu-repo/semantics/review
info:eu-repo/semantics/publishedVersion
Revision
http://purl.org/coar/resource_type/c_dcae04bc
info:ar-repo/semantics/resenaArticulo
format review
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/52386
url http://sedici.unlp.edu.ar/handle/10915/52386
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/http://journal.info.unlp.edu.ar/wp-content/uploads/2015/10/JCST-42-Thesis-Overview-1.pdf
info:eu-repo/semantics/altIdentifier/issn/1666-6038
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by/3.0/
Creative Commons Attribution 3.0 Unported (CC BY 3.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by/3.0/
Creative Commons Attribution 3.0 Unported (CC BY 3.0)
dc.format.none.fl_str_mv application/pdf
59-60
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1846064017002463232
score 13.22299