Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability
- Autores
- Meyer, Hugo
- Año de publicación
- 2016
- Idioma
- inglés
- Tipo de recurso
- reseña artículo
- Estado
- versión publicada
- Descripción
- In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the number of components. With the growing scale of HPC applications has came an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Times Between Failures (MTBF) in current systems encourages the research of suitable Fault Tolerance (FT) solutions which makes it possible to guarantee the successful completion of parallel applications. By executing applications on HPC systems, we aim to improve the performance despite the failures that may affect systems. Our research focuses on analyzing and reducing the impact of scalable FT techniques based on rollback-recovery (e.g. uncoordinated checkpoint). As message logging is normally the main source of overhead when using uncoordinated checkpoint approaches, our research focuses on analyzing and reducing the impact of current pessimistic receiver-based message logging techniques. Taking into account the advent of multicore machines, our main contributions aim to make an efficient use of the parallel environment considering the interaction between applications processes and fault tolerance tasks. The main contributions of this research are described below.
Facultad de Informática - Materia
-
Ciencias Informáticas
Fault tolerance
Parallel - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by/3.0/
- Repositorio
.jpg)
- Institución
- Universidad Nacional de La Plata
- OAI Identificador
- oai:sedici.unlp.edu.ar:10915/52386
Ver los metadatos del registro completo
| id |
SEDICI_6decd3694b0187b444dc26369d75d907 |
|---|---|
| oai_identifier_str |
oai:sedici.unlp.edu.ar:10915/52386 |
| network_acronym_str |
SEDICI |
| repository_id_str |
1329 |
| network_name_str |
SEDICI (UNLP) |
| spelling |
Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and DependabilityMeyer, HugoCiencias InformáticasFault toleranceParallelIn High Performance Computing (HPC) the demand for more performance is satisfied by increasing the number of components. With the growing scale of HPC applications has came an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Times Between Failures (MTBF) in current systems encourages the research of suitable Fault Tolerance (FT) solutions which makes it possible to guarantee the successful completion of parallel applications. By executing applications on HPC systems, we aim to improve the performance despite the failures that may affect systems. Our research focuses on analyzing and reducing the impact of scalable FT techniques based on rollback-recovery (e.g. uncoordinated checkpoint). As message logging is normally the main source of overhead when using uncoordinated checkpoint approaches, our research focuses on analyzing and reducing the impact of current pessimistic receiver-based message logging techniques. Taking into account the advent of multicore machines, our main contributions aim to make an efficient use of the parallel environment considering the interaction between applications processes and fault tolerance tasks. The main contributions of this research are described below.Facultad de Informática2016-04info:eu-repo/semantics/reviewinfo:eu-repo/semantics/publishedVersionRevisionhttp://purl.org/coar/resource_type/c_dcae04bcinfo:ar-repo/semantics/resenaArticuloapplication/pdf59-60http://sedici.unlp.edu.ar/handle/10915/52386enginfo:eu-repo/semantics/altIdentifier/url/http://journal.info.unlp.edu.ar/wp-content/uploads/2015/10/JCST-42-Thesis-Overview-1.pdfinfo:eu-repo/semantics/altIdentifier/issn/1666-6038info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/3.0/Creative Commons Attribution 3.0 Unported (CC BY 3.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-10-15T10:57:10Zoai:sedici.unlp.edu.ar:10915/52386Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-10-15 10:57:10.764SEDICI (UNLP) - Universidad Nacional de La Platafalse |
| dc.title.none.fl_str_mv |
Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability |
| title |
Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability |
| spellingShingle |
Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability Meyer, Hugo Ciencias Informáticas Fault tolerance Parallel |
| title_short |
Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability |
| title_full |
Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability |
| title_fullStr |
Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability |
| title_full_unstemmed |
Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability |
| title_sort |
Fault Tolerance in Multicore Clusters. Techniques to Balance Performance and Dependability |
| dc.creator.none.fl_str_mv |
Meyer, Hugo |
| author |
Meyer, Hugo |
| author_facet |
Meyer, Hugo |
| author_role |
author |
| dc.subject.none.fl_str_mv |
Ciencias Informáticas Fault tolerance Parallel |
| topic |
Ciencias Informáticas Fault tolerance Parallel |
| dc.description.none.fl_txt_mv |
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the number of components. With the growing scale of HPC applications has came an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Times Between Failures (MTBF) in current systems encourages the research of suitable Fault Tolerance (FT) solutions which makes it possible to guarantee the successful completion of parallel applications. By executing applications on HPC systems, we aim to improve the performance despite the failures that may affect systems. Our research focuses on analyzing and reducing the impact of scalable FT techniques based on rollback-recovery (e.g. uncoordinated checkpoint). As message logging is normally the main source of overhead when using uncoordinated checkpoint approaches, our research focuses on analyzing and reducing the impact of current pessimistic receiver-based message logging techniques. Taking into account the advent of multicore machines, our main contributions aim to make an efficient use of the parallel environment considering the interaction between applications processes and fault tolerance tasks. The main contributions of this research are described below. Facultad de Informática |
| description |
In High Performance Computing (HPC) the demand for more performance is satisfied by increasing the number of components. With the growing scale of HPC applications has came an increase in the number of interruptions as a consequence of hardware failures. The remarkable decrease of Mean Times Between Failures (MTBF) in current systems encourages the research of suitable Fault Tolerance (FT) solutions which makes it possible to guarantee the successful completion of parallel applications. By executing applications on HPC systems, we aim to improve the performance despite the failures that may affect systems. Our research focuses on analyzing and reducing the impact of scalable FT techniques based on rollback-recovery (e.g. uncoordinated checkpoint). As message logging is normally the main source of overhead when using uncoordinated checkpoint approaches, our research focuses on analyzing and reducing the impact of current pessimistic receiver-based message logging techniques. Taking into account the advent of multicore machines, our main contributions aim to make an efficient use of the parallel environment considering the interaction between applications processes and fault tolerance tasks. The main contributions of this research are described below. |
| publishDate |
2016 |
| dc.date.none.fl_str_mv |
2016-04 |
| dc.type.none.fl_str_mv |
info:eu-repo/semantics/review info:eu-repo/semantics/publishedVersion Revision http://purl.org/coar/resource_type/c_dcae04bc info:ar-repo/semantics/resenaArticulo |
| format |
review |
| status_str |
publishedVersion |
| dc.identifier.none.fl_str_mv |
http://sedici.unlp.edu.ar/handle/10915/52386 |
| url |
http://sedici.unlp.edu.ar/handle/10915/52386 |
| dc.language.none.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/url/http://journal.info.unlp.edu.ar/wp-content/uploads/2015/10/JCST-42-Thesis-Overview-1.pdf info:eu-repo/semantics/altIdentifier/issn/1666-6038 |
| dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by/3.0/ Creative Commons Attribution 3.0 Unported (CC BY 3.0) |
| eu_rights_str_mv |
openAccess |
| rights_invalid_str_mv |
http://creativecommons.org/licenses/by/3.0/ Creative Commons Attribution 3.0 Unported (CC BY 3.0) |
| dc.format.none.fl_str_mv |
application/pdf 59-60 |
| dc.source.none.fl_str_mv |
reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP |
| reponame_str |
SEDICI (UNLP) |
| collection |
SEDICI (UNLP) |
| instname_str |
Universidad Nacional de La Plata |
| instacron_str |
UNLP |
| institution |
UNLP |
| repository.name.fl_str_mv |
SEDICI (UNLP) - Universidad Nacional de La Plata |
| repository.mail.fl_str_mv |
alira@sedici.unlp.edu.ar |
| _version_ |
1846064017002463232 |
| score |
13.22299 |