Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library

Autores
Camele, Genaro; Hasperué, Waldo; Ronchetti, Franco; Quiroga, Facundo Manuel
Año de publicación
2021
Idioma
inglés
Tipo de recurso
documento de conferencia
Estado
versión publicada
Descripción
Classification algorithms are widely used in several areas: finance, education, security, medicine, and more. Another use of these algorithms is to support feature extraction techniques. These techniques use classification algorithms to determine the best subset of attributes that support an acceptable prediction. Currently, a large amount of data is being collected and, as a result, databases are becoming increasingly larger and distributed processing becomes a necessity. In this sense, Spark, and in particular its Spark ML library, is one of the most widely used frameworks for performing classification tasks in large databases. Given that some feature extraction techniques need to execute a classification algorithm a significant number of times, with a different subset of attributes in each run, the performance of these algorithms should be known beforehand so that the overall feature extraction process is carried out in the shortest possible time. In this work, we carry out a comparative study of four Spark ML classification algorithms, measuring predictive power and execution times as a function of the number of attributes in the training dataset.
Workshop: WBDMD - Base de Datos y Minería de Datos
Red de Universidades con Carreras en Informática
Materia
Ciencias Informáticas
Big Data
Machine learning
Classification Models
Apache Spark
Spark ML
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/130348

id SEDICI_f234a5395ff1df9a25099dbd9098c1b1
oai_identifier_str oai:sedici.unlp.edu.ar:10915/130348
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML LibraryCamele, GenaroHasperué, WaldoRonchetti, FrancoQuiroga, Facundo ManuelCiencias InformáticasBig DataMachine learningClassification ModelsApache SparkSpark MLClassification algorithms are widely used in several areas: finance, education, security, medicine, and more. Another use of these algorithms is to support feature extraction techniques. These techniques use classification algorithms to determine the best subset of attributes that support an acceptable prediction. Currently, a large amount of data is being collected and, as a result, databases are becoming increasingly larger and distributed processing becomes a necessity. In this sense, Spark, and in particular its Spark ML library, is one of the most widely used frameworks for performing classification tasks in large databases. Given that some feature extraction techniques need to execute a classification algorithm a significant number of times, with a different subset of attributes in each run, the performance of these algorithms should be known beforehand so that the overall feature extraction process is carried out in the shortest possible time. In this work, we carry out a comparative study of four Spark ML classification algorithms, measuring predictive power and execution times as a function of the number of attributes in the training dataset.Workshop: WBDMD - Base de Datos y Minería de DatosRed de Universidades con Carreras en Informática2021-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf311-320http://sedici.unlp.edu.ar/handle/10915/130348enginfo:eu-repo/semantics/altIdentifier/isbn/978-987-633-574-4info:eu-repo/semantics/reference/hdl/10915/129809info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-03T11:04:53Zoai:sedici.unlp.edu.ar:10915/130348Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-03 11:04:53.883SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library
title Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library
spellingShingle Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library
Camele, Genaro
Ciencias Informáticas
Big Data
Machine learning
Classification Models
Apache Spark
Spark ML
title_short Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library
title_full Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library
title_fullStr Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library
title_full_unstemmed Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library
title_sort Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library
dc.creator.none.fl_str_mv Camele, Genaro
Hasperué, Waldo
Ronchetti, Franco
Quiroga, Facundo Manuel
author Camele, Genaro
author_facet Camele, Genaro
Hasperué, Waldo
Ronchetti, Franco
Quiroga, Facundo Manuel
author_role author
author2 Hasperué, Waldo
Ronchetti, Franco
Quiroga, Facundo Manuel
author2_role author
author
author
dc.subject.none.fl_str_mv Ciencias Informáticas
Big Data
Machine learning
Classification Models
Apache Spark
Spark ML
topic Ciencias Informáticas
Big Data
Machine learning
Classification Models
Apache Spark
Spark ML
dc.description.none.fl_txt_mv Classification algorithms are widely used in several areas: finance, education, security, medicine, and more. Another use of these algorithms is to support feature extraction techniques. These techniques use classification algorithms to determine the best subset of attributes that support an acceptable prediction. Currently, a large amount of data is being collected and, as a result, databases are becoming increasingly larger and distributed processing becomes a necessity. In this sense, Spark, and in particular its Spark ML library, is one of the most widely used frameworks for performing classification tasks in large databases. Given that some feature extraction techniques need to execute a classification algorithm a significant number of times, with a different subset of attributes in each run, the performance of these algorithms should be known beforehand so that the overall feature extraction process is carried out in the shortest possible time. In this work, we carry out a comparative study of four Spark ML classification algorithms, measuring predictive power and execution times as a function of the number of attributes in the training dataset.
Workshop: WBDMD - Base de Datos y Minería de Datos
Red de Universidades con Carreras en Informática
description Classification algorithms are widely used in several areas: finance, education, security, medicine, and more. Another use of these algorithms is to support feature extraction techniques. These techniques use classification algorithms to determine the best subset of attributes that support an acceptable prediction. Currently, a large amount of data is being collected and, as a result, databases are becoming increasingly larger and distributed processing becomes a necessity. In this sense, Spark, and in particular its Spark ML library, is one of the most widely used frameworks for performing classification tasks in large databases. Given that some feature extraction techniques need to execute a classification algorithm a significant number of times, with a different subset of attributes in each run, the performance of these algorithms should be known beforehand so that the overall feature extraction process is carried out in the shortest possible time. In this work, we carry out a comparative study of four Spark ML classification algorithms, measuring predictive power and execution times as a function of the number of attributes in the training dataset.
publishDate 2021
dc.date.none.fl_str_mv 2021-10
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
info:eu-repo/semantics/publishedVersion
Objeto de conferencia
http://purl.org/coar/resource_type/c_5794
info:ar-repo/semantics/documentoDeConferencia
format conferenceObject
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/130348
url http://sedici.unlp.edu.ar/handle/10915/130348
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/isbn/978-987-633-574-4
info:eu-repo/semantics/reference/hdl/10915/129809
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv application/pdf
311-320
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1842260549104041984
score 13.13397