Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library
- Autores
- Camele, Genaro; Hasperué, Waldo; Ronchetti, Franco; Quiroga, Facundo Manuel
- Año de publicación
- 2021
- Idioma
- inglés
- Tipo de recurso
- documento de conferencia
- Estado
- versión publicada
- Descripción
- Classification algorithms are widely used in several areas: finance, education, security, medicine, and more. Another use of these algorithms is to support feature extraction techniques. These techniques use classification algorithms to determine the best subset of attributes that support an acceptable prediction. Currently, a large amount of data is being collected and, as a result, databases are becoming increasingly larger and distributed processing becomes a necessity. In this sense, Spark, and in particular its Spark ML library, is one of the most widely used frameworks for performing classification tasks in large databases. Given that some feature extraction techniques need to execute a classification algorithm a significant number of times, with a different subset of attributes in each run, the performance of these algorithms should be known beforehand so that the overall feature extraction process is carried out in the shortest possible time. In this work, we carry out a comparative study of four Spark ML classification algorithms, measuring predictive power and execution times as a function of the number of attributes in the training dataset.
Workshop: WBDMD - Base de Datos y Minería de Datos
Red de Universidades con Carreras en Informática - Materia
-
Ciencias Informáticas
Big Data
Machine learning
Classification Models
Apache Spark
Spark ML - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/4.0/
- Repositorio
- Institución
- Universidad Nacional de La Plata
- OAI Identificador
- oai:sedici.unlp.edu.ar:10915/130348
Ver los metadatos del registro completo
id |
SEDICI_f234a5395ff1df9a25099dbd9098c1b1 |
---|---|
oai_identifier_str |
oai:sedici.unlp.edu.ar:10915/130348 |
network_acronym_str |
SEDICI |
repository_id_str |
1329 |
network_name_str |
SEDICI (UNLP) |
spelling |
Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML LibraryCamele, GenaroHasperué, WaldoRonchetti, FrancoQuiroga, Facundo ManuelCiencias InformáticasBig DataMachine learningClassification ModelsApache SparkSpark MLClassification algorithms are widely used in several areas: finance, education, security, medicine, and more. Another use of these algorithms is to support feature extraction techniques. These techniques use classification algorithms to determine the best subset of attributes that support an acceptable prediction. Currently, a large amount of data is being collected and, as a result, databases are becoming increasingly larger and distributed processing becomes a necessity. In this sense, Spark, and in particular its Spark ML library, is one of the most widely used frameworks for performing classification tasks in large databases. Given that some feature extraction techniques need to execute a classification algorithm a significant number of times, with a different subset of attributes in each run, the performance of these algorithms should be known beforehand so that the overall feature extraction process is carried out in the shortest possible time. In this work, we carry out a comparative study of four Spark ML classification algorithms, measuring predictive power and execution times as a function of the number of attributes in the training dataset.Workshop: WBDMD - Base de Datos y Minería de DatosRed de Universidades con Carreras en Informática2021-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf311-320http://sedici.unlp.edu.ar/handle/10915/130348enginfo:eu-repo/semantics/altIdentifier/isbn/978-987-633-574-4info:eu-repo/semantics/reference/hdl/10915/129809info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-03T11:04:53Zoai:sedici.unlp.edu.ar:10915/130348Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-03 11:04:53.883SEDICI (UNLP) - Universidad Nacional de La Platafalse |
dc.title.none.fl_str_mv |
Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library |
title |
Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library |
spellingShingle |
Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library Camele, Genaro Ciencias Informáticas Big Data Machine learning Classification Models Apache Spark Spark ML |
title_short |
Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library |
title_full |
Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library |
title_fullStr |
Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library |
title_full_unstemmed |
Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library |
title_sort |
Comparative Study of the Performance of the Classification Algorithms of the Apache Spark ML Library |
dc.creator.none.fl_str_mv |
Camele, Genaro Hasperué, Waldo Ronchetti, Franco Quiroga, Facundo Manuel |
author |
Camele, Genaro |
author_facet |
Camele, Genaro Hasperué, Waldo Ronchetti, Franco Quiroga, Facundo Manuel |
author_role |
author |
author2 |
Hasperué, Waldo Ronchetti, Franco Quiroga, Facundo Manuel |
author2_role |
author author author |
dc.subject.none.fl_str_mv |
Ciencias Informáticas Big Data Machine learning Classification Models Apache Spark Spark ML |
topic |
Ciencias Informáticas Big Data Machine learning Classification Models Apache Spark Spark ML |
dc.description.none.fl_txt_mv |
Classification algorithms are widely used in several areas: finance, education, security, medicine, and more. Another use of these algorithms is to support feature extraction techniques. These techniques use classification algorithms to determine the best subset of attributes that support an acceptable prediction. Currently, a large amount of data is being collected and, as a result, databases are becoming increasingly larger and distributed processing becomes a necessity. In this sense, Spark, and in particular its Spark ML library, is one of the most widely used frameworks for performing classification tasks in large databases. Given that some feature extraction techniques need to execute a classification algorithm a significant number of times, with a different subset of attributes in each run, the performance of these algorithms should be known beforehand so that the overall feature extraction process is carried out in the shortest possible time. In this work, we carry out a comparative study of four Spark ML classification algorithms, measuring predictive power and execution times as a function of the number of attributes in the training dataset. Workshop: WBDMD - Base de Datos y Minería de Datos Red de Universidades con Carreras en Informática |
description |
Classification algorithms are widely used in several areas: finance, education, security, medicine, and more. Another use of these algorithms is to support feature extraction techniques. These techniques use classification algorithms to determine the best subset of attributes that support an acceptable prediction. Currently, a large amount of data is being collected and, as a result, databases are becoming increasingly larger and distributed processing becomes a necessity. In this sense, Spark, and in particular its Spark ML library, is one of the most widely used frameworks for performing classification tasks in large databases. Given that some feature extraction techniques need to execute a classification algorithm a significant number of times, with a different subset of attributes in each run, the performance of these algorithms should be known beforehand so that the overall feature extraction process is carried out in the shortest possible time. In this work, we carry out a comparative study of four Spark ML classification algorithms, measuring predictive power and execution times as a function of the number of attributes in the training dataset. |
publishDate |
2021 |
dc.date.none.fl_str_mv |
2021-10 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://sedici.unlp.edu.ar/handle/10915/130348 |
url |
http://sedici.unlp.edu.ar/handle/10915/130348 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/isbn/978-987-633-574-4 info:eu-repo/semantics/reference/hdl/10915/129809 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.format.none.fl_str_mv |
application/pdf 311-320 |
dc.source.none.fl_str_mv |
reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP |
reponame_str |
SEDICI (UNLP) |
collection |
SEDICI (UNLP) |
instname_str |
Universidad Nacional de La Plata |
instacron_str |
UNLP |
institution |
UNLP |
repository.name.fl_str_mv |
SEDICI (UNLP) - Universidad Nacional de La Plata |
repository.mail.fl_str_mv |
alira@sedici.unlp.edu.ar |
_version_ |
1842260549104041984 |
score |
13.13397 |