Statistical analysis of the performance of four Apache Spark ML Algorithms

Autores: Camele, Genaro; Hasperué, Waldo; Ronchetti, Franco; Quiroga, Facundo Manuel
Año de publicación: 2022
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess the importance of each feature for a particular task. However, due to the increasing size of currently available databases, distributed processing has become a necessity for many tasks. In this context, the Apache Spark ML library is one of the most widely used libraries for performing classification and other tasks with large datasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms before applying a FS technique is crucial to planning computations and saving time. In this work, a comparative study of four Spark ML classification algorithms is carried out, statistically measuring execution times and predictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Naive Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.
Las técnicas de selección de características suelen requerir el entrenamiento y la evaluación repetida de modelos con el fin de evaluar la ünportancia de cada característica para una tarea concreta. Sin embargo, debido al aumento del tamaño de las bases de datos disponibles actualmente, el procesamiento distribuido se ha convertido en una necesidad para muchas tareas tareas. En este contexto, la librería Apache Spark ML es una de las más utilizadas para realizar clasificación y otras tareas con grandes conjuntos de datos. Por ello, conocer tanto el rendimiento predictivo como la eficiencia de sus principales algoritmos antes de aplicar una técnica de selección de características es crucial para planificar los cálculos y ahorrar tiempo. En este trabajo se realiza un estudio comparativo de cuatro algoritmos de clasificación de Spark ML, midiendo estadísticamente los tiempos de ejecución y el poder predictivo en función del número de atributos de una base de datos de cáncer de colon. Los resultados fueron analizados estadísticamente, mostrando que, aunque Random Forest y Naive Bayes son los algoritmos con menores tiempos de ejecución, Support Vector Machine obtiene modelos con el mejor poder predictivo. El estudio de la performance de estos algoritmos resulta interesante ya que los mismos son utilizados en problemas muy diversos, como por ejemplo, la clasificación de diferentes patologías a partir de datos epigenómicos, clasificación de imágenes, la predicción de ataques informáticos en problemas de seguridad en redes, entre otros.
Facultad de Informática
Materia: Ciencias Informáticas
Big Data
Machine Learning
Classification Models
Apache Spark
Spark ML
Wilcoxon Test
Student’s T Test
Big Data
Aprendizaje automático
Modelos de clasificación
Test de Wilcoxon
Test T-Student
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by-nc/4.0/
Repositorio
Institución: Universidad Nacional de La Plata
OAI Identificador: oai:sedici.unlp.edu.ar:10915/146934

Acceder

id	SEDICI_16fda5ba7be7d04fe5aec5c054a9a349
oai_identifier_str	oai:sedici.unlp.edu.ar:10915/146934
network_acronym_str	SEDICI
repository_id_str	1329
network_name_str	SEDICI (UNLP)
spelling	Statistical analysis of the performance of four Apache Spark ML AlgorithmsAnálisis estadístico del rendimiento de cuatro algoritmos de Apache Spark MLCamele, GenaroHasperué, WaldoRonchetti, FrancoQuiroga, Facundo ManuelCiencias InformáticasBig DataMachine LearningClassification ModelsApache SparkSpark MLWilcoxon TestStudent’s T TestBig DataAprendizaje automáticoModelos de clasificaciónTest de WilcoxonTest T-StudentFeature selection (FS) techniques generally require repeatedly training and evaluating models to assess the importance of each feature for a particular task. However, due to the increasing size of currently available databases, distributed processing has become a necessity for many tasks. In this context, the Apache Spark ML library is one of the most widely used libraries for performing classification and other tasks with large datasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms before applying a FS technique is crucial to planning computations and saving time. In this work, a comparative study of four Spark ML classification algorithms is carried out, statistically measuring execution times and predictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Naive Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.Las técnicas de selección de características suelen requerir el entrenamiento y la evaluación repetida de modelos con el fin de evaluar la ünportancia de cada característica para una tarea concreta. Sin embargo, debido al aumento del tamaño de las bases de datos disponibles actualmente, el procesamiento distribuido se ha convertido en una necesidad para muchas tareas tareas. En este contexto, la librería Apache Spark ML es una de las más utilizadas para realizar clasificación y otras tareas con grandes conjuntos de datos. Por ello, conocer tanto el rendimiento predictivo como la eficiencia de sus principales algoritmos antes de aplicar una técnica de selección de características es crucial para planificar los cálculos y ahorrar tiempo. En este trabajo se realiza un estudio comparativo de cuatro algoritmos de clasificación de Spark ML, midiendo estadísticamente los tiempos de ejecución y el poder predictivo en función del número de atributos de una base de datos de cáncer de colon. Los resultados fueron analizados estadísticamente, mostrando que, aunque Random Forest y Naive Bayes son los algoritmos con menores tiempos de ejecución, Support Vector Machine obtiene modelos con el mejor poder predictivo. El estudio de la performance de estos algoritmos resulta interesante ya que los mismos son utilizados en problemas muy diversos, como por ejemplo, la clasificación de diferentes patologías a partir de datos epigenómicos, clasificación de imágenes, la predicción de ataques informáticos en problemas de seguridad en redes, entre otros.Facultad de Informática2022-10-17info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArticulohttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/146934enginfo:eu-repo/semantics/altIdentifier/issn/1666-6038info:eu-repo/semantics/altIdentifier/doi/10.24215/16666038.22.e14info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc/4.0/Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2026-05-27T11:31:45Zoai:sedici.unlp.edu.ar:10915/146934Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292026-05-27 11:31:45.725SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv	Statistical analysis of the performance of four Apache Spark ML Algorithms Análisis estadístico del rendimiento de cuatro algoritmos de Apache Spark ML
title	Statistical analysis of the performance of four Apache Spark ML Algorithms
spellingShingle	Statistical analysis of the performance of four Apache Spark ML Algorithms Camele, Genaro Ciencias Informáticas Big Data Machine Learning Classification Models Apache Spark Spark ML Wilcoxon Test Student’s T Test Big Data Aprendizaje automático Modelos de clasificación Test de Wilcoxon Test T-Student
title_short	Statistical analysis of the performance of four Apache Spark ML Algorithms
title_full	Statistical analysis of the performance of four Apache Spark ML Algorithms
title_fullStr	Statistical analysis of the performance of four Apache Spark ML Algorithms
title_full_unstemmed	Statistical analysis of the performance of four Apache Spark ML Algorithms
title_sort	Statistical analysis of the performance of four Apache Spark ML Algorithms
dc.creator.none.fl_str_mv	Camele, Genaro Hasperué, Waldo Ronchetti, Franco Quiroga, Facundo Manuel
author	Camele, Genaro
author_facet	Camele, Genaro Hasperué, Waldo Ronchetti, Franco Quiroga, Facundo Manuel
author_role	author
author2	Hasperué, Waldo Ronchetti, Franco Quiroga, Facundo Manuel
author2_role	author author author
dc.subject.none.fl_str_mv	Ciencias Informáticas Big Data Machine Learning Classification Models Apache Spark Spark ML Wilcoxon Test Student’s T Test Big Data Aprendizaje automático Modelos de clasificación Test de Wilcoxon Test T-Student
topic	Ciencias Informáticas Big Data Machine Learning Classification Models Apache Spark Spark ML Wilcoxon Test Student’s T Test Big Data Aprendizaje automático Modelos de clasificación Test de Wilcoxon Test T-Student
dc.description.none.fl_txt_mv	Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess the importance of each feature for a particular task. However, due to the increasing size of currently available databases, distributed processing has become a necessity for many tasks. In this context, the Apache Spark ML library is one of the most widely used libraries for performing classification and other tasks with large datasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms before applying a FS technique is crucial to planning computations and saving time. In this work, a comparative study of four Spark ML classification algorithms is carried out, statistically measuring execution times and predictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Naive Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others. Las técnicas de selección de características suelen requerir el entrenamiento y la evaluación repetida de modelos con el fin de evaluar la ünportancia de cada característica para una tarea concreta. Sin embargo, debido al aumento del tamaño de las bases de datos disponibles actualmente, el procesamiento distribuido se ha convertido en una necesidad para muchas tareas tareas. En este contexto, la librería Apache Spark ML es una de las más utilizadas para realizar clasificación y otras tareas con grandes conjuntos de datos. Por ello, conocer tanto el rendimiento predictivo como la eficiencia de sus principales algoritmos antes de aplicar una técnica de selección de características es crucial para planificar los cálculos y ahorrar tiempo. En este trabajo se realiza un estudio comparativo de cuatro algoritmos de clasificación de Spark ML, midiendo estadísticamente los tiempos de ejecución y el poder predictivo en función del número de atributos de una base de datos de cáncer de colon. Los resultados fueron analizados estadísticamente, mostrando que, aunque Random Forest y Naive Bayes son los algoritmos con menores tiempos de ejecución, Support Vector Machine obtiene modelos con el mejor poder predictivo. El estudio de la performance de estos algoritmos resulta interesante ya que los mismos son utilizados en problemas muy diversos, como por ejemplo, la clasificación de diferentes patologías a partir de datos epigenómicos, clasificación de imágenes, la predicción de ataques informáticos en problemas de seguridad en redes, entre otros. Facultad de Informática
description	Feature selection (FS) techniques generally require repeatedly training and evaluating models to assess the importance of each feature for a particular task. However, due to the increasing size of currently available databases, distributed processing has become a necessity for many tasks. In this context, the Apache Spark ML library is one of the most widely used libraries for performing classification and other tasks with large datasets. Therefore, knowing both the predictive performance and efficiency of its main algorithms before applying a FS technique is crucial to planning computations and saving time. In this work, a comparative study of four Spark ML classification algorithms is carried out, statistically measuring execution times and predictive power based on the number of attributes from a colon cancer database. Results were statistically analyzed, showing that, although Random Forest and Naive Bayes are the algorithms with the shortest execution times, Support Vector Machine obtains models with the best predictive power. The study of the performance of these algorithms is interesting as they are applied in many different problems, such as classification of pathologies from epigenomic data, image classification, prediction of computer attacks in network security problems, among others.
publishDate	2022
dc.date.none.fl_str_mv	2022-10-17
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion Articulo http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://sedici.unlp.edu.ar/handle/10915/146934
url	http://sedici.unlp.edu.ar/handle/10915/146934
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/issn/1666-6038 info:eu-repo/semantics/altIdentifier/doi/10.24215/16666038.22.e14
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc/4.0/ Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc/4.0/ Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP
reponame_str	SEDICI (UNLP)
collection	SEDICI (UNLP)
instname_str	Universidad Nacional de La Plata
instacron_str	UNLP
institution	UNLP
repository.name.fl_str_mv	SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv	alira@sedici.unlp.edu.ar
_version_	1866371955802767360
score	13.343132

Statistical analysis of the performance of four Apache Spark ML Algorithms

Publicaciones similares