An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing

Autores: Basgall, María José; Hasperué, Waldo; Naiouf, Marcelo; Fernández, Alberto; Herrera, Francisco
Año de publicación: 2019
Idioma: inglés
Tipo de recurso: documento de conferencia
Estado: versión publicada
Descripción: Addressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context. In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise. Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance. In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global).
Instituto de Investigación en Informática
Materia: Ciencias Informáticas
big data
imbalanced classification
preprocessing techniques
SMOTE
scalability
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
Institución: Universidad Nacional de La Plata
OAI Identificador: oai:sedici.unlp.edu.ar:10915/80384

Acceder

id	SEDICI_c30457cf205e13cc43e1168ba571d094
oai_identifier_str	oai:sedici.unlp.edu.ar:10915/80384
network_acronym_str	SEDICI
repository_id_str	1329
network_name_str	SEDICI (UNLP)
spelling	An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessingBasgall, María JoséHasperué, WaldoNaiouf, MarceloFernández, AlbertoHerrera, FranciscoCiencias Informáticasbig dataimbalanced classificationpreprocessing techniquesSMOTEscalabilityAddressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context. In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise. Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance. In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global).Instituto de Investigación en Informática2019-06info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf75-85http://sedici.unlp.edu.ar/handle/10915/80384enginfo:eu-repo/semantics/altIdentifier/isbn/978-3-030-27713-0info:eu-repo/semantics/reference/doi/10.1007/978-3-030-27713-0info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2026-04-15T11:22:04Zoai:sedici.unlp.edu.ar:10915/80384Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292026-04-15 11:22:04.691SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv	An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing
title	An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing
spellingShingle	An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing Basgall, María José Ciencias Informáticas big data imbalanced classification preprocessing techniques SMOTE scalability
title_short	An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing
title_full	An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing
title_fullStr	An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing
title_full_unstemmed	An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing
title_sort	An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing
dc.creator.none.fl_str_mv	Basgall, María José Hasperué, Waldo Naiouf, Marcelo Fernández, Alberto Herrera, Francisco
author	Basgall, María José
author_facet	Basgall, María José Hasperué, Waldo Naiouf, Marcelo Fernández, Alberto Herrera, Francisco
author_role	author
author2	Hasperué, Waldo Naiouf, Marcelo Fernández, Alberto Herrera, Francisco
author2_role	author author author author
dc.subject.none.fl_str_mv	Ciencias Informáticas big data imbalanced classification preprocessing techniques SMOTE scalability
topic	Ciencias Informáticas big data imbalanced classification preprocessing techniques SMOTE scalability
dc.description.none.fl_txt_mv	Addressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context. In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise. Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance. In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global). Instituto de Investigación en Informática
description	Addressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context. In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise. Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance. In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global).
publishDate	2019
dc.date.none.fl_str_mv	2019-06
dc.type.none.fl_str_mv	info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia
format	conferenceObject
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://sedici.unlp.edu.ar/handle/10915/80384
url	http://sedici.unlp.edu.ar/handle/10915/80384
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/isbn/978-3-030-27713-0 info:eu-repo/semantics/reference/doi/10.1007/978-3-030-27713-0
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv	application/pdf 75-85
dc.source.none.fl_str_mv	reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP
reponame_str	SEDICI (UNLP)
collection	SEDICI (UNLP)
instname_str	Universidad Nacional de La Plata
instacron_str	UNLP
institution	UNLP
repository.name.fl_str_mv	SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv	alira@sedici.unlp.edu.ar
_version_	1862568773601460224
score	13.203462

An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing

Publicaciones similares