An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing
- Autores
- Basgall, María José; Hasperué, Waldo; Naiouf, Marcelo; Fernández, Alberto; Herrera, Francisco
- Año de publicación
- 2019
- Idioma
- inglés
- Tipo de recurso
- documento de conferencia
- Estado
- versión publicada
- Descripción
- Addressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context. In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise. Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance. In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global).
Instituto de Investigación en Informática - Materia
-
Ciencias Informáticas
big data
imbalanced classification
preprocessing techniques
SMOTE
scalability - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/4.0/
- Repositorio
- Institución
- Universidad Nacional de La Plata
- OAI Identificador
- oai:sedici.unlp.edu.ar:10915/80384
Ver los metadatos del registro completo
id |
SEDICI_c30457cf205e13cc43e1168ba571d094 |
---|---|
oai_identifier_str |
oai:sedici.unlp.edu.ar:10915/80384 |
network_acronym_str |
SEDICI |
repository_id_str |
1329 |
network_name_str |
SEDICI (UNLP) |
spelling |
An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessingBasgall, María JoséHasperué, WaldoNaiouf, MarceloFernández, AlbertoHerrera, FranciscoCiencias Informáticasbig dataimbalanced classificationpreprocessing techniquesSMOTEscalabilityAddressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context. In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise. Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance. In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global).Instituto de Investigación en Informática2019-06info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf75-85http://sedici.unlp.edu.ar/handle/10915/80384enginfo:eu-repo/semantics/altIdentifier/isbn/978-3-030-27713-0info:eu-repo/semantics/reference/doi/10.1007/978-3-030-27713-0info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-10-15T11:06:43Zoai:sedici.unlp.edu.ar:10915/80384Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-10-15 11:06:43.752SEDICI (UNLP) - Universidad Nacional de La Platafalse |
dc.title.none.fl_str_mv |
An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing |
title |
An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing |
spellingShingle |
An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing Basgall, María José Ciencias Informáticas big data imbalanced classification preprocessing techniques SMOTE scalability |
title_short |
An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing |
title_full |
An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing |
title_fullStr |
An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing |
title_full_unstemmed |
An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing |
title_sort |
An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing |
dc.creator.none.fl_str_mv |
Basgall, María José Hasperué, Waldo Naiouf, Marcelo Fernández, Alberto Herrera, Francisco |
author |
Basgall, María José |
author_facet |
Basgall, María José Hasperué, Waldo Naiouf, Marcelo Fernández, Alberto Herrera, Francisco |
author_role |
author |
author2 |
Hasperué, Waldo Naiouf, Marcelo Fernández, Alberto Herrera, Francisco |
author2_role |
author author author author |
dc.subject.none.fl_str_mv |
Ciencias Informáticas big data imbalanced classification preprocessing techniques SMOTE scalability |
topic |
Ciencias Informáticas big data imbalanced classification preprocessing techniques SMOTE scalability |
dc.description.none.fl_txt_mv |
Addressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context. In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise. Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance. In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global). Instituto de Investigación en Informática |
description |
Addressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context. In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise. Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance. In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global). |
publishDate |
2019 |
dc.date.none.fl_str_mv |
2019-06 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://sedici.unlp.edu.ar/handle/10915/80384 |
url |
http://sedici.unlp.edu.ar/handle/10915/80384 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/isbn/978-3-030-27713-0 info:eu-repo/semantics/reference/doi/10.1007/978-3-030-27713-0 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.format.none.fl_str_mv |
application/pdf 75-85 |
dc.source.none.fl_str_mv |
reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP |
reponame_str |
SEDICI (UNLP) |
collection |
SEDICI (UNLP) |
instname_str |
Universidad Nacional de La Plata |
instacron_str |
UNLP |
institution |
UNLP |
repository.name.fl_str_mv |
SEDICI (UNLP) - Universidad Nacional de La Plata |
repository.mail.fl_str_mv |
alira@sedici.unlp.edu.ar |
_version_ |
1846064122600357888 |
score |
13.22299 |