SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data

Autores
Basgall, María José; Hasperué, Waldo; Naiouf, Marcelo; Fernández, Alberto; Herrera, Francisco
Año de publicación
2018
Idioma
inglés
Tipo de recurso
documento de conferencia
Estado
versión publicada
Descripción
The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.
Materia
Ciencias de la Computación e Información
big data, imbalanced classification, preprocessing, SMOTE, spark
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
CIC Digital (CICBA)
Institución
Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
OAI Identificador
oai:digital.cic.gba.gob.ar:11746/8512

id CICBA_8dc08e8c6460a53928360c522fa3d4eb
oai_identifier_str oai:digital.cic.gba.gob.ar:11746/8512
network_acronym_str CICBA
repository_id_str 9441
network_name_str CIC Digital (CICBA)
spelling SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big DataBasgall, María JoséHasperué, WaldoNaiouf, MarceloFernández, AlbertoHerrera, FranciscoCiencias de la Computación e Informaciónbig data, imbalanced classification, preprocessing, SMOTE, sparkThe volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.2018info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfhttps://digital.cic.gba.gob.ar/handle/11746/8512enginfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/reponame:CIC Digital (CICBA)instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Airesinstacron:CICBA2025-09-04T09:43:17Zoai:digital.cic.gba.gob.ar:11746/8512Institucionalhttp://digital.cic.gba.gob.arOrganismo científico-tecnológicoNo correspondehttp://digital.cic.gba.gob.ar/oai/snrdmarisa.degiusti@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:94412025-09-04 09:43:17.979CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Airesfalse
dc.title.none.fl_str_mv SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
spellingShingle SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
Basgall, María José
Ciencias de la Computación e Información
big data, imbalanced classification, preprocessing, SMOTE, spark
title_short SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title_full SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title_fullStr SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title_full_unstemmed SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title_sort SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
dc.creator.none.fl_str_mv Basgall, María José
Hasperué, Waldo
Naiouf, Marcelo
Fernández, Alberto
Herrera, Francisco
author Basgall, María José
author_facet Basgall, María José
Hasperué, Waldo
Naiouf, Marcelo
Fernández, Alberto
Herrera, Francisco
author_role author
author2 Hasperué, Waldo
Naiouf, Marcelo
Fernández, Alberto
Herrera, Francisco
author2_role author
author
author
author
dc.subject.none.fl_str_mv Ciencias de la Computación e Información
big data, imbalanced classification, preprocessing, SMOTE, spark
topic Ciencias de la Computación e Información
big data, imbalanced classification, preprocessing, SMOTE, spark
dc.description.none.fl_txt_mv The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.
description The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.
publishDate 2018
dc.date.none.fl_str_mv 2018
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_5794
info:ar-repo/semantics/documentoDeConferencia
format conferenceObject
status_str publishedVersion
dc.identifier.none.fl_str_mv https://digital.cic.gba.gob.ar/handle/11746/8512
url https://digital.cic.gba.gob.ar/handle/11746/8512
dc.language.none.fl_str_mv eng
language eng
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/4.0/
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:CIC Digital (CICBA)
instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
instacron:CICBA
reponame_str CIC Digital (CICBA)
collection CIC Digital (CICBA)
instname_str Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
instacron_str CICBA
institution CICBA
repository.name.fl_str_mv CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
repository.mail.fl_str_mv marisa.degiusti@sedici.unlp.edu.ar
_version_ 1842340411902787584
score 12.623145