SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data

Autores: Basgall, María José; Hasperué, Waldo; Naiouf, Marcelo; Fernández, Alberto; Herrera, Francisco
Año de publicación: 2018
Idioma: inglés
Tipo de recurso: documento de conferencia
Estado: versión publicada
Descripción: The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.
Materia: Ciencias de la Computación e Información
big data, imbalanced classification, preprocessing, SMOTE, spark
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
Institución: Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
OAI Identificador: oai:digital.cic.gba.gob.ar:11746/8512

Acceder

id	CICBA_8dc08e8c6460a53928360c522fa3d4eb
oai_identifier_str	oai:digital.cic.gba.gob.ar:11746/8512
network_acronym_str	CICBA
repository_id_str	9441
network_name_str	CIC Digital (CICBA)
spelling	SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big DataBasgall, María JoséHasperué, WaldoNaiouf, MarceloFernández, AlbertoHerrera, FranciscoCiencias de la Computación e Informaciónbig data, imbalanced classification, preprocessing, SMOTE, sparkThe volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.2018info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfhttps://digital.cic.gba.gob.ar/handle/11746/8512enginfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/reponame:CIC Digital (CICBA)instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Airesinstacron:CICBA2026-06-04T09:39:23Zoai:digital.cic.gba.gob.ar:11746/8512Institucionalhttp://digital.cic.gba.gob.arOrganismo científico-tecnológicoNo correspondehttp://digital.cic.gba.gob.ar/oai/snrdmarisa.degiusti@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:94412026-06-04 09:39:23.805CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Airesfalse
dc.title.none.fl_str_mv	SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title	SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
spellingShingle	SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data Basgall, María José Ciencias de la Computación e Información big data, imbalanced classification, preprocessing, SMOTE, spark
title_short	SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title_full	SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title_fullStr	SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title_full_unstemmed	SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
title_sort	SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
dc.creator.none.fl_str_mv	Basgall, María José Hasperué, Waldo Naiouf, Marcelo Fernández, Alberto Herrera, Francisco
author	Basgall, María José
author_facet	Basgall, María José Hasperué, Waldo Naiouf, Marcelo Fernández, Alberto Herrera, Francisco
author_role	author
author2	Hasperué, Waldo Naiouf, Marcelo Fernández, Alberto Herrera, Francisco
author2_role	author author author author
dc.subject.none.fl_str_mv	Ciencias de la Computación e Información big data, imbalanced classification, preprocessing, SMOTE, spark
topic	Ciencias de la Computación e Información big data, imbalanced classification, preprocessing, SMOTE, spark
dc.description.none.fl_txt_mv	The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.
description	The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.
publishDate	2018
dc.date.none.fl_str_mv	2018
dc.type.none.fl_str_mv	info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia
format	conferenceObject
status_str	publishedVersion
dc.identifier.none.fl_str_mv	https://digital.cic.gba.gob.ar/handle/11746/8512
url	https://digital.cic.gba.gob.ar/handle/11746/8512
dc.language.none.fl_str_mv	eng
language	eng
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-sa/4.0/
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:CIC Digital (CICBA) instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Aires instacron:CICBA
reponame_str	CIC Digital (CICBA)
collection	CIC Digital (CICBA)
instname_str	Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
instacron_str	CICBA
institution	CICBA
repository.name.fl_str_mv	CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
repository.mail.fl_str_mv	marisa.degiusti@sedici.unlp.edu.ar
_version_	1867092456595521536
score	12.832306

SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data

Publicaciones similares