SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data
- Autores
- Basgall, María José; Hasperué, Waldo; Naiouf, Marcelo; Fernández, Alberto; Herrera, Francisco
- Año de publicación
- 2018
- Idioma
- inglés
- Tipo de recurso
- documento de conferencia
- Estado
- versión publicada
- Descripción
- The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.
- Materia
-
Ciencias de la Computación e Información
big data, imbalanced classification, preprocessing, SMOTE, spark - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/4.0/
- Repositorio
- Institución
- Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
- OAI Identificador
- oai:digital.cic.gba.gob.ar:11746/8512
Ver los metadatos del registro completo
id |
CICBA_8dc08e8c6460a53928360c522fa3d4eb |
---|---|
oai_identifier_str |
oai:digital.cic.gba.gob.ar:11746/8512 |
network_acronym_str |
CICBA |
repository_id_str |
9441 |
network_name_str |
CIC Digital (CICBA) |
spelling |
SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big DataBasgall, María JoséHasperué, WaldoNaiouf, MarceloFernández, AlbertoHerrera, FranciscoCiencias de la Computación e Informaciónbig data, imbalanced classification, preprocessing, SMOTE, sparkThe volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation.2018info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfhttps://digital.cic.gba.gob.ar/handle/11746/8512enginfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/reponame:CIC Digital (CICBA)instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Airesinstacron:CICBA2025-09-04T09:43:17Zoai:digital.cic.gba.gob.ar:11746/8512Institucionalhttp://digital.cic.gba.gob.arOrganismo científico-tecnológicoNo correspondehttp://digital.cic.gba.gob.ar/oai/snrdmarisa.degiusti@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:94412025-09-04 09:43:17.979CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Airesfalse |
dc.title.none.fl_str_mv |
SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data |
title |
SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data |
spellingShingle |
SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data Basgall, María José Ciencias de la Computación e Información big data, imbalanced classification, preprocessing, SMOTE, spark |
title_short |
SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data |
title_full |
SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data |
title_fullStr |
SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data |
title_full_unstemmed |
SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data |
title_sort |
SMOTE-BD: An Exact and Scalable Oversampling Method for Imbalanced Classification in Big Data |
dc.creator.none.fl_str_mv |
Basgall, María José Hasperué, Waldo Naiouf, Marcelo Fernández, Alberto Herrera, Francisco |
author |
Basgall, María José |
author_facet |
Basgall, María José Hasperué, Waldo Naiouf, Marcelo Fernández, Alberto Herrera, Francisco |
author_role |
author |
author2 |
Hasperué, Waldo Naiouf, Marcelo Fernández, Alberto Herrera, Francisco |
author2_role |
author author author author |
dc.subject.none.fl_str_mv |
Ciencias de la Computación e Información big data, imbalanced classification, preprocessing, SMOTE, spark |
topic |
Ciencias de la Computación e Información big data, imbalanced classification, preprocessing, SMOTE, spark |
dc.description.none.fl_txt_mv |
The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation. |
description |
The volume of data in today’s applications has meant a change in the way Machine Learning issues are addressed. Indeed, the Big Data scenario involves scalability constraints that can only be achieved through intelligent model design and the use of distributed technologies. In this context, solutions based on the Spark platform have established themselves as a de facto standard. In this contribution, we focus on a very important framework within Big Data Analytics, namely classification with imbalanced datasets. The main characteristic of this problem is that one of the classes is underrepresented, and therefore it is usually more complex to find a model that identifies it correctly. For this reason, it is common to apply preprocessing techniques such as oversampling to balance the distribution of examples in classes. In this work we present SMOTE-BD, fully scalable preprocessing approach for imbalanced classification in Big Data. It is based on one of the most widespread preprocessing solutions for imbalanced classification, namely the SMOTE algorithm, which creates new synthetic instances according to the neighborhood of each example of the minority class. Our novel development is made to be independent of the number of partitions or processes created to achieve a higher degree of efficiency. Experiments conducted on different standard and Big Data datasets show the quality of the proposed design and implementation. |
publishDate |
2018 |
dc.date.none.fl_str_mv |
2018 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
https://digital.cic.gba.gob.ar/handle/11746/8512 |
url |
https://digital.cic.gba.gob.ar/handle/11746/8512 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:CIC Digital (CICBA) instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Aires instacron:CICBA |
reponame_str |
CIC Digital (CICBA) |
collection |
CIC Digital (CICBA) |
instname_str |
Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
instacron_str |
CICBA |
institution |
CICBA |
repository.name.fl_str_mv |
CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
repository.mail.fl_str_mv |
marisa.degiusti@sedici.unlp.edu.ar |
_version_ |
1842340411902787584 |
score |
12.623145 |