Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets

Autores
Fernández, Juan Manuel; Errecalde, Marcelo Luis
Año de publicación
2022
Idioma
inglés
Tipo de recurso
documento de conferencia
Estado
versión publicada
Descripción
One of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases.
XIX Workshop base de datos y Minería de datos (WBDMD)
Red de Universidades con Carreras en Informática
Materia
Ciencias Informáticas
imbalanced data
automatic classification
information retrieval
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/149456

id SEDICI_4bf5da82936866f6e6779d205a8afb8f
oai_identifier_str oai:sedici.unlp.edu.ar:10915/149456
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasetsFernández, Juan ManuelErrecalde, Marcelo LuisCiencias Informáticasimbalanced dataautomatic classificationinformation retrievalOne of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases.XIX Workshop base de datos y Minería de datos (WBDMD)Red de Universidades con Carreras en Informática2022-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf415-425http://sedici.unlp.edu.ar/handle/10915/149456enginfo:eu-repo/semantics/altIdentifier/isbn/978-987-1364-31-2info:eu-repo/semantics/reference/hdl/10915/149102info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:38:21Zoai:sedici.unlp.edu.ar:10915/149456Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:38:22.126SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
spellingShingle Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
Fernández, Juan Manuel
Ciencias Informáticas
imbalanced data
automatic classification
information retrieval
title_short Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_full Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_fullStr Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_full_unstemmed Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_sort Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
dc.creator.none.fl_str_mv Fernández, Juan Manuel
Errecalde, Marcelo Luis
author Fernández, Juan Manuel
author_facet Fernández, Juan Manuel
Errecalde, Marcelo Luis
author_role author
author2 Errecalde, Marcelo Luis
author2_role author
dc.subject.none.fl_str_mv Ciencias Informáticas
imbalanced data
automatic classification
information retrieval
topic Ciencias Informáticas
imbalanced data
automatic classification
information retrieval
dc.description.none.fl_txt_mv One of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases.
XIX Workshop base de datos y Minería de datos (WBDMD)
Red de Universidades con Carreras en Informática
description One of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases.
publishDate 2022
dc.date.none.fl_str_mv 2022-10
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
info:eu-repo/semantics/publishedVersion
Objeto de conferencia
http://purl.org/coar/resource_type/c_5794
info:ar-repo/semantics/documentoDeConferencia
format conferenceObject
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/149456
url http://sedici.unlp.edu.ar/handle/10915/149456
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/isbn/978-987-1364-31-2
info:eu-repo/semantics/reference/hdl/10915/149102
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv application/pdf
415-425
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1844616258745532416
score 13.070432