Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
- Autores
- Fernández, Juan Manuel; Errecalde, Marcelo Luis
- Año de publicación
- 2022
- Idioma
- inglés
- Tipo de recurso
- documento de conferencia
- Estado
- versión publicada
- Descripción
- One of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases.
XIX Workshop base de datos y Minería de datos (WBDMD)
Red de Universidades con Carreras en Informática - Materia
-
Ciencias Informáticas
imbalanced data
automatic classification
information retrieval - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/4.0/
- Repositorio
- Institución
- Universidad Nacional de La Plata
- OAI Identificador
- oai:sedici.unlp.edu.ar:10915/149456
Ver los metadatos del registro completo
id |
SEDICI_4bf5da82936866f6e6779d205a8afb8f |
---|---|
oai_identifier_str |
oai:sedici.unlp.edu.ar:10915/149456 |
network_acronym_str |
SEDICI |
repository_id_str |
1329 |
network_name_str |
SEDICI (UNLP) |
spelling |
Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasetsFernández, Juan ManuelErrecalde, Marcelo LuisCiencias Informáticasimbalanced dataautomatic classificationinformation retrievalOne of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases.XIX Workshop base de datos y Minería de datos (WBDMD)Red de Universidades con Carreras en Informática2022-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf415-425http://sedici.unlp.edu.ar/handle/10915/149456enginfo:eu-repo/semantics/altIdentifier/isbn/978-987-1364-31-2info:eu-repo/semantics/reference/hdl/10915/149102info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:38:21Zoai:sedici.unlp.edu.ar:10915/149456Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:38:22.126SEDICI (UNLP) - Universidad Nacional de La Platafalse |
dc.title.none.fl_str_mv |
Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets |
title |
Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets |
spellingShingle |
Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets Fernández, Juan Manuel Ciencias Informáticas imbalanced data automatic classification information retrieval |
title_short |
Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets |
title_full |
Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets |
title_fullStr |
Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets |
title_full_unstemmed |
Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets |
title_sort |
Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets |
dc.creator.none.fl_str_mv |
Fernández, Juan Manuel Errecalde, Marcelo Luis |
author |
Fernández, Juan Manuel |
author_facet |
Fernández, Juan Manuel Errecalde, Marcelo Luis |
author_role |
author |
author2 |
Errecalde, Marcelo Luis |
author2_role |
author |
dc.subject.none.fl_str_mv |
Ciencias Informáticas imbalanced data automatic classification information retrieval |
topic |
Ciencias Informáticas imbalanced data automatic classification information retrieval |
dc.description.none.fl_txt_mv |
One of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases. XIX Workshop base de datos y Minería de datos (WBDMD) Red de Universidades con Carreras en Informática |
description |
One of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-10 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://sedici.unlp.edu.ar/handle/10915/149456 |
url |
http://sedici.unlp.edu.ar/handle/10915/149456 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/isbn/978-987-1364-31-2 info:eu-repo/semantics/reference/hdl/10915/149102 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.format.none.fl_str_mv |
application/pdf 415-425 |
dc.source.none.fl_str_mv |
reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP |
reponame_str |
SEDICI (UNLP) |
collection |
SEDICI (UNLP) |
instname_str |
Universidad Nacional de La Plata |
instacron_str |
UNLP |
institution |
UNLP |
repository.name.fl_str_mv |
SEDICI (UNLP) - Universidad Nacional de La Plata |
repository.mail.fl_str_mv |
alira@sedici.unlp.edu.ar |
_version_ |
1844616258745532416 |
score |
13.070432 |