Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets

Autores: Fernández, Juan Manuel; Errecalde, Marcelo Luis
Año de publicación: 2022
Idioma: inglés
Tipo de recurso: documento de conferencia
Estado: versión publicada
Descripción: One of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases.
XIX Workshop base de datos y Minería de datos (WBDMD)
Red de Universidades con Carreras en Informática
Materia: Ciencias Informáticas
imbalanced data
automatic classification
information retrieval
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
Institución: Universidad Nacional de La Plata
OAI Identificador: oai:sedici.unlp.edu.ar:10915/149456

Acceder

id	SEDICI_4bf5da82936866f6e6779d205a8afb8f
oai_identifier_str	oai:sedici.unlp.edu.ar:10915/149456
network_acronym_str	SEDICI
repository_id_str	1329
network_name_str	SEDICI (UNLP)
spelling	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasetsFernández, Juan ManuelErrecalde, Marcelo LuisCiencias Informáticasimbalanced dataautomatic classificationinformation retrievalOne of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases.XIX Workshop base de datos y Minería de datos (WBDMD)Red de Universidades con Carreras en Informática2022-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf415-425http://sedici.unlp.edu.ar/handle/10915/149456enginfo:eu-repo/semantics/altIdentifier/isbn/978-987-1364-31-2info:eu-repo/semantics/reference/hdl/10915/149102info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2026-05-27T11:32:35Zoai:sedici.unlp.edu.ar:10915/149456Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292026-05-27 11:32:36.095SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
spellingShingle	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets Fernández, Juan Manuel Ciencias Informáticas imbalanced data automatic classification information retrieval
title_short	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_full	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_fullStr	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_full_unstemmed	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
title_sort	Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets
dc.creator.none.fl_str_mv	Fernández, Juan Manuel Errecalde, Marcelo Luis
author	Fernández, Juan Manuel
author_facet	Fernández, Juan Manuel Errecalde, Marcelo Luis
author_role	author
author2	Errecalde, Marcelo Luis
author2_role	author
dc.subject.none.fl_str_mv	Ciencias Informáticas imbalanced data automatic classification information retrieval
topic	Ciencias Informáticas imbalanced data automatic classification information retrieval
dc.description.none.fl_txt_mv	One of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases. XIX Workshop base de datos y Minería de datos (WBDMD) Red de Universidades con Carreras en Informática
description	One of the main challenges in automatic email classification problems occurs when it is necessary to work with a relatively large number of classes and the classes are highly imbalanced. That happens even when non-labeled textual bases are available because manual labeling is costly. In this respect, all automatic text classification strategies –to a greater or lesser extent– are sensitive to the problems of imbalance between classes. The most widely used approaches for learning from unbalanced databases consists of resampling techniques, either by undersampling or oversampling the datasets. However, existing techniques have some problems to be solved. In this work we present a new proposal that consists of balancing the classes of the data set by retrieving unlabeled instances (e-mails) that are similar to those of the minority classes. It is shown that, for the data set used, it is a valid, viable and competitive strategy with respect to the resampling strategies currently used to learn from imbalanced email databases.
publishDate	2022
dc.date.none.fl_str_mv	2022-10
dc.type.none.fl_str_mv	info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia
format	conferenceObject
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://sedici.unlp.edu.ar/handle/10915/149456
url	http://sedici.unlp.edu.ar/handle/10915/149456
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/isbn/978-987-1364-31-2 info:eu-repo/semantics/reference/hdl/10915/149102
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv	application/pdf 415-425
dc.source.none.fl_str_mv	reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP
reponame_str	SEDICI (UNLP)
collection	SEDICI (UNLP)
instname_str	Universidad Nacional de La Plata
instacron_str	UNLP
institution	UNLP
repository.name.fl_str_mv	SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv	alira@sedici.unlp.edu.ar
_version_	1866371968031260672
score	13.343132

Instance retrieval from non-labeled data as a strategy for automatic classifcation of imbalanced e-mail datasets

Publicaciones similares