Outlier Mining Methods Based on Graph Structure Analysis

Autores
Amil, Pablo; Almeira, Nahuel; Masoller, Cristina
Año de publicación
2019
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.
Fil: Amil, Pablo. Universitat Politecnica de Catalunya; España
Fil: Almeira, Nahuel. Universidad Nacional de Córdoba. Facultad de Matemática, Astronomia y Física. Sección Física. Grupo de Teoria de la Materia Condensada; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Física Enrique Gaviola. Universidad Nacional de Córdoba. Instituto de Física Enrique Gaviola; Argentina
Fil: Masoller, Cristina. Universitat Politecnica de Catalunya; España
Materia
ANOMALY DETECTION
COMPLEX NETWORKS
MACHINE LEARNING
OUTLIER MINING
PERCOLATION
SUPERVISED LEARNING
UNSUPERVISED LEARNING
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/125807

id CONICETDig_8a9606b969e9aeec91699ba0f666402a
oai_identifier_str oai:ri.conicet.gov.ar:11336/125807
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling Outlier Mining Methods Based on Graph Structure AnalysisAmil, PabloAlmeira, NahuelMasoller, CristinaANOMALY DETECTIONCOMPLEX NETWORKSMACHINE LEARNINGOUTLIER MININGPERCOLATIONSUPERVISED LEARNINGUNSUPERVISED LEARNINGhttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.Fil: Amil, Pablo. Universitat Politecnica de Catalunya; EspañaFil: Almeira, Nahuel. Universidad Nacional de Córdoba. Facultad de Matemática, Astronomia y Física. Sección Física. Grupo de Teoria de la Materia Condensada; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Física Enrique Gaviola. Universidad Nacional de Córdoba. Instituto de Física Enrique Gaviola; ArgentinaFil: Masoller, Cristina. Universitat Politecnica de Catalunya; EspañaFrontiers Media S.A.2019-11-26info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/125807Amil, Pablo; Almeira, Nahuel; Masoller, Cristina; Outlier Mining Methods Based on Graph Structure Analysis; Frontiers Media S.A.; Frontiers in Physics; 7; 194; 26-11-2019; 1-112296-424XCONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://www.frontiersin.org/article/10.3389/fphy.2019.00194/fullinfo:eu-repo/semantics/altIdentifier/doi/10.3389/fphy.2019.00194info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T10:09:45Zoai:ri.conicet.gov.ar:11336/125807instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 10:09:46.178CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv Outlier Mining Methods Based on Graph Structure Analysis
title Outlier Mining Methods Based on Graph Structure Analysis
spellingShingle Outlier Mining Methods Based on Graph Structure Analysis
Amil, Pablo
ANOMALY DETECTION
COMPLEX NETWORKS
MACHINE LEARNING
OUTLIER MINING
PERCOLATION
SUPERVISED LEARNING
UNSUPERVISED LEARNING
title_short Outlier Mining Methods Based on Graph Structure Analysis
title_full Outlier Mining Methods Based on Graph Structure Analysis
title_fullStr Outlier Mining Methods Based on Graph Structure Analysis
title_full_unstemmed Outlier Mining Methods Based on Graph Structure Analysis
title_sort Outlier Mining Methods Based on Graph Structure Analysis
dc.creator.none.fl_str_mv Amil, Pablo
Almeira, Nahuel
Masoller, Cristina
author Amil, Pablo
author_facet Amil, Pablo
Almeira, Nahuel
Masoller, Cristina
author_role author
author2 Almeira, Nahuel
Masoller, Cristina
author2_role author
author
dc.subject.none.fl_str_mv ANOMALY DETECTION
COMPLEX NETWORKS
MACHINE LEARNING
OUTLIER MINING
PERCOLATION
SUPERVISED LEARNING
UNSUPERVISED LEARNING
topic ANOMALY DETECTION
COMPLEX NETWORKS
MACHINE LEARNING
OUTLIER MINING
PERCOLATION
SUPERVISED LEARNING
UNSUPERVISED LEARNING
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.2
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.
Fil: Amil, Pablo. Universitat Politecnica de Catalunya; España
Fil: Almeira, Nahuel. Universidad Nacional de Córdoba. Facultad de Matemática, Astronomia y Física. Sección Física. Grupo de Teoria de la Materia Condensada; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Física Enrique Gaviola. Universidad Nacional de Córdoba. Instituto de Física Enrique Gaviola; Argentina
Fil: Masoller, Cristina. Universitat Politecnica de Catalunya; España
description Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.
publishDate 2019
dc.date.none.fl_str_mv 2019-11-26
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/125807
Amil, Pablo; Almeira, Nahuel; Masoller, Cristina; Outlier Mining Methods Based on Graph Structure Analysis; Frontiers Media S.A.; Frontiers in Physics; 7; 194; 26-11-2019; 1-11
2296-424X
CONICET Digital
CONICET
url http://hdl.handle.net/11336/125807
identifier_str_mv Amil, Pablo; Almeira, Nahuel; Masoller, Cristina; Outlier Mining Methods Based on Graph Structure Analysis; Frontiers Media S.A.; Frontiers in Physics; 7; 194; 26-11-2019; 1-11
2296-424X
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/https://www.frontiersin.org/article/10.3389/fphy.2019.00194/full
info:eu-repo/semantics/altIdentifier/doi/10.3389/fphy.2019.00194
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv Frontiers Media S.A.
publisher.none.fl_str_mv Frontiers Media S.A.
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1844613979217854464
score 13.070432