Outlier Mining Methods Based on Graph Structure Analysis
- Autores
- Amil, Pablo; Almeira, Nahuel; Masoller, Cristina
- Año de publicación
- 2019
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.
Fil: Amil, Pablo. Universitat Politecnica de Catalunya; España
Fil: Almeira, Nahuel. Universidad Nacional de Córdoba. Facultad de Matemática, Astronomia y Física. Sección Física. Grupo de Teoria de la Materia Condensada; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Física Enrique Gaviola. Universidad Nacional de Córdoba. Instituto de Física Enrique Gaviola; Argentina
Fil: Masoller, Cristina. Universitat Politecnica de Catalunya; España - Materia
-
ANOMALY DETECTION
COMPLEX NETWORKS
MACHINE LEARNING
OUTLIER MINING
PERCOLATION
SUPERVISED LEARNING
UNSUPERVISED LEARNING - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- https://creativecommons.org/licenses/by/2.5/ar/
- Repositorio
- Institución
- Consejo Nacional de Investigaciones Científicas y Técnicas
- OAI Identificador
- oai:ri.conicet.gov.ar:11336/125807
Ver los metadatos del registro completo
id |
CONICETDig_8a9606b969e9aeec91699ba0f666402a |
---|---|
oai_identifier_str |
oai:ri.conicet.gov.ar:11336/125807 |
network_acronym_str |
CONICETDig |
repository_id_str |
3498 |
network_name_str |
CONICET Digital (CONICET) |
spelling |
Outlier Mining Methods Based on Graph Structure AnalysisAmil, PabloAlmeira, NahuelMasoller, CristinaANOMALY DETECTIONCOMPLEX NETWORKSMACHINE LEARNINGOUTLIER MININGPERCOLATIONSUPERVISED LEARNINGUNSUPERVISED LEARNINGhttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.Fil: Amil, Pablo. Universitat Politecnica de Catalunya; EspañaFil: Almeira, Nahuel. Universidad Nacional de Córdoba. Facultad de Matemática, Astronomia y Física. Sección Física. Grupo de Teoria de la Materia Condensada; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Física Enrique Gaviola. Universidad Nacional de Córdoba. Instituto de Física Enrique Gaviola; ArgentinaFil: Masoller, Cristina. Universitat Politecnica de Catalunya; EspañaFrontiers Media S.A.2019-11-26info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/125807Amil, Pablo; Almeira, Nahuel; Masoller, Cristina; Outlier Mining Methods Based on Graph Structure Analysis; Frontiers Media S.A.; Frontiers in Physics; 7; 194; 26-11-2019; 1-112296-424XCONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://www.frontiersin.org/article/10.3389/fphy.2019.00194/fullinfo:eu-repo/semantics/altIdentifier/doi/10.3389/fphy.2019.00194info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T10:09:45Zoai:ri.conicet.gov.ar:11336/125807instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 10:09:46.178CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse |
dc.title.none.fl_str_mv |
Outlier Mining Methods Based on Graph Structure Analysis |
title |
Outlier Mining Methods Based on Graph Structure Analysis |
spellingShingle |
Outlier Mining Methods Based on Graph Structure Analysis Amil, Pablo ANOMALY DETECTION COMPLEX NETWORKS MACHINE LEARNING OUTLIER MINING PERCOLATION SUPERVISED LEARNING UNSUPERVISED LEARNING |
title_short |
Outlier Mining Methods Based on Graph Structure Analysis |
title_full |
Outlier Mining Methods Based on Graph Structure Analysis |
title_fullStr |
Outlier Mining Methods Based on Graph Structure Analysis |
title_full_unstemmed |
Outlier Mining Methods Based on Graph Structure Analysis |
title_sort |
Outlier Mining Methods Based on Graph Structure Analysis |
dc.creator.none.fl_str_mv |
Amil, Pablo Almeira, Nahuel Masoller, Cristina |
author |
Amil, Pablo |
author_facet |
Amil, Pablo Almeira, Nahuel Masoller, Cristina |
author_role |
author |
author2 |
Almeira, Nahuel Masoller, Cristina |
author2_role |
author author |
dc.subject.none.fl_str_mv |
ANOMALY DETECTION COMPLEX NETWORKS MACHINE LEARNING OUTLIER MINING PERCOLATION SUPERVISED LEARNING UNSUPERVISED LEARNING |
topic |
ANOMALY DETECTION COMPLEX NETWORKS MACHINE LEARNING OUTLIER MINING PERCOLATION SUPERVISED LEARNING UNSUPERVISED LEARNING |
purl_subject.fl_str_mv |
https://purl.org/becyt/ford/1.2 https://purl.org/becyt/ford/1 |
dc.description.none.fl_txt_mv |
Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested. Fil: Amil, Pablo. Universitat Politecnica de Catalunya; España Fil: Almeira, Nahuel. Universidad Nacional de Córdoba. Facultad de Matemática, Astronomia y Física. Sección Física. Grupo de Teoria de la Materia Condensada; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Física Enrique Gaviola. Universidad Nacional de Córdoba. Instituto de Física Enrique Gaviola; Argentina Fil: Masoller, Cristina. Universitat Politecnica de Catalunya; España |
description |
Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested. |
publishDate |
2019 |
dc.date.none.fl_str_mv |
2019-11-26 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
format |
article |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://hdl.handle.net/11336/125807 Amil, Pablo; Almeira, Nahuel; Masoller, Cristina; Outlier Mining Methods Based on Graph Structure Analysis; Frontiers Media S.A.; Frontiers in Physics; 7; 194; 26-11-2019; 1-11 2296-424X CONICET Digital CONICET |
url |
http://hdl.handle.net/11336/125807 |
identifier_str_mv |
Amil, Pablo; Almeira, Nahuel; Masoller, Cristina; Outlier Mining Methods Based on Graph Structure Analysis; Frontiers Media S.A.; Frontiers in Physics; 7; 194; 26-11-2019; 1-11 2296-424X CONICET Digital CONICET |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/url/https://www.frontiersin.org/article/10.3389/fphy.2019.00194/full info:eu-repo/semantics/altIdentifier/doi/10.3389/fphy.2019.00194 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by/2.5/ar/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
https://creativecommons.org/licenses/by/2.5/ar/ |
dc.format.none.fl_str_mv |
application/pdf application/pdf |
dc.publisher.none.fl_str_mv |
Frontiers Media S.A. |
publisher.none.fl_str_mv |
Frontiers Media S.A. |
dc.source.none.fl_str_mv |
reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas |
reponame_str |
CONICET Digital (CONICET) |
collection |
CONICET Digital (CONICET) |
instname_str |
Consejo Nacional de Investigaciones Científicas y Técnicas |
repository.name.fl_str_mv |
CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas |
repository.mail.fl_str_mv |
dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar |
_version_ |
1844613979217854464 |
score |
13.070432 |