Clustering gene expression data with a penalized graph-based metric
- Autores
- Baya, Ariel Emilio; Granitto, Pablo Miguel
- Año de publicación
- 2011
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- Background: The search for cluster structure in microarray datasets is a base problem for the so-called “-omic sciences”. A difficult problem in clustering is how to handle data with a manifold structure, i.e. data that is not shaped in the form of compact clouds of points, forming arbitrary shapes or paths embedded in a highdimensional space, as could be the case of some gene expression datasets. Results: In this work we introduce the Penalized k-Nearest-Neighbor-Graph (PKNNG) based metric, a new tool for evaluating distances in such cases. The new metric can be used in combination with most clustering algorithms. The PKNNG metric is based on a two-step procedure: first it constructs the k-Nearest-Neighbor-Graph of the dataset of interest using a low k-value and then it adds edges with a highly penalized weight for connecting the subgraphs produced by the first step. We discuss several possible schemes for connecting the different sub-graphs as well as penalization functions. We show clustering results on several public gene expression datasets and simulated artificial problems to evaluate the behavior of the new metric. Conclusions: In all cases the PKNNG metric shows promising clustering results. The use of the PKNNG metric can improve the performance of commonly used pairwise-distance based clustering methods, to the level of more advanced algorithms. A great advantage of the new procedure is that researchers do not need to learn a new method, they can simply compute distances with the PKNNG metric and then, for example, use hierarchical clustering to produce an accurate and highly interpretable dendrogram of their high-dimensional data.
Fil: Baya, Ariel Emilio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y Sistemas; Argentina. Universidad Nacional de Rosario; Argentina
Fil: Granitto, Pablo Miguel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y Sistemas; Argentina. Universidad Nacional de Rosario; Argentina - Materia
-
CLUSTERING
ISOMAP
GENE EXPRESSION - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
- Repositorio
- Institución
- Consejo Nacional de Investigaciones Científicas y Técnicas
- OAI Identificador
- oai:ri.conicet.gov.ar:11336/15184
Ver los metadatos del registro completo
id |
CONICETDig_bb095ca2d376be60a5ecce4e99eed947 |
---|---|
oai_identifier_str |
oai:ri.conicet.gov.ar:11336/15184 |
network_acronym_str |
CONICETDig |
repository_id_str |
3498 |
network_name_str |
CONICET Digital (CONICET) |
spelling |
Clustering gene expression data with a penalized graph-based metricBaya, Ariel EmilioGranitto, Pablo MiguelCLUSTERINGISOMAPGENE EXPRESSIONhttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Background: The search for cluster structure in microarray datasets is a base problem for the so-called “-omic sciences”. A difficult problem in clustering is how to handle data with a manifold structure, i.e. data that is not shaped in the form of compact clouds of points, forming arbitrary shapes or paths embedded in a highdimensional space, as could be the case of some gene expression datasets. Results: In this work we introduce the Penalized k-Nearest-Neighbor-Graph (PKNNG) based metric, a new tool for evaluating distances in such cases. The new metric can be used in combination with most clustering algorithms. The PKNNG metric is based on a two-step procedure: first it constructs the k-Nearest-Neighbor-Graph of the dataset of interest using a low k-value and then it adds edges with a highly penalized weight for connecting the subgraphs produced by the first step. We discuss several possible schemes for connecting the different sub-graphs as well as penalization functions. We show clustering results on several public gene expression datasets and simulated artificial problems to evaluate the behavior of the new metric. Conclusions: In all cases the PKNNG metric shows promising clustering results. The use of the PKNNG metric can improve the performance of commonly used pairwise-distance based clustering methods, to the level of more advanced algorithms. A great advantage of the new procedure is that researchers do not need to learn a new method, they can simply compute distances with the PKNNG metric and then, for example, use hierarchical clustering to produce an accurate and highly interpretable dendrogram of their high-dimensional data.Fil: Baya, Ariel Emilio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y Sistemas; Argentina. Universidad Nacional de Rosario; ArgentinaFil: Granitto, Pablo Miguel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y Sistemas; Argentina. Universidad Nacional de Rosario; ArgentinaBiomed Central2011-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/15184Baya, Ariel Emilio; Granitto, Pablo Miguel; Clustering gene expression data with a penalized graph-based metric; Biomed Central; Bmc Bioinformatics; 12; 2; 1-2011; 1-181471-2105enginfo:eu-repo/semantics/altIdentifier/doi/10.1186/1471-2105-12-2info:eu-repo/semantics/altIdentifier/url/https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-2info:eu-repo/semantics/altIdentifier/url/https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3023695/info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T09:50:31Zoai:ri.conicet.gov.ar:11336/15184instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 09:50:32.192CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse |
dc.title.none.fl_str_mv |
Clustering gene expression data with a penalized graph-based metric |
title |
Clustering gene expression data with a penalized graph-based metric |
spellingShingle |
Clustering gene expression data with a penalized graph-based metric Baya, Ariel Emilio CLUSTERING ISOMAP GENE EXPRESSION |
title_short |
Clustering gene expression data with a penalized graph-based metric |
title_full |
Clustering gene expression data with a penalized graph-based metric |
title_fullStr |
Clustering gene expression data with a penalized graph-based metric |
title_full_unstemmed |
Clustering gene expression data with a penalized graph-based metric |
title_sort |
Clustering gene expression data with a penalized graph-based metric |
dc.creator.none.fl_str_mv |
Baya, Ariel Emilio Granitto, Pablo Miguel |
author |
Baya, Ariel Emilio |
author_facet |
Baya, Ariel Emilio Granitto, Pablo Miguel |
author_role |
author |
author2 |
Granitto, Pablo Miguel |
author2_role |
author |
dc.subject.none.fl_str_mv |
CLUSTERING ISOMAP GENE EXPRESSION |
topic |
CLUSTERING ISOMAP GENE EXPRESSION |
purl_subject.fl_str_mv |
https://purl.org/becyt/ford/1.2 https://purl.org/becyt/ford/1 |
dc.description.none.fl_txt_mv |
Background: The search for cluster structure in microarray datasets is a base problem for the so-called “-omic sciences”. A difficult problem in clustering is how to handle data with a manifold structure, i.e. data that is not shaped in the form of compact clouds of points, forming arbitrary shapes or paths embedded in a highdimensional space, as could be the case of some gene expression datasets. Results: In this work we introduce the Penalized k-Nearest-Neighbor-Graph (PKNNG) based metric, a new tool for evaluating distances in such cases. The new metric can be used in combination with most clustering algorithms. The PKNNG metric is based on a two-step procedure: first it constructs the k-Nearest-Neighbor-Graph of the dataset of interest using a low k-value and then it adds edges with a highly penalized weight for connecting the subgraphs produced by the first step. We discuss several possible schemes for connecting the different sub-graphs as well as penalization functions. We show clustering results on several public gene expression datasets and simulated artificial problems to evaluate the behavior of the new metric. Conclusions: In all cases the PKNNG metric shows promising clustering results. The use of the PKNNG metric can improve the performance of commonly used pairwise-distance based clustering methods, to the level of more advanced algorithms. A great advantage of the new procedure is that researchers do not need to learn a new method, they can simply compute distances with the PKNNG metric and then, for example, use hierarchical clustering to produce an accurate and highly interpretable dendrogram of their high-dimensional data. Fil: Baya, Ariel Emilio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y Sistemas; Argentina. Universidad Nacional de Rosario; Argentina Fil: Granitto, Pablo Miguel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Rosario. Centro Internacional Franco Argentino de Ciencias de la Información y Sistemas; Argentina. Universidad Nacional de Rosario; Argentina |
description |
Background: The search for cluster structure in microarray datasets is a base problem for the so-called “-omic sciences”. A difficult problem in clustering is how to handle data with a manifold structure, i.e. data that is not shaped in the form of compact clouds of points, forming arbitrary shapes or paths embedded in a highdimensional space, as could be the case of some gene expression datasets. Results: In this work we introduce the Penalized k-Nearest-Neighbor-Graph (PKNNG) based metric, a new tool for evaluating distances in such cases. The new metric can be used in combination with most clustering algorithms. The PKNNG metric is based on a two-step procedure: first it constructs the k-Nearest-Neighbor-Graph of the dataset of interest using a low k-value and then it adds edges with a highly penalized weight for connecting the subgraphs produced by the first step. We discuss several possible schemes for connecting the different sub-graphs as well as penalization functions. We show clustering results on several public gene expression datasets and simulated artificial problems to evaluate the behavior of the new metric. Conclusions: In all cases the PKNNG metric shows promising clustering results. The use of the PKNNG metric can improve the performance of commonly used pairwise-distance based clustering methods, to the level of more advanced algorithms. A great advantage of the new procedure is that researchers do not need to learn a new method, they can simply compute distances with the PKNNG metric and then, for example, use hierarchical clustering to produce an accurate and highly interpretable dendrogram of their high-dimensional data. |
publishDate |
2011 |
dc.date.none.fl_str_mv |
2011-01 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
format |
article |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://hdl.handle.net/11336/15184 Baya, Ariel Emilio; Granitto, Pablo Miguel; Clustering gene expression data with a penalized graph-based metric; Biomed Central; Bmc Bioinformatics; 12; 2; 1-2011; 1-18 1471-2105 |
url |
http://hdl.handle.net/11336/15184 |
identifier_str_mv |
Baya, Ariel Emilio; Granitto, Pablo Miguel; Clustering gene expression data with a penalized graph-based metric; Biomed Central; Bmc Bioinformatics; 12; 2; 1-2011; 1-18 1471-2105 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/doi/10.1186/1471-2105-12-2 info:eu-repo/semantics/altIdentifier/url/https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-12-2 info:eu-repo/semantics/altIdentifier/url/https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3023695/ |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ |
dc.format.none.fl_str_mv |
application/pdf application/pdf application/pdf application/pdf |
dc.publisher.none.fl_str_mv |
Biomed Central |
publisher.none.fl_str_mv |
Biomed Central |
dc.source.none.fl_str_mv |
reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas |
reponame_str |
CONICET Digital (CONICET) |
collection |
CONICET Digital (CONICET) |
instname_str |
Consejo Nacional de Investigaciones Científicas y Técnicas |
repository.name.fl_str_mv |
CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas |
repository.mail.fl_str_mv |
dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar |
_version_ |
1844613557135605760 |
score |
13.070432 |