Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
- Autores
- Reeb, Pablo D.; Bramardi, Sergio Jorge; Steibel, Juan P.
- Año de publicación
- 2015
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- Sample- and gene-based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure.We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.
Facultad de Ciencias Agrarias y Forestales - Materia
-
Ciencias Agrarias
RNA
Transcriptome
Novo transcriptome - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by/4.0/
- Repositorio
- Institución
- Universidad Nacional de La Plata
- OAI Identificador
- oai:sedici.unlp.edu.ar:10915/86218
Ver los metadatos del registro completo
id |
SEDICI_00b10573af411468ed703e70788dae7f |
---|---|
oai_identifier_str |
oai:sedici.unlp.edu.ar:10915/86218 |
network_acronym_str |
SEDICI |
repository_id_str |
1329 |
network_name_str |
SEDICI (UNLP) |
spelling |
Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasetsReeb, Pablo D.Bramardi, Sergio JorgeSteibel, Juan P.Ciencias AgrariasRNATranscriptomeNovo transcriptomeSample- and gene-based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure.We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.Facultad de Ciencias Agrarias y Forestales2015info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArticulohttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/86218enginfo:eu-repo/semantics/altIdentifier/issn/1932-6203info:eu-repo/semantics/altIdentifier/doi/10.1371/journal.pone.0132310info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/4.0/Creative Commons Attribution 4.0 International (CC BY 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-10T12:19:34Zoai:sedici.unlp.edu.ar:10915/86218Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-10 12:19:34.553SEDICI (UNLP) - Universidad Nacional de La Platafalse |
dc.title.none.fl_str_mv |
Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets |
title |
Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets |
spellingShingle |
Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets Reeb, Pablo D. Ciencias Agrarias RNA Transcriptome Novo transcriptome |
title_short |
Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets |
title_full |
Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets |
title_fullStr |
Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets |
title_full_unstemmed |
Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets |
title_sort |
Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets |
dc.creator.none.fl_str_mv |
Reeb, Pablo D. Bramardi, Sergio Jorge Steibel, Juan P. |
author |
Reeb, Pablo D. |
author_facet |
Reeb, Pablo D. Bramardi, Sergio Jorge Steibel, Juan P. |
author_role |
author |
author2 |
Bramardi, Sergio Jorge Steibel, Juan P. |
author2_role |
author author |
dc.subject.none.fl_str_mv |
Ciencias Agrarias RNA Transcriptome Novo transcriptome |
topic |
Ciencias Agrarias RNA Transcriptome Novo transcriptome |
dc.description.none.fl_txt_mv |
Sample- and gene-based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure.We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference. Facultad de Ciencias Agrarias y Forestales |
description |
Sample- and gene-based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure.We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference. |
publishDate |
2015 |
dc.date.none.fl_str_mv |
2015 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion Articulo http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
format |
article |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://sedici.unlp.edu.ar/handle/10915/86218 |
url |
http://sedici.unlp.edu.ar/handle/10915/86218 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/issn/1932-6203 info:eu-repo/semantics/altIdentifier/doi/10.1371/journal.pone.0132310 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International (CC BY 4.0) |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International (CC BY 4.0) |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP |
reponame_str |
SEDICI (UNLP) |
collection |
SEDICI (UNLP) |
instname_str |
Universidad Nacional de La Plata |
instacron_str |
UNLP |
institution |
UNLP |
repository.name.fl_str_mv |
SEDICI (UNLP) - Universidad Nacional de La Plata |
repository.mail.fl_str_mv |
alira@sedici.unlp.edu.ar |
_version_ |
1842904186497269760 |
score |
12.993085 |