Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets

Autores
Reeb, Pablo D.; Bramardi, Sergio Jorge; Steibel, Juan P.
Año de publicación
2015
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Sample- and gene-based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure.We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.
Facultad de Ciencias Agrarias y Forestales
Materia
Ciencias Agrarias
RNA
Transcriptome
Novo transcriptome
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by/4.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/86218

id SEDICI_00b10573af411468ed703e70788dae7f
oai_identifier_str oai:sedici.unlp.edu.ar:10915/86218
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasetsReeb, Pablo D.Bramardi, Sergio JorgeSteibel, Juan P.Ciencias AgrariasRNATranscriptomeNovo transcriptomeSample- and gene-based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure.We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.Facultad de Ciencias Agrarias y Forestales2015info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArticulohttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/86218enginfo:eu-repo/semantics/altIdentifier/issn/1932-6203info:eu-repo/semantics/altIdentifier/doi/10.1371/journal.pone.0132310info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/4.0/Creative Commons Attribution 4.0 International (CC BY 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-10T12:19:34Zoai:sedici.unlp.edu.ar:10915/86218Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-10 12:19:34.553SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
title Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
spellingShingle Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
Reeb, Pablo D.
Ciencias Agrarias
RNA
Transcriptome
Novo transcriptome
title_short Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
title_full Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
title_fullStr Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
title_full_unstemmed Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
title_sort Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
dc.creator.none.fl_str_mv Reeb, Pablo D.
Bramardi, Sergio Jorge
Steibel, Juan P.
author Reeb, Pablo D.
author_facet Reeb, Pablo D.
Bramardi, Sergio Jorge
Steibel, Juan P.
author_role author
author2 Bramardi, Sergio Jorge
Steibel, Juan P.
author2_role author
author
dc.subject.none.fl_str_mv Ciencias Agrarias
RNA
Transcriptome
Novo transcriptome
topic Ciencias Agrarias
RNA
Transcriptome
Novo transcriptome
dc.description.none.fl_txt_mv Sample- and gene-based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure.We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.
Facultad de Ciencias Agrarias y Forestales
description Sample- and gene-based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure.We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.
publishDate 2015
dc.date.none.fl_str_mv 2015
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
Articulo
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/86218
url http://sedici.unlp.edu.ar/handle/10915/86218
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/issn/1932-6203
info:eu-repo/semantics/altIdentifier/doi/10.1371/journal.pone.0132310
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by/4.0/
Creative Commons Attribution 4.0 International (CC BY 4.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by/4.0/
Creative Commons Attribution 4.0 International (CC BY 4.0)
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1842904186497269760
score 12.993085