Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets

Autores: Reeb, Pablo D.; Bramardi, Sergio Jorge; Steibel, Juan P.
Año de publicación: 2015
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: Sample- and gene-based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure.We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.
Facultad de Ciencias Agrarias y Forestales
Materia: Ciencias Agrarias
RNA
Transcriptome
Novo transcriptome
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by/4.0/
Repositorio
Institución: Universidad Nacional de La Plata
OAI Identificador: oai:sedici.unlp.edu.ar:10915/86218

Acceder

id	SEDICI_00b10573af411468ed703e70788dae7f
oai_identifier_str	oai:sedici.unlp.edu.ar:10915/86218
network_acronym_str	SEDICI
repository_id_str	1329
network_name_str	SEDICI (UNLP)
spelling	Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasetsReeb, Pablo D.Bramardi, Sergio JorgeSteibel, Juan P.Ciencias AgrariasRNATranscriptomeNovo transcriptomeSample- and gene-based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure.We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.Facultad de Ciencias Agrarias y Forestales2015info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArticulohttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/86218enginfo:eu-repo/semantics/altIdentifier/issn/1932-6203info:eu-repo/semantics/altIdentifier/doi/10.1371/journal.pone.0132310info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/4.0/Creative Commons Attribution 4.0 International (CC BY 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2026-05-27T11:11:07Zoai:sedici.unlp.edu.ar:10915/86218Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292026-05-27 11:11:07.69SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv	Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
title	Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
spellingShingle	Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets Reeb, Pablo D. Ciencias Agrarias RNA Transcriptome Novo transcriptome
title_short	Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
title_full	Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
title_fullStr	Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
title_full_unstemmed	Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
title_sort	Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets
dc.creator.none.fl_str_mv	Reeb, Pablo D. Bramardi, Sergio Jorge Steibel, Juan P.
author	Reeb, Pablo D.
author_facet	Reeb, Pablo D. Bramardi, Sergio Jorge Steibel, Juan P.
author_role	author
author2	Bramardi, Sergio Jorge Steibel, Juan P.
author2_role	author author
dc.subject.none.fl_str_mv	Ciencias Agrarias RNA Transcriptome Novo transcriptome
topic	Ciencias Agrarias RNA Transcriptome Novo transcriptome
dc.description.none.fl_txt_mv	Sample- and gene-based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure.We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference. Facultad de Ciencias Agrarias y Forestales
description	Sample- and gene-based hierarchical cluster analyses have been widely adopted as tools for exploring gene expression data in high-throughput experiments. Gene expression values (read counts) generated by RNA sequencing technology (RNA-seq) are discrete variables with special statistical properties, such as over-dispersion and right-skewness. Additionally, read counts are subject to technology artifacts as differences in sequencing depth. This possesses a challenge to finding distance measures suitable for hierarchical clustering. Normalization and transformation procedures have been proposed to favor the use of Euclidean and correlation based distances. Additionally, novel model-based dissimilarities that account for RNA-seq data characteristics have also been proposed. Adequacy of dissimilarity measures has been assessed using parametric simulations or exemplar datasets that may limit the scope of the conclusions. Here, we propose the simulation of realistic conditions through creation of plasmode datasets, to assess the adequacy of dissimilarity measures for sample-based hierarchical clustering of RNA-seq data. Consistent results were obtained using plasmode datasets based on RNA-seq experiments conducted under widely different conditions. Dissimilarity measures based on Euclidean distance that only considered data normalization or data standardization were not reliable to represent the expected hierarchical structure. Conversely, using either a Poisson-based dissimilarity or a rank correlation based dissimilarity or an appropriate data transformation, resulted in dendrograms that resemble the expected hierarchical structure. Plasmode datasets can be generated for a wide range of scenarios upon which dissimilarity measures can be evaluated for sample-based hierarchical clustering analysis. We showed different ways of generating such plasmodes and applied them to the problem of selecting a suitable dissimilarity measure.We report several measures that are satisfactory and the choice of a particular measure may rely on the availability on the software pipeline of preference.
publishDate	2015
dc.date.none.fl_str_mv	2015
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion Articulo http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://sedici.unlp.edu.ar/handle/10915/86218
url	http://sedici.unlp.edu.ar/handle/10915/86218
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/issn/1932-6203 info:eu-repo/semantics/altIdentifier/doi/10.1371/journal.pone.0132310
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International (CC BY 4.0)
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International (CC BY 4.0)
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP
reponame_str	SEDICI (UNLP)
collection	SEDICI (UNLP)
instname_str	Universidad Nacional de La Plata
instacron_str	UNLP
institution	UNLP
repository.name.fl_str_mv	SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv	alira@sedici.unlp.edu.ar
_version_	1866371630606843904
score	13.468372

Assessing dissimilarity measures for sample-based hierarchical clustering of RNA sequencing data using plasmode datasets

Publicaciones similares