Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets

Autores
González, Sergio Alberto; Rivarola, Máximo; Ribone, Andrés Ignacio; Lew, Sergio Eduardo; Paniego, Norma Beatriz
Año de publicación
2024
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction.
Fil: González, Sergio Alberto. Instituto Nacional de Tecnología Agropecuaria. Centro de Investigación en Ciencias Veterinarias y Agronómicas. Instituto de Agrobiotecnología y Biología Molecular. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Agrobiotecnología y Biología Molecular; Argentina
Fil: Rivarola, Máximo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Fil: Ribone, Andrés Ignacio. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Fil: Lew, Sergio Eduardo. Universidad de Buenos Aires. Facultad de Ingeniería. Instituto de Ingeniería Biomédica; Argentina
Fil: Paniego, Norma Beatriz. Instituto Nacional de Tecnología Agropecuaria. Centro de Investigación en Ciencias Veterinarias y Agronómicas. Instituto de Agrobiotecnología y Biología Molecular. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Agrobiotecnología y Biología Molecular; Argentina
Materia
DE NOVO ASSEMBLY
SHORT READS
TRANSCRIPTOMICS
EVALUATION METRICS
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/266240

id CONICETDig_84fac2b67fa60d0fe5f7de614ac07ca6
oai_identifier_str oai:ri.conicet.gov.ar:11336/266240
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq DatasetsGonzález, Sergio AlbertoRivarola, MáximoRibone, Andrés IgnacioLew, Sergio EduardoPaniego, Norma BeatrizDE NOVO ASSEMBLYSHORT READSTRANSCRIPTOMICSEVALUATION METRICShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction.Fil: González, Sergio Alberto. Instituto Nacional de Tecnología Agropecuaria. Centro de Investigación en Ciencias Veterinarias y Agronómicas. Instituto de Agrobiotecnología y Biología Molecular. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Agrobiotecnología y Biología Molecular; ArgentinaFil: Rivarola, Máximo. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Ribone, Andrés Ignacio. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Lew, Sergio Eduardo. Universidad de Buenos Aires. Facultad de Ingeniería. Instituto de Ingeniería Biomédica; ArgentinaFil: Paniego, Norma Beatriz. Instituto Nacional de Tecnología Agropecuaria. Centro de Investigación en Ciencias Veterinarias y Agronómicas. Instituto de Agrobiotecnología y Biología Molecular. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Agrobiotecnología y Biología Molecular; ArgentinaSAGE Publications2024-12info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/266240González, Sergio Alberto; Rivarola, Máximo; Ribone, Andrés Ignacio; Lew, Sergio Eduardo; Paniego, Norma Beatriz; Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets; SAGE Publications; Bioinformatics and Biology Insights; 18; 12-2024; 1-131177-93221177-9322CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://journals.sagepub.com/doi/epub/10.1177/11779322241274957info:eu-repo/semantics/altIdentifier/doi/10.1177/11779322241274957info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-03T09:46:34Zoai:ri.conicet.gov.ar:11336/266240instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-03 09:46:34.702CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
spellingShingle Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
González, Sergio Alberto
DE NOVO ASSEMBLY
SHORT READS
TRANSCRIPTOMICS
EVALUATION METRICS
title_short Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_full Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_fullStr Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_full_unstemmed Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_sort Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
dc.creator.none.fl_str_mv González, Sergio Alberto
Rivarola, Máximo
Ribone, Andrés Ignacio
Lew, Sergio Eduardo
Paniego, Norma Beatriz
author González, Sergio Alberto
author_facet González, Sergio Alberto
Rivarola, Máximo
Ribone, Andrés Ignacio
Lew, Sergio Eduardo
Paniego, Norma Beatriz
author_role author
author2 Rivarola, Máximo
Ribone, Andrés Ignacio
Lew, Sergio Eduardo
Paniego, Norma Beatriz
author2_role author
author
author
author
dc.subject.none.fl_str_mv DE NOVO ASSEMBLY
SHORT READS
TRANSCRIPTOMICS
EVALUATION METRICS
topic DE NOVO ASSEMBLY
SHORT READS
TRANSCRIPTOMICS
EVALUATION METRICS
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.2
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction.
Fil: González, Sergio Alberto. Instituto Nacional de Tecnología Agropecuaria. Centro de Investigación en Ciencias Veterinarias y Agronómicas. Instituto de Agrobiotecnología y Biología Molecular. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Agrobiotecnología y Biología Molecular; Argentina
Fil: Rivarola, Máximo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Fil: Ribone, Andrés Ignacio. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Fil: Lew, Sergio Eduardo. Universidad de Buenos Aires. Facultad de Ingeniería. Instituto de Ingeniería Biomédica; Argentina
Fil: Paniego, Norma Beatriz. Instituto Nacional de Tecnología Agropecuaria. Centro de Investigación en Ciencias Veterinarias y Agronómicas. Instituto de Agrobiotecnología y Biología Molecular. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Parque Centenario. Instituto de Agrobiotecnología y Biología Molecular; Argentina
description De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction.
publishDate 2024
dc.date.none.fl_str_mv 2024-12
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/266240
González, Sergio Alberto; Rivarola, Máximo; Ribone, Andrés Ignacio; Lew, Sergio Eduardo; Paniego, Norma Beatriz; Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets; SAGE Publications; Bioinformatics and Biology Insights; 18; 12-2024; 1-13
1177-9322
1177-9322
CONICET Digital
CONICET
url http://hdl.handle.net/11336/266240
identifier_str_mv González, Sergio Alberto; Rivarola, Máximo; Ribone, Andrés Ignacio; Lew, Sergio Eduardo; Paniego, Norma Beatriz; Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets; SAGE Publications; Bioinformatics and Biology Insights; 18; 12-2024; 1-13
1177-9322
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/https://journals.sagepub.com/doi/epub/10.1177/11779322241274957
info:eu-repo/semantics/altIdentifier/doi/10.1177/11779322241274957
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv SAGE Publications
publisher.none.fl_str_mv SAGE Publications
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1842268803717660672
score 13.13397