Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets

Autores: Gonzalez, Sergio Alberto; Rivarola, Maximo Lisandro; Ribone, Andrés Ignacio; Lew, Sergio Eduardo; Paniego, Norma Beatriz
Año de publicación: 2024
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction.
Instituto de Biotecnología
Fil: Gonzalez, Sergio Alberto. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnologia y Biología Molecular; Argentina
Fil: Gonzalez, Sergio Alberto. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Fil: Rivarola, Maximo Lisandro. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Fil: Ribone, Andrés Ignacio. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Fil: Lew, Sergio Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Fil: Lew, Sergio Eduardo. Universidad de Buenos Aires. Facultad de Ingeniería. Instituto de Ingeniería Biomédica; Argentina
Fil: Paniego, Norma Beatriz. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular; Argentina
Fil: Paniego, Norma Beatriz. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Fuente: Bioinformatics and Biology Insights 18 : 1-13 (2024)
Materia: ARN
Transcriptómica
Genética
Modelos de Simulación
RNA
Transcriptomics
Genetics
Simulation Models
De Novo Assembly
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
Institución: Instituto Nacional de Tecnología Agropecuaria
OAI Identificador: oai:localhost:20.500.12123/21747

Acceder

id	INTADig_c3ce9e6789b6e47e92d696cc5184ae56
oai_identifier_str	oai:localhost:20.500.12123/21747
network_acronym_str	INTADig
repository_id_str	l
network_name_str	INTA Digital (INTA)
spelling	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq DatasetsGonzalez, Sergio AlbertoRivarola, Maximo LisandroRibone, Andrés IgnacioLew, Sergio EduardoPaniego, Norma BeatrizARNTranscriptómicaGenéticaModelos de SimulaciónRNATranscriptomicsGeneticsSimulation ModelsDe Novo AssemblyDe novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction.Instituto de BiotecnologíaFil: Gonzalez, Sergio Alberto. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnologia y Biología Molecular; ArgentinaFil: Gonzalez, Sergio Alberto. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Rivarola, Maximo Lisandro. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Ribone, Andrés Ignacio. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Lew, Sergio Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Lew, Sergio Eduardo. Universidad de Buenos Aires. Facultad de Ingeniería. Instituto de Ingeniería Biomédica; ArgentinaFil: Paniego, Norma Beatriz. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular; ArgentinaFil: Paniego, Norma Beatriz. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaSage Publications2025-03-20T12:21:22Z2025-03-20T12:21:22Z2024-12info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://hdl.handle.net/20.500.12123/21747https://journals.sagepub.com/doi/full/10.1177/117793222412749571177-9322https://doi.org/10.1177/11779322241274957Bioinformatics and Biology Insights 18 : 1-13 (2024)reponame:INTA Digital (INTA)instname:Instituto Nacional de Tecnología Agropecuariaenginfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)2026-02-26T11:47:05Zoai:localhost:20.500.12123/21747instacron:INTAInstitucionalhttp://repositorio.inta.gob.ar/Organismo científico-tecnológicoNo correspondehttp://repositorio.inta.gob.ar/oai/requesttripaldi.nicolas@inta.gob.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:l2026-02-26 11:47:05.483INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuariafalse
dc.title.none.fl_str_mv	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
spellingShingle	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets Gonzalez, Sergio Alberto ARN Transcriptómica Genética Modelos de Simulación RNA Transcriptomics Genetics Simulation Models De Novo Assembly
title_short	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_full	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_fullStr	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_full_unstemmed	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_sort	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
dc.creator.none.fl_str_mv	Gonzalez, Sergio Alberto Rivarola, Maximo Lisandro Ribone, Andrés Ignacio Lew, Sergio Eduardo Paniego, Norma Beatriz
author	Gonzalez, Sergio Alberto
author_facet	Gonzalez, Sergio Alberto Rivarola, Maximo Lisandro Ribone, Andrés Ignacio Lew, Sergio Eduardo Paniego, Norma Beatriz
author_role	author
author2	Rivarola, Maximo Lisandro Ribone, Andrés Ignacio Lew, Sergio Eduardo Paniego, Norma Beatriz
author2_role	author author author author
dc.subject.none.fl_str_mv	ARN Transcriptómica Genética Modelos de Simulación RNA Transcriptomics Genetics Simulation Models De Novo Assembly
topic	ARN Transcriptómica Genética Modelos de Simulación RNA Transcriptomics Genetics Simulation Models De Novo Assembly
dc.description.none.fl_txt_mv	De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction. Instituto de Biotecnología Fil: Gonzalez, Sergio Alberto. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnologia y Biología Molecular; Argentina Fil: Gonzalez, Sergio Alberto. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina Fil: Rivarola, Maximo Lisandro. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina Fil: Ribone, Andrés Ignacio. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina Fil: Lew, Sergio Eduardo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina Fil: Lew, Sergio Eduardo. Universidad de Buenos Aires. Facultad de Ingeniería. Instituto de Ingeniería Biomédica; Argentina Fil: Paniego, Norma Beatriz. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular; Argentina Fil: Paniego, Norma Beatriz. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
description	De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction.
publishDate	2024
dc.date.none.fl_str_mv	2024-12 2025-03-20T12:21:22Z 2025-03-20T12:21:22Z
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://hdl.handle.net/20.500.12123/21747 https://journals.sagepub.com/doi/full/10.1177/11779322241274957 1177-9322 https://doi.org/10.1177/11779322241274957
url	http://hdl.handle.net/20.500.12123/21747 https://journals.sagepub.com/doi/full/10.1177/11779322241274957 https://doi.org/10.1177/11779322241274957
identifier_str_mv	1177-9322
dc.language.none.fl_str_mv	eng
language	eng
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Sage Publications
publisher.none.fl_str_mv	Sage Publications
dc.source.none.fl_str_mv	Bioinformatics and Biology Insights 18 : 1-13 (2024) reponame:INTA Digital (INTA) instname:Instituto Nacional de Tecnología Agropecuaria
reponame_str	INTA Digital (INTA)
collection	INTA Digital (INTA)
instname_str	Instituto Nacional de Tecnología Agropecuaria
repository.name.fl_str_mv	INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuaria
repository.mail.fl_str_mv	tripaldi.nicolas@inta.gob.ar
_version_	1858207921393893376
score	13.176822

Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets

Publicaciones similares