K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes

Autores: Contreras-Moreira, Bruno; Filippi, Carla Valeria; Naamati, Guy; García Girón, Carlos; Allen, James E.; Flicek, Paul
Año de publicación: 2021
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.
Instituto de Biotecnología
Fil: Contreras-Moreira, Bruno. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Filippi, Carla Valeria. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular (IABIMO); Argentina
Fil: Filippi, Carla Valeria. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Fil: Filippi, Carla Valeria. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Naamati, Guy. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: García Girón, Carlos. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Allen, James E. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Flicek, Paul. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fuente: The Plant Genome 14 (3) : e20143 (November 2021)
Materia: Genomas
Fitogenética
Genética
Genomes
Plant Genetics
Genetics
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
Institución: Instituto Nacional de Tecnología Agropecuaria
OAI Identificador: oai:localhost:20.500.12123/10882

Acceder

id	INTADig_56fd39354e288c10d0ff5e8988205cf8
oai_identifier_str	oai:localhost:20.500.12123/10882
network_acronym_str	INTADig
repository_id_str	l
network_name_str	INTA Digital (INTA)
spelling	K-mer counting and curated libraries drive efficient annotation of repeats in plant genomesContreras-Moreira, BrunoFilippi, Carla ValeriaNaamati, GuyGarcía Girón, CarlosAllen, James E.Flicek, PaulGenomasFitogenéticaGenéticaGenomesPlant GeneticsGeneticsThe annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.Instituto de BiotecnologíaFil: Contreras-Moreira, Bruno. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Filippi, Carla Valeria. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular (IABIMO); ArgentinaFil: Filippi, Carla Valeria. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Filippi, Carla Valeria. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Naamati, Guy. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: García Girón, Carlos. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Allen, James E. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Flicek, Paul. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoWiley2021-12-10T13:45:33Z2021-12-10T13:45:33Z2021-09info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://hdl.handle.net/20.500.12123/10882https://acsess.onlinelibrary.wiley.com/doi/full/10.1002/tpg2.201431940-3372https://doi.org/10.1002/tpg2.20143The Plant Genome 14 (3) : e20143 (November 2021)reponame:INTA Digital (INTA)instname:Instituto Nacional de Tecnología Agropecuariaenginfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)2026-02-26T11:44:55Zoai:localhost:20.500.12123/10882instacron:INTAInstitucionalhttp://repositorio.inta.gob.ar/Organismo científico-tecnológicoNo correspondehttp://repositorio.inta.gob.ar/oai/requesttripaldi.nicolas@inta.gob.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:l2026-02-26 11:44:56.054INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuariafalse
dc.title.none.fl_str_mv	K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title	K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
spellingShingle	K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes Contreras-Moreira, Bruno Genomas Fitogenética Genética Genomes Plant Genetics Genetics
title_short	K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_full	K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_fullStr	K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_full_unstemmed	K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_sort	K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
dc.creator.none.fl_str_mv	Contreras-Moreira, Bruno Filippi, Carla Valeria Naamati, Guy García Girón, Carlos Allen, James E. Flicek, Paul
author	Contreras-Moreira, Bruno
author_facet	Contreras-Moreira, Bruno Filippi, Carla Valeria Naamati, Guy García Girón, Carlos Allen, James E. Flicek, Paul
author_role	author
author2	Filippi, Carla Valeria Naamati, Guy García Girón, Carlos Allen, James E. Flicek, Paul
author2_role	author author author author author
dc.subject.none.fl_str_mv	Genomas Fitogenética Genética Genomes Plant Genetics Genetics
topic	Genomas Fitogenética Genética Genomes Plant Genetics Genetics
dc.description.none.fl_txt_mv	The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts. Instituto de Biotecnología Fil: Contreras-Moreira, Bruno. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: Filippi, Carla Valeria. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular (IABIMO); Argentina Fil: Filippi, Carla Valeria. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina Fil: Filippi, Carla Valeria. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: Naamati, Guy. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: García Girón, Carlos. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: Allen, James E. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: Flicek, Paul. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
description	The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.
publishDate	2021
dc.date.none.fl_str_mv	2021-12-10T13:45:33Z 2021-12-10T13:45:33Z 2021-09
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://hdl.handle.net/20.500.12123/10882 https://acsess.onlinelibrary.wiley.com/doi/full/10.1002/tpg2.20143 1940-3372 https://doi.org/10.1002/tpg2.20143
url	http://hdl.handle.net/20.500.12123/10882 https://acsess.onlinelibrary.wiley.com/doi/full/10.1002/tpg2.20143 https://doi.org/10.1002/tpg2.20143
identifier_str_mv	1940-3372
dc.language.none.fl_str_mv	eng
language	eng
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Wiley
publisher.none.fl_str_mv	Wiley
dc.source.none.fl_str_mv	The Plant Genome 14 (3) : e20143 (November 2021) reponame:INTA Digital (INTA) instname:Instituto Nacional de Tecnología Agropecuaria
reponame_str	INTA Digital (INTA)
collection	INTA Digital (INTA)
instname_str	Instituto Nacional de Tecnología Agropecuaria
repository.name.fl_str_mv	INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuaria
repository.mail.fl_str_mv	tripaldi.nicolas@inta.gob.ar
_version_	1858207880410300416
score	13.176822

K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes

Publicaciones similares