K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes

Autores
Contreras-Moreira, Bruno; Filippi, Carla Valeria; Naamati, Guy; García Girón, Carlos; Allen, James E.; Flicek, Paul
Año de publicación
2021
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.
Instituto de Biotecnología
Fil: Contreras-Moreira, Bruno. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Filippi, Carla Valeria. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular (IABIMO); Argentina
Fil: Filippi, Carla Valeria. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Fil: Filippi, Carla Valeria. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Naamati, Guy. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: García Girón, Carlos. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Allen, James E. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Flicek, Paul. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fuente
The Plant Genome 14 (3) : e20143 (November 2021)
Materia
Genomas
Fitogenética
Genética
Genomes
Plant Genetics
Genetics
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
INTA Digital (INTA)
Institución
Instituto Nacional de Tecnología Agropecuaria
OAI Identificador
oai:localhost:20.500.12123/10882

id INTADig_56fd39354e288c10d0ff5e8988205cf8
oai_identifier_str oai:localhost:20.500.12123/10882
network_acronym_str INTADig
repository_id_str l
network_name_str INTA Digital (INTA)
spelling K-mer counting and curated libraries drive efficient annotation of repeats in plant genomesContreras-Moreira, BrunoFilippi, Carla ValeriaNaamati, GuyGarcía Girón, CarlosAllen, James E.Flicek, PaulGenomasFitogenéticaGenéticaGenomesPlant GeneticsGeneticsThe annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.Instituto de BiotecnologíaFil: Contreras-Moreira, Bruno. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Filippi, Carla Valeria. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular (IABIMO); ArgentinaFil: Filippi, Carla Valeria. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Filippi, Carla Valeria. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Naamati, Guy. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: García Girón, Carlos. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Allen, James E. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Flicek, Paul. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoWiley2021-12-10T13:45:33Z2021-12-10T13:45:33Z2021-09info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://hdl.handle.net/20.500.12123/10882https://acsess.onlinelibrary.wiley.com/doi/full/10.1002/tpg2.201431940-3372https://doi.org/10.1002/tpg2.20143The Plant Genome 14 (3) : e20143 (November 2021)reponame:INTA Digital (INTA)instname:Instituto Nacional de Tecnología Agropecuariaenginfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)2025-09-04T09:49:12Zoai:localhost:20.500.12123/10882instacron:INTAInstitucionalhttp://repositorio.inta.gob.ar/Organismo científico-tecnológicoNo correspondehttp://repositorio.inta.gob.ar/oai/requesttripaldi.nicolas@inta.gob.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:l2025-09-04 09:49:12.887INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuariafalse
dc.title.none.fl_str_mv K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
spellingShingle K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
Contreras-Moreira, Bruno
Genomas
Fitogenética
Genética
Genomes
Plant Genetics
Genetics
title_short K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_full K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_fullStr K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_full_unstemmed K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
title_sort K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
dc.creator.none.fl_str_mv Contreras-Moreira, Bruno
Filippi, Carla Valeria
Naamati, Guy
García Girón, Carlos
Allen, James E.
Flicek, Paul
author Contreras-Moreira, Bruno
author_facet Contreras-Moreira, Bruno
Filippi, Carla Valeria
Naamati, Guy
García Girón, Carlos
Allen, James E.
Flicek, Paul
author_role author
author2 Filippi, Carla Valeria
Naamati, Guy
García Girón, Carlos
Allen, James E.
Flicek, Paul
author2_role author
author
author
author
author
dc.subject.none.fl_str_mv Genomas
Fitogenética
Genética
Genomes
Plant Genetics
Genetics
topic Genomas
Fitogenética
Genética
Genomes
Plant Genetics
Genetics
dc.description.none.fl_txt_mv The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.
Instituto de Biotecnología
Fil: Contreras-Moreira, Bruno. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Filippi, Carla Valeria. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular (IABIMO); Argentina
Fil: Filippi, Carla Valeria. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Fil: Filippi, Carla Valeria. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Naamati, Guy. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: García Girón, Carlos. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Allen, James E. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Flicek, Paul. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
description The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.
publishDate 2021
dc.date.none.fl_str_mv 2021-12-10T13:45:33Z
2021-12-10T13:45:33Z
2021-09
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/20.500.12123/10882
https://acsess.onlinelibrary.wiley.com/doi/full/10.1002/tpg2.20143
1940-3372
https://doi.org/10.1002/tpg2.20143
url http://hdl.handle.net/20.500.12123/10882
https://acsess.onlinelibrary.wiley.com/doi/full/10.1002/tpg2.20143
https://doi.org/10.1002/tpg2.20143
identifier_str_mv 1940-3372
dc.language.none.fl_str_mv eng
language eng
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Wiley
publisher.none.fl_str_mv Wiley
dc.source.none.fl_str_mv The Plant Genome 14 (3) : e20143 (November 2021)
reponame:INTA Digital (INTA)
instname:Instituto Nacional de Tecnología Agropecuaria
reponame_str INTA Digital (INTA)
collection INTA Digital (INTA)
instname_str Instituto Nacional de Tecnología Agropecuaria
repository.name.fl_str_mv INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuaria
repository.mail.fl_str_mv tripaldi.nicolas@inta.gob.ar
_version_ 1842341392898064384
score 12.623145