K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes
- Autores
- Contreras-Moreira, Bruno; Filippi, Carla Valeria; Naamati, Guy; García Girón, Carlos; Allen, James E.; Flicek, Paul
- Año de publicación
- 2021
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.
Instituto de Biotecnología
Fil: Contreras-Moreira, Bruno. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Filippi, Carla Valeria. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular (IABIMO); Argentina
Fil: Filippi, Carla Valeria. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Fil: Filippi, Carla Valeria. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Naamati, Guy. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: García Girón, Carlos. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Allen, James E. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido
Fil: Flicek, Paul. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido - Fuente
- The Plant Genome 14 (3) : e20143 (November 2021)
- Materia
-
Genomas
Fitogenética
Genética
Genomes
Plant Genetics
Genetics - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/4.0/
- Repositorio
- Institución
- Instituto Nacional de Tecnología Agropecuaria
- OAI Identificador
- oai:localhost:20.500.12123/10882
Ver los metadatos del registro completo
id |
INTADig_56fd39354e288c10d0ff5e8988205cf8 |
---|---|
oai_identifier_str |
oai:localhost:20.500.12123/10882 |
network_acronym_str |
INTADig |
repository_id_str |
l |
network_name_str |
INTA Digital (INTA) |
spelling |
K-mer counting and curated libraries drive efficient annotation of repeats in plant genomesContreras-Moreira, BrunoFilippi, Carla ValeriaNaamati, GuyGarcía Girón, CarlosAllen, James E.Flicek, PaulGenomasFitogenéticaGenéticaGenomesPlant GeneticsGeneticsThe annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts.Instituto de BiotecnologíaFil: Contreras-Moreira, Bruno. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Filippi, Carla Valeria. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular (IABIMO); ArgentinaFil: Filippi, Carla Valeria. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Filippi, Carla Valeria. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Naamati, Guy. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: García Girón, Carlos. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Allen, James E. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoFil: Flicek, Paul. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino UnidoWiley2021-12-10T13:45:33Z2021-12-10T13:45:33Z2021-09info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://hdl.handle.net/20.500.12123/10882https://acsess.onlinelibrary.wiley.com/doi/full/10.1002/tpg2.201431940-3372https://doi.org/10.1002/tpg2.20143The Plant Genome 14 (3) : e20143 (November 2021)reponame:INTA Digital (INTA)instname:Instituto Nacional de Tecnología Agropecuariaenginfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)2025-09-04T09:49:12Zoai:localhost:20.500.12123/10882instacron:INTAInstitucionalhttp://repositorio.inta.gob.ar/Organismo científico-tecnológicoNo correspondehttp://repositorio.inta.gob.ar/oai/requesttripaldi.nicolas@inta.gob.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:l2025-09-04 09:49:12.887INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuariafalse |
dc.title.none.fl_str_mv |
K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes |
title |
K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes |
spellingShingle |
K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes Contreras-Moreira, Bruno Genomas Fitogenética Genética Genomes Plant Genetics Genetics |
title_short |
K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes |
title_full |
K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes |
title_fullStr |
K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes |
title_full_unstemmed |
K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes |
title_sort |
K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes |
dc.creator.none.fl_str_mv |
Contreras-Moreira, Bruno Filippi, Carla Valeria Naamati, Guy García Girón, Carlos Allen, James E. Flicek, Paul |
author |
Contreras-Moreira, Bruno |
author_facet |
Contreras-Moreira, Bruno Filippi, Carla Valeria Naamati, Guy García Girón, Carlos Allen, James E. Flicek, Paul |
author_role |
author |
author2 |
Filippi, Carla Valeria Naamati, Guy García Girón, Carlos Allen, James E. Flicek, Paul |
author2_role |
author author author author author |
dc.subject.none.fl_str_mv |
Genomas Fitogenética Genética Genomes Plant Genetics Genetics |
topic |
Genomas Fitogenética Genética Genomes Plant Genetics Genetics |
dc.description.none.fl_txt_mv |
The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts. Instituto de Biotecnología Fil: Contreras-Moreira, Bruno. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: Filippi, Carla Valeria. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Agrobiotecnología y Biología Molecular (IABIMO); Argentina Fil: Filippi, Carla Valeria. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina Fil: Filippi, Carla Valeria. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: Naamati, Guy. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: García Girón, Carlos. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: Allen, James E. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido Fil: Flicek, Paul. European Bioinformatics Institute. European Molecular Biology Laboratory; Reino Unido |
description |
The annotation of repetitive sequences within plant genomes can help in the interpretation of observed phenotypes. Moreover, repeat masking is required for tasks such as whole-genome alignment, promoter analysis, or pangenome exploration. Although homology-based annotation methods are computationally expensive, k-mer strategies for masking are orders of magnitude faster. Here, we benchmarked a two-step approach, where repeats were first called by k-mer counting and then annotated by comparison to curated libraries. This hybrid protocol was tested on 20 plant genomes from Ensembl, with the k-mer-based Repeat Detector (Red) and two repeat libraries (REdat, last updated in 2013, and nrTEplants, curated for this work). Custom libraries produced by RepeatModeler were also tested. We obtained repeated genome fractions that matched those reported in the literature but with shorter repeated elements than those produced directly by sequence homology. Inspection of the masked regions that overlapped genes revealed no preference for specific protein domains. Most Red-masked sequences could be successfully classified by sequence similarity, with the complete protocol taking less than 2 h on a desktop Linux box. A guide to curating your own repeat libraries and the scripts for masking and annotating plant genomes can be obtained at https://github.com/Ensembl/plant-scripts. |
publishDate |
2021 |
dc.date.none.fl_str_mv |
2021-12-10T13:45:33Z 2021-12-10T13:45:33Z 2021-09 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
format |
article |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://hdl.handle.net/20.500.12123/10882 https://acsess.onlinelibrary.wiley.com/doi/full/10.1002/tpg2.20143 1940-3372 https://doi.org/10.1002/tpg2.20143 |
url |
http://hdl.handle.net/20.500.12123/10882 https://acsess.onlinelibrary.wiley.com/doi/full/10.1002/tpg2.20143 https://doi.org/10.1002/tpg2.20143 |
identifier_str_mv |
1940-3372 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Wiley |
publisher.none.fl_str_mv |
Wiley |
dc.source.none.fl_str_mv |
The Plant Genome 14 (3) : e20143 (November 2021) reponame:INTA Digital (INTA) instname:Instituto Nacional de Tecnología Agropecuaria |
reponame_str |
INTA Digital (INTA) |
collection |
INTA Digital (INTA) |
instname_str |
Instituto Nacional de Tecnología Agropecuaria |
repository.name.fl_str_mv |
INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuaria |
repository.mail.fl_str_mv |
tripaldi.nicolas@inta.gob.ar |
_version_ |
1842341392898064384 |
score |
12.623145 |