Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns

Autores
Videla, María Eugenia; Iglesias, Juliana; Bruno, Cecilia Inés
Año de publicación
2021
Idioma
inglés
Tipo de recurso
artículo
Estado
versión aceptada
Descripción
A number of clustering algorithms are available to depict population genetic structure (PGS) with genomic data; however, there is no consensus on which methods are the best performing ones. We conducted a simulation study of three PGS scenarios with subpopulations k = 2, 5 and 10, recreating several maize genomes as a model to: (1) compare three well-known clustering methods: UPGMA, k-means and, Bayesian method (BM); (2) asses four internal validation indices: CH, Connectivity, Dunn and Silhouette, to determine the reliable number of groups defining a PGS; and (3) estimate the misclassification rate for each validation index. Moreover, a publicly available maize dataset was used to illustrate the outcomes of our simulation. BM was the best method to classify individuals in all tested scenarios, without assignment errors. Conversely, UPGMA was the method with the highest misclassification rate. In scenarios with 5 and 10 subpopulations, CH and Connectivity indices had the maximum underestimation of group number for all cluster algorithms. Dunn and Silhouette indices showed the best performance with BM. Nevertheless, since Silhouette measures the degree of confidence in cluster assignment, and BM measures the probability of cluster membership, these results should be considered with caution. In this study we found that BM showed to be efficient to depict the PGS in both simulated and real maize datasets. This study offers a robust alternative to unveil the existing PGS, thereby facilitating population studies and breeding strategies in maize programs. Moreover, the present findings may have implications for other crop species.
EEA Pergamino
Fil: Videla, María Eugenia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; Argentina
Fil: Videla, María Eugenia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); Argentina
Fil: Videla, María Eugenia. Universidad Nacional de Villa María; Argentina
Fil: Iglesias, Juliana. Instituto Nacional de Tecnología Agropecuaria (INTA). Estación Experimental Agropecuaria Pergamino. Departamento de Maíz; Argentina
Fil: Iglesias, Juliana. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Agrarias, Naturales y Ambientales; Argentina
Fil: Bruno, Cecilia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; Argentina
Fil: Bruno, Cecilia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); Argentina
Fuente
Euphytica 217 (10) : 195 (October 2021)
Materia
Maíz
Genética de Poblaciones
Genomas
Mejoramiento Genético
Maize
Population Genetics
Genomes
Genetic Improvement
Unsupervised Learning
Population Genetic Structure
Multivariate Technique
Outcome Misclassification
Nivel de accesibilidad
acceso restringido
Condiciones de uso
Repositorio
INTA Digital (INTA)
Institución
Instituto Nacional de Tecnología Agropecuaria
OAI Identificador
oai:localhost:20.500.12123/11153

id INTADig_cd37f8be32803239259b5f7f8b104373
oai_identifier_str oai:localhost:20.500.12123/11153
network_acronym_str INTADig
repository_id_str l
network_name_str INTA Digital (INTA)
spelling Relative performance of cluster algorithms and validation indices in maize genome-wide structure patternsVidela, María EugeniaIglesias, JulianaBruno, Cecilia InésMaízGenética de PoblacionesGenomasMejoramiento GenéticoMaizePopulation GeneticsGenomesGenetic ImprovementUnsupervised LearningPopulation Genetic StructureMultivariate TechniqueOutcome MisclassificationA number of clustering algorithms are available to depict population genetic structure (PGS) with genomic data; however, there is no consensus on which methods are the best performing ones. We conducted a simulation study of three PGS scenarios with subpopulations k = 2, 5 and 10, recreating several maize genomes as a model to: (1) compare three well-known clustering methods: UPGMA, k-means and, Bayesian method (BM); (2) asses four internal validation indices: CH, Connectivity, Dunn and Silhouette, to determine the reliable number of groups defining a PGS; and (3) estimate the misclassification rate for each validation index. Moreover, a publicly available maize dataset was used to illustrate the outcomes of our simulation. BM was the best method to classify individuals in all tested scenarios, without assignment errors. Conversely, UPGMA was the method with the highest misclassification rate. In scenarios with 5 and 10 subpopulations, CH and Connectivity indices had the maximum underestimation of group number for all cluster algorithms. Dunn and Silhouette indices showed the best performance with BM. Nevertheless, since Silhouette measures the degree of confidence in cluster assignment, and BM measures the probability of cluster membership, these results should be considered with caution. In this study we found that BM showed to be efficient to depict the PGS in both simulated and real maize datasets. This study offers a robust alternative to unveil the existing PGS, thereby facilitating population studies and breeding strategies in maize programs. Moreover, the present findings may have implications for other crop species.EEA PergaminoFil: Videla, María Eugenia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; ArgentinaFil: Videla, María Eugenia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); ArgentinaFil: Videla, María Eugenia. Universidad Nacional de Villa María; ArgentinaFil: Iglesias, Juliana. Instituto Nacional de Tecnología Agropecuaria (INTA). Estación Experimental Agropecuaria Pergamino. Departamento de Maíz; ArgentinaFil: Iglesias, Juliana. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Agrarias, Naturales y Ambientales; ArgentinaFil: Bruno, Cecilia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; ArgentinaFil: Bruno, Cecilia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); ArgentinaSpringer Nature2022-02-15T14:34:44Z2022-02-15T14:34:44Z2021-09info:eu-repo/semantics/articleinfo:eu-repo/semantics/acceptedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://hdl.handle.net/20.500.12123/11153https://link.springer.com/article/10.1007/s10681-021-02926-51573-5060 (online)0014-2336https://doi.org/10.1007/s10681-021-02926-5Euphytica 217 (10) : 195 (October 2021)reponame:INTA Digital (INTA)instname:Instituto Nacional de Tecnología Agropecuariaenginfo:eu-repograntAgreement/INTA/2019-PE-E6-I114-001/2019-PE-E6-I114-001/AR./Caracterización de la diversidad genética de plantas, animales y microorganismos mediante herramientas de genómica aplicada.info:eu-repo/semantics/restrictedAccess2025-10-16T09:30:23Zoai:localhost:20.500.12123/11153instacron:INTAInstitucionalhttp://repositorio.inta.gob.ar/Organismo científico-tecnológicoNo correspondehttp://repositorio.inta.gob.ar/oai/requesttripaldi.nicolas@inta.gob.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:l2025-10-16 09:30:24.074INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuariafalse
dc.title.none.fl_str_mv Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
spellingShingle Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
Videla, María Eugenia
Maíz
Genética de Poblaciones
Genomas
Mejoramiento Genético
Maize
Population Genetics
Genomes
Genetic Improvement
Unsupervised Learning
Population Genetic Structure
Multivariate Technique
Outcome Misclassification
title_short Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title_full Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title_fullStr Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title_full_unstemmed Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title_sort Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
dc.creator.none.fl_str_mv Videla, María Eugenia
Iglesias, Juliana
Bruno, Cecilia Inés
author Videla, María Eugenia
author_facet Videla, María Eugenia
Iglesias, Juliana
Bruno, Cecilia Inés
author_role author
author2 Iglesias, Juliana
Bruno, Cecilia Inés
author2_role author
author
dc.subject.none.fl_str_mv Maíz
Genética de Poblaciones
Genomas
Mejoramiento Genético
Maize
Population Genetics
Genomes
Genetic Improvement
Unsupervised Learning
Population Genetic Structure
Multivariate Technique
Outcome Misclassification
topic Maíz
Genética de Poblaciones
Genomas
Mejoramiento Genético
Maize
Population Genetics
Genomes
Genetic Improvement
Unsupervised Learning
Population Genetic Structure
Multivariate Technique
Outcome Misclassification
dc.description.none.fl_txt_mv A number of clustering algorithms are available to depict population genetic structure (PGS) with genomic data; however, there is no consensus on which methods are the best performing ones. We conducted a simulation study of three PGS scenarios with subpopulations k = 2, 5 and 10, recreating several maize genomes as a model to: (1) compare three well-known clustering methods: UPGMA, k-means and, Bayesian method (BM); (2) asses four internal validation indices: CH, Connectivity, Dunn and Silhouette, to determine the reliable number of groups defining a PGS; and (3) estimate the misclassification rate for each validation index. Moreover, a publicly available maize dataset was used to illustrate the outcomes of our simulation. BM was the best method to classify individuals in all tested scenarios, without assignment errors. Conversely, UPGMA was the method with the highest misclassification rate. In scenarios with 5 and 10 subpopulations, CH and Connectivity indices had the maximum underestimation of group number for all cluster algorithms. Dunn and Silhouette indices showed the best performance with BM. Nevertheless, since Silhouette measures the degree of confidence in cluster assignment, and BM measures the probability of cluster membership, these results should be considered with caution. In this study we found that BM showed to be efficient to depict the PGS in both simulated and real maize datasets. This study offers a robust alternative to unveil the existing PGS, thereby facilitating population studies and breeding strategies in maize programs. Moreover, the present findings may have implications for other crop species.
EEA Pergamino
Fil: Videla, María Eugenia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; Argentina
Fil: Videla, María Eugenia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); Argentina
Fil: Videla, María Eugenia. Universidad Nacional de Villa María; Argentina
Fil: Iglesias, Juliana. Instituto Nacional de Tecnología Agropecuaria (INTA). Estación Experimental Agropecuaria Pergamino. Departamento de Maíz; Argentina
Fil: Iglesias, Juliana. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Agrarias, Naturales y Ambientales; Argentina
Fil: Bruno, Cecilia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; Argentina
Fil: Bruno, Cecilia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); Argentina
description A number of clustering algorithms are available to depict population genetic structure (PGS) with genomic data; however, there is no consensus on which methods are the best performing ones. We conducted a simulation study of three PGS scenarios with subpopulations k = 2, 5 and 10, recreating several maize genomes as a model to: (1) compare three well-known clustering methods: UPGMA, k-means and, Bayesian method (BM); (2) asses four internal validation indices: CH, Connectivity, Dunn and Silhouette, to determine the reliable number of groups defining a PGS; and (3) estimate the misclassification rate for each validation index. Moreover, a publicly available maize dataset was used to illustrate the outcomes of our simulation. BM was the best method to classify individuals in all tested scenarios, without assignment errors. Conversely, UPGMA was the method with the highest misclassification rate. In scenarios with 5 and 10 subpopulations, CH and Connectivity indices had the maximum underestimation of group number for all cluster algorithms. Dunn and Silhouette indices showed the best performance with BM. Nevertheless, since Silhouette measures the degree of confidence in cluster assignment, and BM measures the probability of cluster membership, these results should be considered with caution. In this study we found that BM showed to be efficient to depict the PGS in both simulated and real maize datasets. This study offers a robust alternative to unveil the existing PGS, thereby facilitating population studies and breeding strategies in maize programs. Moreover, the present findings may have implications for other crop species.
publishDate 2021
dc.date.none.fl_str_mv 2021-09
2022-02-15T14:34:44Z
2022-02-15T14:34:44Z
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/acceptedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str acceptedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/20.500.12123/11153
https://link.springer.com/article/10.1007/s10681-021-02926-5
1573-5060 (online)
0014-2336
https://doi.org/10.1007/s10681-021-02926-5
url http://hdl.handle.net/20.500.12123/11153
https://link.springer.com/article/10.1007/s10681-021-02926-5
https://doi.org/10.1007/s10681-021-02926-5
identifier_str_mv 1573-5060 (online)
0014-2336
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repograntAgreement/INTA/2019-PE-E6-I114-001/2019-PE-E6-I114-001/AR./Caracterización de la diversidad genética de plantas, animales y microorganismos mediante herramientas de genómica aplicada.
dc.rights.none.fl_str_mv info:eu-repo/semantics/restrictedAccess
eu_rights_str_mv restrictedAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Springer Nature
publisher.none.fl_str_mv Springer Nature
dc.source.none.fl_str_mv Euphytica 217 (10) : 195 (October 2021)
reponame:INTA Digital (INTA)
instname:Instituto Nacional de Tecnología Agropecuaria
reponame_str INTA Digital (INTA)
collection INTA Digital (INTA)
instname_str Instituto Nacional de Tecnología Agropecuaria
repository.name.fl_str_mv INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuaria
repository.mail.fl_str_mv tripaldi.nicolas@inta.gob.ar
_version_ 1846143543432708096
score 12.712165