Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns

Autores: Videla, María Eugenia; Iglesias, Juliana; Bruno, Cecilia Inés
Año de publicación: 2021
Idioma: inglés
Tipo de recurso: artículo
Estado: versión aceptada
Descripción: A number of clustering algorithms are available to depict population genetic structure (PGS) with genomic data; however, there is no consensus on which methods are the best performing ones. We conducted a simulation study of three PGS scenarios with subpopulations k = 2, 5 and 10, recreating several maize genomes as a model to: (1) compare three well-known clustering methods: UPGMA, k-means and, Bayesian method (BM); (2) asses four internal validation indices: CH, Connectivity, Dunn and Silhouette, to determine the reliable number of groups defining a PGS; and (3) estimate the misclassification rate for each validation index. Moreover, a publicly available maize dataset was used to illustrate the outcomes of our simulation. BM was the best method to classify individuals in all tested scenarios, without assignment errors. Conversely, UPGMA was the method with the highest misclassification rate. In scenarios with 5 and 10 subpopulations, CH and Connectivity indices had the maximum underestimation of group number for all cluster algorithms. Dunn and Silhouette indices showed the best performance with BM. Nevertheless, since Silhouette measures the degree of confidence in cluster assignment, and BM measures the probability of cluster membership, these results should be considered with caution. In this study we found that BM showed to be efficient to depict the PGS in both simulated and real maize datasets. This study offers a robust alternative to unveil the existing PGS, thereby facilitating population studies and breeding strategies in maize programs. Moreover, the present findings may have implications for other crop species.
EEA Pergamino
Fil: Videla, María Eugenia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; Argentina
Fil: Videla, María Eugenia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); Argentina
Fil: Videla, María Eugenia. Universidad Nacional de Villa María; Argentina
Fil: Iglesias, Juliana. Instituto Nacional de Tecnología Agropecuaria (INTA). Estación Experimental Agropecuaria Pergamino. Departamento de Maíz; Argentina
Fil: Iglesias, Juliana. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Agrarias, Naturales y Ambientales; Argentina
Fil: Bruno, Cecilia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; Argentina
Fil: Bruno, Cecilia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); Argentina
Fuente: Euphytica 217 (10) : 195 (October 2021)
Materia: Maíz
Genética de Poblaciones
Genomas
Mejoramiento Genético
Maize
Population Genetics
Genomes
Genetic Improvement
Unsupervised Learning
Population Genetic Structure
Multivariate Technique
Outcome Misclassification
Nivel de accesibilidad: acceso restringido
Condiciones de uso
Repositorio
Institución: Instituto Nacional de Tecnología Agropecuaria
OAI Identificador: oai:localhost:20.500.12123/11153

Acceder

id	INTADig_cd37f8be32803239259b5f7f8b104373
oai_identifier_str	oai:localhost:20.500.12123/11153
network_acronym_str	INTADig
repository_id_str	l
network_name_str	INTA Digital (INTA)
spelling	Relative performance of cluster algorithms and validation indices in maize genome-wide structure patternsVidela, María EugeniaIglesias, JulianaBruno, Cecilia InésMaízGenética de PoblacionesGenomasMejoramiento GenéticoMaizePopulation GeneticsGenomesGenetic ImprovementUnsupervised LearningPopulation Genetic StructureMultivariate TechniqueOutcome MisclassificationA number of clustering algorithms are available to depict population genetic structure (PGS) with genomic data; however, there is no consensus on which methods are the best performing ones. We conducted a simulation study of three PGS scenarios with subpopulations k = 2, 5 and 10, recreating several maize genomes as a model to: (1) compare three well-known clustering methods: UPGMA, k-means and, Bayesian method (BM); (2) asses four internal validation indices: CH, Connectivity, Dunn and Silhouette, to determine the reliable number of groups defining a PGS; and (3) estimate the misclassification rate for each validation index. Moreover, a publicly available maize dataset was used to illustrate the outcomes of our simulation. BM was the best method to classify individuals in all tested scenarios, without assignment errors. Conversely, UPGMA was the method with the highest misclassification rate. In scenarios with 5 and 10 subpopulations, CH and Connectivity indices had the maximum underestimation of group number for all cluster algorithms. Dunn and Silhouette indices showed the best performance with BM. Nevertheless, since Silhouette measures the degree of confidence in cluster assignment, and BM measures the probability of cluster membership, these results should be considered with caution. In this study we found that BM showed to be efficient to depict the PGS in both simulated and real maize datasets. This study offers a robust alternative to unveil the existing PGS, thereby facilitating population studies and breeding strategies in maize programs. Moreover, the present findings may have implications for other crop species.EEA PergaminoFil: Videla, María Eugenia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; ArgentinaFil: Videla, María Eugenia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); ArgentinaFil: Videla, María Eugenia. Universidad Nacional de Villa María; ArgentinaFil: Iglesias, Juliana. Instituto Nacional de Tecnología Agropecuaria (INTA). Estación Experimental Agropecuaria Pergamino. Departamento de Maíz; ArgentinaFil: Iglesias, Juliana. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Agrarias, Naturales y Ambientales; ArgentinaFil: Bruno, Cecilia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; ArgentinaFil: Bruno, Cecilia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); ArgentinaSpringer Nature2022-02-15T14:34:44Z2022-02-15T14:34:44Z2021-09info:eu-repo/semantics/articleinfo:eu-repo/semantics/acceptedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://hdl.handle.net/20.500.12123/11153https://link.springer.com/article/10.1007/s10681-021-02926-51573-5060 (online)0014-2336https://doi.org/10.1007/s10681-021-02926-5Euphytica 217 (10) : 195 (October 2021)reponame:INTA Digital (INTA)instname:Instituto Nacional de Tecnología Agropecuariaenginfo:eu-repograntAgreement/INTA/2019-PE-E6-I114-001/2019-PE-E6-I114-001/AR./Caracterización de la diversidad genética de plantas, animales y microorganismos mediante herramientas de genómica aplicada.info:eu-repo/semantics/restrictedAccess2025-12-26T11:11:09Zoai:localhost:20.500.12123/11153instacron:INTAInstitucionalhttp://repositorio.inta.gob.ar/Organismo científico-tecnológicoNo correspondehttp://repositorio.inta.gob.ar/oai/requesttripaldi.nicolas@inta.gob.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:l2025-12-26 11:11:10.279INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuariafalse
dc.title.none.fl_str_mv	Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title	Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
spellingShingle	Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns Videla, María Eugenia Maíz Genética de Poblaciones Genomas Mejoramiento Genético Maize Population Genetics Genomes Genetic Improvement Unsupervised Learning Population Genetic Structure Multivariate Technique Outcome Misclassification
title_short	Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title_full	Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title_fullStr	Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title_full_unstemmed	Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
title_sort	Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns
dc.creator.none.fl_str_mv	Videla, María Eugenia Iglesias, Juliana Bruno, Cecilia Inés
author	Videla, María Eugenia
author_facet	Videla, María Eugenia Iglesias, Juliana Bruno, Cecilia Inés
author_role	author
author2	Iglesias, Juliana Bruno, Cecilia Inés
author2_role	author author
dc.subject.none.fl_str_mv	Maíz Genética de Poblaciones Genomas Mejoramiento Genético Maize Population Genetics Genomes Genetic Improvement Unsupervised Learning Population Genetic Structure Multivariate Technique Outcome Misclassification
topic	Maíz Genética de Poblaciones Genomas Mejoramiento Genético Maize Population Genetics Genomes Genetic Improvement Unsupervised Learning Population Genetic Structure Multivariate Technique Outcome Misclassification
dc.description.none.fl_txt_mv	A number of clustering algorithms are available to depict population genetic structure (PGS) with genomic data; however, there is no consensus on which methods are the best performing ones. We conducted a simulation study of three PGS scenarios with subpopulations k = 2, 5 and 10, recreating several maize genomes as a model to: (1) compare three well-known clustering methods: UPGMA, k-means and, Bayesian method (BM); (2) asses four internal validation indices: CH, Connectivity, Dunn and Silhouette, to determine the reliable number of groups defining a PGS; and (3) estimate the misclassification rate for each validation index. Moreover, a publicly available maize dataset was used to illustrate the outcomes of our simulation. BM was the best method to classify individuals in all tested scenarios, without assignment errors. Conversely, UPGMA was the method with the highest misclassification rate. In scenarios with 5 and 10 subpopulations, CH and Connectivity indices had the maximum underestimation of group number for all cluster algorithms. Dunn and Silhouette indices showed the best performance with BM. Nevertheless, since Silhouette measures the degree of confidence in cluster assignment, and BM measures the probability of cluster membership, these results should be considered with caution. In this study we found that BM showed to be efficient to depict the PGS in both simulated and real maize datasets. This study offers a robust alternative to unveil the existing PGS, thereby facilitating population studies and breeding strategies in maize programs. Moreover, the present findings may have implications for other crop species. EEA Pergamino Fil: Videla, María Eugenia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; Argentina Fil: Videla, María Eugenia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); Argentina Fil: Videla, María Eugenia. Universidad Nacional de Villa María; Argentina Fil: Iglesias, Juliana. Instituto Nacional de Tecnología Agropecuaria (INTA). Estación Experimental Agropecuaria Pergamino. Departamento de Maíz; Argentina Fil: Iglesias, Juliana. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Escuela de Agrarias, Naturales y Ambientales; Argentina Fil: Bruno, Cecilia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Estadística y Biometría; Argentina Fil: Bruno, Cecilia. Consejo Nacional de Investigaciones Científicas y Tecnológicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA -CONICET); Argentina
description	A number of clustering algorithms are available to depict population genetic structure (PGS) with genomic data; however, there is no consensus on which methods are the best performing ones. We conducted a simulation study of three PGS scenarios with subpopulations k = 2, 5 and 10, recreating several maize genomes as a model to: (1) compare three well-known clustering methods: UPGMA, k-means and, Bayesian method (BM); (2) asses four internal validation indices: CH, Connectivity, Dunn and Silhouette, to determine the reliable number of groups defining a PGS; and (3) estimate the misclassification rate for each validation index. Moreover, a publicly available maize dataset was used to illustrate the outcomes of our simulation. BM was the best method to classify individuals in all tested scenarios, without assignment errors. Conversely, UPGMA was the method with the highest misclassification rate. In scenarios with 5 and 10 subpopulations, CH and Connectivity indices had the maximum underestimation of group number for all cluster algorithms. Dunn and Silhouette indices showed the best performance with BM. Nevertheless, since Silhouette measures the degree of confidence in cluster assignment, and BM measures the probability of cluster membership, these results should be considered with caution. In this study we found that BM showed to be efficient to depict the PGS in both simulated and real maize datasets. This study offers a robust alternative to unveil the existing PGS, thereby facilitating population studies and breeding strategies in maize programs. Moreover, the present findings may have implications for other crop species.
publishDate	2021
dc.date.none.fl_str_mv	2021-09 2022-02-15T14:34:44Z 2022-02-15T14:34:44Z
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/acceptedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	acceptedVersion
dc.identifier.none.fl_str_mv	http://hdl.handle.net/20.500.12123/11153 https://link.springer.com/article/10.1007/s10681-021-02926-5 1573-5060 (online) 0014-2336 https://doi.org/10.1007/s10681-021-02926-5
url	http://hdl.handle.net/20.500.12123/11153 https://link.springer.com/article/10.1007/s10681-021-02926-5 https://doi.org/10.1007/s10681-021-02926-5
identifier_str_mv	1573-5060 (online) 0014-2336
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repograntAgreement/INTA/2019-PE-E6-I114-001/2019-PE-E6-I114-001/AR./Caracterización de la diversidad genética de plantas, animales y microorganismos mediante herramientas de genómica aplicada.
dc.rights.none.fl_str_mv	info:eu-repo/semantics/restrictedAccess
eu_rights_str_mv	restrictedAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Springer Nature
publisher.none.fl_str_mv	Springer Nature
dc.source.none.fl_str_mv	Euphytica 217 (10) : 195 (October 2021) reponame:INTA Digital (INTA) instname:Instituto Nacional de Tecnología Agropecuaria
reponame_str	INTA Digital (INTA)
collection	INTA Digital (INTA)
instname_str	Instituto Nacional de Tecnología Agropecuaria
repository.name.fl_str_mv	INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuaria
repository.mail.fl_str_mv	tripaldi.nicolas@inta.gob.ar
_version_	1852582561820180480
score	13.049097

Relative performance of cluster algorithms and validation indices in maize genome-wide structure patterns

Publicaciones similares