Handling missing values in trait data

Autores
Johnson, Thomas F.; Isaac, Nick J. B.; Paviolo, Agustin Javier; González Suárez, Manuela
Año de publicación
2021
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Aim: Trait data are widely used in ecological and evolutionary phylogenetic comparative studies, but often values are not available for all species of interest. Traditionally, researchers have excluded species without data from analyses, but estimation of missing values using imputation has been proposed as a better approach. However, imputation methods have largely been designed for randomly missing data, whereas trait data are often not missing at random (e.g., more data for bigger species). Here, we evaluate the performance of approaches for handling missing values when considering biased datasets. Location: Any. Time period: Any. Major taxa studied: Any. Methods: We simulated continuous traits and separate response variables to test the performance of nine imputation methods and complete-case analysis (excluding missing values from the dataset) under biased missing data scenarios. We characterized performance by estimating the error in imputed trait values (deviation from the true value) and inferred trait–response relationships (deviation from the true relationship between a trait and response). Results: Generally, Rphylopars imputation produced the most accurate estimate of missing values and best preserved the response–trait slope. However, estimates of missing data were still inaccurate, even with only 5% of values missing. Under severe biases, errors were high with every approach. Imputation was not always the best option, with complete-case analysis frequently outperforming Mice imputation and, to a lesser degree, BHPMF imputation. Mice, a popular approach, performed poorly when the response variable was excluded from the imputation model. Main conclusions: Imputation can handle missing data effectively in some conditions but is not always the best solution. None of the methods we tested could deal effectively with severe biases, which can be common in trait datasets. We recommend rigorous data checking for biases before and after imputation and propose variables that can assist researchers working with incomplete datasets to detect data biases and minimize errors.
Fil: Johnson, Thomas F.. University of Reading; Reino Unido
Fil: Isaac, Nick J. B.. Centre For Ecology And Hydrology; Reino Unido
Fil: Paviolo, Agustin Javier. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Nordeste. Instituto de Biología Subtropical. Instituto de Biología Subtropical - Nodo Puerto Iguazú | Universidad Nacional de Misiones. Instituto de Biología Subtropical. Instituto de Biología Subtropical - Nodo Puerto Iguazú; Argentina. Centro de Investigaciones del Bosque Atlántico; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Nordeste; Argentina
Fil: González Suárez, Manuela. University of Reading; Reino Unido
Materia
BHPMF
FUNCTIONAL TRAIT
IMPUTATION
LIFE-HISTORY TRAIT
MAR
MCAR
MISSING DATA
MNAR
MULTIPLE IMPUTATION CHAINED EQUATIONS
RPHYLOPARS
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/168014

id CONICETDig_ad16cc811b8a0c3ce9e76c58cacbbb8b
oai_identifier_str oai:ri.conicet.gov.ar:11336/168014
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling Handling missing values in trait dataJohnson, Thomas F.Isaac, Nick J. B.Paviolo, Agustin JavierGonzález Suárez, ManuelaBHPMFFUNCTIONAL TRAITIMPUTATIONLIFE-HISTORY TRAITMARMCARMISSING DATAMNARMULTIPLE IMPUTATION CHAINED EQUATIONSRPHYLOPARShttps://purl.org/becyt/ford/1.6https://purl.org/becyt/ford/1Aim: Trait data are widely used in ecological and evolutionary phylogenetic comparative studies, but often values are not available for all species of interest. Traditionally, researchers have excluded species without data from analyses, but estimation of missing values using imputation has been proposed as a better approach. However, imputation methods have largely been designed for randomly missing data, whereas trait data are often not missing at random (e.g., more data for bigger species). Here, we evaluate the performance of approaches for handling missing values when considering biased datasets. Location: Any. Time period: Any. Major taxa studied: Any. Methods: We simulated continuous traits and separate response variables to test the performance of nine imputation methods and complete-case analysis (excluding missing values from the dataset) under biased missing data scenarios. We characterized performance by estimating the error in imputed trait values (deviation from the true value) and inferred trait–response relationships (deviation from the true relationship between a trait and response). Results: Generally, Rphylopars imputation produced the most accurate estimate of missing values and best preserved the response–trait slope. However, estimates of missing data were still inaccurate, even with only 5% of values missing. Under severe biases, errors were high with every approach. Imputation was not always the best option, with complete-case analysis frequently outperforming Mice imputation and, to a lesser degree, BHPMF imputation. Mice, a popular approach, performed poorly when the response variable was excluded from the imputation model. Main conclusions: Imputation can handle missing data effectively in some conditions but is not always the best solution. None of the methods we tested could deal effectively with severe biases, which can be common in trait datasets. We recommend rigorous data checking for biases before and after imputation and propose variables that can assist researchers working with incomplete datasets to detect data biases and minimize errors.Fil: Johnson, Thomas F.. University of Reading; Reino UnidoFil: Isaac, Nick J. B.. Centre For Ecology And Hydrology; Reino UnidoFil: Paviolo, Agustin Javier. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Nordeste. Instituto de Biología Subtropical. Instituto de Biología Subtropical - Nodo Puerto Iguazú | Universidad Nacional de Misiones. Instituto de Biología Subtropical. Instituto de Biología Subtropical - Nodo Puerto Iguazú; Argentina. Centro de Investigaciones del Bosque Atlántico; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Nordeste; ArgentinaFil: González Suárez, Manuela. University of Reading; Reino UnidoWiley Blackwell Publishing, Inc2021-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/168014Johnson, Thomas F.; Isaac, Nick J. B.; Paviolo, Agustin Javier; González Suárez, Manuela; Handling missing values in trait data; Wiley Blackwell Publishing, Inc; Global Ecology and Biogeography; 30; 1; 1-2021; 51-621466-822XCONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/doi/10.1111/geb.13185info:eu-repo/semantics/altIdentifier/url/https://onlinelibrary.wiley.com/doi/10.1111/geb.13185info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T09:57:10Zoai:ri.conicet.gov.ar:11336/168014instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 09:57:10.739CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv Handling missing values in trait data
title Handling missing values in trait data
spellingShingle Handling missing values in trait data
Johnson, Thomas F.
BHPMF
FUNCTIONAL TRAIT
IMPUTATION
LIFE-HISTORY TRAIT
MAR
MCAR
MISSING DATA
MNAR
MULTIPLE IMPUTATION CHAINED EQUATIONS
RPHYLOPARS
title_short Handling missing values in trait data
title_full Handling missing values in trait data
title_fullStr Handling missing values in trait data
title_full_unstemmed Handling missing values in trait data
title_sort Handling missing values in trait data
dc.creator.none.fl_str_mv Johnson, Thomas F.
Isaac, Nick J. B.
Paviolo, Agustin Javier
González Suárez, Manuela
author Johnson, Thomas F.
author_facet Johnson, Thomas F.
Isaac, Nick J. B.
Paviolo, Agustin Javier
González Suárez, Manuela
author_role author
author2 Isaac, Nick J. B.
Paviolo, Agustin Javier
González Suárez, Manuela
author2_role author
author
author
dc.subject.none.fl_str_mv BHPMF
FUNCTIONAL TRAIT
IMPUTATION
LIFE-HISTORY TRAIT
MAR
MCAR
MISSING DATA
MNAR
MULTIPLE IMPUTATION CHAINED EQUATIONS
RPHYLOPARS
topic BHPMF
FUNCTIONAL TRAIT
IMPUTATION
LIFE-HISTORY TRAIT
MAR
MCAR
MISSING DATA
MNAR
MULTIPLE IMPUTATION CHAINED EQUATIONS
RPHYLOPARS
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.6
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv Aim: Trait data are widely used in ecological and evolutionary phylogenetic comparative studies, but often values are not available for all species of interest. Traditionally, researchers have excluded species without data from analyses, but estimation of missing values using imputation has been proposed as a better approach. However, imputation methods have largely been designed for randomly missing data, whereas trait data are often not missing at random (e.g., more data for bigger species). Here, we evaluate the performance of approaches for handling missing values when considering biased datasets. Location: Any. Time period: Any. Major taxa studied: Any. Methods: We simulated continuous traits and separate response variables to test the performance of nine imputation methods and complete-case analysis (excluding missing values from the dataset) under biased missing data scenarios. We characterized performance by estimating the error in imputed trait values (deviation from the true value) and inferred trait–response relationships (deviation from the true relationship between a trait and response). Results: Generally, Rphylopars imputation produced the most accurate estimate of missing values and best preserved the response–trait slope. However, estimates of missing data were still inaccurate, even with only 5% of values missing. Under severe biases, errors were high with every approach. Imputation was not always the best option, with complete-case analysis frequently outperforming Mice imputation and, to a lesser degree, BHPMF imputation. Mice, a popular approach, performed poorly when the response variable was excluded from the imputation model. Main conclusions: Imputation can handle missing data effectively in some conditions but is not always the best solution. None of the methods we tested could deal effectively with severe biases, which can be common in trait datasets. We recommend rigorous data checking for biases before and after imputation and propose variables that can assist researchers working with incomplete datasets to detect data biases and minimize errors.
Fil: Johnson, Thomas F.. University of Reading; Reino Unido
Fil: Isaac, Nick J. B.. Centre For Ecology And Hydrology; Reino Unido
Fil: Paviolo, Agustin Javier. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Nordeste. Instituto de Biología Subtropical. Instituto de Biología Subtropical - Nodo Puerto Iguazú | Universidad Nacional de Misiones. Instituto de Biología Subtropical. Instituto de Biología Subtropical - Nodo Puerto Iguazú; Argentina. Centro de Investigaciones del Bosque Atlántico; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Nordeste; Argentina
Fil: González Suárez, Manuela. University of Reading; Reino Unido
description Aim: Trait data are widely used in ecological and evolutionary phylogenetic comparative studies, but often values are not available for all species of interest. Traditionally, researchers have excluded species without data from analyses, but estimation of missing values using imputation has been proposed as a better approach. However, imputation methods have largely been designed for randomly missing data, whereas trait data are often not missing at random (e.g., more data for bigger species). Here, we evaluate the performance of approaches for handling missing values when considering biased datasets. Location: Any. Time period: Any. Major taxa studied: Any. Methods: We simulated continuous traits and separate response variables to test the performance of nine imputation methods and complete-case analysis (excluding missing values from the dataset) under biased missing data scenarios. We characterized performance by estimating the error in imputed trait values (deviation from the true value) and inferred trait–response relationships (deviation from the true relationship between a trait and response). Results: Generally, Rphylopars imputation produced the most accurate estimate of missing values and best preserved the response–trait slope. However, estimates of missing data were still inaccurate, even with only 5% of values missing. Under severe biases, errors were high with every approach. Imputation was not always the best option, with complete-case analysis frequently outperforming Mice imputation and, to a lesser degree, BHPMF imputation. Mice, a popular approach, performed poorly when the response variable was excluded from the imputation model. Main conclusions: Imputation can handle missing data effectively in some conditions but is not always the best solution. None of the methods we tested could deal effectively with severe biases, which can be common in trait datasets. We recommend rigorous data checking for biases before and after imputation and propose variables that can assist researchers working with incomplete datasets to detect data biases and minimize errors.
publishDate 2021
dc.date.none.fl_str_mv 2021-01
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/168014
Johnson, Thomas F.; Isaac, Nick J. B.; Paviolo, Agustin Javier; González Suárez, Manuela; Handling missing values in trait data; Wiley Blackwell Publishing, Inc; Global Ecology and Biogeography; 30; 1; 1-2021; 51-62
1466-822X
CONICET Digital
CONICET
url http://hdl.handle.net/11336/168014
identifier_str_mv Johnson, Thomas F.; Isaac, Nick J. B.; Paviolo, Agustin Javier; González Suárez, Manuela; Handling missing values in trait data; Wiley Blackwell Publishing, Inc; Global Ecology and Biogeography; 30; 1; 1-2021; 51-62
1466-822X
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/doi/10.1111/geb.13185
info:eu-repo/semantics/altIdentifier/url/https://onlinelibrary.wiley.com/doi/10.1111/geb.13185
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv Wiley Blackwell Publishing, Inc
publisher.none.fl_str_mv Wiley Blackwell Publishing, Inc
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1844613712330096640
score 13.070432