Marriage between variable selection and prediction methods to model plant disease risk

Autores
Suarez, Franco; Bruno, Cecilia; Kurina Giannini, Franca; Gimenez, Maria; Rodriguez Pardina, Patricia; Balzarini, Mónica Graciela
Año de publicación
2023
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Predicting the risk of a disease in a pathosystem based on a set of climatic variables usually requires handling a high number of input variables, many of which are often irrelevant and/or redundant. Building linear predictive models entails not only dimensionality issues but also the negative impact of multicollinearity. Several feature selection methods have proved to be efficient in both linear and non-linear models, regardless of those issues. However, in a machine learning (ML) context, it is necessary to evaluate these feature selection methods embedded into the model fitting algorithm to obtain the greatest accuracy. The aim of this work was to assess different combinations of variable selection methods with linear and non-linear predictors to fit climate-based models that predict the occurrence of a disease in a pathosystem. Four selection methods were compared: stepwise, which is frequently used in linear models, combined with VIF and p-value statistical criteria (Step+VIF+Pv), and other methods commonly used in ML: filter (F), genetic algorithm (GA), and Boruta (B). The disease risk predictors were constructed with a logistic linear regression model (LR) and the random forest (RF) algorithm, using all the available variables and the subgroups of variables selected by each feature selection method. Data from three pathosystems were processed: two involving Begomovirus –one in common bean (Phaseolus vulgaris L) and the other in soybean (Glycine max)– and the third one involving Mal de Rio Cuarto virus in maize (Zea mays L.). The data sets differed in sample size and number of variables. The accuracy of RF prediction did not vary among feature selection methods. Step+VIF+Pv was used to reduce the model outperformed the other feature selection methods in fitting LR. Our proposal suggests that the appropriate pairing of variable selection and prediction models would improve the modeling of plant disease risk.
Instituto de Patología Vegetal
Fil: Suarez, Franco. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Cátedra de Estadística y Biometría; Argentina
Fil: Suarez, Franco. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina
Fil: Suarez, Franco. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina
Fil: Bruno, Cecilia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Cátedra de Estadística y Biometría; Argentina
Fil: Bruno, Cecilia. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina
Fil: Bruno, Cecilia. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina
Fil: Kurina Giannini, Franca. Aarhus Universitet. institut for agroøkologi. Jornær sektioner; Dinamarca
Fil: Gimenez, Maria De La Paz. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina
Fil: Gimenez, Maria De La Paz. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina
Fil: Rodriguez Pardina, Patricia. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina
Fil: Rodriguez Pardina, Patricia. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina
Fil: Balzarini, Mónica Graciela. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Cátedra de Estadística y Biometría; Argentina
Fil: Balzarini, Mónica Graciela. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina
Fil: Balzarini, Mónica Graciela. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina
Fuente
European Journal of Agronomy 151: 126995 (November 2023)
Materia
Multicollinearity
Plant Diseases
Multicolinearidad
Enfermedades de las Plantas
Logistic Regression
Random Forest
Feature Selection
Prediction Models
Pathosystems
Nivel de accesibilidad
acceso restringido
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
INTA Digital (INTA)
Institución
Instituto Nacional de Tecnología Agropecuaria
OAI Identificador
oai:localhost:20.500.12123/15634

id INTADig_1186c6a02b7f13e7ee4317bb2b723eed
oai_identifier_str oai:localhost:20.500.12123/15634
network_acronym_str INTADig
repository_id_str l
network_name_str INTA Digital (INTA)
spelling Marriage between variable selection and prediction methods to model plant disease riskSuarez, FrancoBruno, CeciliaKurina Giannini, FrancaGimenez, MariaRodriguez Pardina, PatriciaBalzarini, Mónica GracielaMulticollinearityPlant DiseasesMulticolinearidadEnfermedades de las PlantasLogistic RegressionRandom ForestFeature SelectionPrediction ModelsPathosystemsPredicting the risk of a disease in a pathosystem based on a set of climatic variables usually requires handling a high number of input variables, many of which are often irrelevant and/or redundant. Building linear predictive models entails not only dimensionality issues but also the negative impact of multicollinearity. Several feature selection methods have proved to be efficient in both linear and non-linear models, regardless of those issues. However, in a machine learning (ML) context, it is necessary to evaluate these feature selection methods embedded into the model fitting algorithm to obtain the greatest accuracy. The aim of this work was to assess different combinations of variable selection methods with linear and non-linear predictors to fit climate-based models that predict the occurrence of a disease in a pathosystem. Four selection methods were compared: stepwise, which is frequently used in linear models, combined with VIF and p-value statistical criteria (Step+VIF+Pv), and other methods commonly used in ML: filter (F), genetic algorithm (GA), and Boruta (B). The disease risk predictors were constructed with a logistic linear regression model (LR) and the random forest (RF) algorithm, using all the available variables and the subgroups of variables selected by each feature selection method. Data from three pathosystems were processed: two involving Begomovirus –one in common bean (Phaseolus vulgaris L) and the other in soybean (Glycine max)– and the third one involving Mal de Rio Cuarto virus in maize (Zea mays L.). The data sets differed in sample size and number of variables. The accuracy of RF prediction did not vary among feature selection methods. Step+VIF+Pv was used to reduce the model outperformed the other feature selection methods in fitting LR. Our proposal suggests that the appropriate pairing of variable selection and prediction models would improve the modeling of plant disease risk.Instituto de Patología VegetalFil: Suarez, Franco. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Cátedra de Estadística y Biometría; ArgentinaFil: Suarez, Franco. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); ArgentinaFil: Suarez, Franco. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; ArgentinaFil: Bruno, Cecilia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Cátedra de Estadística y Biometría; ArgentinaFil: Bruno, Cecilia. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); ArgentinaFil: Bruno, Cecilia. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; ArgentinaFil: Kurina Giannini, Franca. Aarhus Universitet. institut for agroøkologi. Jornær sektioner; DinamarcaFil: Gimenez, Maria De La Paz. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; ArgentinaFil: Gimenez, Maria De La Paz. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); ArgentinaFil: Rodriguez Pardina, Patricia. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; ArgentinaFil: Rodriguez Pardina, Patricia. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); ArgentinaFil: Balzarini, Mónica Graciela. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Cátedra de Estadística y Biometría; ArgentinaFil: Balzarini, Mónica Graciela. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); ArgentinaFil: Balzarini, Mónica Graciela. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; ArgentinaElsevier2023-10-23T10:21:16Z2023-10-23T10:21:16Z2023-10-11info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://hdl.handle.net/20.500.12123/15634https://www.sciencedirect.com/science/article/pii/S11610301230026301161-0301https://doi.org/10.1016/j.eja.2023.126995European Journal of Agronomy 151: 126995 (November 2023)reponame:INTA Digital (INTA)instname:Instituto Nacional de Tecnología Agropecuariaenginfo:eu-repo/semantics/restrictedAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)2025-09-29T13:46:09Zoai:localhost:20.500.12123/15634instacron:INTAInstitucionalhttp://repositorio.inta.gob.ar/Organismo científico-tecnológicoNo correspondehttp://repositorio.inta.gob.ar/oai/requesttripaldi.nicolas@inta.gob.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:l2025-09-29 13:46:10.268INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuariafalse
dc.title.none.fl_str_mv Marriage between variable selection and prediction methods to model plant disease risk
title Marriage between variable selection and prediction methods to model plant disease risk
spellingShingle Marriage between variable selection and prediction methods to model plant disease risk
Suarez, Franco
Multicollinearity
Plant Diseases
Multicolinearidad
Enfermedades de las Plantas
Logistic Regression
Random Forest
Feature Selection
Prediction Models
Pathosystems
title_short Marriage between variable selection and prediction methods to model plant disease risk
title_full Marriage between variable selection and prediction methods to model plant disease risk
title_fullStr Marriage between variable selection and prediction methods to model plant disease risk
title_full_unstemmed Marriage between variable selection and prediction methods to model plant disease risk
title_sort Marriage between variable selection and prediction methods to model plant disease risk
dc.creator.none.fl_str_mv Suarez, Franco
Bruno, Cecilia
Kurina Giannini, Franca
Gimenez, Maria
Rodriguez Pardina, Patricia
Balzarini, Mónica Graciela
author Suarez, Franco
author_facet Suarez, Franco
Bruno, Cecilia
Kurina Giannini, Franca
Gimenez, Maria
Rodriguez Pardina, Patricia
Balzarini, Mónica Graciela
author_role author
author2 Bruno, Cecilia
Kurina Giannini, Franca
Gimenez, Maria
Rodriguez Pardina, Patricia
Balzarini, Mónica Graciela
author2_role author
author
author
author
author
dc.subject.none.fl_str_mv Multicollinearity
Plant Diseases
Multicolinearidad
Enfermedades de las Plantas
Logistic Regression
Random Forest
Feature Selection
Prediction Models
Pathosystems
topic Multicollinearity
Plant Diseases
Multicolinearidad
Enfermedades de las Plantas
Logistic Regression
Random Forest
Feature Selection
Prediction Models
Pathosystems
dc.description.none.fl_txt_mv Predicting the risk of a disease in a pathosystem based on a set of climatic variables usually requires handling a high number of input variables, many of which are often irrelevant and/or redundant. Building linear predictive models entails not only dimensionality issues but also the negative impact of multicollinearity. Several feature selection methods have proved to be efficient in both linear and non-linear models, regardless of those issues. However, in a machine learning (ML) context, it is necessary to evaluate these feature selection methods embedded into the model fitting algorithm to obtain the greatest accuracy. The aim of this work was to assess different combinations of variable selection methods with linear and non-linear predictors to fit climate-based models that predict the occurrence of a disease in a pathosystem. Four selection methods were compared: stepwise, which is frequently used in linear models, combined with VIF and p-value statistical criteria (Step+VIF+Pv), and other methods commonly used in ML: filter (F), genetic algorithm (GA), and Boruta (B). The disease risk predictors were constructed with a logistic linear regression model (LR) and the random forest (RF) algorithm, using all the available variables and the subgroups of variables selected by each feature selection method. Data from three pathosystems were processed: two involving Begomovirus –one in common bean (Phaseolus vulgaris L) and the other in soybean (Glycine max)– and the third one involving Mal de Rio Cuarto virus in maize (Zea mays L.). The data sets differed in sample size and number of variables. The accuracy of RF prediction did not vary among feature selection methods. Step+VIF+Pv was used to reduce the model outperformed the other feature selection methods in fitting LR. Our proposal suggests that the appropriate pairing of variable selection and prediction models would improve the modeling of plant disease risk.
Instituto de Patología Vegetal
Fil: Suarez, Franco. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Cátedra de Estadística y Biometría; Argentina
Fil: Suarez, Franco. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina
Fil: Suarez, Franco. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina
Fil: Bruno, Cecilia. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Cátedra de Estadística y Biometría; Argentina
Fil: Bruno, Cecilia. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina
Fil: Bruno, Cecilia. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina
Fil: Kurina Giannini, Franca. Aarhus Universitet. institut for agroøkologi. Jornær sektioner; Dinamarca
Fil: Gimenez, Maria De La Paz. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina
Fil: Gimenez, Maria De La Paz. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina
Fil: Rodriguez Pardina, Patricia. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina
Fil: Rodriguez Pardina, Patricia. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina
Fil: Balzarini, Mónica Graciela. Universidad Nacional de Córdoba. Facultad de Ciencias Agropecuarias. Cátedra de Estadística y Biometría; Argentina
Fil: Balzarini, Mónica Graciela. Consejo Nacional de Investigaciones Científicas y Técnicas. Unidad de Fitopatología y Modelización Agrícola (UFyMA); Argentina
Fil: Balzarini, Mónica Graciela. Instituto Nacional de Tecnología Agropecuaria (INTA). Instituto de Patología Vegetal; Argentina
description Predicting the risk of a disease in a pathosystem based on a set of climatic variables usually requires handling a high number of input variables, many of which are often irrelevant and/or redundant. Building linear predictive models entails not only dimensionality issues but also the negative impact of multicollinearity. Several feature selection methods have proved to be efficient in both linear and non-linear models, regardless of those issues. However, in a machine learning (ML) context, it is necessary to evaluate these feature selection methods embedded into the model fitting algorithm to obtain the greatest accuracy. The aim of this work was to assess different combinations of variable selection methods with linear and non-linear predictors to fit climate-based models that predict the occurrence of a disease in a pathosystem. Four selection methods were compared: stepwise, which is frequently used in linear models, combined with VIF and p-value statistical criteria (Step+VIF+Pv), and other methods commonly used in ML: filter (F), genetic algorithm (GA), and Boruta (B). The disease risk predictors were constructed with a logistic linear regression model (LR) and the random forest (RF) algorithm, using all the available variables and the subgroups of variables selected by each feature selection method. Data from three pathosystems were processed: two involving Begomovirus –one in common bean (Phaseolus vulgaris L) and the other in soybean (Glycine max)– and the third one involving Mal de Rio Cuarto virus in maize (Zea mays L.). The data sets differed in sample size and number of variables. The accuracy of RF prediction did not vary among feature selection methods. Step+VIF+Pv was used to reduce the model outperformed the other feature selection methods in fitting LR. Our proposal suggests that the appropriate pairing of variable selection and prediction models would improve the modeling of plant disease risk.
publishDate 2023
dc.date.none.fl_str_mv 2023-10-23T10:21:16Z
2023-10-23T10:21:16Z
2023-10-11
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/20.500.12123/15634
https://www.sciencedirect.com/science/article/pii/S1161030123002630
1161-0301
https://doi.org/10.1016/j.eja.2023.126995
url http://hdl.handle.net/20.500.12123/15634
https://www.sciencedirect.com/science/article/pii/S1161030123002630
https://doi.org/10.1016/j.eja.2023.126995
identifier_str_mv 1161-0301
dc.language.none.fl_str_mv eng
language eng
dc.rights.none.fl_str_mv info:eu-repo/semantics/restrictedAccess
http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv restrictedAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Elsevier
publisher.none.fl_str_mv Elsevier
dc.source.none.fl_str_mv European Journal of Agronomy 151: 126995 (November 2023)
reponame:INTA Digital (INTA)
instname:Instituto Nacional de Tecnología Agropecuaria
reponame_str INTA Digital (INTA)
collection INTA Digital (INTA)
instname_str Instituto Nacional de Tecnología Agropecuaria
repository.name.fl_str_mv INTA Digital (INTA) - Instituto Nacional de Tecnología Agropecuaria
repository.mail.fl_str_mv tripaldi.nicolas@inta.gob.ar
_version_ 1844619180460998656
score 12.559606