FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems

Autores
Basgall, María José; Naiouf, Ricardo Marcelo; Fernández, Alberto
Año de publicación
2021
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
Fil: Basgall, María José. Universidad de Granada; España. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina
Fil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina
Fil: Fernández, Alberto. Universidad de Granada; España
Materia
APACHE SPARK
BIG DATA
CLASSIFICATION
DATA REDUCTION
PREPROCESSING TECHNIQUES
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/150370

id CONICETDig_a94eca3398b0f8c54fa36acada5ed51d
oai_identifier_str oai:ri.conicet.gov.ar:11336/150370
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problemsBasgall, María JoséNaiouf, Ricardo MarceloFernández, AlbertoAPACHE SPARKBIG DATACLASSIFICATIONDATA REDUCTIONPREPROCESSING TECHNIQUEShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.Fil: Basgall, María José. Universidad de Granada; España. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; ArgentinaFil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; ArgentinaFil: Fernández, Alberto. Universidad de Granada; EspañaMolecular Diversity Preservation International2021-08info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/150370Basgall, María José; Naiouf, Ricardo Marcelo; Fernández, Alberto; FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems; Molecular Diversity Preservation International; Electronics; 10; 15; 8-2021; 1-192079-9292CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2079-9292/10/15/1757info:eu-repo/semantics/altIdentifier/doi/10.3390/electronics10151757info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-03T09:57:20Zoai:ri.conicet.gov.ar:11336/150370instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-03 09:57:21.161CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
title FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
spellingShingle FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
Basgall, María José
APACHE SPARK
BIG DATA
CLASSIFICATION
DATA REDUCTION
PREPROCESSING TECHNIQUES
title_short FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
title_full FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
title_fullStr FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
title_full_unstemmed FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
title_sort FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
dc.creator.none.fl_str_mv Basgall, María José
Naiouf, Ricardo Marcelo
Fernández, Alberto
author Basgall, María José
author_facet Basgall, María José
Naiouf, Ricardo Marcelo
Fernández, Alberto
author_role author
author2 Naiouf, Ricardo Marcelo
Fernández, Alberto
author2_role author
author
dc.subject.none.fl_str_mv APACHE SPARK
BIG DATA
CLASSIFICATION
DATA REDUCTION
PREPROCESSING TECHNIQUES
topic APACHE SPARK
BIG DATA
CLASSIFICATION
DATA REDUCTION
PREPROCESSING TECHNIQUES
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.2
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
Fil: Basgall, María José. Universidad de Granada; España. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina
Fil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina
Fil: Fernández, Alberto. Universidad de Granada; España
description In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
publishDate 2021
dc.date.none.fl_str_mv 2021-08
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/150370
Basgall, María José; Naiouf, Ricardo Marcelo; Fernández, Alberto; FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems; Molecular Diversity Preservation International; Electronics; 10; 15; 8-2021; 1-19
2079-9292
CONICET Digital
CONICET
url http://hdl.handle.net/11336/150370
identifier_str_mv Basgall, María José; Naiouf, Ricardo Marcelo; Fernández, Alberto; FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems; Molecular Diversity Preservation International; Electronics; 10; 15; 8-2021; 1-19
2079-9292
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2079-9292/10/15/1757
info:eu-repo/semantics/altIdentifier/doi/10.3390/electronics10151757
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv Molecular Diversity Preservation International
publisher.none.fl_str_mv Molecular Diversity Preservation International
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1842269457517379584
score 13.13397