FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems

Autores
Basgall, María; Naiouf, Marcelo; Fernández, Alberto
Año de publicación
2021
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
Instituto de Investigación en Informática
Materia
Ciencias Informáticas
Big data
Data reduction
Classification
Preprocessing techniques
Apache Spark
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by/4.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/125448

id SEDICI_9b431da0ea7dc3817a5cfc0c56276692
oai_identifier_str oai:sedici.unlp.edu.ar:10915/125448
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification ProblemsBasgall, MaríaNaiouf, MarceloFernández, AlbertoCiencias InformáticasBig dataData reductionClassificationPreprocessing techniquesApache SparkIn this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.Instituto de Investigación en Informática2021info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArticulohttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/125448enginfo:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2079-9292/10/15/1757info:eu-repo/semantics/altIdentifier/issn/2079-9292info:eu-repo/semantics/altIdentifier/doi/10.3390/electronics10151757info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/4.0/Creative Commons Attribution 4.0 International (CC BY 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-10-15T11:21:56Zoai:sedici.unlp.edu.ar:10915/125448Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-10-15 11:21:56.535SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
spellingShingle FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
Basgall, María
Ciencias Informáticas
Big data
Data reduction
Classification
Preprocessing techniques
Apache Spark
title_short FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_full FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_fullStr FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_full_unstemmed FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_sort FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
dc.creator.none.fl_str_mv Basgall, María
Naiouf, Marcelo
Fernández, Alberto
author Basgall, María
author_facet Basgall, María
Naiouf, Marcelo
Fernández, Alberto
author_role author
author2 Naiouf, Marcelo
Fernández, Alberto
author2_role author
author
dc.subject.none.fl_str_mv Ciencias Informáticas
Big data
Data reduction
Classification
Preprocessing techniques
Apache Spark
topic Ciencias Informáticas
Big data
Data reduction
Classification
Preprocessing techniques
Apache Spark
dc.description.none.fl_txt_mv In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
Instituto de Investigación en Informática
description In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
publishDate 2021
dc.date.none.fl_str_mv 2021
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
Articulo
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/125448
url http://sedici.unlp.edu.ar/handle/10915/125448
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2079-9292/10/15/1757
info:eu-repo/semantics/altIdentifier/issn/2079-9292
info:eu-repo/semantics/altIdentifier/doi/10.3390/electronics10151757
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by/4.0/
Creative Commons Attribution 4.0 International (CC BY 4.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by/4.0/
Creative Commons Attribution 4.0 International (CC BY 4.0)
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1846064275612762112
score 13.22299