FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems

Autores: Basgall, María; Naiouf, Marcelo; Fernández, Alberto
Año de publicación: 2021
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
Instituto de Investigación en Informática
Materia: Ciencias Informáticas
Big data
Data reduction
Classification
Preprocessing techniques
Apache Spark
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by/4.0/
Repositorio
Institución: Universidad Nacional de La Plata
OAI Identificador: oai:sedici.unlp.edu.ar:10915/125448

Acceder

id	SEDICI_9b431da0ea7dc3817a5cfc0c56276692
oai_identifier_str	oai:sedici.unlp.edu.ar:10915/125448
network_acronym_str	SEDICI
repository_id_str	1329
network_name_str	SEDICI (UNLP)
spelling	FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification ProblemsBasgall, MaríaNaiouf, MarceloFernández, AlbertoCiencias InformáticasBig dataData reductionClassificationPreprocessing techniquesApache SparkIn this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.Instituto de Investigación en Informática2021info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArticulohttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/125448enginfo:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2079-9292/10/15/1757info:eu-repo/semantics/altIdentifier/issn/2079-9292info:eu-repo/semantics/altIdentifier/doi/10.3390/electronics10151757info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/4.0/Creative Commons Attribution 4.0 International (CC BY 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-12-23T11:32:10Zoai:sedici.unlp.edu.ar:10915/125448Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-12-23 11:32:11.159SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv	FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title	FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
spellingShingle	FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems Basgall, María Ciencias Informáticas Big data Data reduction Classification Preprocessing techniques Apache Spark
title_short	FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_full	FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_fullStr	FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_full_unstemmed	FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
title_sort	FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
dc.creator.none.fl_str_mv	Basgall, María Naiouf, Marcelo Fernández, Alberto
author	Basgall, María
author_facet	Basgall, María Naiouf, Marcelo Fernández, Alberto
author_role	author
author2	Naiouf, Marcelo Fernández, Alberto
author2_role	author author
dc.subject.none.fl_str_mv	Ciencias Informáticas Big data Data reduction Classification Preprocessing techniques Apache Spark
topic	Ciencias Informáticas Big data Data reduction Classification Preprocessing techniques Apache Spark
dc.description.none.fl_txt_mv	In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline. Instituto de Investigación en Informática
description	In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
publishDate	2021
dc.date.none.fl_str_mv	2021
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion Articulo http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://sedici.unlp.edu.ar/handle/10915/125448
url	http://sedici.unlp.edu.ar/handle/10915/125448
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2079-9292/10/15/1757 info:eu-repo/semantics/altIdentifier/issn/2079-9292 info:eu-repo/semantics/altIdentifier/doi/10.3390/electronics10151757
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International (CC BY 4.0)
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International (CC BY 4.0)
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP
reponame_str	SEDICI (UNLP)
collection	SEDICI (UNLP)
instname_str	Universidad Nacional de La Plata
instacron_str	UNLP
institution	UNLP
repository.name.fl_str_mv	SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv	alira@sedici.unlp.edu.ar
_version_	1852334443863212032
score	12.952241

FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems

Publicaciones similares