FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems
- Autores
- Basgall, María; Naiouf, Marcelo; Fernández, Alberto
- Año de publicación
- 2021
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
Instituto de Investigación en Informática - Materia
-
Ciencias Informáticas
Big data
Data reduction
Classification
Preprocessing techniques
Apache Spark - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by/4.0/
- Repositorio
- Institución
- Universidad Nacional de La Plata
- OAI Identificador
- oai:sedici.unlp.edu.ar:10915/125448
Ver los metadatos del registro completo
id |
SEDICI_9b431da0ea7dc3817a5cfc0c56276692 |
---|---|
oai_identifier_str |
oai:sedici.unlp.edu.ar:10915/125448 |
network_acronym_str |
SEDICI |
repository_id_str |
1329 |
network_name_str |
SEDICI (UNLP) |
spelling |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification ProblemsBasgall, MaríaNaiouf, MarceloFernández, AlbertoCiencias InformáticasBig dataData reductionClassificationPreprocessing techniquesApache SparkIn this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.Instituto de Investigación en Informática2021info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArticulohttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/125448enginfo:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2079-9292/10/15/1757info:eu-repo/semantics/altIdentifier/issn/2079-9292info:eu-repo/semantics/altIdentifier/doi/10.3390/electronics10151757info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/4.0/Creative Commons Attribution 4.0 International (CC BY 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-10-15T11:21:56Zoai:sedici.unlp.edu.ar:10915/125448Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-10-15 11:21:56.535SEDICI (UNLP) - Universidad Nacional de La Platafalse |
dc.title.none.fl_str_mv |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems |
title |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems |
spellingShingle |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems Basgall, María Ciencias Informáticas Big data Data reduction Classification Preprocessing techniques Apache Spark |
title_short |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems |
title_full |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems |
title_fullStr |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems |
title_full_unstemmed |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems |
title_sort |
FDR²-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems |
dc.creator.none.fl_str_mv |
Basgall, María Naiouf, Marcelo Fernández, Alberto |
author |
Basgall, María |
author_facet |
Basgall, María Naiouf, Marcelo Fernández, Alberto |
author_role |
author |
author2 |
Naiouf, Marcelo Fernández, Alberto |
author2_role |
author author |
dc.subject.none.fl_str_mv |
Ciencias Informáticas Big data Data reduction Classification Preprocessing techniques Apache Spark |
topic |
Ciencias Informáticas Big data Data reduction Classification Preprocessing techniques Apache Spark |
dc.description.none.fl_txt_mv |
In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline. Instituto de Investigación en Informática |
description |
In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR²-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline. |
publishDate |
2021 |
dc.date.none.fl_str_mv |
2021 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion Articulo http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
format |
article |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://sedici.unlp.edu.ar/handle/10915/125448 |
url |
http://sedici.unlp.edu.ar/handle/10915/125448 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2079-9292/10/15/1757 info:eu-repo/semantics/altIdentifier/issn/2079-9292 info:eu-repo/semantics/altIdentifier/doi/10.3390/electronics10151757 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International (CC BY 4.0) |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International (CC BY 4.0) |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP |
reponame_str |
SEDICI (UNLP) |
collection |
SEDICI (UNLP) |
instname_str |
Universidad Nacional de La Plata |
instacron_str |
UNLP |
institution |
UNLP |
repository.name.fl_str_mv |
SEDICI (UNLP) - Universidad Nacional de La Plata |
repository.mail.fl_str_mv |
alira@sedici.unlp.edu.ar |
_version_ |
1846064275612762112 |
score |
13.22299 |