FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems

Autores: Basgall, María José; Naiouf, Ricardo Marcelo; Fernández, Alberto
Año de publicación: 2021
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
Fil: Basgall, María José. Universidad de Granada; España. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina
Fil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina
Fil: Fernández, Alberto. Universidad de Granada; España
Materia: APACHE SPARK
BIG DATA
CLASSIFICATION
DATA REDUCTION
PREPROCESSING TECHNIQUES
Nivel de accesibilidad: acceso abierto
Condiciones de uso: https://creativecommons.org/licenses/by/2.5/ar/
Repositorio
Institución: Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador: oai:ri.conicet.gov.ar:11336/150370

Acceder

id	CONICETDig_a94eca3398b0f8c54fa36acada5ed51d
oai_identifier_str	oai:ri.conicet.gov.ar:11336/150370
network_acronym_str	CONICETDig
repository_id_str	3498
network_name_str	CONICET Digital (CONICET)
spelling	FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problemsBasgall, María JoséNaiouf, Ricardo MarceloFernández, AlbertoAPACHE SPARKBIG DATACLASSIFICATIONDATA REDUCTIONPREPROCESSING TECHNIQUEShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.Fil: Basgall, María José. Universidad de Granada; España. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; ArgentinaFil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; ArgentinaFil: Fernández, Alberto. Universidad de Granada; EspañaMolecular Diversity Preservation International2021-08info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/150370Basgall, María José; Naiouf, Ricardo Marcelo; Fernández, Alberto; FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems; Molecular Diversity Preservation International; Electronics; 10; 15; 8-2021; 1-192079-9292CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2079-9292/10/15/1757info:eu-repo/semantics/altIdentifier/doi/10.3390/electronics10151757info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2026-06-04T11:03:21Zoai:ri.conicet.gov.ar:11336/150370instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982026-06-04 11:03:21.889CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv	FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
title	FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
spellingShingle	FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems Basgall, María José APACHE SPARK BIG DATA CLASSIFICATION DATA REDUCTION PREPROCESSING TECHNIQUES
title_short	FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
title_full	FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
title_fullStr	FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
title_full_unstemmed	FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
title_sort	FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems
dc.creator.none.fl_str_mv	Basgall, María José Naiouf, Ricardo Marcelo Fernández, Alberto
author	Basgall, María José
author_facet	Basgall, María José Naiouf, Ricardo Marcelo Fernández, Alberto
author_role	author
author2	Naiouf, Ricardo Marcelo Fernández, Alberto
author2_role	author author
dc.subject.none.fl_str_mv	APACHE SPARK BIG DATA CLASSIFICATION DATA REDUCTION PREPROCESSING TECHNIQUES
topic	APACHE SPARK BIG DATA CLASSIFICATION DATA REDUCTION PREPROCESSING TECHNIQUES
purl_subject.fl_str_mv	https://purl.org/becyt/ford/1.2 https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv	In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline. Fil: Basgall, María José. Universidad de Granada; España. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata; Argentina Fil: Naiouf, Ricardo Marcelo. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina Fil: Fernández, Alberto. Universidad de Granada; España
description	In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.
publishDate	2021
dc.date.none.fl_str_mv	2021-08
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://hdl.handle.net/11336/150370 Basgall, María José; Naiouf, Ricardo Marcelo; Fernández, Alberto; FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems; Molecular Diversity Preservation International; Electronics; 10; 15; 8-2021; 1-19 2079-9292 CONICET Digital CONICET
url	http://hdl.handle.net/11336/150370
identifier_str_mv	Basgall, María José; Naiouf, Ricardo Marcelo; Fernández, Alberto; FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems; Molecular Diversity Preservation International; Electronics; 10; 15; 8-2021; 1-19 2079-9292 CONICET Digital CONICET
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/url/https://www.mdpi.com/2079-9292/10/15/1757 info:eu-repo/semantics/altIdentifier/doi/10.3390/electronics10151757
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by/2.5/ar/
eu_rights_str_mv	openAccess
rights_invalid_str_mv	https://creativecommons.org/licenses/by/2.5/ar/
dc.format.none.fl_str_mv	application/pdf application/pdf
dc.publisher.none.fl_str_mv	Molecular Diversity Preservation International
publisher.none.fl_str_mv	Molecular Diversity Preservation International
dc.source.none.fl_str_mv	reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str	CONICET Digital (CONICET)
collection	CONICET Digital (CONICET)
instname_str	Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv	CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv	dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_	1867098845830184960
score	12.832306

FDR2-BD: A fast data reduction recommendation tool for tabular big data classification problems

Publicaciones similares