Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish

Autores: Tessore, Juan Pablo; Esnaola, Leonardo Martín; Lanzarini, Laura Cristina; Baldassarri, Sandra Silvia
Año de publicación: 2021
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.
Fil: Tessore, Juan Pablo. Universidad Nacional del Noroeste de la Pcia.de Bs.as.. Escuela de Tecnologia. Instituto de Investigacion y Transferencia En Tecnologia. - Comision de Investigaciones Cientificas de la Provincia de Buenos Aires. Instituto de Investigacion y Transferencia En Tecnologia.; Argentina
Fil: Esnaola, Leonardo Martín. Universidad Nacional del Noroeste de la Pcia.de Bs.as.. Escuela de Tecnologia. Instituto de Investigacion y Transferencia En Tecnologia. - Comision de Investigaciones Cientificas de la Provincia de Buenos Aires. Instituto de Investigacion y Transferencia En Tecnologia.; Argentina
Fil: Lanzarini, Laura Cristina. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina
Fil: Baldassarri, Sandra Silvia. Universidad de Zaragoza; España
Materia: DATASET CONSTRUCTION
DATASET VALIDATION
FACEBOOK
SENTIMENT ANALYSIS
TEXT MINING
Nivel de accesibilidad: acceso abierto
Condiciones de uso: https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
Institución: Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador: oai:ri.conicet.gov.ar:11336/215171

Acceder

id	CONICETDig_ee05b040152510b70838578d7b43f34f
oai_identifier_str	oai:ri.conicet.gov.ar:11336/215171
network_acronym_str	CONICETDig
repository_id_str	3498
network_name_str	CONICET Digital (CONICET)
spelling	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in SpanishTessore, Juan PabloEsnaola, Leonardo MartínLanzarini, Laura CristinaBaldassarri, Sandra SilviaDATASET CONSTRUCTIONDATASET VALIDATIONFACEBOOKSENTIMENT ANALYSISTEXT MININGhttps://purl.org/becyt/ford/2.2https://purl.org/becyt/ford/2Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.Fil: Tessore, Juan Pablo. Universidad Nacional del Noroeste de la Pcia.de Bs.as.. Escuela de Tecnologia. Instituto de Investigacion y Transferencia En Tecnologia. - Comision de Investigaciones Cientificas de la Provincia de Buenos Aires. Instituto de Investigacion y Transferencia En Tecnologia.; ArgentinaFil: Esnaola, Leonardo Martín. Universidad Nacional del Noroeste de la Pcia.de Bs.as.. Escuela de Tecnologia. Instituto de Investigacion y Transferencia En Tecnologia. - Comision de Investigaciones Cientificas de la Provincia de Buenos Aires. Instituto de Investigacion y Transferencia En Tecnologia.; ArgentinaFil: Lanzarini, Laura Cristina. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; ArgentinaFil: Baldassarri, Sandra Silvia. Universidad de Zaragoza; EspañaSpringer2021-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfapplication/pdfapplication/pdfapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/215171Tessore, Juan Pablo; Esnaola, Leonardo Martín; Lanzarini, Laura Cristina; Baldassarri, Sandra Silvia; Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish; Springer; Cognitive Computation; 14; 1; 1-2021; 407-4241866-99561866-9964CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/doi/10.1007/s12559-020-09800-xinfo:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2026-05-27T14:30:14Zoai:ri.conicet.gov.ar:11336/215171instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982026-05-27 14:30:14.389CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
spellingShingle	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish Tessore, Juan Pablo DATASET CONSTRUCTION DATASET VALIDATION FACEBOOK SENTIMENT ANALYSIS TEXT MINING
title_short	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title_full	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title_fullStr	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title_full_unstemmed	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title_sort	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
dc.creator.none.fl_str_mv	Tessore, Juan Pablo Esnaola, Leonardo Martín Lanzarini, Laura Cristina Baldassarri, Sandra Silvia
author	Tessore, Juan Pablo
author_facet	Tessore, Juan Pablo Esnaola, Leonardo Martín Lanzarini, Laura Cristina Baldassarri, Sandra Silvia
author_role	author
author2	Esnaola, Leonardo Martín Lanzarini, Laura Cristina Baldassarri, Sandra Silvia
author2_role	author author author
dc.subject.none.fl_str_mv	DATASET CONSTRUCTION DATASET VALIDATION FACEBOOK SENTIMENT ANALYSIS TEXT MINING
topic	DATASET CONSTRUCTION DATASET VALIDATION FACEBOOK SENTIMENT ANALYSIS TEXT MINING
purl_subject.fl_str_mv	https://purl.org/becyt/ford/2.2 https://purl.org/becyt/ford/2
dc.description.none.fl_txt_mv	Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field. Fil: Tessore, Juan Pablo. Universidad Nacional del Noroeste de la Pcia.de Bs.as.. Escuela de Tecnologia. Instituto de Investigacion y Transferencia En Tecnologia. - Comision de Investigaciones Cientificas de la Provincia de Buenos Aires. Instituto de Investigacion y Transferencia En Tecnologia.; Argentina Fil: Esnaola, Leonardo Martín. Universidad Nacional del Noroeste de la Pcia.de Bs.as.. Escuela de Tecnologia. Instituto de Investigacion y Transferencia En Tecnologia. - Comision de Investigaciones Cientificas de la Provincia de Buenos Aires. Instituto de Investigacion y Transferencia En Tecnologia.; Argentina Fil: Lanzarini, Laura Cristina. Universidad Nacional de La Plata. Facultad de Informática. Instituto de Investigación en Informática Lidi; Argentina Fil: Baldassarri, Sandra Silvia. Universidad de Zaragoza; España
description	Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.
publishDate	2021
dc.date.none.fl_str_mv	2021-01
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://hdl.handle.net/11336/215171 Tessore, Juan Pablo; Esnaola, Leonardo Martín; Lanzarini, Laura Cristina; Baldassarri, Sandra Silvia; Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish; Springer; Cognitive Computation; 14; 1; 1-2021; 407-424 1866-9956 1866-9964 CONICET Digital CONICET
url	http://hdl.handle.net/11336/215171
identifier_str_mv	Tessore, Juan Pablo; Esnaola, Leonardo Martín; Lanzarini, Laura Cristina; Baldassarri, Sandra Silvia; Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish; Springer; Cognitive Computation; 14; 1; 1-2021; 407-424 1866-9956 1866-9964 CONICET Digital CONICET
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/doi/10.1007/s12559-020-09800-x
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
eu_rights_str_mv	openAccess
rights_invalid_str_mv	https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.format.none.fl_str_mv	application/pdf application/pdf application/pdf application/pdf application/pdf application/pdf
dc.publisher.none.fl_str_mv	Springer
publisher.none.fl_str_mv	Springer
dc.source.none.fl_str_mv	reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str	CONICET Digital (CONICET)
collection	CONICET Digital (CONICET)
instname_str	Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv	CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv	dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_	1866373002485039104
score	13.143419

Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish

Publicaciones similares