Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish

Autores: Tessore, Juan Pablo; Esnaola, Leonardo; Lanzarini, Laura Cristina; Baldassarri, Sandra
Año de publicación: 2021
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.
Instituto de Investigación en Informática
Materia: Informática
Sentiment analysis
Dataset construction
Dataset validation
Facebook
Text mining
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by/4.0/
Repositorio
Institución: Universidad Nacional de La Plata
OAI Identificador: oai:sedici.unlp.edu.ar:10915/138899

Acceder

id	SEDICI_18b2a1d02569d1683455a07337a9ee27
oai_identifier_str	oai:sedici.unlp.edu.ar:10915/138899
network_acronym_str	SEDICI
repository_id_str	1329
network_name_str	SEDICI (UNLP)
spelling	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in SpanishTessore, Juan PabloEsnaola, LeonardoLanzarini, Laura CristinaBaldassarri, SandraInformáticaSentiment analysisDataset constructionDataset validationFacebookText miningTagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.Instituto de Investigación en Informática2021-01-18info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArticulohttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdf1-18http://sedici.unlp.edu.ar/handle/10915/138899enginfo:eu-repo/semantics/altIdentifier/issn/1866-9956info:eu-repo/semantics/altIdentifier/issn/1866-9964info:eu-repo/semantics/altIdentifier/doi/10.1007/s12559-020-09800-xinfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/4.0/Creative Commons Attribution 4.0 International (CC BY 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2026-05-27T11:26:03Zoai:sedici.unlp.edu.ar:10915/138899Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292026-05-27 11:26:03.541SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
spellingShingle	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish Tessore, Juan Pablo Informática Sentiment analysis Dataset construction Dataset validation Facebook Text mining
title_short	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title_full	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title_fullStr	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title_full_unstemmed	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
title_sort	Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish
dc.creator.none.fl_str_mv	Tessore, Juan Pablo Esnaola, Leonardo Lanzarini, Laura Cristina Baldassarri, Sandra
author	Tessore, Juan Pablo
author_facet	Tessore, Juan Pablo Esnaola, Leonardo Lanzarini, Laura Cristina Baldassarri, Sandra
author_role	author
author2	Esnaola, Leonardo Lanzarini, Laura Cristina Baldassarri, Sandra
author2_role	author author author
dc.subject.none.fl_str_mv	Informática Sentiment analysis Dataset construction Dataset validation Facebook Text mining
topic	Informática Sentiment analysis Dataset construction Dataset validation Facebook Text mining
dc.description.none.fl_txt_mv	Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field. Instituto de Investigación en Informática
description	Tagged language resources are an essential requirement for developing machine-learning text-based classifiers. However, manual tagging is extremely time consuming and the resulting datasets are rather small, containing only a few thousand samples. Basic emotion datasets are particularly difficult to classify manually because categorization is prone to subjectivity, and thus, redundant classification is required to validate the assigned tag. Even though, in recent years, the amount of emotion-tagged text datasets in Spanish has been growing, it cannot be compared with the number, size, and quality of the datasets in English. Quality is a particularly concerning issue, as not many datasets in Spanish included a validation step in the construction process. In this article, a dataset of social media comments in Spanish is compiled, selected, filtered, and presented. A sample of the dataset is reclassified by a group of psychologists and validated using the Fleiss Kappa interrater agreement measure. Error analysis is performed by using the Sentic Computing tool BabelSenticNet. Results indicate that the agreement between the human raters and the automatically acquired tag is moderate, similar to other manually tagged datasets, with the advantages that the presented dataset contains several hundreds of thousands of tagged comments and it does not require extensive manual tagging. The agreement measured between human raters is very similar to the one between human raters and the original tag. Every measure presented is in the moderate agreement zone and, as such, suitable for training classification algorithms in sentiment analysis field.
publishDate	2021
dc.date.none.fl_str_mv	2021-01-18
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion Articulo http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://sedici.unlp.edu.ar/handle/10915/138899
url	http://sedici.unlp.edu.ar/handle/10915/138899
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/issn/1866-9956 info:eu-repo/semantics/altIdentifier/issn/1866-9964 info:eu-repo/semantics/altIdentifier/doi/10.1007/s12559-020-09800-x
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International (CC BY 4.0)
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by/4.0/ Creative Commons Attribution 4.0 International (CC BY 4.0)
dc.format.none.fl_str_mv	application/pdf 1-18
dc.source.none.fl_str_mv	reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP
reponame_str	SEDICI (UNLP)
collection	SEDICI (UNLP)
instname_str	Universidad Nacional de La Plata
instacron_str	UNLP
institution	UNLP
repository.name.fl_str_mv	SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv	alira@sedici.unlp.edu.ar
_version_	1866371868425977856
score	13.143419

Distant Supervised Construction and Evaluation of a Novel Dataset of Emotion-Tagged Social Media Comments in Spanish

Publicaciones similares