A Spanish text corpus for the author profiling task

Autores
Villegas, María Paula; Garciarena Ucelay, María José; Errecalde, Marcelo Luis; Cagnina, Leticia
Año de publicación
2014
Idioma
inglés
Tipo de recurso
documento de conferencia
Estado
versión publicada
Descripción
Author Profiling is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to its potential applications in security, crime and marketing, among others. One of the main difficulties in this field is the lack of reliable text collections (corpora) to train and test automatically derived classifiers, in particular in specific languages such as Spanish. Although some recent data sets were generated for the PAN competitions, these documents have a lot of “noise” that prevent researchers from obtaining more general conclusions about this task when more formal documents are used. In this context, this work proposes and describes SpanText, a data collection of formal texts in Spanish language which is, as far as we know, the first collection with these characteristics for the author profiling task. Besides, an experimental study is carried out where the difference in performance obtained with formal and informal texts is clearly established and opens interesting research lines to get a deeper understanding of the particularities that each type of documents poses to the author profiling task.
XI Workshop Bases de Datos y Minería de Datos
Red de Universidades con Carreras de Informática (RedUNCI)
Materia
Ciencias Informáticas
author profiling
natural processing language
Spanish text corpus
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/42290

id SEDICI_60511b35f800c0db56dd56726e2a453f
oai_identifier_str oai:sedici.unlp.edu.ar:10915/42290
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling A Spanish text corpus for the author profiling taskVillegas, María PaulaGarciarena Ucelay, María JoséErrecalde, Marcelo LuisCagnina, LeticiaCiencias Informáticasauthor profilingnatural processing languageSpanish text corpus<i>Author Profiling</i> is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to its potential applications in security, crime and marketing, among others. One of the main difficulties in this field is the lack of reliable text collections (corpora) to train and test automatically derived classifiers, in particular in specific languages such as Spanish. Although some recent data sets were generated for the PAN competitions, these documents have a lot of “noise” that prevent researchers from obtaining more general conclusions about this task when more formal documents are used. In this context, this work proposes and describes <i>SpanText</i>, a data collection of formal texts in Spanish language which is, as far as we know, the first collection with these characteristics for the author profiling task. Besides, an experimental study is carried out where the difference in performance obtained with formal and informal texts is clearly established and opens interesting research lines to get a deeper understanding of the particularities that each type of documents poses to the author profiling task.XI Workshop Bases de Datos y Minería de DatosRed de Universidades con Carreras de Informática (RedUNCI)2014-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/42290enginfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/2.5/ar/Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Argentina (CC BY-NC-SA 2.5)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-10-15T10:53:53Zoai:sedici.unlp.edu.ar:10915/42290Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-10-15 10:53:54.194SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv A Spanish text corpus for the author profiling task
title A Spanish text corpus for the author profiling task
spellingShingle A Spanish text corpus for the author profiling task
Villegas, María Paula
Ciencias Informáticas
author profiling
natural processing language
Spanish text corpus
title_short A Spanish text corpus for the author profiling task
title_full A Spanish text corpus for the author profiling task
title_fullStr A Spanish text corpus for the author profiling task
title_full_unstemmed A Spanish text corpus for the author profiling task
title_sort A Spanish text corpus for the author profiling task
dc.creator.none.fl_str_mv Villegas, María Paula
Garciarena Ucelay, María José
Errecalde, Marcelo Luis
Cagnina, Leticia
author Villegas, María Paula
author_facet Villegas, María Paula
Garciarena Ucelay, María José
Errecalde, Marcelo Luis
Cagnina, Leticia
author_role author
author2 Garciarena Ucelay, María José
Errecalde, Marcelo Luis
Cagnina, Leticia
author2_role author
author
author
dc.subject.none.fl_str_mv Ciencias Informáticas
author profiling
natural processing language
Spanish text corpus
topic Ciencias Informáticas
author profiling
natural processing language
Spanish text corpus
dc.description.none.fl_txt_mv <i>Author Profiling</i> is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to its potential applications in security, crime and marketing, among others. One of the main difficulties in this field is the lack of reliable text collections (corpora) to train and test automatically derived classifiers, in particular in specific languages such as Spanish. Although some recent data sets were generated for the PAN competitions, these documents have a lot of “noise” that prevent researchers from obtaining more general conclusions about this task when more formal documents are used. In this context, this work proposes and describes <i>SpanText</i>, a data collection of formal texts in Spanish language which is, as far as we know, the first collection with these characteristics for the author profiling task. Besides, an experimental study is carried out where the difference in performance obtained with formal and informal texts is clearly established and opens interesting research lines to get a deeper understanding of the particularities that each type of documents poses to the author profiling task.
XI Workshop Bases de Datos y Minería de Datos
Red de Universidades con Carreras de Informática (RedUNCI)
description <i>Author Profiling</i> is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to its potential applications in security, crime and marketing, among others. One of the main difficulties in this field is the lack of reliable text collections (corpora) to train and test automatically derived classifiers, in particular in specific languages such as Spanish. Although some recent data sets were generated for the PAN competitions, these documents have a lot of “noise” that prevent researchers from obtaining more general conclusions about this task when more formal documents are used. In this context, this work proposes and describes <i>SpanText</i>, a data collection of formal texts in Spanish language which is, as far as we know, the first collection with these characteristics for the author profiling task. Besides, an experimental study is carried out where the difference in performance obtained with formal and informal texts is clearly established and opens interesting research lines to get a deeper understanding of the particularities that each type of documents poses to the author profiling task.
publishDate 2014
dc.date.none.fl_str_mv 2014-10
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
info:eu-repo/semantics/publishedVersion
Objeto de conferencia
http://purl.org/coar/resource_type/c_5794
info:ar-repo/semantics/documentoDeConferencia
format conferenceObject
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/42290
url http://sedici.unlp.edu.ar/handle/10915/42290
dc.language.none.fl_str_mv eng
language eng
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Argentina (CC BY-NC-SA 2.5)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Argentina (CC BY-NC-SA 2.5)
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1846063978926571521
score 13.22299