A Spanish text corpus for the author profiling task
- Autores
- Villegas, María Paula; Garciarena Ucelay, María José; Errecalde, Marcelo Luis; Cagnina, Leticia
- Año de publicación
- 2014
- Idioma
- inglés
- Tipo de recurso
- documento de conferencia
- Estado
- versión publicada
- Descripción
- Author Profiling is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to its potential applications in security, crime and marketing, among others. One of the main difficulties in this field is the lack of reliable text collections (corpora) to train and test automatically derived classifiers, in particular in specific languages such as Spanish. Although some recent data sets were generated for the PAN competitions, these documents have a lot of “noise” that prevent researchers from obtaining more general conclusions about this task when more formal documents are used. In this context, this work proposes and describes SpanText, a data collection of formal texts in Spanish language which is, as far as we know, the first collection with these characteristics for the author profiling task. Besides, an experimental study is carried out where the difference in performance obtained with formal and informal texts is clearly established and opens interesting research lines to get a deeper understanding of the particularities that each type of documents poses to the author profiling task.
XI Workshop Bases de Datos y Minería de Datos
Red de Universidades con Carreras de Informática (RedUNCI) - Materia
-
Ciencias Informáticas
author profiling
natural processing language
Spanish text corpus - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/2.5/ar/
- Repositorio
- Institución
- Universidad Nacional de La Plata
- OAI Identificador
- oai:sedici.unlp.edu.ar:10915/42290
Ver los metadatos del registro completo
id |
SEDICI_60511b35f800c0db56dd56726e2a453f |
---|---|
oai_identifier_str |
oai:sedici.unlp.edu.ar:10915/42290 |
network_acronym_str |
SEDICI |
repository_id_str |
1329 |
network_name_str |
SEDICI (UNLP) |
spelling |
A Spanish text corpus for the author profiling taskVillegas, María PaulaGarciarena Ucelay, María JoséErrecalde, Marcelo LuisCagnina, LeticiaCiencias Informáticasauthor profilingnatural processing languageSpanish text corpus<i>Author Profiling</i> is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to its potential applications in security, crime and marketing, among others. One of the main difficulties in this field is the lack of reliable text collections (corpora) to train and test automatically derived classifiers, in particular in specific languages such as Spanish. Although some recent data sets were generated for the PAN competitions, these documents have a lot of “noise” that prevent researchers from obtaining more general conclusions about this task when more formal documents are used. In this context, this work proposes and describes <i>SpanText</i>, a data collection of formal texts in Spanish language which is, as far as we know, the first collection with these characteristics for the author profiling task. Besides, an experimental study is carried out where the difference in performance obtained with formal and informal texts is clearly established and opens interesting research lines to get a deeper understanding of the particularities that each type of documents poses to the author profiling task.XI Workshop Bases de Datos y Minería de DatosRed de Universidades con Carreras de Informática (RedUNCI)2014-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/42290enginfo:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/2.5/ar/Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Argentina (CC BY-NC-SA 2.5)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-10-15T10:53:53Zoai:sedici.unlp.edu.ar:10915/42290Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-10-15 10:53:54.194SEDICI (UNLP) - Universidad Nacional de La Platafalse |
dc.title.none.fl_str_mv |
A Spanish text corpus for the author profiling task |
title |
A Spanish text corpus for the author profiling task |
spellingShingle |
A Spanish text corpus for the author profiling task Villegas, María Paula Ciencias Informáticas author profiling natural processing language Spanish text corpus |
title_short |
A Spanish text corpus for the author profiling task |
title_full |
A Spanish text corpus for the author profiling task |
title_fullStr |
A Spanish text corpus for the author profiling task |
title_full_unstemmed |
A Spanish text corpus for the author profiling task |
title_sort |
A Spanish text corpus for the author profiling task |
dc.creator.none.fl_str_mv |
Villegas, María Paula Garciarena Ucelay, María José Errecalde, Marcelo Luis Cagnina, Leticia |
author |
Villegas, María Paula |
author_facet |
Villegas, María Paula Garciarena Ucelay, María José Errecalde, Marcelo Luis Cagnina, Leticia |
author_role |
author |
author2 |
Garciarena Ucelay, María José Errecalde, Marcelo Luis Cagnina, Leticia |
author2_role |
author author author |
dc.subject.none.fl_str_mv |
Ciencias Informáticas author profiling natural processing language Spanish text corpus |
topic |
Ciencias Informáticas author profiling natural processing language Spanish text corpus |
dc.description.none.fl_txt_mv |
<i>Author Profiling</i> is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to its potential applications in security, crime and marketing, among others. One of the main difficulties in this field is the lack of reliable text collections (corpora) to train and test automatically derived classifiers, in particular in specific languages such as Spanish. Although some recent data sets were generated for the PAN competitions, these documents have a lot of “noise” that prevent researchers from obtaining more general conclusions about this task when more formal documents are used. In this context, this work proposes and describes <i>SpanText</i>, a data collection of formal texts in Spanish language which is, as far as we know, the first collection with these characteristics for the author profiling task. Besides, an experimental study is carried out where the difference in performance obtained with formal and informal texts is clearly established and opens interesting research lines to get a deeper understanding of the particularities that each type of documents poses to the author profiling task. XI Workshop Bases de Datos y Minería de Datos Red de Universidades con Carreras de Informática (RedUNCI) |
description |
<i>Author Profiling</i> is the task of predicting characteristics of the author of a text, such as age, gender, personality, native language, etc. This is a task of growing importance due to its potential applications in security, crime and marketing, among others. One of the main difficulties in this field is the lack of reliable text collections (corpora) to train and test automatically derived classifiers, in particular in specific languages such as Spanish. Although some recent data sets were generated for the PAN competitions, these documents have a lot of “noise” that prevent researchers from obtaining more general conclusions about this task when more formal documents are used. In this context, this work proposes and describes <i>SpanText</i>, a data collection of formal texts in Spanish language which is, as far as we know, the first collection with these characteristics for the author profiling task. Besides, an experimental study is carried out where the difference in performance obtained with formal and informal texts is clearly established and opens interesting research lines to get a deeper understanding of the particularities that each type of documents poses to the author profiling task. |
publishDate |
2014 |
dc.date.none.fl_str_mv |
2014-10 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://sedici.unlp.edu.ar/handle/10915/42290 |
url |
http://sedici.unlp.edu.ar/handle/10915/42290 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/2.5/ar/ Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Argentina (CC BY-NC-SA 2.5) |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/2.5/ar/ Creative Commons Attribution-NonCommercial-ShareAlike 2.5 Argentina (CC BY-NC-SA 2.5) |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP |
reponame_str |
SEDICI (UNLP) |
collection |
SEDICI (UNLP) |
instname_str |
Universidad Nacional de La Plata |
instacron_str |
UNLP |
institution |
UNLP |
repository.name.fl_str_mv |
SEDICI (UNLP) - Universidad Nacional de La Plata |
repository.mail.fl_str_mv |
alira@sedici.unlp.edu.ar |
_version_ |
1846063978926571521 |
score |
13.22299 |