Towards Information Quality Assurance in Spanish: Wikipedia

Autores
Ferretti, Edgardo; Soria, Matías; Pérez Casseignau, Sebastián; Pohn, Lian; Urquiza, Guido; Gómez, Sergio Alejandro; Errecalde, Marcelo Luis
Año de publicación
2017
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Featured Articles (FA) are considered to be the best articles that Wikipedia has to offer and in the last years, researchers have found interesting to analyze whether and how they can be distinguished from “ordinary” articles. Likewise, identifying what issues have to be enhanced or fixed in ordinary articles in order to improve their quality is a recent key research trend. Most of the approaches developed to face these information quality problems have been proposed for the English Wikipedia. However, few efforts have been accomplished in Spanish Wikipedia, despite being Spanish, one of the most spoken languages in the world by native speakers. In this respect, we present a breakdown of Spanish Wikipedia’s quality flaw structure. Besides, we carry out studies with three different corpora to automatically assess information quality in Spanish Wikipedia, where FA identification is evaluated as a binary classification task. Our evaluation on a unified setting allows to compare with the English version, the performance achieved by our approach on the Spanish version. The best results obtained show that FA identification in Spanish, can be performed with an F1 score of 0.88 using a document model consisting of only twenty six features and Support Vector Machine as classification algorithm.
Facultad de Informática
Materia
Ciencias Informáticas
featured article identification
information quality
quality flaws prediction
Wikipedia
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by/3.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/59979

id SEDICI_2468c0a2062018fa183ae6c16cdaa772
oai_identifier_str oai:sedici.unlp.edu.ar:10915/59979
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling Towards Information Quality Assurance in Spanish: WikipediaFerretti, EdgardoSoria, MatíasPérez Casseignau, SebastiánPohn, LianUrquiza, GuidoGómez, Sergio AlejandroErrecalde, Marcelo LuisCiencias Informáticasfeatured article identificationinformation qualityquality flaws predictionWikipediaFeatured Articles (FA) are considered to be the best articles that Wikipedia has to offer and in the last years, researchers have found interesting to analyze whether and how they can be distinguished from “ordinary” articles. Likewise, identifying what issues have to be enhanced or fixed in ordinary articles in order to improve their quality is a recent key research trend. Most of the approaches developed to face these information quality problems have been proposed for the English Wikipedia. However, few efforts have been accomplished in Spanish Wikipedia, despite being Spanish, one of the most spoken languages in the world by native speakers. In this respect, we present a breakdown of Spanish Wikipedia’s quality flaw structure. Besides, we carry out studies with three different corpora to automatically assess information quality in Spanish Wikipedia, where FA identification is evaluated as a binary classification task. Our evaluation on a unified setting allows to compare with the English version, the performance achieved by our approach on the Spanish version. The best results obtained show that FA identification in Spanish, can be performed with an F1 score of 0.88 using a document model consisting of only twenty six features and Support Vector Machine as classification algorithm.Facultad de Informática2017-04info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionArticulohttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdf29-36http://sedici.unlp.edu.ar/handle/10915/59979enginfo:eu-repo/semantics/altIdentifier/url/http://journal.info.unlp.edu.ar/wp-content/uploads/2017/05/JCST-44-Paper-4.pdfinfo:eu-repo/semantics/altIdentifier/issn/1666-6038info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/3.0/Creative Commons Attribution 3.0 Unported (CC BY 3.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:07:15Zoai:sedici.unlp.edu.ar:10915/59979Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:07:15.856SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv Towards Information Quality Assurance in Spanish: Wikipedia
title Towards Information Quality Assurance in Spanish: Wikipedia
spellingShingle Towards Information Quality Assurance in Spanish: Wikipedia
Ferretti, Edgardo
Ciencias Informáticas
featured article identification
information quality
quality flaws prediction
Wikipedia
title_short Towards Information Quality Assurance in Spanish: Wikipedia
title_full Towards Information Quality Assurance in Spanish: Wikipedia
title_fullStr Towards Information Quality Assurance in Spanish: Wikipedia
title_full_unstemmed Towards Information Quality Assurance in Spanish: Wikipedia
title_sort Towards Information Quality Assurance in Spanish: Wikipedia
dc.creator.none.fl_str_mv Ferretti, Edgardo
Soria, Matías
Pérez Casseignau, Sebastián
Pohn, Lian
Urquiza, Guido
Gómez, Sergio Alejandro
Errecalde, Marcelo Luis
author Ferretti, Edgardo
author_facet Ferretti, Edgardo
Soria, Matías
Pérez Casseignau, Sebastián
Pohn, Lian
Urquiza, Guido
Gómez, Sergio Alejandro
Errecalde, Marcelo Luis
author_role author
author2 Soria, Matías
Pérez Casseignau, Sebastián
Pohn, Lian
Urquiza, Guido
Gómez, Sergio Alejandro
Errecalde, Marcelo Luis
author2_role author
author
author
author
author
author
dc.subject.none.fl_str_mv Ciencias Informáticas
featured article identification
information quality
quality flaws prediction
Wikipedia
topic Ciencias Informáticas
featured article identification
information quality
quality flaws prediction
Wikipedia
dc.description.none.fl_txt_mv Featured Articles (FA) are considered to be the best articles that Wikipedia has to offer and in the last years, researchers have found interesting to analyze whether and how they can be distinguished from “ordinary” articles. Likewise, identifying what issues have to be enhanced or fixed in ordinary articles in order to improve their quality is a recent key research trend. Most of the approaches developed to face these information quality problems have been proposed for the English Wikipedia. However, few efforts have been accomplished in Spanish Wikipedia, despite being Spanish, one of the most spoken languages in the world by native speakers. In this respect, we present a breakdown of Spanish Wikipedia’s quality flaw structure. Besides, we carry out studies with three different corpora to automatically assess information quality in Spanish Wikipedia, where FA identification is evaluated as a binary classification task. Our evaluation on a unified setting allows to compare with the English version, the performance achieved by our approach on the Spanish version. The best results obtained show that FA identification in Spanish, can be performed with an F1 score of 0.88 using a document model consisting of only twenty six features and Support Vector Machine as classification algorithm.
Facultad de Informática
description Featured Articles (FA) are considered to be the best articles that Wikipedia has to offer and in the last years, researchers have found interesting to analyze whether and how they can be distinguished from “ordinary” articles. Likewise, identifying what issues have to be enhanced or fixed in ordinary articles in order to improve their quality is a recent key research trend. Most of the approaches developed to face these information quality problems have been proposed for the English Wikipedia. However, few efforts have been accomplished in Spanish Wikipedia, despite being Spanish, one of the most spoken languages in the world by native speakers. In this respect, we present a breakdown of Spanish Wikipedia’s quality flaw structure. Besides, we carry out studies with three different corpora to automatically assess information quality in Spanish Wikipedia, where FA identification is evaluated as a binary classification task. Our evaluation on a unified setting allows to compare with the English version, the performance achieved by our approach on the Spanish version. The best results obtained show that FA identification in Spanish, can be performed with an F1 score of 0.88 using a document model consisting of only twenty six features and Support Vector Machine as classification algorithm.
publishDate 2017
dc.date.none.fl_str_mv 2017-04
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
Articulo
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/59979
url http://sedici.unlp.edu.ar/handle/10915/59979
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/http://journal.info.unlp.edu.ar/wp-content/uploads/2017/05/JCST-44-Paper-4.pdf
info:eu-repo/semantics/altIdentifier/issn/1666-6038
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by/3.0/
Creative Commons Attribution 3.0 Unported (CC BY 3.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by/3.0/
Creative Commons Attribution 3.0 Unported (CC BY 3.0)
dc.format.none.fl_str_mv application/pdf
29-36
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1844615943865499648
score 13.070432