Towards Information Quality Assurance in Spanish: Wikipedia
- Autores
- Ferretti, Edgardo; Soria, Matías; Pérez Casseignau, Sebastián; Pohn, Lian; Urquiza, Guido; Gómez, Sergio Alejandro; Errecalde, Marcelo
- Año de publicación
- 2017
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- Featured Articles (FA) are considered to be the best articles that Wikipedia has to offer and in the last years, researchers have found interesting to analyze whether and how they can be distinguished from “ordinary” articles. Likewise, identifying what issues have to be enhanced or fixed in ordinary articles in order to improve their quality is a recent key research trend. Most of the approaches developed to face these information quality problems have been proposed for the English Wikipedia. However, few efforts have been accomplished in Spanish Wikipedia, despite being Spanish, one of the most spoken languages in the world by native speakers. In this respect, we present a breakdown of Spanish Wikipedia’s quality flaw structure. Besides, we carry out studies with three different corpora to automatically assess information quality in Spanish Wikipedia, where FA identification is evaluated as a binary classification task. Our evaluation on a unified setting allows to compare with the English version, the performance achieved by our approach on the Spanish version. The best results obtained show that FA identification in Spanish, can be performed with an F1 score of 0.88 using a document model consisting of only twenty six features and Support Vector Machine as classification algorithm.
- Materia
-
Ciencias de la Computación e Información
featured article identification
information quality
quality flaws prediction
Wikipedia - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by/4.0/
- Repositorio
- Institución
- Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
- OAI Identificador
- oai:digital.cic.gba.gob.ar:11746/5668
Ver los metadatos del registro completo
id |
CICBA_db22c32439b9c553ff557b14731dcae1 |
---|---|
oai_identifier_str |
oai:digital.cic.gba.gob.ar:11746/5668 |
network_acronym_str |
CICBA |
repository_id_str |
9441 |
network_name_str |
CIC Digital (CICBA) |
spelling |
Towards Information Quality Assurance in Spanish: WikipediaFerretti, EdgardoSoria, MatíasPérez Casseignau, SebastiánPohn, LianUrquiza, GuidoGómez, Sergio AlejandroErrecalde, MarceloCiencias de la Computación e Informaciónfeatured article identificationinformation qualityquality flaws predictionWikipediaFeatured Articles (FA) are considered to be the best articles that Wikipedia has to offer and in the last years, researchers have found interesting to analyze whether and how they can be distinguished from “ordinary” articles. Likewise, identifying what issues have to be enhanced or fixed in ordinary articles in order to improve their quality is a recent key research trend. Most of the approaches developed to face these information quality problems have been proposed for the English Wikipedia. However, few efforts have been accomplished in Spanish Wikipedia, despite being Spanish, one of the most spoken languages in the world by native speakers. In this respect, we present a breakdown of Spanish Wikipedia’s quality flaw structure. Besides, we carry out studies with three different corpora to automatically assess information quality in Spanish Wikipedia, where FA identification is evaluated as a binary classification task. Our evaluation on a unified setting allows to compare with the English version, the performance achieved by our approach on the Spanish version. The best results obtained show that FA identification in Spanish, can be performed with an F1 score of 0.88 using a document model consisting of only twenty six features and Support Vector Machine as classification algorithm.2017-04info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttps://digital.cic.gba.gob.ar/handle/11746/5668enginfo:eu-repo/semantics/altIdentifier/issn/1666-6038info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/4.0/reponame:CIC Digital (CICBA)instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Airesinstacron:CICBA2025-09-29T13:40:20Zoai:digital.cic.gba.gob.ar:11746/5668Institucionalhttp://digital.cic.gba.gob.arOrganismo científico-tecnológicoNo correspondehttp://digital.cic.gba.gob.ar/oai/snrdmarisa.degiusti@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:94412025-09-29 13:40:20.546CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Airesfalse |
dc.title.none.fl_str_mv |
Towards Information Quality Assurance in Spanish: Wikipedia |
title |
Towards Information Quality Assurance in Spanish: Wikipedia |
spellingShingle |
Towards Information Quality Assurance in Spanish: Wikipedia Ferretti, Edgardo Ciencias de la Computación e Información featured article identification information quality quality flaws prediction Wikipedia |
title_short |
Towards Information Quality Assurance in Spanish: Wikipedia |
title_full |
Towards Information Quality Assurance in Spanish: Wikipedia |
title_fullStr |
Towards Information Quality Assurance in Spanish: Wikipedia |
title_full_unstemmed |
Towards Information Quality Assurance in Spanish: Wikipedia |
title_sort |
Towards Information Quality Assurance in Spanish: Wikipedia |
dc.creator.none.fl_str_mv |
Ferretti, Edgardo Soria, Matías Pérez Casseignau, Sebastián Pohn, Lian Urquiza, Guido Gómez, Sergio Alejandro Errecalde, Marcelo |
author |
Ferretti, Edgardo |
author_facet |
Ferretti, Edgardo Soria, Matías Pérez Casseignau, Sebastián Pohn, Lian Urquiza, Guido Gómez, Sergio Alejandro Errecalde, Marcelo |
author_role |
author |
author2 |
Soria, Matías Pérez Casseignau, Sebastián Pohn, Lian Urquiza, Guido Gómez, Sergio Alejandro Errecalde, Marcelo |
author2_role |
author author author author author author |
dc.subject.none.fl_str_mv |
Ciencias de la Computación e Información featured article identification information quality quality flaws prediction Wikipedia |
topic |
Ciencias de la Computación e Información featured article identification information quality quality flaws prediction Wikipedia |
dc.description.none.fl_txt_mv |
Featured Articles (FA) are considered to be the best articles that Wikipedia has to offer and in the last years, researchers have found interesting to analyze whether and how they can be distinguished from “ordinary” articles. Likewise, identifying what issues have to be enhanced or fixed in ordinary articles in order to improve their quality is a recent key research trend. Most of the approaches developed to face these information quality problems have been proposed for the English Wikipedia. However, few efforts have been accomplished in Spanish Wikipedia, despite being Spanish, one of the most spoken languages in the world by native speakers. In this respect, we present a breakdown of Spanish Wikipedia’s quality flaw structure. Besides, we carry out studies with three different corpora to automatically assess information quality in Spanish Wikipedia, where FA identification is evaluated as a binary classification task. Our evaluation on a unified setting allows to compare with the English version, the performance achieved by our approach on the Spanish version. The best results obtained show that FA identification in Spanish, can be performed with an F1 score of 0.88 using a document model consisting of only twenty six features and Support Vector Machine as classification algorithm. |
description |
Featured Articles (FA) are considered to be the best articles that Wikipedia has to offer and in the last years, researchers have found interesting to analyze whether and how they can be distinguished from “ordinary” articles. Likewise, identifying what issues have to be enhanced or fixed in ordinary articles in order to improve their quality is a recent key research trend. Most of the approaches developed to face these information quality problems have been proposed for the English Wikipedia. However, few efforts have been accomplished in Spanish Wikipedia, despite being Spanish, one of the most spoken languages in the world by native speakers. In this respect, we present a breakdown of Spanish Wikipedia’s quality flaw structure. Besides, we carry out studies with three different corpora to automatically assess information quality in Spanish Wikipedia, where FA identification is evaluated as a binary classification task. Our evaluation on a unified setting allows to compare with the English version, the performance achieved by our approach on the Spanish version. The best results obtained show that FA identification in Spanish, can be performed with an F1 score of 0.88 using a document model consisting of only twenty six features and Support Vector Machine as classification algorithm. |
publishDate |
2017 |
dc.date.none.fl_str_mv |
2017-04 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
format |
article |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
https://digital.cic.gba.gob.ar/handle/11746/5668 |
url |
https://digital.cic.gba.gob.ar/handle/11746/5668 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/issn/1666-6038 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by/4.0/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by/4.0/ |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:CIC Digital (CICBA) instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Aires instacron:CICBA |
reponame_str |
CIC Digital (CICBA) |
collection |
CIC Digital (CICBA) |
instname_str |
Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
instacron_str |
CICBA |
institution |
CICBA |
repository.name.fl_str_mv |
CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
repository.mail.fl_str_mv |
marisa.degiusti@sedici.unlp.edu.ar |
_version_ |
1844618617709133824 |
score |
13.070432 |