Automatically Assessing the Need of Additional Citations for Information Quality Verification in Wikipedia Articles

Autores
Bazán Pereyra, Gerónimo; Cuello, Carolina; Capodici, Gianfranco; Jofré, Vanessa; Ferretti, Edgardo; Errecalde, Marcelo Luis
Año de publicación
2019
Idioma
inglés
Tipo de recurso
documento de conferencia
Estado
versión publicada
Descripción
Quality flaws prediction in Wikipedia is an ongoing research trend. In particular, in this work we tackle the problem of automatically assessing the need of including additional citations for contributing to verify the articles’ content; the so-called Refimprove quality flaw. This information quality flaw, ranks among the five most frequent flaws and represents 12.4% of the flawed articles in the English Wikipedia. Underbagged decision trees, biased-SVM, and centroid-based balanced SVM –three different state-of-the-art approaches– were evaluated, with the aim of handling the existing imbalances between the number of articles’ tagged as flawed content, and the remaining untagged documents that exist in Wikipedia, which can help in the learning stage of the algorithms. Also, a uniformly sampled balanced SVM classifier was evaluated as a baseline. The results showed that under-bagged decision trees with the min rule as aggregation method, perform best achieving an F1 score of 0.96 on the test corpus from the 1st International Competition on Quality Flaw Prediction in Wikipedia; a well-known uniform evaluation corpus from this research field. Likewise, biased-SVM also achieved an F1 score that outperform previously published results.
II Track de Gobierno Digital y Ciudades Inteligentes.
Red de Universidades con Carreras en Informática
Materia
Ciencias Informáticas
Wikipedia
Information Quality
Quality Flaws Prediction
Refimprove Flaw
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/90453

id SEDICI_487ee57fda82f7412745965970a40971
oai_identifier_str oai:sedici.unlp.edu.ar:10915/90453
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling Automatically Assessing the Need of Additional Citations for Information Quality Verification in Wikipedia ArticlesBazán Pereyra, GerónimoCuello, CarolinaCapodici, GianfrancoJofré, VanessaFerretti, EdgardoErrecalde, Marcelo LuisCiencias InformáticasWikipediaInformation QualityQuality Flaws PredictionRefimprove FlawQuality flaws prediction in Wikipedia is an ongoing research trend. In particular, in this work we tackle the problem of automatically assessing the need of including additional citations for contributing to verify the articles’ content; the so-called Refimprove quality flaw. This information quality flaw, ranks among the five most frequent flaws and represents 12.4% of the flawed articles in the English Wikipedia. Underbagged decision trees, biased-SVM, and centroid-based balanced SVM –three different state-of-the-art approaches– were evaluated, with the aim of handling the existing imbalances between the number of articles’ tagged as flawed content, and the remaining untagged documents that exist in Wikipedia, which can help in the learning stage of the algorithms. Also, a uniformly sampled balanced SVM classifier was evaluated as a baseline. The results showed that under-bagged decision trees with the min rule as aggregation method, perform best achieving an F1 score of 0.96 on the test corpus from the 1st International Competition on Quality Flaw Prediction in Wikipedia; a well-known uniform evaluation corpus from this research field. Likewise, biased-SVM also achieved an F1 score that outperform previously published results.II Track de Gobierno Digital y Ciudades Inteligentes.Red de Universidades con Carreras en Informática2019-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf42-51http://sedici.unlp.edu.ar/handle/10915/90453enginfo:eu-repo/semantics/altIdentifier/isbn/978-987-688-377-1info:eu-repo/semantics/reference/hdl/10915/90359info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:18:37Zoai:sedici.unlp.edu.ar:10915/90453Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:18:37.798SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv Automatically Assessing the Need of Additional Citations for Information Quality Verification in Wikipedia Articles
title Automatically Assessing the Need of Additional Citations for Information Quality Verification in Wikipedia Articles
spellingShingle Automatically Assessing the Need of Additional Citations for Information Quality Verification in Wikipedia Articles
Bazán Pereyra, Gerónimo
Ciencias Informáticas
Wikipedia
Information Quality
Quality Flaws Prediction
Refimprove Flaw
title_short Automatically Assessing the Need of Additional Citations for Information Quality Verification in Wikipedia Articles
title_full Automatically Assessing the Need of Additional Citations for Information Quality Verification in Wikipedia Articles
title_fullStr Automatically Assessing the Need of Additional Citations for Information Quality Verification in Wikipedia Articles
title_full_unstemmed Automatically Assessing the Need of Additional Citations for Information Quality Verification in Wikipedia Articles
title_sort Automatically Assessing the Need of Additional Citations for Information Quality Verification in Wikipedia Articles
dc.creator.none.fl_str_mv Bazán Pereyra, Gerónimo
Cuello, Carolina
Capodici, Gianfranco
Jofré, Vanessa
Ferretti, Edgardo
Errecalde, Marcelo Luis
author Bazán Pereyra, Gerónimo
author_facet Bazán Pereyra, Gerónimo
Cuello, Carolina
Capodici, Gianfranco
Jofré, Vanessa
Ferretti, Edgardo
Errecalde, Marcelo Luis
author_role author
author2 Cuello, Carolina
Capodici, Gianfranco
Jofré, Vanessa
Ferretti, Edgardo
Errecalde, Marcelo Luis
author2_role author
author
author
author
author
dc.subject.none.fl_str_mv Ciencias Informáticas
Wikipedia
Information Quality
Quality Flaws Prediction
Refimprove Flaw
topic Ciencias Informáticas
Wikipedia
Information Quality
Quality Flaws Prediction
Refimprove Flaw
dc.description.none.fl_txt_mv Quality flaws prediction in Wikipedia is an ongoing research trend. In particular, in this work we tackle the problem of automatically assessing the need of including additional citations for contributing to verify the articles’ content; the so-called Refimprove quality flaw. This information quality flaw, ranks among the five most frequent flaws and represents 12.4% of the flawed articles in the English Wikipedia. Underbagged decision trees, biased-SVM, and centroid-based balanced SVM –three different state-of-the-art approaches– were evaluated, with the aim of handling the existing imbalances between the number of articles’ tagged as flawed content, and the remaining untagged documents that exist in Wikipedia, which can help in the learning stage of the algorithms. Also, a uniformly sampled balanced SVM classifier was evaluated as a baseline. The results showed that under-bagged decision trees with the min rule as aggregation method, perform best achieving an F1 score of 0.96 on the test corpus from the 1st International Competition on Quality Flaw Prediction in Wikipedia; a well-known uniform evaluation corpus from this research field. Likewise, biased-SVM also achieved an F1 score that outperform previously published results.
II Track de Gobierno Digital y Ciudades Inteligentes.
Red de Universidades con Carreras en Informática
description Quality flaws prediction in Wikipedia is an ongoing research trend. In particular, in this work we tackle the problem of automatically assessing the need of including additional citations for contributing to verify the articles’ content; the so-called Refimprove quality flaw. This information quality flaw, ranks among the five most frequent flaws and represents 12.4% of the flawed articles in the English Wikipedia. Underbagged decision trees, biased-SVM, and centroid-based balanced SVM –three different state-of-the-art approaches– were evaluated, with the aim of handling the existing imbalances between the number of articles’ tagged as flawed content, and the remaining untagged documents that exist in Wikipedia, which can help in the learning stage of the algorithms. Also, a uniformly sampled balanced SVM classifier was evaluated as a baseline. The results showed that under-bagged decision trees with the min rule as aggregation method, perform best achieving an F1 score of 0.96 on the test corpus from the 1st International Competition on Quality Flaw Prediction in Wikipedia; a well-known uniform evaluation corpus from this research field. Likewise, biased-SVM also achieved an F1 score that outperform previously published results.
publishDate 2019
dc.date.none.fl_str_mv 2019-10
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
info:eu-repo/semantics/publishedVersion
Objeto de conferencia
http://purl.org/coar/resource_type/c_5794
info:ar-repo/semantics/documentoDeConferencia
format conferenceObject
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/90453
url http://sedici.unlp.edu.ar/handle/10915/90453
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/isbn/978-987-688-377-1
info:eu-repo/semantics/reference/hdl/10915/90359
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/4.0/
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv application/pdf
42-51
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1844616059756216320
score 13.070432