A web platform for collaborative semi-automatic OCR post-processing

Autores
Mechaca C., Ana L.; Marmanillo, Walter G.; Xamena, Eduardo; Ramirez-Orta, Juan; Maguitman, Ana Gabriela; Milios, Evangelos E.
Año de publicación
2021
Idioma
inglés
Tipo de recurso
documento de conferencia
Estado
versión publicada
Descripción
Digital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processing and ground-truth generation. This platform employs machine learning to predict the correct texts accurately from OCR noisy strings. The method used for this task involves transformers for character-based denoising language models. An active learning workflow is proposed, as the users can feed their corrections to the platform, generating new annotated data for re-training the underlying machine learning correction models.
Sociedad Argentina de Informática e Investigación Operativa
Materia
Ciencias Informáticas
OCR Post-processing
Digital Humanities
Language Models
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/3.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/140119

id SEDICI_77b14ce776c8812afab6371513b02198
oai_identifier_str oai:sedici.unlp.edu.ar:10915/140119
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling A web platform for collaborative semi-automatic OCR post-processingMechaca C., Ana L.Marmanillo, Walter G.Xamena, EduardoRamirez-Orta, JuanMaguitman, Ana GabrielaMilios, Evangelos E.Ciencias InformáticasOCR Post-processingDigital HumanitiesLanguage ModelsDigital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processing and ground-truth generation. This platform employs machine learning to predict the correct texts accurately from OCR noisy strings. The method used for this task involves transformers for character-based denoising language models. An active learning workflow is proposed, as the users can feed their corrections to the platform, generating new annotated data for re-training the underlying machine learning correction models.Sociedad Argentina de Informática e Investigación Operativa2021-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf11-14http://sedici.unlp.edu.ar/handle/10915/140119enginfo:eu-repo/semantics/altIdentifier/url/http://50jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-02.pdfinfo:eu-repo/semantics/altIdentifier/issn/2683-8966info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/3.0/Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:35:34Zoai:sedici.unlp.edu.ar:10915/140119Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:35:34.372SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv A web platform for collaborative semi-automatic OCR post-processing
title A web platform for collaborative semi-automatic OCR post-processing
spellingShingle A web platform for collaborative semi-automatic OCR post-processing
Mechaca C., Ana L.
Ciencias Informáticas
OCR Post-processing
Digital Humanities
Language Models
title_short A web platform for collaborative semi-automatic OCR post-processing
title_full A web platform for collaborative semi-automatic OCR post-processing
title_fullStr A web platform for collaborative semi-automatic OCR post-processing
title_full_unstemmed A web platform for collaborative semi-automatic OCR post-processing
title_sort A web platform for collaborative semi-automatic OCR post-processing
dc.creator.none.fl_str_mv Mechaca C., Ana L.
Marmanillo, Walter G.
Xamena, Eduardo
Ramirez-Orta, Juan
Maguitman, Ana Gabriela
Milios, Evangelos E.
author Mechaca C., Ana L.
author_facet Mechaca C., Ana L.
Marmanillo, Walter G.
Xamena, Eduardo
Ramirez-Orta, Juan
Maguitman, Ana Gabriela
Milios, Evangelos E.
author_role author
author2 Marmanillo, Walter G.
Xamena, Eduardo
Ramirez-Orta, Juan
Maguitman, Ana Gabriela
Milios, Evangelos E.
author2_role author
author
author
author
author
dc.subject.none.fl_str_mv Ciencias Informáticas
OCR Post-processing
Digital Humanities
Language Models
topic Ciencias Informáticas
OCR Post-processing
Digital Humanities
Language Models
dc.description.none.fl_txt_mv Digital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processing and ground-truth generation. This platform employs machine learning to predict the correct texts accurately from OCR noisy strings. The method used for this task involves transformers for character-based denoising language models. An active learning workflow is proposed, as the users can feed their corrections to the platform, generating new annotated data for re-training the underlying machine learning correction models.
Sociedad Argentina de Informática e Investigación Operativa
description Digital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processing and ground-truth generation. This platform employs machine learning to predict the correct texts accurately from OCR noisy strings. The method used for this task involves transformers for character-based denoising language models. An active learning workflow is proposed, as the users can feed their corrections to the platform, generating new annotated data for re-training the underlying machine learning correction models.
publishDate 2021
dc.date.none.fl_str_mv 2021-10
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
info:eu-repo/semantics/publishedVersion
Objeto de conferencia
http://purl.org/coar/resource_type/c_5794
info:ar-repo/semantics/documentoDeConferencia
format conferenceObject
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/140119
url http://sedici.unlp.edu.ar/handle/10915/140119
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/http://50jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-02.pdf
info:eu-repo/semantics/altIdentifier/issn/2683-8966
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/3.0/
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/3.0/
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
dc.format.none.fl_str_mv application/pdf
11-14
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1844616234431152128
score 13.070432