A web platform for collaborative semi-automatic OCR post-processing
- Autores
- Mechaca C., Ana L.; Marmanillo, Walter G.; Xamena, Eduardo; Ramirez-Orta, Juan; Maguitman, Ana Gabriela; Milios, Evangelos E.
- Año de publicación
- 2021
- Idioma
- inglés
- Tipo de recurso
- documento de conferencia
- Estado
- versión publicada
- Descripción
- Digital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processing and ground-truth generation. This platform employs machine learning to predict the correct texts accurately from OCR noisy strings. The method used for this task involves transformers for character-based denoising language models. An active learning workflow is proposed, as the users can feed their corrections to the platform, generating new annotated data for re-training the underlying machine learning correction models.
Sociedad Argentina de Informática e Investigación Operativa - Materia
-
Ciencias Informáticas
OCR Post-processing
Digital Humanities
Language Models - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/3.0/
- Repositorio
- Institución
- Universidad Nacional de La Plata
- OAI Identificador
- oai:sedici.unlp.edu.ar:10915/140119
Ver los metadatos del registro completo
id |
SEDICI_77b14ce776c8812afab6371513b02198 |
---|---|
oai_identifier_str |
oai:sedici.unlp.edu.ar:10915/140119 |
network_acronym_str |
SEDICI |
repository_id_str |
1329 |
network_name_str |
SEDICI (UNLP) |
spelling |
A web platform for collaborative semi-automatic OCR post-processingMechaca C., Ana L.Marmanillo, Walter G.Xamena, EduardoRamirez-Orta, JuanMaguitman, Ana GabrielaMilios, Evangelos E.Ciencias InformáticasOCR Post-processingDigital HumanitiesLanguage ModelsDigital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processing and ground-truth generation. This platform employs machine learning to predict the correct texts accurately from OCR noisy strings. The method used for this task involves transformers for character-based denoising language models. An active learning workflow is proposed, as the users can feed their corrections to the platform, generating new annotated data for re-training the underlying machine learning correction models.Sociedad Argentina de Informática e Investigación Operativa2021-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf11-14http://sedici.unlp.edu.ar/handle/10915/140119enginfo:eu-repo/semantics/altIdentifier/url/http://50jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-02.pdfinfo:eu-repo/semantics/altIdentifier/issn/2683-8966info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/3.0/Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:35:34Zoai:sedici.unlp.edu.ar:10915/140119Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:35:34.372SEDICI (UNLP) - Universidad Nacional de La Platafalse |
dc.title.none.fl_str_mv |
A web platform for collaborative semi-automatic OCR post-processing |
title |
A web platform for collaborative semi-automatic OCR post-processing |
spellingShingle |
A web platform for collaborative semi-automatic OCR post-processing Mechaca C., Ana L. Ciencias Informáticas OCR Post-processing Digital Humanities Language Models |
title_short |
A web platform for collaborative semi-automatic OCR post-processing |
title_full |
A web platform for collaborative semi-automatic OCR post-processing |
title_fullStr |
A web platform for collaborative semi-automatic OCR post-processing |
title_full_unstemmed |
A web platform for collaborative semi-automatic OCR post-processing |
title_sort |
A web platform for collaborative semi-automatic OCR post-processing |
dc.creator.none.fl_str_mv |
Mechaca C., Ana L. Marmanillo, Walter G. Xamena, Eduardo Ramirez-Orta, Juan Maguitman, Ana Gabriela Milios, Evangelos E. |
author |
Mechaca C., Ana L. |
author_facet |
Mechaca C., Ana L. Marmanillo, Walter G. Xamena, Eduardo Ramirez-Orta, Juan Maguitman, Ana Gabriela Milios, Evangelos E. |
author_role |
author |
author2 |
Marmanillo, Walter G. Xamena, Eduardo Ramirez-Orta, Juan Maguitman, Ana Gabriela Milios, Evangelos E. |
author2_role |
author author author author author |
dc.subject.none.fl_str_mv |
Ciencias Informáticas OCR Post-processing Digital Humanities Language Models |
topic |
Ciencias Informáticas OCR Post-processing Digital Humanities Language Models |
dc.description.none.fl_txt_mv |
Digital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processing and ground-truth generation. This platform employs machine learning to predict the correct texts accurately from OCR noisy strings. The method used for this task involves transformers for character-based denoising language models. An active learning workflow is proposed, as the users can feed their corrections to the platform, generating new annotated data for re-training the underlying machine learning correction models. Sociedad Argentina de Informática e Investigación Operativa |
description |
Digital Humanities researchers often make use of software that helps them in the task of finding non-trivial relationships among characters in historical text. Usually, the source texts that contain such information come from OCR acquired volumes, carrying high amounts of errors within them. This work explains the development of a web platform for the task of OCR post-processing and ground-truth generation. This platform employs machine learning to predict the correct texts accurately from OCR noisy strings. The method used for this task involves transformers for character-based denoising language models. An active learning workflow is proposed, as the users can feed their corrections to the platform, generating new annotated data for re-training the underlying machine learning correction models. |
publishDate |
2021 |
dc.date.none.fl_str_mv |
2021-10 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://sedici.unlp.edu.ar/handle/10915/140119 |
url |
http://sedici.unlp.edu.ar/handle/10915/140119 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/url/http://50jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-02.pdf info:eu-repo/semantics/altIdentifier/issn/2683-8966 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/3.0/ Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/3.0/ Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) |
dc.format.none.fl_str_mv |
application/pdf 11-14 |
dc.source.none.fl_str_mv |
reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP |
reponame_str |
SEDICI (UNLP) |
collection |
SEDICI (UNLP) |
instname_str |
Universidad Nacional de La Plata |
instacron_str |
UNLP |
institution |
UNLP |
repository.name.fl_str_mv |
SEDICI (UNLP) - Universidad Nacional de La Plata |
repository.mail.fl_str_mv |
alira@sedici.unlp.edu.ar |
_version_ |
1844616234431152128 |
score |
13.070432 |