Language modeling tools for massive historical OCR post-processing

Autores
Xamena, Eduardo; Maguitman, Ana Gabriela
Año de publicación
2020
Idioma
inglés
Tipo de recurso
documento de conferencia
Estado
versión publicada
Descripción
Upon these days, there is a large number of available historical documentary collections that have not been exploited to extract information. Many efforts are being made to digitize these volumes and make them available for digital platforms. However, various obstacles appear in the task of processing their content. Due to the deterioration of documents and other factors such as the different dialects and language variants, the quality of the digitizations is usually low. By means of NLP tools it is possible to increase the quality of texts. The current proposal consists in the employment of NLP tools, particularly neural language models, for processing the output of different OCR mechanisms. Important improvements in the quality of the texts are expected, as this has been the case in many related tasks. The ultimate purpose of this work is the use of the resulting digitized texts in information retrieval (IR) and information extraction (IE) platforms.
Sociedad Argentina de Informática
Materia
Ciencias Informáticas
OCR post-processing
Neural language models
Information retrieval.
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-nc-sa/3.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/116420

id SEDICI_dd58d3867d0166052eedcaf819257744
oai_identifier_str oai:sedici.unlp.edu.ar:10915/116420
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling Language modeling tools for massive historical OCR post-processingXamena, EduardoMaguitman, Ana GabrielaCiencias InformáticasOCR post-processingNeural language modelsInformation retrieval.Upon these days, there is a large number of available historical documentary collections that have not been exploited to extract information. Many efforts are being made to digitize these volumes and make them available for digital platforms. However, various obstacles appear in the task of processing their content. Due to the deterioration of documents and other factors such as the different dialects and language variants, the quality of the digitizations is usually low. By means of NLP tools it is possible to increase the quality of texts. The current proposal consists in the employment of NLP tools, particularly neural language models, for processing the output of different OCR mechanisms. Important improvements in the quality of the texts are expected, as this has been the case in many related tasks. The ultimate purpose of this work is the use of the resulting digitized texts in information retrieval (IR) and information extraction (IE) platforms.Sociedad Argentina de Informática2020-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf125-128http://sedici.unlp.edu.ar/handle/10915/116420enginfo:eu-repo/semantics/altIdentifier/url/http://49jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-15.pdfinfo:eu-repo/semantics/altIdentifier/issn/2683-8966info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/3.0/Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:27:14Zoai:sedici.unlp.edu.ar:10915/116420Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:27:14.676SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv Language modeling tools for massive historical OCR post-processing
title Language modeling tools for massive historical OCR post-processing
spellingShingle Language modeling tools for massive historical OCR post-processing
Xamena, Eduardo
Ciencias Informáticas
OCR post-processing
Neural language models
Information retrieval.
title_short Language modeling tools for massive historical OCR post-processing
title_full Language modeling tools for massive historical OCR post-processing
title_fullStr Language modeling tools for massive historical OCR post-processing
title_full_unstemmed Language modeling tools for massive historical OCR post-processing
title_sort Language modeling tools for massive historical OCR post-processing
dc.creator.none.fl_str_mv Xamena, Eduardo
Maguitman, Ana Gabriela
author Xamena, Eduardo
author_facet Xamena, Eduardo
Maguitman, Ana Gabriela
author_role author
author2 Maguitman, Ana Gabriela
author2_role author
dc.subject.none.fl_str_mv Ciencias Informáticas
OCR post-processing
Neural language models
Information retrieval.
topic Ciencias Informáticas
OCR post-processing
Neural language models
Information retrieval.
dc.description.none.fl_txt_mv Upon these days, there is a large number of available historical documentary collections that have not been exploited to extract information. Many efforts are being made to digitize these volumes and make them available for digital platforms. However, various obstacles appear in the task of processing their content. Due to the deterioration of documents and other factors such as the different dialects and language variants, the quality of the digitizations is usually low. By means of NLP tools it is possible to increase the quality of texts. The current proposal consists in the employment of NLP tools, particularly neural language models, for processing the output of different OCR mechanisms. Important improvements in the quality of the texts are expected, as this has been the case in many related tasks. The ultimate purpose of this work is the use of the resulting digitized texts in information retrieval (IR) and information extraction (IE) platforms.
Sociedad Argentina de Informática
description Upon these days, there is a large number of available historical documentary collections that have not been exploited to extract information. Many efforts are being made to digitize these volumes and make them available for digital platforms. However, various obstacles appear in the task of processing their content. Due to the deterioration of documents and other factors such as the different dialects and language variants, the quality of the digitizations is usually low. By means of NLP tools it is possible to increase the quality of texts. The current proposal consists in the employment of NLP tools, particularly neural language models, for processing the output of different OCR mechanisms. Important improvements in the quality of the texts are expected, as this has been the case in many related tasks. The ultimate purpose of this work is the use of the resulting digitized texts in information retrieval (IR) and information extraction (IE) platforms.
publishDate 2020
dc.date.none.fl_str_mv 2020-10
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
info:eu-repo/semantics/publishedVersion
Objeto de conferencia
http://purl.org/coar/resource_type/c_5794
info:ar-repo/semantics/documentoDeConferencia
format conferenceObject
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/116420
url http://sedici.unlp.edu.ar/handle/10915/116420
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/http://49jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-15.pdf
info:eu-repo/semantics/altIdentifier/issn/2683-8966
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-nc-sa/3.0/
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-sa/3.0/
Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)
dc.format.none.fl_str_mv application/pdf
125-128
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1844616150181216256
score 13.070432