Language modeling tools for massive historical OCR post-processing
- Autores
- Xamena, Eduardo; Maguitman, Ana Gabriela
- Año de publicación
- 2020
- Idioma
- inglés
- Tipo de recurso
- documento de conferencia
- Estado
- versión publicada
- Descripción
- Upon these days, there is a large number of available historical documentary collections that have not been exploited to extract information. Many efforts are being made to digitize these volumes and make them available for digital platforms. However, various obstacles appear in the task of processing their content. Due to the deterioration of documents and other factors such as the different dialects and language variants, the quality of the digitizations is usually low. By means of NLP tools it is possible to increase the quality of texts. The current proposal consists in the employment of NLP tools, particularly neural language models, for processing the output of different OCR mechanisms. Important improvements in the quality of the texts are expected, as this has been the case in many related tasks. The ultimate purpose of this work is the use of the resulting digitized texts in information retrieval (IR) and information extraction (IE) platforms.
Sociedad Argentina de Informática - Materia
-
Ciencias Informáticas
OCR post-processing
Neural language models
Information retrieval. - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/3.0/
- Repositorio
- Institución
- Universidad Nacional de La Plata
- OAI Identificador
- oai:sedici.unlp.edu.ar:10915/116420
Ver los metadatos del registro completo
id |
SEDICI_dd58d3867d0166052eedcaf819257744 |
---|---|
oai_identifier_str |
oai:sedici.unlp.edu.ar:10915/116420 |
network_acronym_str |
SEDICI |
repository_id_str |
1329 |
network_name_str |
SEDICI (UNLP) |
spelling |
Language modeling tools for massive historical OCR post-processingXamena, EduardoMaguitman, Ana GabrielaCiencias InformáticasOCR post-processingNeural language modelsInformation retrieval.Upon these days, there is a large number of available historical documentary collections that have not been exploited to extract information. Many efforts are being made to digitize these volumes and make them available for digital platforms. However, various obstacles appear in the task of processing their content. Due to the deterioration of documents and other factors such as the different dialects and language variants, the quality of the digitizations is usually low. By means of NLP tools it is possible to increase the quality of texts. The current proposal consists in the employment of NLP tools, particularly neural language models, for processing the output of different OCR mechanisms. Important improvements in the quality of the texts are expected, as this has been the case in many related tasks. The ultimate purpose of this work is the use of the resulting digitized texts in information retrieval (IR) and information extraction (IE) platforms.Sociedad Argentina de Informática2020-10info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf125-128http://sedici.unlp.edu.ar/handle/10915/116420enginfo:eu-repo/semantics/altIdentifier/url/http://49jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-15.pdfinfo:eu-repo/semantics/altIdentifier/issn/2683-8966info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/3.0/Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:27:14Zoai:sedici.unlp.edu.ar:10915/116420Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:27:14.676SEDICI (UNLP) - Universidad Nacional de La Platafalse |
dc.title.none.fl_str_mv |
Language modeling tools for massive historical OCR post-processing |
title |
Language modeling tools for massive historical OCR post-processing |
spellingShingle |
Language modeling tools for massive historical OCR post-processing Xamena, Eduardo Ciencias Informáticas OCR post-processing Neural language models Information retrieval. |
title_short |
Language modeling tools for massive historical OCR post-processing |
title_full |
Language modeling tools for massive historical OCR post-processing |
title_fullStr |
Language modeling tools for massive historical OCR post-processing |
title_full_unstemmed |
Language modeling tools for massive historical OCR post-processing |
title_sort |
Language modeling tools for massive historical OCR post-processing |
dc.creator.none.fl_str_mv |
Xamena, Eduardo Maguitman, Ana Gabriela |
author |
Xamena, Eduardo |
author_facet |
Xamena, Eduardo Maguitman, Ana Gabriela |
author_role |
author |
author2 |
Maguitman, Ana Gabriela |
author2_role |
author |
dc.subject.none.fl_str_mv |
Ciencias Informáticas OCR post-processing Neural language models Information retrieval. |
topic |
Ciencias Informáticas OCR post-processing Neural language models Information retrieval. |
dc.description.none.fl_txt_mv |
Upon these days, there is a large number of available historical documentary collections that have not been exploited to extract information. Many efforts are being made to digitize these volumes and make them available for digital platforms. However, various obstacles appear in the task of processing their content. Due to the deterioration of documents and other factors such as the different dialects and language variants, the quality of the digitizations is usually low. By means of NLP tools it is possible to increase the quality of texts. The current proposal consists in the employment of NLP tools, particularly neural language models, for processing the output of different OCR mechanisms. Important improvements in the quality of the texts are expected, as this has been the case in many related tasks. The ultimate purpose of this work is the use of the resulting digitized texts in information retrieval (IR) and information extraction (IE) platforms. Sociedad Argentina de Informática |
description |
Upon these days, there is a large number of available historical documentary collections that have not been exploited to extract information. Many efforts are being made to digitize these volumes and make them available for digital platforms. However, various obstacles appear in the task of processing their content. Due to the deterioration of documents and other factors such as the different dialects and language variants, the quality of the digitizations is usually low. By means of NLP tools it is possible to increase the quality of texts. The current proposal consists in the employment of NLP tools, particularly neural language models, for processing the output of different OCR mechanisms. Important improvements in the quality of the texts are expected, as this has been the case in many related tasks. The ultimate purpose of this work is the use of the resulting digitized texts in information retrieval (IR) and information extraction (IE) platforms. |
publishDate |
2020 |
dc.date.none.fl_str_mv |
2020-10 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://sedici.unlp.edu.ar/handle/10915/116420 |
url |
http://sedici.unlp.edu.ar/handle/10915/116420 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/url/http://49jaiio.sadio.org.ar/pdfs/agranda/AGRANDA-15.pdf info:eu-repo/semantics/altIdentifier/issn/2683-8966 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/3.0/ Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/3.0/ Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported (CC BY-NC-SA 3.0) |
dc.format.none.fl_str_mv |
application/pdf 125-128 |
dc.source.none.fl_str_mv |
reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP |
reponame_str |
SEDICI (UNLP) |
collection |
SEDICI (UNLP) |
instname_str |
Universidad Nacional de La Plata |
instacron_str |
UNLP |
institution |
UNLP |
repository.name.fl_str_mv |
SEDICI (UNLP) - Universidad Nacional de La Plata |
repository.mail.fl_str_mv |
alira@sedici.unlp.edu.ar |
_version_ |
1844616150181216256 |
score |
13.070432 |