Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus

Autores
Dellanzo, Antonella; Cotik, Viviana Erica; Lozano Barriga, Daniel Yunior; Mollapaza Apaza, Jonathan Jimmy; Palomino, Daniel; Schiaffino, Fernando; Yanque Aliaga, Alexander; Ochoa Luna, José
Año de publicación
2022
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Background: In order to detect threats to public health and to be well-prepared for endemic and pandemic illness outbreaks, countries usually rely on event-based surveillance (EBS) and indicator-based surveillance systems. Event-based surveillance systems are key components of early warning systems and focus on fast capturing of data to detect threat signals through channels other than traditional surveillance. In this study, we develop Natural Language Processing tools that can be used within EBS systems. In particular, we focus on information extraction techniques that enable digital surveillance to monitor Internet data and social media. Results: We created an annotated Spanish corpus from ProMED-mail health reports regarding disease outbreaks in Latin America. The corpus has been used to train algorithms for two information extraction tasks: named entity recognition and relation extraction. The algorithms, based on deep learning and rules, have been applied to recognize diseases, hosts, and geographical locations where a disease is occurring, among other entities and relations. In addition, an in-depth analysis of micro-average F1 metrics shows the suitability of our approaches for both tasks. Conclusions: The annotated corpus and algorithms presented could leverage the development of automated tools for extracting information from news and health reports written in Spanish. Moreover, this framework could be useful within EBS systems to support the early detection of Latin American disease outbreaks.
Fil: Dellanzo, Antonella. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina
Fil: Cotik, Viviana Erica. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
Fil: Lozano Barriga, Daniel Yunior. Universidad Católica San Pablo; Perú
Fil: Mollapaza Apaza, Jonathan Jimmy. Universidad Católica San Pablo; Perú
Fil: Palomino, Daniel. Universidad Católica San Pablo; Perú
Fil: Schiaffino, Fernando. Universidad de Buenos Aires. Facultad de Filosofía y Letras; Argentina
Fil: Yanque Aliaga, Alexander. Universidad Católica San Pablo; Perú
Fil: Ochoa Luna, José. Universidad Católica San Pablo; Perú
Materia
DIGITAL SURVEILLANCE
DISEASES OUTBREAKS
EVENT-BASED SURVEILLANCE
NAMED ENTITY RECOGNITION
PROMED-MAIL
RELATION EXTRACTION
SPANISH CORPUS
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/217703

id CONICETDig_ad2c246ef8833e857325fcffbbe376d9
oai_identifier_str oai:ri.conicet.gov.ar:11336/217703
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpusDellanzo, AntonellaCotik, Viviana EricaLozano Barriga, Daniel YuniorMollapaza Apaza, Jonathan JimmyPalomino, DanielSchiaffino, FernandoYanque Aliaga, AlexanderOchoa Luna, JoséDIGITAL SURVEILLANCEDISEASES OUTBREAKSEVENT-BASED SURVEILLANCENAMED ENTITY RECOGNITIONPROMED-MAILRELATION EXTRACTIONSPANISH CORPUShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Background: In order to detect threats to public health and to be well-prepared for endemic and pandemic illness outbreaks, countries usually rely on event-based surveillance (EBS) and indicator-based surveillance systems. Event-based surveillance systems are key components of early warning systems and focus on fast capturing of data to detect threat signals through channels other than traditional surveillance. In this study, we develop Natural Language Processing tools that can be used within EBS systems. In particular, we focus on information extraction techniques that enable digital surveillance to monitor Internet data and social media. Results: We created an annotated Spanish corpus from ProMED-mail health reports regarding disease outbreaks in Latin America. The corpus has been used to train algorithms for two information extraction tasks: named entity recognition and relation extraction. The algorithms, based on deep learning and rules, have been applied to recognize diseases, hosts, and geographical locations where a disease is occurring, among other entities and relations. In addition, an in-depth analysis of micro-average F1 metrics shows the suitability of our approaches for both tasks. Conclusions: The annotated corpus and algorithms presented could leverage the development of automated tools for extracting information from news and health reports written in Spanish. Moreover, this framework could be useful within EBS systems to support the early detection of Latin American disease outbreaks.Fil: Dellanzo, Antonella. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; ArgentinaFil: Cotik, Viviana Erica. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Lozano Barriga, Daniel Yunior. Universidad Católica San Pablo; PerúFil: Mollapaza Apaza, Jonathan Jimmy. Universidad Católica San Pablo; PerúFil: Palomino, Daniel. Universidad Católica San Pablo; PerúFil: Schiaffino, Fernando. Universidad de Buenos Aires. Facultad de Filosofía y Letras; ArgentinaFil: Yanque Aliaga, Alexander. Universidad Católica San Pablo; PerúFil: Ochoa Luna, José. Universidad Católica San Pablo; PerúBioMed Central2022-12info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/217703Dellanzo, Antonella; Cotik, Viviana Erica; Lozano Barriga, Daniel Yunior; Mollapaza Apaza, Jonathan Jimmy; Palomino, Daniel; et al.; Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus; BioMed Central; BMC Bioinformatics; 23; 1; 12-2022; 1-221471-2105CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/doi/10.1186/s12859-022-05094-yinfo:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T10:09:27Zoai:ri.conicet.gov.ar:11336/217703instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 10:09:27.823CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus
title Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus
spellingShingle Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus
Dellanzo, Antonella
DIGITAL SURVEILLANCE
DISEASES OUTBREAKS
EVENT-BASED SURVEILLANCE
NAMED ENTITY RECOGNITION
PROMED-MAIL
RELATION EXTRACTION
SPANISH CORPUS
title_short Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus
title_full Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus
title_fullStr Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus
title_full_unstemmed Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus
title_sort Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus
dc.creator.none.fl_str_mv Dellanzo, Antonella
Cotik, Viviana Erica
Lozano Barriga, Daniel Yunior
Mollapaza Apaza, Jonathan Jimmy
Palomino, Daniel
Schiaffino, Fernando
Yanque Aliaga, Alexander
Ochoa Luna, José
author Dellanzo, Antonella
author_facet Dellanzo, Antonella
Cotik, Viviana Erica
Lozano Barriga, Daniel Yunior
Mollapaza Apaza, Jonathan Jimmy
Palomino, Daniel
Schiaffino, Fernando
Yanque Aliaga, Alexander
Ochoa Luna, José
author_role author
author2 Cotik, Viviana Erica
Lozano Barriga, Daniel Yunior
Mollapaza Apaza, Jonathan Jimmy
Palomino, Daniel
Schiaffino, Fernando
Yanque Aliaga, Alexander
Ochoa Luna, José
author2_role author
author
author
author
author
author
author
dc.subject.none.fl_str_mv DIGITAL SURVEILLANCE
DISEASES OUTBREAKS
EVENT-BASED SURVEILLANCE
NAMED ENTITY RECOGNITION
PROMED-MAIL
RELATION EXTRACTION
SPANISH CORPUS
topic DIGITAL SURVEILLANCE
DISEASES OUTBREAKS
EVENT-BASED SURVEILLANCE
NAMED ENTITY RECOGNITION
PROMED-MAIL
RELATION EXTRACTION
SPANISH CORPUS
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.2
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv Background: In order to detect threats to public health and to be well-prepared for endemic and pandemic illness outbreaks, countries usually rely on event-based surveillance (EBS) and indicator-based surveillance systems. Event-based surveillance systems are key components of early warning systems and focus on fast capturing of data to detect threat signals through channels other than traditional surveillance. In this study, we develop Natural Language Processing tools that can be used within EBS systems. In particular, we focus on information extraction techniques that enable digital surveillance to monitor Internet data and social media. Results: We created an annotated Spanish corpus from ProMED-mail health reports regarding disease outbreaks in Latin America. The corpus has been used to train algorithms for two information extraction tasks: named entity recognition and relation extraction. The algorithms, based on deep learning and rules, have been applied to recognize diseases, hosts, and geographical locations where a disease is occurring, among other entities and relations. In addition, an in-depth analysis of micro-average F1 metrics shows the suitability of our approaches for both tasks. Conclusions: The annotated corpus and algorithms presented could leverage the development of automated tools for extracting information from news and health reports written in Spanish. Moreover, this framework could be useful within EBS systems to support the early detection of Latin American disease outbreaks.
Fil: Dellanzo, Antonella. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina
Fil: Cotik, Viviana Erica. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
Fil: Lozano Barriga, Daniel Yunior. Universidad Católica San Pablo; Perú
Fil: Mollapaza Apaza, Jonathan Jimmy. Universidad Católica San Pablo; Perú
Fil: Palomino, Daniel. Universidad Católica San Pablo; Perú
Fil: Schiaffino, Fernando. Universidad de Buenos Aires. Facultad de Filosofía y Letras; Argentina
Fil: Yanque Aliaga, Alexander. Universidad Católica San Pablo; Perú
Fil: Ochoa Luna, José. Universidad Católica San Pablo; Perú
description Background: In order to detect threats to public health and to be well-prepared for endemic and pandemic illness outbreaks, countries usually rely on event-based surveillance (EBS) and indicator-based surveillance systems. Event-based surveillance systems are key components of early warning systems and focus on fast capturing of data to detect threat signals through channels other than traditional surveillance. In this study, we develop Natural Language Processing tools that can be used within EBS systems. In particular, we focus on information extraction techniques that enable digital surveillance to monitor Internet data and social media. Results: We created an annotated Spanish corpus from ProMED-mail health reports regarding disease outbreaks in Latin America. The corpus has been used to train algorithms for two information extraction tasks: named entity recognition and relation extraction. The algorithms, based on deep learning and rules, have been applied to recognize diseases, hosts, and geographical locations where a disease is occurring, among other entities and relations. In addition, an in-depth analysis of micro-average F1 metrics shows the suitability of our approaches for both tasks. Conclusions: The annotated corpus and algorithms presented could leverage the development of automated tools for extracting information from news and health reports written in Spanish. Moreover, this framework could be useful within EBS systems to support the early detection of Latin American disease outbreaks.
publishDate 2022
dc.date.none.fl_str_mv 2022-12
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/217703
Dellanzo, Antonella; Cotik, Viviana Erica; Lozano Barriga, Daniel Yunior; Mollapaza Apaza, Jonathan Jimmy; Palomino, Daniel; et al.; Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus; BioMed Central; BMC Bioinformatics; 23; 1; 12-2022; 1-22
1471-2105
CONICET Digital
CONICET
url http://hdl.handle.net/11336/217703
identifier_str_mv Dellanzo, Antonella; Cotik, Viviana Erica; Lozano Barriga, Daniel Yunior; Mollapaza Apaza, Jonathan Jimmy; Palomino, Daniel; et al.; Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus; BioMed Central; BMC Bioinformatics; 23; 1; 12-2022; 1-22
1471-2105
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/doi/10.1186/s12859-022-05094-y
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv BioMed Central
publisher.none.fl_str_mv BioMed Central
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1844613973693956096
score 13.070432