Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus
- Autores
- Dellanzo, Antonella; Cotik, Viviana Erica; Lozano Barriga, Daniel Yunior; Mollapaza Apaza, Jonathan Jimmy; Palomino, Daniel; Schiaffino, Fernando; Yanque Aliaga, Alexander; Ochoa Luna, José
- Año de publicación
- 2022
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- Background: In order to detect threats to public health and to be well-prepared for endemic and pandemic illness outbreaks, countries usually rely on event-based surveillance (EBS) and indicator-based surveillance systems. Event-based surveillance systems are key components of early warning systems and focus on fast capturing of data to detect threat signals through channels other than traditional surveillance. In this study, we develop Natural Language Processing tools that can be used within EBS systems. In particular, we focus on information extraction techniques that enable digital surveillance to monitor Internet data and social media. Results: We created an annotated Spanish corpus from ProMED-mail health reports regarding disease outbreaks in Latin America. The corpus has been used to train algorithms for two information extraction tasks: named entity recognition and relation extraction. The algorithms, based on deep learning and rules, have been applied to recognize diseases, hosts, and geographical locations where a disease is occurring, among other entities and relations. In addition, an in-depth analysis of micro-average F1 metrics shows the suitability of our approaches for both tasks. Conclusions: The annotated corpus and algorithms presented could leverage the development of automated tools for extracting information from news and health reports written in Spanish. Moreover, this framework could be useful within EBS systems to support the early detection of Latin American disease outbreaks.
Fil: Dellanzo, Antonella. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina
Fil: Cotik, Viviana Erica. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
Fil: Lozano Barriga, Daniel Yunior. Universidad Católica San Pablo; Perú
Fil: Mollapaza Apaza, Jonathan Jimmy. Universidad Católica San Pablo; Perú
Fil: Palomino, Daniel. Universidad Católica San Pablo; Perú
Fil: Schiaffino, Fernando. Universidad de Buenos Aires. Facultad de Filosofía y Letras; Argentina
Fil: Yanque Aliaga, Alexander. Universidad Católica San Pablo; Perú
Fil: Ochoa Luna, José. Universidad Católica San Pablo; Perú - Materia
-
DIGITAL SURVEILLANCE
DISEASES OUTBREAKS
EVENT-BASED SURVEILLANCE
NAMED ENTITY RECOGNITION
PROMED-MAIL
RELATION EXTRACTION
SPANISH CORPUS - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- https://creativecommons.org/licenses/by/2.5/ar/
- Repositorio
- Institución
- Consejo Nacional de Investigaciones Científicas y Técnicas
- OAI Identificador
- oai:ri.conicet.gov.ar:11336/217703
Ver los metadatos del registro completo
id |
CONICETDig_ad2c246ef8833e857325fcffbbe376d9 |
---|---|
oai_identifier_str |
oai:ri.conicet.gov.ar:11336/217703 |
network_acronym_str |
CONICETDig |
repository_id_str |
3498 |
network_name_str |
CONICET Digital (CONICET) |
spelling |
Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpusDellanzo, AntonellaCotik, Viviana EricaLozano Barriga, Daniel YuniorMollapaza Apaza, Jonathan JimmyPalomino, DanielSchiaffino, FernandoYanque Aliaga, AlexanderOchoa Luna, JoséDIGITAL SURVEILLANCEDISEASES OUTBREAKSEVENT-BASED SURVEILLANCENAMED ENTITY RECOGNITIONPROMED-MAILRELATION EXTRACTIONSPANISH CORPUShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Background: In order to detect threats to public health and to be well-prepared for endemic and pandemic illness outbreaks, countries usually rely on event-based surveillance (EBS) and indicator-based surveillance systems. Event-based surveillance systems are key components of early warning systems and focus on fast capturing of data to detect threat signals through channels other than traditional surveillance. In this study, we develop Natural Language Processing tools that can be used within EBS systems. In particular, we focus on information extraction techniques that enable digital surveillance to monitor Internet data and social media. Results: We created an annotated Spanish corpus from ProMED-mail health reports regarding disease outbreaks in Latin America. The corpus has been used to train algorithms for two information extraction tasks: named entity recognition and relation extraction. The algorithms, based on deep learning and rules, have been applied to recognize diseases, hosts, and geographical locations where a disease is occurring, among other entities and relations. In addition, an in-depth analysis of micro-average F1 metrics shows the suitability of our approaches for both tasks. Conclusions: The annotated corpus and algorithms presented could leverage the development of automated tools for extracting information from news and health reports written in Spanish. Moreover, this framework could be useful within EBS systems to support the early detection of Latin American disease outbreaks.Fil: Dellanzo, Antonella. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; ArgentinaFil: Cotik, Viviana Erica. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Lozano Barriga, Daniel Yunior. Universidad Católica San Pablo; PerúFil: Mollapaza Apaza, Jonathan Jimmy. Universidad Católica San Pablo; PerúFil: Palomino, Daniel. Universidad Católica San Pablo; PerúFil: Schiaffino, Fernando. Universidad de Buenos Aires. Facultad de Filosofía y Letras; ArgentinaFil: Yanque Aliaga, Alexander. Universidad Católica San Pablo; PerúFil: Ochoa Luna, José. Universidad Católica San Pablo; PerúBioMed Central2022-12info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/217703Dellanzo, Antonella; Cotik, Viviana Erica; Lozano Barriga, Daniel Yunior; Mollapaza Apaza, Jonathan Jimmy; Palomino, Daniel; et al.; Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus; BioMed Central; BMC Bioinformatics; 23; 1; 12-2022; 1-221471-2105CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/doi/10.1186/s12859-022-05094-yinfo:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T10:09:27Zoai:ri.conicet.gov.ar:11336/217703instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 10:09:27.823CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse |
dc.title.none.fl_str_mv |
Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus |
title |
Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus |
spellingShingle |
Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus Dellanzo, Antonella DIGITAL SURVEILLANCE DISEASES OUTBREAKS EVENT-BASED SURVEILLANCE NAMED ENTITY RECOGNITION PROMED-MAIL RELATION EXTRACTION SPANISH CORPUS |
title_short |
Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus |
title_full |
Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus |
title_fullStr |
Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus |
title_full_unstemmed |
Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus |
title_sort |
Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus |
dc.creator.none.fl_str_mv |
Dellanzo, Antonella Cotik, Viviana Erica Lozano Barriga, Daniel Yunior Mollapaza Apaza, Jonathan Jimmy Palomino, Daniel Schiaffino, Fernando Yanque Aliaga, Alexander Ochoa Luna, José |
author |
Dellanzo, Antonella |
author_facet |
Dellanzo, Antonella Cotik, Viviana Erica Lozano Barriga, Daniel Yunior Mollapaza Apaza, Jonathan Jimmy Palomino, Daniel Schiaffino, Fernando Yanque Aliaga, Alexander Ochoa Luna, José |
author_role |
author |
author2 |
Cotik, Viviana Erica Lozano Barriga, Daniel Yunior Mollapaza Apaza, Jonathan Jimmy Palomino, Daniel Schiaffino, Fernando Yanque Aliaga, Alexander Ochoa Luna, José |
author2_role |
author author author author author author author |
dc.subject.none.fl_str_mv |
DIGITAL SURVEILLANCE DISEASES OUTBREAKS EVENT-BASED SURVEILLANCE NAMED ENTITY RECOGNITION PROMED-MAIL RELATION EXTRACTION SPANISH CORPUS |
topic |
DIGITAL SURVEILLANCE DISEASES OUTBREAKS EVENT-BASED SURVEILLANCE NAMED ENTITY RECOGNITION PROMED-MAIL RELATION EXTRACTION SPANISH CORPUS |
purl_subject.fl_str_mv |
https://purl.org/becyt/ford/1.2 https://purl.org/becyt/ford/1 |
dc.description.none.fl_txt_mv |
Background: In order to detect threats to public health and to be well-prepared for endemic and pandemic illness outbreaks, countries usually rely on event-based surveillance (EBS) and indicator-based surveillance systems. Event-based surveillance systems are key components of early warning systems and focus on fast capturing of data to detect threat signals through channels other than traditional surveillance. In this study, we develop Natural Language Processing tools that can be used within EBS systems. In particular, we focus on information extraction techniques that enable digital surveillance to monitor Internet data and social media. Results: We created an annotated Spanish corpus from ProMED-mail health reports regarding disease outbreaks in Latin America. The corpus has been used to train algorithms for two information extraction tasks: named entity recognition and relation extraction. The algorithms, based on deep learning and rules, have been applied to recognize diseases, hosts, and geographical locations where a disease is occurring, among other entities and relations. In addition, an in-depth analysis of micro-average F1 metrics shows the suitability of our approaches for both tasks. Conclusions: The annotated corpus and algorithms presented could leverage the development of automated tools for extracting information from news and health reports written in Spanish. Moreover, this framework could be useful within EBS systems to support the early detection of Latin American disease outbreaks. Fil: Dellanzo, Antonella. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina Fil: Cotik, Viviana Erica. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina Fil: Lozano Barriga, Daniel Yunior. Universidad Católica San Pablo; Perú Fil: Mollapaza Apaza, Jonathan Jimmy. Universidad Católica San Pablo; Perú Fil: Palomino, Daniel. Universidad Católica San Pablo; Perú Fil: Schiaffino, Fernando. Universidad de Buenos Aires. Facultad de Filosofía y Letras; Argentina Fil: Yanque Aliaga, Alexander. Universidad Católica San Pablo; Perú Fil: Ochoa Luna, José. Universidad Católica San Pablo; Perú |
description |
Background: In order to detect threats to public health and to be well-prepared for endemic and pandemic illness outbreaks, countries usually rely on event-based surveillance (EBS) and indicator-based surveillance systems. Event-based surveillance systems are key components of early warning systems and focus on fast capturing of data to detect threat signals through channels other than traditional surveillance. In this study, we develop Natural Language Processing tools that can be used within EBS systems. In particular, we focus on information extraction techniques that enable digital surveillance to monitor Internet data and social media. Results: We created an annotated Spanish corpus from ProMED-mail health reports regarding disease outbreaks in Latin America. The corpus has been used to train algorithms for two information extraction tasks: named entity recognition and relation extraction. The algorithms, based on deep learning and rules, have been applied to recognize diseases, hosts, and geographical locations where a disease is occurring, among other entities and relations. In addition, an in-depth analysis of micro-average F1 metrics shows the suitability of our approaches for both tasks. Conclusions: The annotated corpus and algorithms presented could leverage the development of automated tools for extracting information from news and health reports written in Spanish. Moreover, this framework could be useful within EBS systems to support the early detection of Latin American disease outbreaks. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-12 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
format |
article |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://hdl.handle.net/11336/217703 Dellanzo, Antonella; Cotik, Viviana Erica; Lozano Barriga, Daniel Yunior; Mollapaza Apaza, Jonathan Jimmy; Palomino, Daniel; et al.; Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus; BioMed Central; BMC Bioinformatics; 23; 1; 12-2022; 1-22 1471-2105 CONICET Digital CONICET |
url |
http://hdl.handle.net/11336/217703 |
identifier_str_mv |
Dellanzo, Antonella; Cotik, Viviana Erica; Lozano Barriga, Daniel Yunior; Mollapaza Apaza, Jonathan Jimmy; Palomino, Daniel; et al.; Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus; BioMed Central; BMC Bioinformatics; 23; 1; 12-2022; 1-22 1471-2105 CONICET Digital CONICET |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/doi/10.1186/s12859-022-05094-y |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by/2.5/ar/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
https://creativecommons.org/licenses/by/2.5/ar/ |
dc.format.none.fl_str_mv |
application/pdf application/pdf |
dc.publisher.none.fl_str_mv |
BioMed Central |
publisher.none.fl_str_mv |
BioMed Central |
dc.source.none.fl_str_mv |
reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas |
reponame_str |
CONICET Digital (CONICET) |
collection |
CONICET Digital (CONICET) |
instname_str |
Consejo Nacional de Investigaciones Científicas y Técnicas |
repository.name.fl_str_mv |
CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas |
repository.mail.fl_str_mv |
dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar |
_version_ |
1844613973693956096 |
score |
13.070432 |