A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches

Autores
Rolando, Matias; Raggio, Victor; Naya, Hugo; Spangenberg, Lucia; Cagnina, Leticia Cecilia
Año de publicación
2025
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges, with an average diagnosis time of 5 years, multiple specialist visits, and invasive procedures. This ‘diagnostic odyssey’ can be detrimental to their health. Machine learning (ML) has the potential to improve healthcare by providing more personalized and accurate patient management, diagnoses, and in some cases, treatments. Leveraging the MIMIC-III database and additional medical notes from different sources such as in-house data, PubMed and chatGPT, we propose a labeled dataset for early RD detection in hospital settings. Applying various supervised ML methods, including logistic regression, decision trees, support vector machine (SVM), deep learning methods (LSTM and CNN), and Transformers (BERT), we validated the use of the proposed resource, achieving 92.7% F-measure and a 96% AUC using SVM. These findings highlight the potential of ML in redirecting RD patients towards more accurate diagnostic pathways and presents a corpus that can be used for future development and refinements.
Fil: Rolando, Matias. Instituto Pasteur de Montevideo; Uruguay
Fil: Raggio, Victor. Universidad de la República; Uruguay
Fil: Naya, Hugo. Universidad de la República; Uruguay. Instituto Pasteur de Montevideo; Uruguay
Fil: Spangenberg, Lucia. Universidad de la República; Uruguay. Instituto Pasteur de Montevideo; Uruguay
Fil: Cagnina, Leticia Cecilia. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Informática. Laboratorio Investigación y Desarrollo en Inteligencia Computacional; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; Argentina
Materia
RARE DISEASES
MACHINE LEARNING
ARTIFICIAL CORPUS
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-nd/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/273229

id CONICETDig_8e1f06dfaba8bf97be61fd27e88a7981
oai_identifier_str oai:ri.conicet.gov.ar:11336/273229
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling A labeled medical records corpus for the timely detection of rare diseases using machine learning approachesRolando, MatiasRaggio, VictorNaya, HugoSpangenberg, LuciaCagnina, Leticia CeciliaRARE DISEASESMACHINE LEARNINGARTIFICIAL CORPUShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges, with an average diagnosis time of 5 years, multiple specialist visits, and invasive procedures. This ‘diagnostic odyssey’ can be detrimental to their health. Machine learning (ML) has the potential to improve healthcare by providing more personalized and accurate patient management, diagnoses, and in some cases, treatments. Leveraging the MIMIC-III database and additional medical notes from different sources such as in-house data, PubMed and chatGPT, we propose a labeled dataset for early RD detection in hospital settings. Applying various supervised ML methods, including logistic regression, decision trees, support vector machine (SVM), deep learning methods (LSTM and CNN), and Transformers (BERT), we validated the use of the proposed resource, achieving 92.7% F-measure and a 96% AUC using SVM. These findings highlight the potential of ML in redirecting RD patients towards more accurate diagnostic pathways and presents a corpus that can be used for future development and refinements.Fil: Rolando, Matias. Instituto Pasteur de Montevideo; UruguayFil: Raggio, Victor. Universidad de la República; UruguayFil: Naya, Hugo. Universidad de la República; Uruguay. Instituto Pasteur de Montevideo; UruguayFil: Spangenberg, Lucia. Universidad de la República; Uruguay. Instituto Pasteur de Montevideo; UruguayFil: Cagnina, Leticia Cecilia. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Informática. Laboratorio Investigación y Desarrollo en Inteligencia Computacional; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaNature2025-02info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/273229Rolando, Matias; Raggio, Victor; Naya, Hugo; Spangenberg, Lucia; Cagnina, Leticia Cecilia; A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches; Nature; Scientific Reports; 15; 1; 2-2025; 1-102045-2322CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://www.nature.com/articles/s41598-025-90450-0info:eu-repo/semantics/altIdentifier/doi/10.1038/s41598-025-90450-0info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-nd/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-10-29T12:13:21Zoai:ri.conicet.gov.ar:11336/273229instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-10-29 12:13:22.288CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
title A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
spellingShingle A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
Rolando, Matias
RARE DISEASES
MACHINE LEARNING
ARTIFICIAL CORPUS
title_short A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
title_full A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
title_fullStr A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
title_full_unstemmed A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
title_sort A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
dc.creator.none.fl_str_mv Rolando, Matias
Raggio, Victor
Naya, Hugo
Spangenberg, Lucia
Cagnina, Leticia Cecilia
author Rolando, Matias
author_facet Rolando, Matias
Raggio, Victor
Naya, Hugo
Spangenberg, Lucia
Cagnina, Leticia Cecilia
author_role author
author2 Raggio, Victor
Naya, Hugo
Spangenberg, Lucia
Cagnina, Leticia Cecilia
author2_role author
author
author
author
dc.subject.none.fl_str_mv RARE DISEASES
MACHINE LEARNING
ARTIFICIAL CORPUS
topic RARE DISEASES
MACHINE LEARNING
ARTIFICIAL CORPUS
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.2
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges, with an average diagnosis time of 5 years, multiple specialist visits, and invasive procedures. This ‘diagnostic odyssey’ can be detrimental to their health. Machine learning (ML) has the potential to improve healthcare by providing more personalized and accurate patient management, diagnoses, and in some cases, treatments. Leveraging the MIMIC-III database and additional medical notes from different sources such as in-house data, PubMed and chatGPT, we propose a labeled dataset for early RD detection in hospital settings. Applying various supervised ML methods, including logistic regression, decision trees, support vector machine (SVM), deep learning methods (LSTM and CNN), and Transformers (BERT), we validated the use of the proposed resource, achieving 92.7% F-measure and a 96% AUC using SVM. These findings highlight the potential of ML in redirecting RD patients towards more accurate diagnostic pathways and presents a corpus that can be used for future development and refinements.
Fil: Rolando, Matias. Instituto Pasteur de Montevideo; Uruguay
Fil: Raggio, Victor. Universidad de la República; Uruguay
Fil: Naya, Hugo. Universidad de la República; Uruguay. Instituto Pasteur de Montevideo; Uruguay
Fil: Spangenberg, Lucia. Universidad de la República; Uruguay. Instituto Pasteur de Montevideo; Uruguay
Fil: Cagnina, Leticia Cecilia. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Informática. Laboratorio Investigación y Desarrollo en Inteligencia Computacional; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; Argentina
description Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges, with an average diagnosis time of 5 years, multiple specialist visits, and invasive procedures. This ‘diagnostic odyssey’ can be detrimental to their health. Machine learning (ML) has the potential to improve healthcare by providing more personalized and accurate patient management, diagnoses, and in some cases, treatments. Leveraging the MIMIC-III database and additional medical notes from different sources such as in-house data, PubMed and chatGPT, we propose a labeled dataset for early RD detection in hospital settings. Applying various supervised ML methods, including logistic regression, decision trees, support vector machine (SVM), deep learning methods (LSTM and CNN), and Transformers (BERT), we validated the use of the proposed resource, achieving 92.7% F-measure and a 96% AUC using SVM. These findings highlight the potential of ML in redirecting RD patients towards more accurate diagnostic pathways and presents a corpus that can be used for future development and refinements.
publishDate 2025
dc.date.none.fl_str_mv 2025-02
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/273229
Rolando, Matias; Raggio, Victor; Naya, Hugo; Spangenberg, Lucia; Cagnina, Leticia Cecilia; A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches; Nature; Scientific Reports; 15; 1; 2-2025; 1-10
2045-2322
CONICET Digital
CONICET
url http://hdl.handle.net/11336/273229
identifier_str_mv Rolando, Matias; Raggio, Victor; Naya, Hugo; Spangenberg, Lucia; Cagnina, Leticia Cecilia; A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches; Nature; Scientific Reports; 15; 1; 2-2025; 1-10
2045-2322
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/https://www.nature.com/articles/s41598-025-90450-0
info:eu-repo/semantics/altIdentifier/doi/10.1038/s41598-025-90450-0
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-nd/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-nd/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv Nature
publisher.none.fl_str_mv Nature
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1847426980459315200
score 13.10058