A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches
- Autores
- Rolando, Matias; Raggio, Victor; Naya, Hugo; Spangenberg, Lucia; Cagnina, Leticia Cecilia
- Año de publicación
- 2025
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges, with an average diagnosis time of 5 years, multiple specialist visits, and invasive procedures. This ‘diagnostic odyssey’ can be detrimental to their health. Machine learning (ML) has the potential to improve healthcare by providing more personalized and accurate patient management, diagnoses, and in some cases, treatments. Leveraging the MIMIC-III database and additional medical notes from different sources such as in-house data, PubMed and chatGPT, we propose a labeled dataset for early RD detection in hospital settings. Applying various supervised ML methods, including logistic regression, decision trees, support vector machine (SVM), deep learning methods (LSTM and CNN), and Transformers (BERT), we validated the use of the proposed resource, achieving 92.7% F-measure and a 96% AUC using SVM. These findings highlight the potential of ML in redirecting RD patients towards more accurate diagnostic pathways and presents a corpus that can be used for future development and refinements.
Fil: Rolando, Matias. Instituto Pasteur de Montevideo; Uruguay
Fil: Raggio, Victor. Universidad de la República; Uruguay
Fil: Naya, Hugo. Universidad de la República; Uruguay. Instituto Pasteur de Montevideo; Uruguay
Fil: Spangenberg, Lucia. Universidad de la República; Uruguay. Instituto Pasteur de Montevideo; Uruguay
Fil: Cagnina, Leticia Cecilia. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Informática. Laboratorio Investigación y Desarrollo en Inteligencia Computacional; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; Argentina - Materia
-
RARE DISEASES
MACHINE LEARNING
ARTIFICIAL CORPUS - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- https://creativecommons.org/licenses/by-nc-nd/2.5/ar/
- Repositorio
.jpg)
- Institución
- Consejo Nacional de Investigaciones Científicas y Técnicas
- OAI Identificador
- oai:ri.conicet.gov.ar:11336/273229
Ver los metadatos del registro completo
| id |
CONICETDig_8e1f06dfaba8bf97be61fd27e88a7981 |
|---|---|
| oai_identifier_str |
oai:ri.conicet.gov.ar:11336/273229 |
| network_acronym_str |
CONICETDig |
| repository_id_str |
3498 |
| network_name_str |
CONICET Digital (CONICET) |
| spelling |
A labeled medical records corpus for the timely detection of rare diseases using machine learning approachesRolando, MatiasRaggio, VictorNaya, HugoSpangenberg, LuciaCagnina, Leticia CeciliaRARE DISEASESMACHINE LEARNINGARTIFICIAL CORPUShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges, with an average diagnosis time of 5 years, multiple specialist visits, and invasive procedures. This ‘diagnostic odyssey’ can be detrimental to their health. Machine learning (ML) has the potential to improve healthcare by providing more personalized and accurate patient management, diagnoses, and in some cases, treatments. Leveraging the MIMIC-III database and additional medical notes from different sources such as in-house data, PubMed and chatGPT, we propose a labeled dataset for early RD detection in hospital settings. Applying various supervised ML methods, including logistic regression, decision trees, support vector machine (SVM), deep learning methods (LSTM and CNN), and Transformers (BERT), we validated the use of the proposed resource, achieving 92.7% F-measure and a 96% AUC using SVM. These findings highlight the potential of ML in redirecting RD patients towards more accurate diagnostic pathways and presents a corpus that can be used for future development and refinements.Fil: Rolando, Matias. Instituto Pasteur de Montevideo; UruguayFil: Raggio, Victor. Universidad de la República; UruguayFil: Naya, Hugo. Universidad de la República; Uruguay. Instituto Pasteur de Montevideo; UruguayFil: Spangenberg, Lucia. Universidad de la República; Uruguay. Instituto Pasteur de Montevideo; UruguayFil: Cagnina, Leticia Cecilia. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Informática. Laboratorio Investigación y Desarrollo en Inteligencia Computacional; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; ArgentinaNature2025-02info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/273229Rolando, Matias; Raggio, Victor; Naya, Hugo; Spangenberg, Lucia; Cagnina, Leticia Cecilia; A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches; Nature; Scientific Reports; 15; 1; 2-2025; 1-102045-2322CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://www.nature.com/articles/s41598-025-90450-0info:eu-repo/semantics/altIdentifier/doi/10.1038/s41598-025-90450-0info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-nd/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-10-29T12:13:21Zoai:ri.conicet.gov.ar:11336/273229instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-10-29 12:13:22.288CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse |
| dc.title.none.fl_str_mv |
A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches |
| title |
A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches |
| spellingShingle |
A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches Rolando, Matias RARE DISEASES MACHINE LEARNING ARTIFICIAL CORPUS |
| title_short |
A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches |
| title_full |
A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches |
| title_fullStr |
A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches |
| title_full_unstemmed |
A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches |
| title_sort |
A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches |
| dc.creator.none.fl_str_mv |
Rolando, Matias Raggio, Victor Naya, Hugo Spangenberg, Lucia Cagnina, Leticia Cecilia |
| author |
Rolando, Matias |
| author_facet |
Rolando, Matias Raggio, Victor Naya, Hugo Spangenberg, Lucia Cagnina, Leticia Cecilia |
| author_role |
author |
| author2 |
Raggio, Victor Naya, Hugo Spangenberg, Lucia Cagnina, Leticia Cecilia |
| author2_role |
author author author author |
| dc.subject.none.fl_str_mv |
RARE DISEASES MACHINE LEARNING ARTIFICIAL CORPUS |
| topic |
RARE DISEASES MACHINE LEARNING ARTIFICIAL CORPUS |
| purl_subject.fl_str_mv |
https://purl.org/becyt/ford/1.2 https://purl.org/becyt/ford/1 |
| dc.description.none.fl_txt_mv |
Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges, with an average diagnosis time of 5 years, multiple specialist visits, and invasive procedures. This ‘diagnostic odyssey’ can be detrimental to their health. Machine learning (ML) has the potential to improve healthcare by providing more personalized and accurate patient management, diagnoses, and in some cases, treatments. Leveraging the MIMIC-III database and additional medical notes from different sources such as in-house data, PubMed and chatGPT, we propose a labeled dataset for early RD detection in hospital settings. Applying various supervised ML methods, including logistic regression, decision trees, support vector machine (SVM), deep learning methods (LSTM and CNN), and Transformers (BERT), we validated the use of the proposed resource, achieving 92.7% F-measure and a 96% AUC using SVM. These findings highlight the potential of ML in redirecting RD patients towards more accurate diagnostic pathways and presents a corpus that can be used for future development and refinements. Fil: Rolando, Matias. Instituto Pasteur de Montevideo; Uruguay Fil: Raggio, Victor. Universidad de la República; Uruguay Fil: Naya, Hugo. Universidad de la República; Uruguay. Instituto Pasteur de Montevideo; Uruguay Fil: Spangenberg, Lucia. Universidad de la República; Uruguay. Instituto Pasteur de Montevideo; Uruguay Fil: Cagnina, Leticia Cecilia. Universidad Nacional de San Luis. Facultad de Ciencias Físico Matemáticas y Naturales. Departamento de Informática. Laboratorio Investigación y Desarrollo en Inteligencia Computacional; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; Argentina |
| description |
Rare diseases (RDs) are a group of pathologies that individually affect less than 1 in 2000 people but collectively impact around 7% of the world’s population. Most of them affect children, are chronic and progressive, and have no specific treatment. RD patients face diagnostic challenges, with an average diagnosis time of 5 years, multiple specialist visits, and invasive procedures. This ‘diagnostic odyssey’ can be detrimental to their health. Machine learning (ML) has the potential to improve healthcare by providing more personalized and accurate patient management, diagnoses, and in some cases, treatments. Leveraging the MIMIC-III database and additional medical notes from different sources such as in-house data, PubMed and chatGPT, we propose a labeled dataset for early RD detection in hospital settings. Applying various supervised ML methods, including logistic regression, decision trees, support vector machine (SVM), deep learning methods (LSTM and CNN), and Transformers (BERT), we validated the use of the proposed resource, achieving 92.7% F-measure and a 96% AUC using SVM. These findings highlight the potential of ML in redirecting RD patients towards more accurate diagnostic pathways and presents a corpus that can be used for future development and refinements. |
| publishDate |
2025 |
| dc.date.none.fl_str_mv |
2025-02 |
| dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
| format |
article |
| status_str |
publishedVersion |
| dc.identifier.none.fl_str_mv |
http://hdl.handle.net/11336/273229 Rolando, Matias; Raggio, Victor; Naya, Hugo; Spangenberg, Lucia; Cagnina, Leticia Cecilia; A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches; Nature; Scientific Reports; 15; 1; 2-2025; 1-10 2045-2322 CONICET Digital CONICET |
| url |
http://hdl.handle.net/11336/273229 |
| identifier_str_mv |
Rolando, Matias; Raggio, Victor; Naya, Hugo; Spangenberg, Lucia; Cagnina, Leticia Cecilia; A labeled medical records corpus for the timely detection of rare diseases using machine learning approaches; Nature; Scientific Reports; 15; 1; 2-2025; 1-10 2045-2322 CONICET Digital CONICET |
| dc.language.none.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/url/https://www.nature.com/articles/s41598-025-90450-0 info:eu-repo/semantics/altIdentifier/doi/10.1038/s41598-025-90450-0 |
| dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by-nc-nd/2.5/ar/ |
| eu_rights_str_mv |
openAccess |
| rights_invalid_str_mv |
https://creativecommons.org/licenses/by-nc-nd/2.5/ar/ |
| dc.format.none.fl_str_mv |
application/pdf application/pdf |
| dc.publisher.none.fl_str_mv |
Nature |
| publisher.none.fl_str_mv |
Nature |
| dc.source.none.fl_str_mv |
reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas |
| reponame_str |
CONICET Digital (CONICET) |
| collection |
CONICET Digital (CONICET) |
| instname_str |
Consejo Nacional de Investigaciones Científicas y Técnicas |
| repository.name.fl_str_mv |
CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas |
| repository.mail.fl_str_mv |
dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar |
| _version_ |
1847426980459315200 |
| score |
13.10058 |