Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023)

Autores
Nievas Offidani, Mauro Andrés; Delrieux, Claudio Augusto
Año de publicación
2023
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
This paper details the acquisition, structure and preprocessing of the MultiCaRe Dataset, a multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions. Data extraction was performed using different APIs and packages such as Biopython, requests, Beautifulsoup, BioC API for PMC and EuropePMC RESTful API. Image labels were created based on the contents of their corresponding captions, by using Spark NLP for Healthcare and manual annotations. Images were preprocessed with OpenCV in order to remove borders and split figures containing multiple images, data were analyzed and described, and a subset was randomly selected for quality assessment. The dataset's structure allows for seamless integration of different types of data, making it a valuable resource for training or fine-tuning medical language, computer vision or multi-modal models.
Fil: Nievas Offidani, Mauro Andrés. Universidad Nacional del Sur. Departamento de Ingeniería Eléctrica y de Computadoras; Argentina
Fil: Delrieux, Claudio Augusto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Universidad Nacional del Sur. Departamento de Ingeniería Eléctrica y de Computadoras; Argentina
Materia
Multimodal
Medical
Healthcare
Radiology
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/278085

id CONICETDig_bce811afd01d7a63cfeeee049f89db47
oai_identifier_str oai:ri.conicet.gov.ar:11336/278085
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023)Nievas Offidani, Mauro AndrésDelrieux, Claudio AugustoMultimodalMedicalHealthcareRadiologyhttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1This paper details the acquisition, structure and preprocessing of the MultiCaRe Dataset, a multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions. Data extraction was performed using different APIs and packages such as Biopython, requests, Beautifulsoup, BioC API for PMC and EuropePMC RESTful API. Image labels were created based on the contents of their corresponding captions, by using Spark NLP for Healthcare and manual annotations. Images were preprocessed with OpenCV in order to remove borders and split figures containing multiple images, data were analyzed and described, and a subset was randomly selected for quality assessment. The dataset's structure allows for seamless integration of different types of data, making it a valuable resource for training or fine-tuning medical language, computer vision or multi-modal models.Fil: Nievas Offidani, Mauro Andrés. Universidad Nacional del Sur. Departamento de Ingeniería Eléctrica y de Computadoras; ArgentinaFil: Delrieux, Claudio Augusto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Universidad Nacional del Sur. Departamento de Ingeniería Eléctrica y de Computadoras; ArgentinaElsevier2023-12-23info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/278085Nievas Offidani, Mauro Andrés; Delrieux, Claudio Augusto; Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023); Elsevier; Data in Brief; 52; 110008; 23-12-2023; 1-102352-3409CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://linkinghub.elsevier.com/retrieve/pii/S2352340923010351info:eu-repo/semantics/altIdentifier/doi/10.1016/j.dib.2023.110008info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-12-23T14:23:31Zoai:ri.conicet.gov.ar:11336/278085instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-12-23 14:23:32.22CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023)
title Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023)
spellingShingle Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023)
Nievas Offidani, Mauro Andrés
Multimodal
Medical
Healthcare
Radiology
title_short Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023)
title_full Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023)
title_fullStr Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023)
title_full_unstemmed Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023)
title_sort Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023)
dc.creator.none.fl_str_mv Nievas Offidani, Mauro Andrés
Delrieux, Claudio Augusto
author Nievas Offidani, Mauro Andrés
author_facet Nievas Offidani, Mauro Andrés
Delrieux, Claudio Augusto
author_role author
author2 Delrieux, Claudio Augusto
author2_role author
dc.subject.none.fl_str_mv Multimodal
Medical
Healthcare
Radiology
topic Multimodal
Medical
Healthcare
Radiology
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.2
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv This paper details the acquisition, structure and preprocessing of the MultiCaRe Dataset, a multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions. Data extraction was performed using different APIs and packages such as Biopython, requests, Beautifulsoup, BioC API for PMC and EuropePMC RESTful API. Image labels were created based on the contents of their corresponding captions, by using Spark NLP for Healthcare and manual annotations. Images were preprocessed with OpenCV in order to remove borders and split figures containing multiple images, data were analyzed and described, and a subset was randomly selected for quality assessment. The dataset's structure allows for seamless integration of different types of data, making it a valuable resource for training or fine-tuning medical language, computer vision or multi-modal models.
Fil: Nievas Offidani, Mauro Andrés. Universidad Nacional del Sur. Departamento de Ingeniería Eléctrica y de Computadoras; Argentina
Fil: Delrieux, Claudio Augusto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Universidad Nacional del Sur. Departamento de Ingeniería Eléctrica y de Computadoras; Argentina
description This paper details the acquisition, structure and preprocessing of the MultiCaRe Dataset, a multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions. Data extraction was performed using different APIs and packages such as Biopython, requests, Beautifulsoup, BioC API for PMC and EuropePMC RESTful API. Image labels were created based on the contents of their corresponding captions, by using Spark NLP for Healthcare and manual annotations. Images were preprocessed with OpenCV in order to remove borders and split figures containing multiple images, data were analyzed and described, and a subset was randomly selected for quality assessment. The dataset's structure allows for seamless integration of different types of data, making it a valuable resource for training or fine-tuning medical language, computer vision or multi-modal models.
publishDate 2023
dc.date.none.fl_str_mv 2023-12-23
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/278085
Nievas Offidani, Mauro Andrés; Delrieux, Claudio Augusto; Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023); Elsevier; Data in Brief; 52; 110008; 23-12-2023; 1-10
2352-3409
CONICET Digital
CONICET
url http://hdl.handle.net/11336/278085
identifier_str_mv Nievas Offidani, Mauro Andrés; Delrieux, Claudio Augusto; Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023); Elsevier; Data in Brief; 52; 110008; 23-12-2023; 1-10
2352-3409
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/https://linkinghub.elsevier.com/retrieve/pii/S2352340923010351
info:eu-repo/semantics/altIdentifier/doi/10.1016/j.dib.2023.110008
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv Elsevier
publisher.none.fl_str_mv Elsevier
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1852335657380216832
score 12.952241