Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023)
- Autores
- Nievas Offidani, Mauro Andrés; Delrieux, Claudio Augusto
- Año de publicación
- 2023
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- This paper details the acquisition, structure and preprocessing of the MultiCaRe Dataset, a multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions. Data extraction was performed using different APIs and packages such as Biopython, requests, Beautifulsoup, BioC API for PMC and EuropePMC RESTful API. Image labels were created based on the contents of their corresponding captions, by using Spark NLP for Healthcare and manual annotations. Images were preprocessed with OpenCV in order to remove borders and split figures containing multiple images, data were analyzed and described, and a subset was randomly selected for quality assessment. The dataset's structure allows for seamless integration of different types of data, making it a valuable resource for training or fine-tuning medical language, computer vision or multi-modal models.
Fil: Nievas Offidani, Mauro Andrés. Universidad Nacional del Sur. Departamento de Ingeniería Eléctrica y de Computadoras; Argentina
Fil: Delrieux, Claudio Augusto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Universidad Nacional del Sur. Departamento de Ingeniería Eléctrica y de Computadoras; Argentina - Materia
-
Multimodal
Medical
Healthcare
Radiology - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- https://creativecommons.org/licenses/by/2.5/ar/
- Repositorio
.jpg)
- Institución
- Consejo Nacional de Investigaciones Científicas y Técnicas
- OAI Identificador
- oai:ri.conicet.gov.ar:11336/278085
Ver los metadatos del registro completo
| id |
CONICETDig_bce811afd01d7a63cfeeee049f89db47 |
|---|---|
| oai_identifier_str |
oai:ri.conicet.gov.ar:11336/278085 |
| network_acronym_str |
CONICETDig |
| repository_id_str |
3498 |
| network_name_str |
CONICET Digital (CONICET) |
| spelling |
Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023)Nievas Offidani, Mauro AndrésDelrieux, Claudio AugustoMultimodalMedicalHealthcareRadiologyhttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1This paper details the acquisition, structure and preprocessing of the MultiCaRe Dataset, a multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions. Data extraction was performed using different APIs and packages such as Biopython, requests, Beautifulsoup, BioC API for PMC and EuropePMC RESTful API. Image labels were created based on the contents of their corresponding captions, by using Spark NLP for Healthcare and manual annotations. Images were preprocessed with OpenCV in order to remove borders and split figures containing multiple images, data were analyzed and described, and a subset was randomly selected for quality assessment. The dataset's structure allows for seamless integration of different types of data, making it a valuable resource for training or fine-tuning medical language, computer vision or multi-modal models.Fil: Nievas Offidani, Mauro Andrés. Universidad Nacional del Sur. Departamento de Ingeniería Eléctrica y de Computadoras; ArgentinaFil: Delrieux, Claudio Augusto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Universidad Nacional del Sur. Departamento de Ingeniería Eléctrica y de Computadoras; ArgentinaElsevier2023-12-23info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/278085Nievas Offidani, Mauro Andrés; Delrieux, Claudio Augusto; Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023); Elsevier; Data in Brief; 52; 110008; 23-12-2023; 1-102352-3409CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://linkinghub.elsevier.com/retrieve/pii/S2352340923010351info:eu-repo/semantics/altIdentifier/doi/10.1016/j.dib.2023.110008info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-12-23T14:23:31Zoai:ri.conicet.gov.ar:11336/278085instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-12-23 14:23:32.22CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse |
| dc.title.none.fl_str_mv |
Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023) |
| title |
Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023) |
| spellingShingle |
Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023) Nievas Offidani, Mauro Andrés Multimodal Medical Healthcare Radiology |
| title_short |
Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023) |
| title_full |
Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023) |
| title_fullStr |
Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023) |
| title_full_unstemmed |
Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023) |
| title_sort |
Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023) |
| dc.creator.none.fl_str_mv |
Nievas Offidani, Mauro Andrés Delrieux, Claudio Augusto |
| author |
Nievas Offidani, Mauro Andrés |
| author_facet |
Nievas Offidani, Mauro Andrés Delrieux, Claudio Augusto |
| author_role |
author |
| author2 |
Delrieux, Claudio Augusto |
| author2_role |
author |
| dc.subject.none.fl_str_mv |
Multimodal Medical Healthcare Radiology |
| topic |
Multimodal Medical Healthcare Radiology |
| purl_subject.fl_str_mv |
https://purl.org/becyt/ford/1.2 https://purl.org/becyt/ford/1 |
| dc.description.none.fl_txt_mv |
This paper details the acquisition, structure and preprocessing of the MultiCaRe Dataset, a multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions. Data extraction was performed using different APIs and packages such as Biopython, requests, Beautifulsoup, BioC API for PMC and EuropePMC RESTful API. Image labels were created based on the contents of their corresponding captions, by using Spark NLP for Healthcare and manual annotations. Images were preprocessed with OpenCV in order to remove borders and split figures containing multiple images, data were analyzed and described, and a subset was randomly selected for quality assessment. The dataset's structure allows for seamless integration of different types of data, making it a valuable resource for training or fine-tuning medical language, computer vision or multi-modal models. Fil: Nievas Offidani, Mauro Andrés. Universidad Nacional del Sur. Departamento de Ingeniería Eléctrica y de Computadoras; Argentina Fil: Delrieux, Claudio Augusto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca. Instituto de Ciencias e Ingeniería de la Computación. Universidad Nacional del Sur. Departamento de Ciencias e Ingeniería de la Computación. Instituto de Ciencias e Ingeniería de la Computación; Argentina. Universidad Nacional del Sur. Departamento de Ingeniería Eléctrica y de Computadoras; Argentina |
| description |
This paper details the acquisition, structure and preprocessing of the MultiCaRe Dataset, a multimodal case report dataset which contains data from 75,382 open access PubMed Central articles spanning the period from 1990 to 2023. The dataset includes 96,428 clinical cases, 135,596 images, and their corresponding labels and captions. Data extraction was performed using different APIs and packages such as Biopython, requests, Beautifulsoup, BioC API for PMC and EuropePMC RESTful API. Image labels were created based on the contents of their corresponding captions, by using Spark NLP for Healthcare and manual annotations. Images were preprocessed with OpenCV in order to remove borders and split figures containing multiple images, data were analyzed and described, and a subset was randomly selected for quality assessment. The dataset's structure allows for seamless integration of different types of data, making it a valuable resource for training or fine-tuning medical language, computer vision or multi-modal models. |
| publishDate |
2023 |
| dc.date.none.fl_str_mv |
2023-12-23 |
| dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
| format |
article |
| status_str |
publishedVersion |
| dc.identifier.none.fl_str_mv |
http://hdl.handle.net/11336/278085 Nievas Offidani, Mauro Andrés; Delrieux, Claudio Augusto; Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023); Elsevier; Data in Brief; 52; 110008; 23-12-2023; 1-10 2352-3409 CONICET Digital CONICET |
| url |
http://hdl.handle.net/11336/278085 |
| identifier_str_mv |
Nievas Offidani, Mauro Andrés; Delrieux, Claudio Augusto; Dataset of clinical cases, images, image labels and captions from open access case reports from PubMed Central (1990–2023); Elsevier; Data in Brief; 52; 110008; 23-12-2023; 1-10 2352-3409 CONICET Digital CONICET |
| dc.language.none.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/url/https://linkinghub.elsevier.com/retrieve/pii/S2352340923010351 info:eu-repo/semantics/altIdentifier/doi/10.1016/j.dib.2023.110008 |
| dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by/2.5/ar/ |
| eu_rights_str_mv |
openAccess |
| rights_invalid_str_mv |
https://creativecommons.org/licenses/by/2.5/ar/ |
| dc.format.none.fl_str_mv |
application/pdf application/pdf |
| dc.publisher.none.fl_str_mv |
Elsevier |
| publisher.none.fl_str_mv |
Elsevier |
| dc.source.none.fl_str_mv |
reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas |
| reponame_str |
CONICET Digital (CONICET) |
| collection |
CONICET Digital (CONICET) |
| instname_str |
Consejo Nacional de Investigaciones Científicas y Técnicas |
| repository.name.fl_str_mv |
CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas |
| repository.mail.fl_str_mv |
dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar |
| _version_ |
1852335657380216832 |
| score |
12.952241 |