Pooch: A friend to fetch your data files

Autores
Uieda, Leonardo; Soler, Santiago Rubén; Rampin, Rémi; Kemenade, Hugo van; Turk, Matthew; Shapero, Daniel; Banihirwe, Anderson; Leeman, John
Año de publicación
2020
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Scientific software is usually created to acquire, analyze, model, and visualize data. As such, many software libraries include sample datasets in their distributions for use in documentation, tests, benchmarks, and workshops. A common approach is to include smaller datasets in the GitHub repository directly and package them with the source and binary distributions (e.g., scikit-learn (Pedregosa et al., 2011) and scikit-image (Van der Walt et al., 2014) do this). As data files increase in size, it becomes unfeasible to store them in GitHub repositories. Thus, larger datasets require writing code to download the files from a remote server to the user’s computer. The same problem is faced by scientists using version control to manage their research projects. While downloading a data file over HTTPS can be done easily with modern Python libraries, it is not trivial to manage a set of files, keep them updated, and check for corruption. For example, scikit-learn (Pedregosa et al., 2011), Cartopy (Met Office, n.d.), and PyVista (Sullivan & Kaszynski, 2019) all include code dedicated to this particular task. Instead of scientists and library authors recreating the same code, it would be best to have a minimalistic and easy to set up tool for fetching and maintaining data files.
Fil: Uieda, Leonardo. University of Liverpool; Reino Unido
Fil: Soler, Santiago Rubén. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Juan; Argentina. Universidad Nacional de San Juan. Facultad de Ciencias Exactas, Físicas y Naturales. Instituto Geofísico Sismológico Volponi; Argentina
Fil: Rampin, Rémi. University of New York; Estados Unidos
Fil: Kemenade, Hugo van. No especifíca;
Fil: Turk, Matthew. University of Illinois. Urbana - Champaign; Estados Unidos
Fil: Shapero, Daniel. University of Washington; Estados Unidos
Fil: Banihirwe, Anderson. National Center for Atmospheric Research; Estados Unidos
Fil: Leeman, John. Leeman Geophysical; Estados Unidos
Materia
OPEN SOURCE
PYTHON
DATA
JOSS
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/156774

id CONICETDig_0906ba4e143b6e79ba0a92c037427b37
oai_identifier_str oai:ri.conicet.gov.ar:11336/156774
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling Pooch: A friend to fetch your data filesUieda, LeonardoSoler, Santiago RubénRampin, RémiKemenade, Hugo vanTurk, MatthewShapero, DanielBanihirwe, AndersonLeeman, JohnOPEN SOURCEPYTHONDATAJOSShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Scientific software is usually created to acquire, analyze, model, and visualize data. As such, many software libraries include sample datasets in their distributions for use in documentation, tests, benchmarks, and workshops. A common approach is to include smaller datasets in the GitHub repository directly and package them with the source and binary distributions (e.g., scikit-learn (Pedregosa et al., 2011) and scikit-image (Van der Walt et al., 2014) do this). As data files increase in size, it becomes unfeasible to store them in GitHub repositories. Thus, larger datasets require writing code to download the files from a remote server to the user’s computer. The same problem is faced by scientists using version control to manage their research projects. While downloading a data file over HTTPS can be done easily with modern Python libraries, it is not trivial to manage a set of files, keep them updated, and check for corruption. For example, scikit-learn (Pedregosa et al., 2011), Cartopy (Met Office, n.d.), and PyVista (Sullivan & Kaszynski, 2019) all include code dedicated to this particular task. Instead of scientists and library authors recreating the same code, it would be best to have a minimalistic and easy to set up tool for fetching and maintaining data files.Fil: Uieda, Leonardo. University of Liverpool; Reino UnidoFil: Soler, Santiago Rubén. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Juan; Argentina. Universidad Nacional de San Juan. Facultad de Ciencias Exactas, Físicas y Naturales. Instituto Geofísico Sismológico Volponi; ArgentinaFil: Rampin, Rémi. University of New York; Estados UnidosFil: Kemenade, Hugo van. No especifíca;Fil: Turk, Matthew. University of Illinois. Urbana - Champaign; Estados UnidosFil: Shapero, Daniel. University of Washington; Estados UnidosFil: Banihirwe, Anderson. National Center for Atmospheric Research; Estados UnidosFil: Leeman, John. Leeman Geophysical; Estados UnidosJournal of Open Source Software2020-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/156774Uieda, Leonardo; Soler, Santiago Rubén; Rampin, Rémi; Kemenade, Hugo van; Turk, Matthew; et al.; Pooch: A friend to fetch your data files; Journal of Open Source Software; Journal of Open Source Software; 5; 45; 1-2020; 1-32475-9066CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://joss.theoj.org/papers/10.21105/joss.01943info:eu-repo/semantics/altIdentifier/doi/10.21105/joss.01943info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T10:36:19Zoai:ri.conicet.gov.ar:11336/156774instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 10:36:19.669CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv Pooch: A friend to fetch your data files
title Pooch: A friend to fetch your data files
spellingShingle Pooch: A friend to fetch your data files
Uieda, Leonardo
OPEN SOURCE
PYTHON
DATA
JOSS
title_short Pooch: A friend to fetch your data files
title_full Pooch: A friend to fetch your data files
title_fullStr Pooch: A friend to fetch your data files
title_full_unstemmed Pooch: A friend to fetch your data files
title_sort Pooch: A friend to fetch your data files
dc.creator.none.fl_str_mv Uieda, Leonardo
Soler, Santiago Rubén
Rampin, Rémi
Kemenade, Hugo van
Turk, Matthew
Shapero, Daniel
Banihirwe, Anderson
Leeman, John
author Uieda, Leonardo
author_facet Uieda, Leonardo
Soler, Santiago Rubén
Rampin, Rémi
Kemenade, Hugo van
Turk, Matthew
Shapero, Daniel
Banihirwe, Anderson
Leeman, John
author_role author
author2 Soler, Santiago Rubén
Rampin, Rémi
Kemenade, Hugo van
Turk, Matthew
Shapero, Daniel
Banihirwe, Anderson
Leeman, John
author2_role author
author
author
author
author
author
author
dc.subject.none.fl_str_mv OPEN SOURCE
PYTHON
DATA
JOSS
topic OPEN SOURCE
PYTHON
DATA
JOSS
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.2
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv Scientific software is usually created to acquire, analyze, model, and visualize data. As such, many software libraries include sample datasets in their distributions for use in documentation, tests, benchmarks, and workshops. A common approach is to include smaller datasets in the GitHub repository directly and package them with the source and binary distributions (e.g., scikit-learn (Pedregosa et al., 2011) and scikit-image (Van der Walt et al., 2014) do this). As data files increase in size, it becomes unfeasible to store them in GitHub repositories. Thus, larger datasets require writing code to download the files from a remote server to the user’s computer. The same problem is faced by scientists using version control to manage their research projects. While downloading a data file over HTTPS can be done easily with modern Python libraries, it is not trivial to manage a set of files, keep them updated, and check for corruption. For example, scikit-learn (Pedregosa et al., 2011), Cartopy (Met Office, n.d.), and PyVista (Sullivan & Kaszynski, 2019) all include code dedicated to this particular task. Instead of scientists and library authors recreating the same code, it would be best to have a minimalistic and easy to set up tool for fetching and maintaining data files.
Fil: Uieda, Leonardo. University of Liverpool; Reino Unido
Fil: Soler, Santiago Rubén. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Juan; Argentina. Universidad Nacional de San Juan. Facultad de Ciencias Exactas, Físicas y Naturales. Instituto Geofísico Sismológico Volponi; Argentina
Fil: Rampin, Rémi. University of New York; Estados Unidos
Fil: Kemenade, Hugo van. No especifíca;
Fil: Turk, Matthew. University of Illinois. Urbana - Champaign; Estados Unidos
Fil: Shapero, Daniel. University of Washington; Estados Unidos
Fil: Banihirwe, Anderson. National Center for Atmospheric Research; Estados Unidos
Fil: Leeman, John. Leeman Geophysical; Estados Unidos
description Scientific software is usually created to acquire, analyze, model, and visualize data. As such, many software libraries include sample datasets in their distributions for use in documentation, tests, benchmarks, and workshops. A common approach is to include smaller datasets in the GitHub repository directly and package them with the source and binary distributions (e.g., scikit-learn (Pedregosa et al., 2011) and scikit-image (Van der Walt et al., 2014) do this). As data files increase in size, it becomes unfeasible to store them in GitHub repositories. Thus, larger datasets require writing code to download the files from a remote server to the user’s computer. The same problem is faced by scientists using version control to manage their research projects. While downloading a data file over HTTPS can be done easily with modern Python libraries, it is not trivial to manage a set of files, keep them updated, and check for corruption. For example, scikit-learn (Pedregosa et al., 2011), Cartopy (Met Office, n.d.), and PyVista (Sullivan & Kaszynski, 2019) all include code dedicated to this particular task. Instead of scientists and library authors recreating the same code, it would be best to have a minimalistic and easy to set up tool for fetching and maintaining data files.
publishDate 2020
dc.date.none.fl_str_mv 2020-01
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/156774
Uieda, Leonardo; Soler, Santiago Rubén; Rampin, Rémi; Kemenade, Hugo van; Turk, Matthew; et al.; Pooch: A friend to fetch your data files; Journal of Open Source Software; Journal of Open Source Software; 5; 45; 1-2020; 1-3
2475-9066
CONICET Digital
CONICET
url http://hdl.handle.net/11336/156774
identifier_str_mv Uieda, Leonardo; Soler, Santiago Rubén; Rampin, Rémi; Kemenade, Hugo van; Turk, Matthew; et al.; Pooch: A friend to fetch your data files; Journal of Open Source Software; Journal of Open Source Software; 5; 45; 1-2020; 1-3
2475-9066
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/https://joss.theoj.org/papers/10.21105/joss.01943
info:eu-repo/semantics/altIdentifier/doi/10.21105/joss.01943
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv Journal of Open Source Software
publisher.none.fl_str_mv Journal of Open Source Software
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1844614383373647872
score 13.070432