A standardized reference data set for vertebrate taxon name resolution

Autores
Zermoglio, Paula Florencia; Guralnick, Robert P.; Wieczorek, John R.
Año de publicación
2016
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.
Fil: Zermoglio, Paula Florencia. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Ecología, Genética y Evolución de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Ecología, Genética y Evolución de Buenos Aires; Argentina. Université François Rabelais; Francia
Fil: Guralnick, Robert P.. University of Florida; Estados Unidos
Fil: Wieczorek, John R.. University of California at Berkeley; Estados Unidos
Materia
BIOCOLLECTIONS
DATA CURATION
FITNESS FOR USE
GOLD STANDARD
TAXON NAMES
VALIDATION
VERTEBRATES
VERTNET
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/60277

id CONICETDig_77f36e04a646fe4c17d1f8484f152da9
oai_identifier_str oai:ri.conicet.gov.ar:11336/60277
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling A standardized reference data set for vertebrate taxon name resolutionZermoglio, Paula FlorenciaGuralnick, Robert P.Wieczorek, John R.BIOCOLLECTIONSDATA CURATIONFITNESS FOR USEGOLD STANDARDTAXON NAMESVALIDATIONVERTEBRATESVERTNEThttps://purl.org/becyt/ford/1.6https://purl.org/becyt/ford/1Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.Fil: Zermoglio, Paula Florencia. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Ecología, Genética y Evolución de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Ecología, Genética y Evolución de Buenos Aires; Argentina. Université François Rabelais; FranciaFil: Guralnick, Robert P.. University of Florida; Estados UnidosFil: Wieczorek, John R.. University of California at Berkeley; Estados UnidosPublic Library of Science2016-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/60277Zermoglio, Paula Florencia; Guralnick, Robert P.; Wieczorek, John R.; A standardized reference data set for vertebrate taxon name resolution; Public Library of Science; Plos One; 11; 1; 1-2016; 1-20; e01468941932-6203CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/doi/10.1371/journal.pone.0146894info:eu-repo/semantics/altIdentifier/url/https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0146894info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-03T10:08:08Zoai:ri.conicet.gov.ar:11336/60277instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-03 10:08:08.511CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv A standardized reference data set for vertebrate taxon name resolution
title A standardized reference data set for vertebrate taxon name resolution
spellingShingle A standardized reference data set for vertebrate taxon name resolution
Zermoglio, Paula Florencia
BIOCOLLECTIONS
DATA CURATION
FITNESS FOR USE
GOLD STANDARD
TAXON NAMES
VALIDATION
VERTEBRATES
VERTNET
title_short A standardized reference data set for vertebrate taxon name resolution
title_full A standardized reference data set for vertebrate taxon name resolution
title_fullStr A standardized reference data set for vertebrate taxon name resolution
title_full_unstemmed A standardized reference data set for vertebrate taxon name resolution
title_sort A standardized reference data set for vertebrate taxon name resolution
dc.creator.none.fl_str_mv Zermoglio, Paula Florencia
Guralnick, Robert P.
Wieczorek, John R.
author Zermoglio, Paula Florencia
author_facet Zermoglio, Paula Florencia
Guralnick, Robert P.
Wieczorek, John R.
author_role author
author2 Guralnick, Robert P.
Wieczorek, John R.
author2_role author
author
dc.subject.none.fl_str_mv BIOCOLLECTIONS
DATA CURATION
FITNESS FOR USE
GOLD STANDARD
TAXON NAMES
VALIDATION
VERTEBRATES
VERTNET
topic BIOCOLLECTIONS
DATA CURATION
FITNESS FOR USE
GOLD STANDARD
TAXON NAMES
VALIDATION
VERTEBRATES
VERTNET
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.6
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.
Fil: Zermoglio, Paula Florencia. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Ecología, Genética y Evolución de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Ecología, Genética y Evolución de Buenos Aires; Argentina. Université François Rabelais; Francia
Fil: Guralnick, Robert P.. University of Florida; Estados Unidos
Fil: Wieczorek, John R.. University of California at Berkeley; Estados Unidos
description Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.
publishDate 2016
dc.date.none.fl_str_mv 2016-01
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/60277
Zermoglio, Paula Florencia; Guralnick, Robert P.; Wieczorek, John R.; A standardized reference data set for vertebrate taxon name resolution; Public Library of Science; Plos One; 11; 1; 1-2016; 1-20; e0146894
1932-6203
CONICET Digital
CONICET
url http://hdl.handle.net/11336/60277
identifier_str_mv Zermoglio, Paula Florencia; Guralnick, Robert P.; Wieczorek, John R.; A standardized reference data set for vertebrate taxon name resolution; Public Library of Science; Plos One; 11; 1; 1-2016; 1-20; e0146894
1932-6203
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/doi/10.1371/journal.pone.0146894
info:eu-repo/semantics/altIdentifier/url/https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0146894
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv Public Library of Science
publisher.none.fl_str_mv Public Library of Science
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1842270032785047552
score 13.13397