A standardized reference data set for vertebrate taxon name resolution

Autores: Zermoglio, Paula Florencia; Guralnick, Robert P.; Wieczorek, John R.
Año de publicación: 2016
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.
Fil: Zermoglio, Paula Florencia. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Ecología, Genética y Evolución de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Ecología, Genética y Evolución de Buenos Aires; Argentina. Université François Rabelais; Francia
Fil: Guralnick, Robert P.. University of Florida; Estados Unidos
Fil: Wieczorek, John R.. University of California at Berkeley; Estados Unidos
Materia: BIOCOLLECTIONS
DATA CURATION
FITNESS FOR USE
GOLD STANDARD
TAXON NAMES
VALIDATION
VERTEBRATES
VERTNET
Nivel de accesibilidad: acceso abierto
Condiciones de uso: https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
Institución: Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador: oai:ri.conicet.gov.ar:11336/60277

Acceder

id	CONICETDig_77f36e04a646fe4c17d1f8484f152da9
oai_identifier_str	oai:ri.conicet.gov.ar:11336/60277
network_acronym_str	CONICETDig
repository_id_str	3498
network_name_str	CONICET Digital (CONICET)
spelling	A standardized reference data set for vertebrate taxon name resolutionZermoglio, Paula FlorenciaGuralnick, Robert P.Wieczorek, John R.BIOCOLLECTIONSDATA CURATIONFITNESS FOR USEGOLD STANDARDTAXON NAMESVALIDATIONVERTEBRATESVERTNEThttps://purl.org/becyt/ford/1.6https://purl.org/becyt/ford/1Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.Fil: Zermoglio, Paula Florencia. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Ecología, Genética y Evolución de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Ecología, Genética y Evolución de Buenos Aires; Argentina. Université François Rabelais; FranciaFil: Guralnick, Robert P.. University of Florida; Estados UnidosFil: Wieczorek, John R.. University of California at Berkeley; Estados UnidosPublic Library of Science2016-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/60277Zermoglio, Paula Florencia; Guralnick, Robert P.; Wieczorek, John R.; A standardized reference data set for vertebrate taxon name resolution; Public Library of Science; Plos One; 11; 1; 1-2016; 1-20; e01468941932-6203CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/doi/10.1371/journal.pone.0146894info:eu-repo/semantics/altIdentifier/url/https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0146894info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2026-06-10T10:07:12Zoai:ri.conicet.gov.ar:11336/60277instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982026-06-10 10:07:12.443CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv	A standardized reference data set for vertebrate taxon name resolution
title	A standardized reference data set for vertebrate taxon name resolution
spellingShingle	A standardized reference data set for vertebrate taxon name resolution Zermoglio, Paula Florencia BIOCOLLECTIONS DATA CURATION FITNESS FOR USE GOLD STANDARD TAXON NAMES VALIDATION VERTEBRATES VERTNET
title_short	A standardized reference data set for vertebrate taxon name resolution
title_full	A standardized reference data set for vertebrate taxon name resolution
title_fullStr	A standardized reference data set for vertebrate taxon name resolution
title_full_unstemmed	A standardized reference data set for vertebrate taxon name resolution
title_sort	A standardized reference data set for vertebrate taxon name resolution
dc.creator.none.fl_str_mv	Zermoglio, Paula Florencia Guralnick, Robert P. Wieczorek, John R.
author	Zermoglio, Paula Florencia
author_facet	Zermoglio, Paula Florencia Guralnick, Robert P. Wieczorek, John R.
author_role	author
author2	Guralnick, Robert P. Wieczorek, John R.
author2_role	author author
dc.subject.none.fl_str_mv	BIOCOLLECTIONS DATA CURATION FITNESS FOR USE GOLD STANDARD TAXON NAMES VALIDATION VERTEBRATES VERTNET
topic	BIOCOLLECTIONS DATA CURATION FITNESS FOR USE GOLD STANDARD TAXON NAMES VALIDATION VERTEBRATES VERTNET
purl_subject.fl_str_mv	https://purl.org/becyt/ford/1.6 https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv	Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data. Fil: Zermoglio, Paula Florencia. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Ecología, Genética y Evolución de Buenos Aires. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Ecología, Genética y Evolución de Buenos Aires; Argentina. Université François Rabelais; Francia Fil: Guralnick, Robert P.. University of Florida; Estados Unidos Fil: Wieczorek, John R.. University of California at Berkeley; Estados Unidos
description	Taxonomic names associated with digitized biocollections labels have flooded into repositories such as GBIF, iDigBio and VertNet. The names on these labels are often misspelled, out of date, or present other problems, as they were often captured only once during accessioning of specimens, or have a history of label changes without clear provenance. Before records are reliably usable in research, it is critical that these issues be addressed. However, still missing is an assessment of the scope of the problem, the effort needed to solve it, and a way to improve effectiveness of tools developed to aid the process. We present a carefully human-vetted analysis of 1000 verbatim scientific names taken at random from those published via the data aggregator VertNet, providing the first rigorously reviewed, reference validation data set. In addition to characterizing formatting problems, human vetting focused on detecting misspelling, synonymy, and the incorrect use of Darwin Core. Our results reveal a sobering view of the challenge ahead, as less than 47% of name strings were found to be currently valid. More optimistically, nearly 97% of name combinations could be resolved to a currently valid name, suggesting that computer-aided approaches may provide feasible means to improve digitized content. Finally, we associated names back to biocollections records and fit logistic models to test potential drivers of issues. A set of candidate variables (geographic region, year collected, higher-level clade, and the institutional digitally accessible data volume) and their 2-way interactions all predict the probability of records having taxon name issues, based on model selection approaches. We strongly encourage further experiments to use this reference data set as a means to compare automated or computer-aided taxon name tools for their ability to resolve and improve the existing wealth of legacy data.
publishDate	2016
dc.date.none.fl_str_mv	2016-01
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://hdl.handle.net/11336/60277 Zermoglio, Paula Florencia; Guralnick, Robert P.; Wieczorek, John R.; A standardized reference data set for vertebrate taxon name resolution; Public Library of Science; Plos One; 11; 1; 1-2016; 1-20; e0146894 1932-6203 CONICET Digital CONICET
url	http://hdl.handle.net/11336/60277
identifier_str_mv	Zermoglio, Paula Florencia; Guralnick, Robert P.; Wieczorek, John R.; A standardized reference data set for vertebrate taxon name resolution; Public Library of Science; Plos One; 11; 1; 1-2016; 1-20; e0146894 1932-6203 CONICET Digital CONICET
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/doi/10.1371/journal.pone.0146894 info:eu-repo/semantics/altIdentifier/url/https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0146894
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
eu_rights_str_mv	openAccess
rights_invalid_str_mv	https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.format.none.fl_str_mv	application/pdf application/pdf
dc.publisher.none.fl_str_mv	Public Library of Science
publisher.none.fl_str_mv	Public Library of Science
dc.source.none.fl_str_mv	reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str	CONICET Digital (CONICET)
collection	CONICET Digital (CONICET)
instname_str	Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv	CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv	dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_	1867629984487571456
score	12.98848

A standardized reference data set for vertebrate taxon name resolution

Publicaciones similares