Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance

Autores: Chacoma, Andrés Alberto; Zanette, Damian Horacio
Año de publicación: 2020
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment.
Fil: Chacoma, Andrés Alberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Física Enrique Gaviola. Universidad Nacional de Córdoba. Instituto de Física Enrique Gaviola; Argentina
Fil: Zanette, Damian Horacio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; Argentina. Comisión Nacional de Energía Atómica. Gerencia del Área Investigaciones y Aplicaciones no Nucleares; Argentina
Materia: GRAMMATICAL CLASSES
HEAPS' LAW
LANGUAGE REGULARITIES
STATISTICAL ANOMALIES
TAGGED TEXTS
Nivel de accesibilidad: acceso abierto
Condiciones de uso: https://creativecommons.org/licenses/by/2.5/ar/
Repositorio
Institución: Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador: oai:ri.conicet.gov.ar:11336/137044

Acceder

id	CONICETDig_b29ec5e593dac4ca8a836da01bc37844
oai_identifier_str	oai:ri.conicet.gov.ar:11336/137044
network_acronym_str	CONICETDig
repository_id_str	3498
network_name_str	CONICET Digital (CONICET)
spelling	Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevanceChacoma, Andrés AlbertoZanette, Damian HoracioGRAMMATICAL CLASSESHEAPS' LAWLANGUAGE REGULARITIESSTATISTICAL ANOMALIESTAGGED TEXTShttps://purl.org/becyt/ford/1.3https://purl.org/becyt/ford/1We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment.Fil: Chacoma, Andrés Alberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Física Enrique Gaviola. Universidad Nacional de Córdoba. Instituto de Física Enrique Gaviola; ArgentinaFil: Zanette, Damian Horacio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; Argentina. Comisión Nacional de Energía Atómica. Gerencia del Área Investigaciones y Aplicaciones no Nucleares; ArgentinaThe Royal Society2020-03info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/137044Chacoma, Andrés Alberto; Zanette, Damian Horacio; Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance; The Royal Society; Royal Society Open Science; 7; 3; 3-2020; 1-152054-5703CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://royalsocietypublishing.org/doi/10.1098/rsos.200008info:eu-repo/semantics/altIdentifier/doi/10.1098/rsos.200008info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2026-02-26T10:33:57Zoai:ri.conicet.gov.ar:11336/137044instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982026-02-26 10:33:57.408CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv	Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance
title	Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance
spellingShingle	Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance Chacoma, Andrés Alberto GRAMMATICAL CLASSES HEAPS' LAW LANGUAGE REGULARITIES STATISTICAL ANOMALIES TAGGED TEXTS
title_short	Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance
title_full	Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance
title_fullStr	Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance
title_full_unstemmed	Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance
title_sort	Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance
dc.creator.none.fl_str_mv	Chacoma, Andrés Alberto Zanette, Damian Horacio
author	Chacoma, Andrés Alberto
author_facet	Chacoma, Andrés Alberto Zanette, Damian Horacio
author_role	author
author2	Zanette, Damian Horacio
author2_role	author
dc.subject.none.fl_str_mv	GRAMMATICAL CLASSES HEAPS' LAW LANGUAGE REGULARITIES STATISTICAL ANOMALIES TAGGED TEXTS
topic	GRAMMATICAL CLASSES HEAPS' LAW LANGUAGE REGULARITIES STATISTICAL ANOMALIES TAGGED TEXTS
purl_subject.fl_str_mv	https://purl.org/becyt/ford/1.3 https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv	We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment. Fil: Chacoma, Andrés Alberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Física Enrique Gaviola. Universidad Nacional de Córdoba. Instituto de Física Enrique Gaviola; Argentina Fil: Zanette, Damian Horacio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; Argentina. Comisión Nacional de Energía Atómica. Gerencia del Área Investigaciones y Aplicaciones no Nucleares; Argentina
description	We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment.
publishDate	2020
dc.date.none.fl_str_mv	2020-03
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://hdl.handle.net/11336/137044 Chacoma, Andrés Alberto; Zanette, Damian Horacio; Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance; The Royal Society; Royal Society Open Science; 7; 3; 3-2020; 1-15 2054-5703 CONICET Digital CONICET
url	http://hdl.handle.net/11336/137044
identifier_str_mv	Chacoma, Andrés Alberto; Zanette, Damian Horacio; Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance; The Royal Society; Royal Society Open Science; 7; 3; 3-2020; 1-15 2054-5703 CONICET Digital CONICET
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/url/https://royalsocietypublishing.org/doi/10.1098/rsos.200008 info:eu-repo/semantics/altIdentifier/doi/10.1098/rsos.200008
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by/2.5/ar/
eu_rights_str_mv	openAccess
rights_invalid_str_mv	https://creativecommons.org/licenses/by/2.5/ar/
dc.format.none.fl_str_mv	application/pdf application/pdf
dc.publisher.none.fl_str_mv	The Royal Society
publisher.none.fl_str_mv	The Royal Society
dc.source.none.fl_str_mv	reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str	CONICET Digital (CONICET)
collection	CONICET Digital (CONICET)
instname_str	Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv	CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv	dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_	1858306206769086464
score	12.665996

Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance

Publicaciones similares