Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance
- Autores
- Chacoma, Andrés Alberto; Zanette, Damian Horacio
- Año de publicación
- 2020
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment.
Fil: Chacoma, Andrés Alberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Física Enrique Gaviola. Universidad Nacional de Córdoba. Instituto de Física Enrique Gaviola; Argentina
Fil: Zanette, Damian Horacio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; Argentina. Comisión Nacional de Energía Atómica. Gerencia del Área Investigaciones y Aplicaciones no Nucleares; Argentina - Materia
-
GRAMMATICAL CLASSES
HEAPS' LAW
LANGUAGE REGULARITIES
STATISTICAL ANOMALIES
TAGGED TEXTS - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- https://creativecommons.org/licenses/by/2.5/ar/
- Repositorio
- Institución
- Consejo Nacional de Investigaciones Científicas y Técnicas
- OAI Identificador
- oai:ri.conicet.gov.ar:11336/137044
Ver los metadatos del registro completo
id |
CONICETDig_b29ec5e593dac4ca8a836da01bc37844 |
---|---|
oai_identifier_str |
oai:ri.conicet.gov.ar:11336/137044 |
network_acronym_str |
CONICETDig |
repository_id_str |
3498 |
network_name_str |
CONICET Digital (CONICET) |
spelling |
Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevanceChacoma, Andrés AlbertoZanette, Damian HoracioGRAMMATICAL CLASSESHEAPS' LAWLANGUAGE REGULARITIESSTATISTICAL ANOMALIESTAGGED TEXTShttps://purl.org/becyt/ford/1.3https://purl.org/becyt/ford/1We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment.Fil: Chacoma, Andrés Alberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Física Enrique Gaviola. Universidad Nacional de Córdoba. Instituto de Física Enrique Gaviola; ArgentinaFil: Zanette, Damian Horacio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; Argentina. Comisión Nacional de Energía Atómica. Gerencia del Área Investigaciones y Aplicaciones no Nucleares; ArgentinaThe Royal Society2020-03info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/137044Chacoma, Andrés Alberto; Zanette, Damian Horacio; Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance; The Royal Society; Royal Society Open Science; 7; 3; 3-2020; 1-152054-5703CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://royalsocietypublishing.org/doi/10.1098/rsos.200008info:eu-repo/semantics/altIdentifier/doi/10.1098/rsos.200008info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T10:41:46Zoai:ri.conicet.gov.ar:11336/137044instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 10:41:46.405CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse |
dc.title.none.fl_str_mv |
Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance |
title |
Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance |
spellingShingle |
Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance Chacoma, Andrés Alberto GRAMMATICAL CLASSES HEAPS' LAW LANGUAGE REGULARITIES STATISTICAL ANOMALIES TAGGED TEXTS |
title_short |
Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance |
title_full |
Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance |
title_fullStr |
Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance |
title_full_unstemmed |
Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance |
title_sort |
Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance |
dc.creator.none.fl_str_mv |
Chacoma, Andrés Alberto Zanette, Damian Horacio |
author |
Chacoma, Andrés Alberto |
author_facet |
Chacoma, Andrés Alberto Zanette, Damian Horacio |
author_role |
author |
author2 |
Zanette, Damian Horacio |
author2_role |
author |
dc.subject.none.fl_str_mv |
GRAMMATICAL CLASSES HEAPS' LAW LANGUAGE REGULARITIES STATISTICAL ANOMALIES TAGGED TEXTS |
topic |
GRAMMATICAL CLASSES HEAPS' LAW LANGUAGE REGULARITIES STATISTICAL ANOMALIES TAGGED TEXTS |
purl_subject.fl_str_mv |
https://purl.org/becyt/ford/1.3 https://purl.org/becyt/ford/1 |
dc.description.none.fl_txt_mv |
We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment. Fil: Chacoma, Andrés Alberto. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba. Instituto de Física Enrique Gaviola. Universidad Nacional de Córdoba. Instituto de Física Enrique Gaviola; Argentina Fil: Zanette, Damian Horacio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; Argentina. Comisión Nacional de Energía Atómica. Gerencia del Área Investigaciones y Aplicaciones no Nucleares; Argentina |
description |
We study the relationship between vocabulary size and text length in a corpus of 75 literary works in English, authored by six writers, distinguishing between the contributions of three grammatical classes (or 'tags,' namely, nouns, verbs and others), and analyse the progressive appearance of new words of each tag along each individual text. We find that, as prescribed by Heaps' Law, vocabulary sizes and text lengths follow a well-defined power-law relation. Meanwhile, the appearance of new words in each text does not obey a power law, and is on the whole well described by the average of random shufflings of the text. Deviations from this average, however, are statistically significant and show systematic trends across the corpus. Specifically, we find that the appearance of new words along each text is predominantly retarded with respect to the average of random shufflings. Moreover, different tags add systematically distinct contributions to this tendency, with verbs and others being respectively more and less retarded than the mean trend, and nouns following instead the overall mean. These statistical systematicities are likely to point to the existence of linguistically relevant information stored in the different variants of Heaps' Law, a feature that is still in need of extensive assessment. |
publishDate |
2020 |
dc.date.none.fl_str_mv |
2020-03 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
format |
article |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://hdl.handle.net/11336/137044 Chacoma, Andrés Alberto; Zanette, Damian Horacio; Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance; The Royal Society; Royal Society Open Science; 7; 3; 3-2020; 1-15 2054-5703 CONICET Digital CONICET |
url |
http://hdl.handle.net/11336/137044 |
identifier_str_mv |
Chacoma, Andrés Alberto; Zanette, Damian Horacio; Heaps' Law and Heaps functions in tagged texts: Evidences of their linguistic relevance; The Royal Society; Royal Society Open Science; 7; 3; 3-2020; 1-15 2054-5703 CONICET Digital CONICET |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/url/https://royalsocietypublishing.org/doi/10.1098/rsos.200008 info:eu-repo/semantics/altIdentifier/doi/10.1098/rsos.200008 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by/2.5/ar/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
https://creativecommons.org/licenses/by/2.5/ar/ |
dc.format.none.fl_str_mv |
application/pdf application/pdf |
dc.publisher.none.fl_str_mv |
The Royal Society |
publisher.none.fl_str_mv |
The Royal Society |
dc.source.none.fl_str_mv |
reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas |
reponame_str |
CONICET Digital (CONICET) |
collection |
CONICET Digital (CONICET) |
instname_str |
Consejo Nacional de Investigaciones Científicas y Técnicas |
repository.name.fl_str_mv |
CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas |
repository.mail.fl_str_mv |
dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar |
_version_ |
1844614449403527168 |
score |
13.070432 |