Towards the quantification of the semantic information encoded in written language
- Autores
- Montemurro, Marcelo Alejandro; Zanette, Damian Horacio
- Año de publicación
- 2010
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.
Fil: Montemurro, Marcelo Alejandro. University of Manchester; Reino Unido
Fil: Zanette, Damian Horacio. Comisión Nacional de Energía Atómica. Gerencia del Área de Investigación y Aplicaciones No Nucleares. Gerencia de Física (Centro Atómico Bariloche); Argentina. Comisión Nacional de Energía Atómica. Gerencia del Área de Energía Nuclear. Instituto Balseiro; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; Argentina - Materia
-
COMPLEX COMMUNICATION
INFORMATION THEORY
NATURAL LANGUAGE - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
- Repositorio
- Institución
- Consejo Nacional de Investigaciones Científicas y Técnicas
- OAI Identificador
- oai:ri.conicet.gov.ar:11336/125163
Ver los metadatos del registro completo
id |
CONICETDig_1509542a422375e3c4e00af0157a2042 |
---|---|
oai_identifier_str |
oai:ri.conicet.gov.ar:11336/125163 |
network_acronym_str |
CONICETDig |
repository_id_str |
3498 |
network_name_str |
CONICET Digital (CONICET) |
spelling |
Towards the quantification of the semantic information encoded in written languageMontemurro, Marcelo AlejandroZanette, Damian HoracioCOMPLEX COMMUNICATIONINFORMATION THEORYNATURAL LANGUAGEhttps://purl.org/becyt/ford/1.3https://purl.org/becyt/ford/1Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.Fil: Montemurro, Marcelo Alejandro. University of Manchester; Reino UnidoFil: Zanette, Damian Horacio. Comisión Nacional de Energía Atómica. Gerencia del Área de Investigación y Aplicaciones No Nucleares. Gerencia de Física (Centro Atómico Bariloche); Argentina. Comisión Nacional de Energía Atómica. Gerencia del Área de Energía Nuclear. Instituto Balseiro; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; ArgentinaWorld Scientific2010-02info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/125163Montemurro, Marcelo Alejandro; Zanette, Damian Horacio; Towards the quantification of the semantic information encoded in written language; World Scientific; Advances In Complex Systems; 13; 2; 2-2010; 135-1530219-5259CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/arxiv/http://arxiv.org/abs/0907.1558info:eu-repo/semantics/altIdentifier/url/https://www.worldscientific.com/doi/abs/10.1142/S0219525910002530info:eu-repo/semantics/altIdentifier/doi/10.1142/S0219525910002530info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T09:40:31Zoai:ri.conicet.gov.ar:11336/125163instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 09:40:32.075CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse |
dc.title.none.fl_str_mv |
Towards the quantification of the semantic information encoded in written language |
title |
Towards the quantification of the semantic information encoded in written language |
spellingShingle |
Towards the quantification of the semantic information encoded in written language Montemurro, Marcelo Alejandro COMPLEX COMMUNICATION INFORMATION THEORY NATURAL LANGUAGE |
title_short |
Towards the quantification of the semantic information encoded in written language |
title_full |
Towards the quantification of the semantic information encoded in written language |
title_fullStr |
Towards the quantification of the semantic information encoded in written language |
title_full_unstemmed |
Towards the quantification of the semantic information encoded in written language |
title_sort |
Towards the quantification of the semantic information encoded in written language |
dc.creator.none.fl_str_mv |
Montemurro, Marcelo Alejandro Zanette, Damian Horacio |
author |
Montemurro, Marcelo Alejandro |
author_facet |
Montemurro, Marcelo Alejandro Zanette, Damian Horacio |
author_role |
author |
author2 |
Zanette, Damian Horacio |
author2_role |
author |
dc.subject.none.fl_str_mv |
COMPLEX COMMUNICATION INFORMATION THEORY NATURAL LANGUAGE |
topic |
COMPLEX COMMUNICATION INFORMATION THEORY NATURAL LANGUAGE |
purl_subject.fl_str_mv |
https://purl.org/becyt/ford/1.3 https://purl.org/becyt/ford/1 |
dc.description.none.fl_txt_mv |
Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information. Fil: Montemurro, Marcelo Alejandro. University of Manchester; Reino Unido Fil: Zanette, Damian Horacio. Comisión Nacional de Energía Atómica. Gerencia del Área de Investigación y Aplicaciones No Nucleares. Gerencia de Física (Centro Atómico Bariloche); Argentina. Comisión Nacional de Energía Atómica. Gerencia del Área de Energía Nuclear. Instituto Balseiro; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; Argentina |
description |
Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information. |
publishDate |
2010 |
dc.date.none.fl_str_mv |
2010-02 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
format |
article |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
http://hdl.handle.net/11336/125163 Montemurro, Marcelo Alejandro; Zanette, Damian Horacio; Towards the quantification of the semantic information encoded in written language; World Scientific; Advances In Complex Systems; 13; 2; 2-2010; 135-153 0219-5259 CONICET Digital CONICET |
url |
http://hdl.handle.net/11336/125163 |
identifier_str_mv |
Montemurro, Marcelo Alejandro; Zanette, Damian Horacio; Towards the quantification of the semantic information encoded in written language; World Scientific; Advances In Complex Systems; 13; 2; 2-2010; 135-153 0219-5259 CONICET Digital CONICET |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/arxiv/http://arxiv.org/abs/0907.1558 info:eu-repo/semantics/altIdentifier/url/https://www.worldscientific.com/doi/abs/10.1142/S0219525910002530 info:eu-repo/semantics/altIdentifier/doi/10.1142/S0219525910002530 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ |
dc.format.none.fl_str_mv |
application/pdf application/pdf application/pdf |
dc.publisher.none.fl_str_mv |
World Scientific |
publisher.none.fl_str_mv |
World Scientific |
dc.source.none.fl_str_mv |
reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas |
reponame_str |
CONICET Digital (CONICET) |
collection |
CONICET Digital (CONICET) |
instname_str |
Consejo Nacional de Investigaciones Científicas y Técnicas |
repository.name.fl_str_mv |
CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas |
repository.mail.fl_str_mv |
dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar |
_version_ |
1844613282418130944 |
score |
13.070432 |