Towards the quantification of the semantic information encoded in written language

Autores
Montemurro, Marcelo Alejandro; Zanette, Damian Horacio
Año de publicación
2010
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.
Fil: Montemurro, Marcelo Alejandro. University of Manchester; Reino Unido
Fil: Zanette, Damian Horacio. Comisión Nacional de Energía Atómica. Gerencia del Área de Investigación y Aplicaciones No Nucleares. Gerencia de Física (Centro Atómico Bariloche); Argentina. Comisión Nacional de Energía Atómica. Gerencia del Área de Energía Nuclear. Instituto Balseiro; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; Argentina
Materia
COMPLEX COMMUNICATION
INFORMATION THEORY
NATURAL LANGUAGE
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/125163

id CONICETDig_1509542a422375e3c4e00af0157a2042
oai_identifier_str oai:ri.conicet.gov.ar:11336/125163
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling Towards the quantification of the semantic information encoded in written languageMontemurro, Marcelo AlejandroZanette, Damian HoracioCOMPLEX COMMUNICATIONINFORMATION THEORYNATURAL LANGUAGEhttps://purl.org/becyt/ford/1.3https://purl.org/becyt/ford/1Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.Fil: Montemurro, Marcelo Alejandro. University of Manchester; Reino UnidoFil: Zanette, Damian Horacio. Comisión Nacional de Energía Atómica. Gerencia del Área de Investigación y Aplicaciones No Nucleares. Gerencia de Física (Centro Atómico Bariloche); Argentina. Comisión Nacional de Energía Atómica. Gerencia del Área de Energía Nuclear. Instituto Balseiro; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; ArgentinaWorld Scientific2010-02info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/125163Montemurro, Marcelo Alejandro; Zanette, Damian Horacio; Towards the quantification of the semantic information encoded in written language; World Scientific; Advances In Complex Systems; 13; 2; 2-2010; 135-1530219-5259CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/arxiv/http://arxiv.org/abs/0907.1558info:eu-repo/semantics/altIdentifier/url/https://www.worldscientific.com/doi/abs/10.1142/S0219525910002530info:eu-repo/semantics/altIdentifier/doi/10.1142/S0219525910002530info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T09:40:31Zoai:ri.conicet.gov.ar:11336/125163instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 09:40:32.075CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv Towards the quantification of the semantic information encoded in written language
title Towards the quantification of the semantic information encoded in written language
spellingShingle Towards the quantification of the semantic information encoded in written language
Montemurro, Marcelo Alejandro
COMPLEX COMMUNICATION
INFORMATION THEORY
NATURAL LANGUAGE
title_short Towards the quantification of the semantic information encoded in written language
title_full Towards the quantification of the semantic information encoded in written language
title_fullStr Towards the quantification of the semantic information encoded in written language
title_full_unstemmed Towards the quantification of the semantic information encoded in written language
title_sort Towards the quantification of the semantic information encoded in written language
dc.creator.none.fl_str_mv Montemurro, Marcelo Alejandro
Zanette, Damian Horacio
author Montemurro, Marcelo Alejandro
author_facet Montemurro, Marcelo Alejandro
Zanette, Damian Horacio
author_role author
author2 Zanette, Damian Horacio
author2_role author
dc.subject.none.fl_str_mv COMPLEX COMMUNICATION
INFORMATION THEORY
NATURAL LANGUAGE
topic COMPLEX COMMUNICATION
INFORMATION THEORY
NATURAL LANGUAGE
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.3
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.
Fil: Montemurro, Marcelo Alejandro. University of Manchester; Reino Unido
Fil: Zanette, Damian Horacio. Comisión Nacional de Energía Atómica. Gerencia del Área de Investigación y Aplicaciones No Nucleares. Gerencia de Física (Centro Atómico Bariloche); Argentina. Comisión Nacional de Energía Atómica. Gerencia del Área de Energía Nuclear. Instituto Balseiro; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Patagonia Norte; Argentina
description Written language is a complex communication signal capable of conveying information encoded in the form of ordered sequences of words. Beyond the local order ruled by grammar, semantic and thematic structures affect long-range patterns in word usage. Here, we show that a direct application of information theory quantifies the relationship between the statistical distribution of words and the semantic content of the text. We show that there is a characteristic scale, roughly around a few thousand words, which establishes the typical size of the most informative segments in written language. Moreover, we find that the words whose contributions to the overall information is larger, are the ones more closely associated with the main subjects and topics of the text. This scenario can be explained by a model of word usage that assumes that words are distributed along the text in domains of a characteristic size where their frequency is higher than elsewhere. Our conclusions are based on the analysis of a large database of written language, diverse in subjects and styles, and thus are likely to be applicable to general language sequences encoding complex information.
publishDate 2010
dc.date.none.fl_str_mv 2010-02
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/125163
Montemurro, Marcelo Alejandro; Zanette, Damian Horacio; Towards the quantification of the semantic information encoded in written language; World Scientific; Advances In Complex Systems; 13; 2; 2-2010; 135-153
0219-5259
CONICET Digital
CONICET
url http://hdl.handle.net/11336/125163
identifier_str_mv Montemurro, Marcelo Alejandro; Zanette, Damian Horacio; Towards the quantification of the semantic information encoded in written language; World Scientific; Advances In Complex Systems; 13; 2; 2-2010; 135-153
0219-5259
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/arxiv/http://arxiv.org/abs/0907.1558
info:eu-repo/semantics/altIdentifier/url/https://www.worldscientific.com/doi/abs/10.1142/S0219525910002530
info:eu-repo/semantics/altIdentifier/doi/10.1142/S0219525910002530
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
application/pdf
dc.publisher.none.fl_str_mv World Scientific
publisher.none.fl_str_mv World Scientific
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1844613282418130944
score 13.070432