Information Approach to Co-occurrence of Words in Written Language

Autores
Hernández Lahme, Damián Gabriel
Año de publicación
2015
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
In this paper we study the distribution of words across the different parts of a book using tools from information theory. In particular, the mutual information between words in the text and parts of the text is compared with the mutual information of a shuffled version of the book. This analysis allows us to extract not only relevant words of the text but also relationships between the different words, such as co-occurrence and repulsion between them. With the connections due to co-occurrence of words, we show how to construct a network that reflects the semantic organization of the book. This method can be applied to other types of sequences, measuring the relations between the different symbols that compose such sequences.
Fil: Hernández Lahme, Damián Gabriel. Comisión Nacional de Energía Atómica. Gerencia del Área de Energía Nuclear. Instituto Balseiro; Argentina. Comisión Nacional de Energía Atómica. Centro Atómico Bariloche; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
Materia
INFORMATION
COOCURRENCE
WORDS
LANGUAGE
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/54970

id CONICETDig_99a52fd102cff725e829c7d75b77501c
oai_identifier_str oai:ri.conicet.gov.ar:11336/54970
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling Information Approach to Co-occurrence of Words in Written LanguageHernández Lahme, Damián GabrielINFORMATIONCOOCURRENCEWORDSLANGUAGEhttps://purl.org/becyt/ford/1.3https://purl.org/becyt/ford/1In this paper we study the distribution of words across the different parts of a book using tools from information theory. In particular, the mutual information between words in the text and parts of the text is compared with the mutual information of a shuffled version of the book. This analysis allows us to extract not only relevant words of the text but also relationships between the different words, such as co-occurrence and repulsion between them. With the connections due to co-occurrence of words, we show how to construct a network that reflects the semantic organization of the book. This method can be applied to other types of sequences, measuring the relations between the different symbols that compose such sequences.Fil: Hernández Lahme, Damián Gabriel. Comisión Nacional de Energía Atómica. Gerencia del Área de Energía Nuclear. Instituto Balseiro; Argentina. Comisión Nacional de Energía Atómica. Centro Atómico Bariloche; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaComplex Systems Publications2015-06info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/54970Hernández Lahme, Damián Gabriel; Information Approach to Co-occurrence of Words in Written Language; Complex Systems Publications; Complex systems; 24; 2; 6-2015; 1-210891-2513CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/http://www.complex-systems.com/abstracts/v24_i02_a03/info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-29T09:39:57Zoai:ri.conicet.gov.ar:11336/54970instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-29 09:39:57.553CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv Information Approach to Co-occurrence of Words in Written Language
title Information Approach to Co-occurrence of Words in Written Language
spellingShingle Information Approach to Co-occurrence of Words in Written Language
Hernández Lahme, Damián Gabriel
INFORMATION
COOCURRENCE
WORDS
LANGUAGE
title_short Information Approach to Co-occurrence of Words in Written Language
title_full Information Approach to Co-occurrence of Words in Written Language
title_fullStr Information Approach to Co-occurrence of Words in Written Language
title_full_unstemmed Information Approach to Co-occurrence of Words in Written Language
title_sort Information Approach to Co-occurrence of Words in Written Language
dc.creator.none.fl_str_mv Hernández Lahme, Damián Gabriel
author Hernández Lahme, Damián Gabriel
author_facet Hernández Lahme, Damián Gabriel
author_role author
dc.subject.none.fl_str_mv INFORMATION
COOCURRENCE
WORDS
LANGUAGE
topic INFORMATION
COOCURRENCE
WORDS
LANGUAGE
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.3
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv In this paper we study the distribution of words across the different parts of a book using tools from information theory. In particular, the mutual information between words in the text and parts of the text is compared with the mutual information of a shuffled version of the book. This analysis allows us to extract not only relevant words of the text but also relationships between the different words, such as co-occurrence and repulsion between them. With the connections due to co-occurrence of words, we show how to construct a network that reflects the semantic organization of the book. This method can be applied to other types of sequences, measuring the relations between the different symbols that compose such sequences.
Fil: Hernández Lahme, Damián Gabriel. Comisión Nacional de Energía Atómica. Gerencia del Área de Energía Nuclear. Instituto Balseiro; Argentina. Comisión Nacional de Energía Atómica. Centro Atómico Bariloche; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina
description In this paper we study the distribution of words across the different parts of a book using tools from information theory. In particular, the mutual information between words in the text and parts of the text is compared with the mutual information of a shuffled version of the book. This analysis allows us to extract not only relevant words of the text but also relationships between the different words, such as co-occurrence and repulsion between them. With the connections due to co-occurrence of words, we show how to construct a network that reflects the semantic organization of the book. This method can be applied to other types of sequences, measuring the relations between the different symbols that compose such sequences.
publishDate 2015
dc.date.none.fl_str_mv 2015-06
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/54970
Hernández Lahme, Damián Gabriel; Information Approach to Co-occurrence of Words in Written Language; Complex Systems Publications; Complex systems; 24; 2; 6-2015; 1-21
0891-2513
CONICET Digital
CONICET
url http://hdl.handle.net/11336/54970
identifier_str_mv Hernández Lahme, Damián Gabriel; Information Approach to Co-occurrence of Words in Written Language; Complex Systems Publications; Complex systems; 24; 2; 6-2015; 1-21
0891-2513
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/http://www.complex-systems.com/abstracts/v24_i02_a03/
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv Complex Systems Publications
publisher.none.fl_str_mv Complex Systems Publications
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1844613263539568640
score 13.070432