The Dependence on Frequency of Word Embedding Similarity Measures

Autores
Valentini, Francisco Tomás; Fernandez Slezak, Diego; Altszyler Lemcovich, Edgar Jaim
Año de publicación
2022
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Recent research has shown that static word embeddings can encode word frequency information. However, little has been studied about this phenomenon and its effects on downstream tasks. In the present work, we systematically study the association between frequency and semantic similarity in several static word embeddings. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher semantic similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled. This proves that the patterns found are not due to real semantic associations present in the texts, but are an artifact produced by the word embeddings. Finally, we provide an example of how word frequency can strongly impact the measurement of gender bias with embedding-based metrics. In particular, we carry out a controlled experiment that shows that biases can even change sign or reverse their order by manipulating word frequencies.
Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
Fil: Fernandez Slezak, Diego. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
Fil: Altszyler Lemcovich, Edgar Jaim. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
Materia
NATURAL LANGUAGE PROCESSING
WORD EMBEDDINGS
WORD FREQUENCY
WORD SIMILARITY
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/218015

id CONICETDig_a66c411598c2c02eb953fa346b0da816
oai_identifier_str oai:ri.conicet.gov.ar:11336/218015
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling The Dependence on Frequency of Word Embedding Similarity MeasuresValentini, Francisco TomásFernandez Slezak, DiegoAltszyler Lemcovich, Edgar JaimNATURAL LANGUAGE PROCESSINGWORD EMBEDDINGSWORD FREQUENCYWORD SIMILARITYhttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Recent research has shown that static word embeddings can encode word frequency information. However, little has been studied about this phenomenon and its effects on downstream tasks. In the present work, we systematically study the association between frequency and semantic similarity in several static word embeddings. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher semantic similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled. This proves that the patterns found are not due to real semantic associations present in the texts, but are an artifact produced by the word embeddings. Finally, we provide an example of how word frequency can strongly impact the measurement of gender bias with embedding-based metrics. In particular, we carry out a controlled experiment that shows that biases can even change sign or reverse their order by manipulating word frequencies.Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Fernandez Slezak, Diego. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Altszyler Lemcovich, Edgar Jaim. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaCornell University2022-11info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/218015Valentini, Francisco Tomás; Fernandez Slezak, Diego; Altszyler Lemcovich, Edgar Jaim; The Dependence on Frequency of Word Embedding Similarity Measures; Cornell University; ArXiv.org; 11-2022; 1-102331-8422CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/doi/10.48550/arXiv.2211.08203info:eu-repo/semantics/altIdentifier/url/https://arxiv.org/abs/2211.08203info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-03T09:57:22Zoai:ri.conicet.gov.ar:11336/218015instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-03 09:57:22.565CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv The Dependence on Frequency of Word Embedding Similarity Measures
title The Dependence on Frequency of Word Embedding Similarity Measures
spellingShingle The Dependence on Frequency of Word Embedding Similarity Measures
Valentini, Francisco Tomás
NATURAL LANGUAGE PROCESSING
WORD EMBEDDINGS
WORD FREQUENCY
WORD SIMILARITY
title_short The Dependence on Frequency of Word Embedding Similarity Measures
title_full The Dependence on Frequency of Word Embedding Similarity Measures
title_fullStr The Dependence on Frequency of Word Embedding Similarity Measures
title_full_unstemmed The Dependence on Frequency of Word Embedding Similarity Measures
title_sort The Dependence on Frequency of Word Embedding Similarity Measures
dc.creator.none.fl_str_mv Valentini, Francisco Tomás
Fernandez Slezak, Diego
Altszyler Lemcovich, Edgar Jaim
author Valentini, Francisco Tomás
author_facet Valentini, Francisco Tomás
Fernandez Slezak, Diego
Altszyler Lemcovich, Edgar Jaim
author_role author
author2 Fernandez Slezak, Diego
Altszyler Lemcovich, Edgar Jaim
author2_role author
author
dc.subject.none.fl_str_mv NATURAL LANGUAGE PROCESSING
WORD EMBEDDINGS
WORD FREQUENCY
WORD SIMILARITY
topic NATURAL LANGUAGE PROCESSING
WORD EMBEDDINGS
WORD FREQUENCY
WORD SIMILARITY
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.2
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv Recent research has shown that static word embeddings can encode word frequency information. However, little has been studied about this phenomenon and its effects on downstream tasks. In the present work, we systematically study the association between frequency and semantic similarity in several static word embeddings. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher semantic similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled. This proves that the patterns found are not due to real semantic associations present in the texts, but are an artifact produced by the word embeddings. Finally, we provide an example of how word frequency can strongly impact the measurement of gender bias with embedding-based metrics. In particular, we carry out a controlled experiment that shows that biases can even change sign or reverse their order by manipulating word frequencies.
Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
Fil: Fernandez Slezak, Diego. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
Fil: Altszyler Lemcovich, Edgar Jaim. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
description Recent research has shown that static word embeddings can encode word frequency information. However, little has been studied about this phenomenon and its effects on downstream tasks. In the present work, we systematically study the association between frequency and semantic similarity in several static word embeddings. We find that Skip-gram, GloVe and FastText embeddings tend to produce higher semantic similarity between high-frequency words than between other frequency combinations. We show that the association between frequency and similarity also appears when words are randomly shuffled. This proves that the patterns found are not due to real semantic associations present in the texts, but are an artifact produced by the word embeddings. Finally, we provide an example of how word frequency can strongly impact the measurement of gender bias with embedding-based metrics. In particular, we carry out a controlled experiment that shows that biases can even change sign or reverse their order by manipulating word frequencies.
publishDate 2022
dc.date.none.fl_str_mv 2022-11
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/218015
Valentini, Francisco Tomás; Fernandez Slezak, Diego; Altszyler Lemcovich, Edgar Jaim; The Dependence on Frequency of Word Embedding Similarity Measures; Cornell University; ArXiv.org; 11-2022; 1-10
2331-8422
CONICET Digital
CONICET
url http://hdl.handle.net/11336/218015
identifier_str_mv Valentini, Francisco Tomás; Fernandez Slezak, Diego; Altszyler Lemcovich, Edgar Jaim; The Dependence on Frequency of Word Embedding Similarity Measures; Cornell University; ArXiv.org; 11-2022; 1-10
2331-8422
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/doi/10.48550/arXiv.2211.08203
info:eu-repo/semantics/altIdentifier/url/https://arxiv.org/abs/2211.08203
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv Cornell University
publisher.none.fl_str_mv Cornell University
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1842269459248578560
score 13.13397