Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word Embeddings

Autores
Cardellino, Cristian; Alonso i Alemany, Laura
Año de publicación
2017
Idioma
inglés
Tipo de recurso
documento de conferencia
Estado
versión publicada
Descripción
This work explores the use of word embeddings, also known as word vectors, trained on Spanish corpora, to use as features for Spanish verb sense disambiguation (VSD). This type of learning technique is named disjoint semisupervised learning [1]: an unsupervised algorithm is trained on unlabeled data separately as a first step, and then its results (i.e. the word embeddings) are fed to a supervised classifier. Throughout this paper we try to assert two hypothesis: (i) representations of training instances based on word embeddings improve the performance of supervised models for VSD, in contrast to more standard feature engineering techniques based on information taken from the training data; (ii) using word embeddings trained on a specific domain, in this case the same domain the labeled data is gathered from, has a positive impact on the model’s performance, when compared to general domain’s word embeddings. The performance of a model over the data is not only measured using standard metric techniques (e.g. accuracy or precision/recall) but also measuring the model tendency to overfit the available data by analyzing the learning curve. Measuring this overfitting tendency is important as there is a small amount of available data, thus we need to find models to generalize better the VSD problem. For the task we use SenSem [2], a corpus and lexicon of Spanish and Catalan disambiguated verbs, as our base resource for experimentation.
Sociedad Argentina de Informática e Investigación Operativa
Materia
Ciencias Informáticas
word embeddings
disjoint semisupervised learning
verb sense disambiguation
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by-sa/4.0/
Repositorio
SEDICI (UNLP)
Institución
Universidad Nacional de La Plata
OAI Identificador
oai:sedici.unlp.edu.ar:10915/65941

id SEDICI_2713855161a7fcc585f41ec644c8fe5d
oai_identifier_str oai:sedici.unlp.edu.ar:10915/65941
network_acronym_str SEDICI
repository_id_str 1329
network_name_str SEDICI (UNLP)
spelling Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word EmbeddingsCardellino, CristianAlonso i Alemany, LauraCiencias Informáticasword embeddingsdisjoint semisupervised learningverb sense disambiguationThis work explores the use of word embeddings, also known as word vectors, trained on Spanish corpora, to use as features for Spanish verb sense disambiguation (VSD). This type of learning technique is named disjoint semisupervised learning [1]: an unsupervised algorithm is trained on unlabeled data separately as a first step, and then its results (i.e. the word embeddings) are fed to a supervised classifier. Throughout this paper we try to assert two hypothesis: (i) representations of training instances based on word embeddings improve the performance of supervised models for VSD, in contrast to more standard feature engineering techniques based on information taken from the training data; (ii) using word embeddings trained on a specific domain, in this case the same domain the labeled data is gathered from, has a positive impact on the model’s performance, when compared to general domain’s word embeddings. The performance of a model over the data is not only measured using standard metric techniques (e.g. accuracy or precision/recall) but also measuring the model tendency to overfit the available data by analyzing the learning curve. Measuring this overfitting tendency is important as there is a small amount of available data, thus we need to find models to generalize better the VSD problem. For the task we use SenSem [2], a corpus and lexicon of Spanish and Catalan disambiguated verbs, as our base resource for experimentation.Sociedad Argentina de Informática e Investigación Operativa2017-09info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf26-34http://sedici.unlp.edu.ar/handle/10915/65941enginfo:eu-repo/semantics/altIdentifier/url/http://www.clei2017-46jaiio.sadio.org.ar/sites/default/files/Mem/ASAI/asai-05.pdfinfo:eu-repo/semantics/altIdentifier/issn/2451-7585info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-sa/4.0/Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-09-29T11:09:45Zoai:sedici.unlp.edu.ar:10915/65941Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-09-29 11:09:45.368SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word Embeddings
title Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word Embeddings
spellingShingle Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word Embeddings
Cardellino, Cristian
Ciencias Informáticas
word embeddings
disjoint semisupervised learning
verb sense disambiguation
title_short Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word Embeddings
title_full Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word Embeddings
title_fullStr Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word Embeddings
title_full_unstemmed Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word Embeddings
title_sort Disjoint Semi-supervised Spanish Verb Sense Disambiguation Using Word Embeddings
dc.creator.none.fl_str_mv Cardellino, Cristian
Alonso i Alemany, Laura
author Cardellino, Cristian
author_facet Cardellino, Cristian
Alonso i Alemany, Laura
author_role author
author2 Alonso i Alemany, Laura
author2_role author
dc.subject.none.fl_str_mv Ciencias Informáticas
word embeddings
disjoint semisupervised learning
verb sense disambiguation
topic Ciencias Informáticas
word embeddings
disjoint semisupervised learning
verb sense disambiguation
dc.description.none.fl_txt_mv This work explores the use of word embeddings, also known as word vectors, trained on Spanish corpora, to use as features for Spanish verb sense disambiguation (VSD). This type of learning technique is named disjoint semisupervised learning [1]: an unsupervised algorithm is trained on unlabeled data separately as a first step, and then its results (i.e. the word embeddings) are fed to a supervised classifier. Throughout this paper we try to assert two hypothesis: (i) representations of training instances based on word embeddings improve the performance of supervised models for VSD, in contrast to more standard feature engineering techniques based on information taken from the training data; (ii) using word embeddings trained on a specific domain, in this case the same domain the labeled data is gathered from, has a positive impact on the model’s performance, when compared to general domain’s word embeddings. The performance of a model over the data is not only measured using standard metric techniques (e.g. accuracy or precision/recall) but also measuring the model tendency to overfit the available data by analyzing the learning curve. Measuring this overfitting tendency is important as there is a small amount of available data, thus we need to find models to generalize better the VSD problem. For the task we use SenSem [2], a corpus and lexicon of Spanish and Catalan disambiguated verbs, as our base resource for experimentation.
Sociedad Argentina de Informática e Investigación Operativa
description This work explores the use of word embeddings, also known as word vectors, trained on Spanish corpora, to use as features for Spanish verb sense disambiguation (VSD). This type of learning technique is named disjoint semisupervised learning [1]: an unsupervised algorithm is trained on unlabeled data separately as a first step, and then its results (i.e. the word embeddings) are fed to a supervised classifier. Throughout this paper we try to assert two hypothesis: (i) representations of training instances based on word embeddings improve the performance of supervised models for VSD, in contrast to more standard feature engineering techniques based on information taken from the training data; (ii) using word embeddings trained on a specific domain, in this case the same domain the labeled data is gathered from, has a positive impact on the model’s performance, when compared to general domain’s word embeddings. The performance of a model over the data is not only measured using standard metric techniques (e.g. accuracy or precision/recall) but also measuring the model tendency to overfit the available data by analyzing the learning curve. Measuring this overfitting tendency is important as there is a small amount of available data, thus we need to find models to generalize better the VSD problem. For the task we use SenSem [2], a corpus and lexicon of Spanish and Catalan disambiguated verbs, as our base resource for experimentation.
publishDate 2017
dc.date.none.fl_str_mv 2017-09
dc.type.none.fl_str_mv info:eu-repo/semantics/conferenceObject
info:eu-repo/semantics/publishedVersion
Objeto de conferencia
http://purl.org/coar/resource_type/c_5794
info:ar-repo/semantics/documentoDeConferencia
format conferenceObject
status_str publishedVersion
dc.identifier.none.fl_str_mv http://sedici.unlp.edu.ar/handle/10915/65941
url http://sedici.unlp.edu.ar/handle/10915/65941
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/http://www.clei2017-46jaiio.sadio.org.ar/sites/default/files/Mem/ASAI/asai-05.pdf
info:eu-repo/semantics/altIdentifier/issn/2451-7585
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by-sa/4.0/
Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-sa/4.0/
Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
dc.format.none.fl_str_mv application/pdf
26-34
dc.source.none.fl_str_mv reponame:SEDICI (UNLP)
instname:Universidad Nacional de La Plata
instacron:UNLP
reponame_str SEDICI (UNLP)
collection SEDICI (UNLP)
instname_str Universidad Nacional de La Plata
instacron_str UNLP
institution UNLP
repository.name.fl_str_mv SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv alira@sedici.unlp.edu.ar
_version_ 1844615965465116672
score 13.069144