Learning the costs for a string edit distance-based similarity measure for abbreviated language

Autores: Alonso i Alemany, Laura
Año de publicación: 2010
Idioma: inglés
Tipo de recurso: documento de conferencia
Estado: versión publicada
Descripción: We present work in progress on word normalization for user-generated content. The approach is simple and helps in reducing the amount of manual annotation characteristic of more classical approaches. First, ortographic variants of a word, mostly abbreviations, are grouped together. From these manually grouped examples, we learn an automated classifier that, given a previously unseen word, determines whether it is an ortographic variant of a known word or an entirely new word. To do that, we calculate the similarity between the unseen word and all known words, and classify the new word as an ortographic variant of its most similar word. The classifier applies a string similarity measure based on the Levenshtein edit distance. To improve the accuracy of this measure, we assign edit operations an error-based cost. This scheme of cost assigning aims to maximize the distance between similar strings that are variants of different words. This custom similarity measure achieves an accuracy of .68, an important improvement if we compare it with the .54 obtained by the Levenshtein distance.
Sociedad Argentina de Informática e Investigación Operativa
Materia: Ciencias Informáticas
Natural Language Processing
String Edit Distances
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by-nc-sa/4.0/
Repositorio
Institución: Universidad Nacional de La Plata
OAI Identificador: oai:sedici.unlp.edu.ar:10915/152590

Acceder

id	SEDICI_3e8f88470859ed2be9e97523b97f85b8
oai_identifier_str	oai:sedici.unlp.edu.ar:10915/152590
network_acronym_str	SEDICI
repository_id_str	1329
network_name_str	SEDICI (UNLP)
spelling	Learning the costs for a string edit distance-based similarity measure for abbreviated languageAlonso i Alemany, LauraCiencias InformáticasNatural Language ProcessingString Edit DistancesWe present work in progress on word normalization for user-generated content. The approach is simple and helps in reducing the amount of manual annotation characteristic of more classical approaches. First, ortographic variants of a word, mostly abbreviations, are grouped together. From these manually grouped examples, we learn an automated classifier that, given a previously unseen word, determines whether it is an ortographic variant of a known word or an entirely new word. To do that, we calculate the similarity between the unseen word and all known words, and classify the new word as an ortographic variant of its most similar word. The classifier applies a string similarity measure based on the Levenshtein edit distance. To improve the accuracy of this measure, we assign edit operations an error-based cost. This scheme of cost assigning aims to maximize the distance between similar strings that are variants of different words. This custom similarity measure achieves an accuracy of .68, an important improvement if we compare it with the .54 obtained by the Levenshtein distance.Sociedad Argentina de Informática e Investigación Operativa2010info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdf72-81http://sedici.unlp.edu.ar/handle/10915/152590enginfo:eu-repo/semantics/altIdentifier/url/http://39jaiio.sadio.org.ar/sites/default/files/39jaiio-asai-07.pdfinfo:eu-repo/semantics/altIdentifier/issn/1850-2784info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2026-02-26T11:26:31Zoai:sedici.unlp.edu.ar:10915/152590Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292026-02-26 11:26:31.627SEDICI (UNLP) - Universidad Nacional de La Platafalse
dc.title.none.fl_str_mv	Learning the costs for a string edit distance-based similarity measure for abbreviated language
title	Learning the costs for a string edit distance-based similarity measure for abbreviated language
spellingShingle	Learning the costs for a string edit distance-based similarity measure for abbreviated language Alonso i Alemany, Laura Ciencias Informáticas Natural Language Processing String Edit Distances
title_short	Learning the costs for a string edit distance-based similarity measure for abbreviated language
title_full	Learning the costs for a string edit distance-based similarity measure for abbreviated language
title_fullStr	Learning the costs for a string edit distance-based similarity measure for abbreviated language
title_full_unstemmed	Learning the costs for a string edit distance-based similarity measure for abbreviated language
title_sort	Learning the costs for a string edit distance-based similarity measure for abbreviated language
dc.creator.none.fl_str_mv	Alonso i Alemany, Laura
author	Alonso i Alemany, Laura
author_facet	Alonso i Alemany, Laura
author_role	author
dc.subject.none.fl_str_mv	Ciencias Informáticas Natural Language Processing String Edit Distances
topic	Ciencias Informáticas Natural Language Processing String Edit Distances
dc.description.none.fl_txt_mv	We present work in progress on word normalization for user-generated content. The approach is simple and helps in reducing the amount of manual annotation characteristic of more classical approaches. First, ortographic variants of a word, mostly abbreviations, are grouped together. From these manually grouped examples, we learn an automated classifier that, given a previously unseen word, determines whether it is an ortographic variant of a known word or an entirely new word. To do that, we calculate the similarity between the unseen word and all known words, and classify the new word as an ortographic variant of its most similar word. The classifier applies a string similarity measure based on the Levenshtein edit distance. To improve the accuracy of this measure, we assign edit operations an error-based cost. This scheme of cost assigning aims to maximize the distance between similar strings that are variants of different words. This custom similarity measure achieves an accuracy of .68, an important improvement if we compare it with the .54 obtained by the Levenshtein distance. Sociedad Argentina de Informática e Investigación Operativa
description	We present work in progress on word normalization for user-generated content. The approach is simple and helps in reducing the amount of manual annotation characteristic of more classical approaches. First, ortographic variants of a word, mostly abbreviations, are grouped together. From these manually grouped examples, we learn an automated classifier that, given a previously unseen word, determines whether it is an ortographic variant of a known word or an entirely new word. To do that, we calculate the similarity between the unseen word and all known words, and classify the new word as an ortographic variant of its most similar word. The classifier applies a string similarity measure based on the Levenshtein edit distance. To improve the accuracy of this measure, we assign edit operations an error-based cost. This scheme of cost assigning aims to maximize the distance between similar strings that are variants of different words. This custom similarity measure achieves an accuracy of .68, an important improvement if we compare it with the .54 obtained by the Levenshtein distance.
publishDate	2010
dc.date.none.fl_str_mv	2010
dc.type.none.fl_str_mv	info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia
format	conferenceObject
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://sedici.unlp.edu.ar/handle/10915/152590
url	http://sedici.unlp.edu.ar/handle/10915/152590
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/url/http://39jaiio.sadio.org.ar/sites/default/files/39jaiio-asai-07.pdf info:eu-repo/semantics/altIdentifier/issn/1850-2784
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
dc.format.none.fl_str_mv	application/pdf 72-81
dc.source.none.fl_str_mv	reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP
reponame_str	SEDICI (UNLP)
collection	SEDICI (UNLP)
instname_str	Universidad Nacional de La Plata
instacron_str	UNLP
institution	UNLP
repository.name.fl_str_mv	SEDICI (UNLP) - Universidad Nacional de La Plata
repository.mail.fl_str_mv	alira@sedici.unlp.edu.ar
_version_	1858282340470489088
score	12.665996

Learning the costs for a string edit distance-based similarity measure for abbreviated language

Publicaciones similares