Flexible Detection of Similar DOM elements
- Autores
- Grigera, Julián; Gardey, Juan Cruz; Rossi, Gustavo Héctor; Garrido, Alejandra
- Año de publicación
- 2021
- Idioma
- inglés
- Tipo de recurso
- documento de conferencia
- Estado
- versión publicada
- Descripción
- Different research fields related to the web require detecting similarity between DOM elements. In the field of information extraction, many approaches emerged to extract structured data from web documents, most of which require comparing sample documents to extract their underlying structure. Other fields of applicability like web augmentation or transcoding also require analyzing structural similarity, but on UI components with smaller structures than full documents, making them unsuitable for the algorithms generally used in information extraction. Instead, these approaches tend to rely on the DOM elements’ location, but this does not resist structural changes in the document, and cannot locate similar elements placed in different positions. In this paper we present two flexible algorithms to measure similarity between DOM elements by using a mixed approach that considers both elements’ location and inner structure, together with a wrapper induction technique. We evaluated our algorithms with respect to other known approaches in the literature by comparing how they cluster a dataset of 1200+ DOM elements, using a manual clustering as ground truth. Results show that both proposed algorithms outperform all baseline ones. The proposed algorithms run in linear time, so they are faster than most approaches that analyze structural similarity.
- Materia
-
Ciencias de la Computación e Información
Information Extraction
Web Adaptation
DOM - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/4.0/
- Repositorio
- Institución
- Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
- OAI Identificador
- oai:digital.cic.gba.gob.ar:11746/12098
Ver los metadatos del registro completo
id |
CICBA_d4c03884dc928da8fe76576ec882da01 |
---|---|
oai_identifier_str |
oai:digital.cic.gba.gob.ar:11746/12098 |
network_acronym_str |
CICBA |
repository_id_str |
9441 |
network_name_str |
CIC Digital (CICBA) |
spelling |
Flexible Detection of Similar DOM elementsGrigera, JuliánGardey, Juan CruzRossi, Gustavo HéctorGarrido, AlejandraCiencias de la Computación e InformaciónInformation ExtractionWeb AdaptationDOMDifferent research fields related to the web require detecting similarity between DOM elements. In the field of information extraction, many approaches emerged to extract structured data from web documents, most of which require comparing sample documents to extract their underlying structure. Other fields of applicability like web augmentation or transcoding also require analyzing structural similarity, but on UI components with smaller structures than full documents, making them unsuitable for the algorithms generally used in information extraction. Instead, these approaches tend to rely on the DOM elements’ location, but this does not resist structural changes in the document, and cannot locate similar elements placed in different positions. In this paper we present two flexible algorithms to measure similarity between DOM elements by using a mixed approach that considers both elements’ location and inner structure, together with a wrapper induction technique. We evaluated our algorithms with respect to other known approaches in the literature by comparing how they cluster a dataset of 1200+ DOM elements, using a manual clustering as ground truth. Results show that both proposed algorithms outperform all baseline ones. The proposed algorithms run in linear time, so they are faster than most approaches that analyze structural similarity.2021info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfhttps://digital.cic.gba.gob.ar/handle/11746/12098enginfo:eu-repo/semantics/altIdentifier/doi/10.1007/978-3-031-24197-0_10info:eu-repo/semantics/altIdentifier/isbn/978-3-031-24197-0info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/reponame:CIC Digital (CICBA)instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Airesinstacron:CICBA2025-09-29T13:39:51Zoai:digital.cic.gba.gob.ar:11746/12098Institucionalhttp://digital.cic.gba.gob.arOrganismo científico-tecnológicoNo correspondehttp://digital.cic.gba.gob.ar/oai/snrdmarisa.degiusti@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:94412025-09-29 13:39:51.74CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Airesfalse |
dc.title.none.fl_str_mv |
Flexible Detection of Similar DOM elements |
title |
Flexible Detection of Similar DOM elements |
spellingShingle |
Flexible Detection of Similar DOM elements Grigera, Julián Ciencias de la Computación e Información Information Extraction Web Adaptation DOM |
title_short |
Flexible Detection of Similar DOM elements |
title_full |
Flexible Detection of Similar DOM elements |
title_fullStr |
Flexible Detection of Similar DOM elements |
title_full_unstemmed |
Flexible Detection of Similar DOM elements |
title_sort |
Flexible Detection of Similar DOM elements |
dc.creator.none.fl_str_mv |
Grigera, Julián Gardey, Juan Cruz Rossi, Gustavo Héctor Garrido, Alejandra |
author |
Grigera, Julián |
author_facet |
Grigera, Julián Gardey, Juan Cruz Rossi, Gustavo Héctor Garrido, Alejandra |
author_role |
author |
author2 |
Gardey, Juan Cruz Rossi, Gustavo Héctor Garrido, Alejandra |
author2_role |
author author author |
dc.subject.none.fl_str_mv |
Ciencias de la Computación e Información Information Extraction Web Adaptation DOM |
topic |
Ciencias de la Computación e Información Information Extraction Web Adaptation DOM |
dc.description.none.fl_txt_mv |
Different research fields related to the web require detecting similarity between DOM elements. In the field of information extraction, many approaches emerged to extract structured data from web documents, most of which require comparing sample documents to extract their underlying structure. Other fields of applicability like web augmentation or transcoding also require analyzing structural similarity, but on UI components with smaller structures than full documents, making them unsuitable for the algorithms generally used in information extraction. Instead, these approaches tend to rely on the DOM elements’ location, but this does not resist structural changes in the document, and cannot locate similar elements placed in different positions. In this paper we present two flexible algorithms to measure similarity between DOM elements by using a mixed approach that considers both elements’ location and inner structure, together with a wrapper induction technique. We evaluated our algorithms with respect to other known approaches in the literature by comparing how they cluster a dataset of 1200+ DOM elements, using a manual clustering as ground truth. Results show that both proposed algorithms outperform all baseline ones. The proposed algorithms run in linear time, so they are faster than most approaches that analyze structural similarity. |
description |
Different research fields related to the web require detecting similarity between DOM elements. In the field of information extraction, many approaches emerged to extract structured data from web documents, most of which require comparing sample documents to extract their underlying structure. Other fields of applicability like web augmentation or transcoding also require analyzing structural similarity, but on UI components with smaller structures than full documents, making them unsuitable for the algorithms generally used in information extraction. Instead, these approaches tend to rely on the DOM elements’ location, but this does not resist structural changes in the document, and cannot locate similar elements placed in different positions. In this paper we present two flexible algorithms to measure similarity between DOM elements by using a mixed approach that considers both elements’ location and inner structure, together with a wrapper induction technique. We evaluated our algorithms with respect to other known approaches in the literature by comparing how they cluster a dataset of 1200+ DOM elements, using a manual clustering as ground truth. Results show that both proposed algorithms outperform all baseline ones. The proposed algorithms run in linear time, so they are faster than most approaches that analyze structural similarity. |
publishDate |
2021 |
dc.date.none.fl_str_mv |
2021 |
dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia |
format |
conferenceObject |
status_str |
publishedVersion |
dc.identifier.none.fl_str_mv |
https://digital.cic.gba.gob.ar/handle/11746/12098 |
url |
https://digital.cic.gba.gob.ar/handle/11746/12098 |
dc.language.none.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/doi/10.1007/978-3-031-24197-0_10 info:eu-repo/semantics/altIdentifier/isbn/978-3-031-24197-0 |
dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ |
eu_rights_str_mv |
openAccess |
rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:CIC Digital (CICBA) instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Aires instacron:CICBA |
reponame_str |
CIC Digital (CICBA) |
collection |
CIC Digital (CICBA) |
instname_str |
Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
instacron_str |
CICBA |
institution |
CICBA |
repository.name.fl_str_mv |
CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
repository.mail.fl_str_mv |
marisa.degiusti@sedici.unlp.edu.ar |
_version_ |
1844618581427355648 |
score |
13.070432 |