Web scraping by end users

Autores
Tacuri, Alex; Firmenich, Sergio; Fernández, Alejandro; Riva, María Florencia; Urbieta, Matías; Rossi, Gustavo Héctor
Año de publicación
2025
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Scraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using generalpurpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results.
Materia
Ciencias de la Computación e Información
Web mining
End-user computing
Human computer interaction
User centered design
Web scraping
Data integration
Scraper specification
Web data extraction
Nivel de accesibilidad
acceso abierto
Condiciones de uso
http://creativecommons.org/licenses/by/4.0/
Repositorio
CIC Digital (CICBA)
Institución
Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
OAI Identificador
oai:digital.cic.gba.gob.ar:11746/12582

id CICBA_b69d6f79ee741643c018b51db4c82c2e
oai_identifier_str oai:digital.cic.gba.gob.ar:11746/12582
network_acronym_str CICBA
repository_id_str 9441
network_name_str CIC Digital (CICBA)
spelling Web scraping by end usersTacuri, AlexFirmenich, SergioFernández, AlejandroRiva, María FlorenciaUrbieta, MatíasRossi, Gustavo HéctorCiencias de la Computación e InformaciónWeb miningEnd-user computingHuman computer interactionUser centered designWeb scrapingData integrationScraper specificationWeb data extractionScraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using generalpurpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results.2025-11-25info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttps://digital.cic.gba.gob.ar/handle/11746/12582enginfo:eu-repo/semantics/altIdentifier/issn/2169-3536info:eu-repo/semantics/altIdentifier/doi/10.1109/access.2025.3636662info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/4.0/reponame:CIC Digital (CICBA)instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Airesinstacron:CICBA2025-12-18T08:52:23Zoai:digital.cic.gba.gob.ar:11746/12582Institucionalhttp://digital.cic.gba.gob.arOrganismo científico-tecnológicoNo correspondehttp://digital.cic.gba.gob.ar/oai/snrdmarisa.degiusti@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:94412025-12-18 08:52:23.987CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Airesfalse
dc.title.none.fl_str_mv Web scraping by end users
title Web scraping by end users
spellingShingle Web scraping by end users
Tacuri, Alex
Ciencias de la Computación e Información
Web mining
End-user computing
Human computer interaction
User centered design
Web scraping
Data integration
Scraper specification
Web data extraction
title_short Web scraping by end users
title_full Web scraping by end users
title_fullStr Web scraping by end users
title_full_unstemmed Web scraping by end users
title_sort Web scraping by end users
dc.creator.none.fl_str_mv Tacuri, Alex
Firmenich, Sergio
Fernández, Alejandro
Riva, María Florencia
Urbieta, Matías
Rossi, Gustavo Héctor
author Tacuri, Alex
author_facet Tacuri, Alex
Firmenich, Sergio
Fernández, Alejandro
Riva, María Florencia
Urbieta, Matías
Rossi, Gustavo Héctor
author_role author
author2 Firmenich, Sergio
Fernández, Alejandro
Riva, María Florencia
Urbieta, Matías
Rossi, Gustavo Héctor
author2_role author
author
author
author
author
dc.subject.none.fl_str_mv Ciencias de la Computación e Información
Web mining
End-user computing
Human computer interaction
User centered design
Web scraping
Data integration
Scraper specification
Web data extraction
topic Ciencias de la Computación e Información
Web mining
End-user computing
Human computer interaction
User centered design
Web scraping
Data integration
Scraper specification
Web data extraction
dc.description.none.fl_txt_mv Scraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using generalpurpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results.
description Scraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using generalpurpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results.
publishDate 2025
dc.date.none.fl_str_mv 2025-11-25
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv https://digital.cic.gba.gob.ar/handle/11746/12582
url https://digital.cic.gba.gob.ar/handle/11746/12582
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/issn/2169-3536
info:eu-repo/semantics/altIdentifier/doi/10.1109/access.2025.3636662
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
http://creativecommons.org/licenses/by/4.0/
eu_rights_str_mv openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by/4.0/
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:CIC Digital (CICBA)
instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
instacron:CICBA
reponame_str CIC Digital (CICBA)
collection CIC Digital (CICBA)
instname_str Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
instacron_str CICBA
institution CICBA
repository.name.fl_str_mv CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
repository.mail.fl_str_mv marisa.degiusti@sedici.unlp.edu.ar
_version_ 1851853400630624256
score 13.176297