Web scraping by end users

Autores: Tacuri, Alex; Firmenich, Sergio; Fernández, Alejandro; Riva, María Florencia; Urbieta, Matías; Rossi, Gustavo Héctor
Año de publicación: 2025
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: Scraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using generalpurpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results.
Materia: Ciencias de la Computación e Información
Web mining
End-user computing
Human computer interaction
User centered design
Web scraping
Data integration
Scraper specification
Web data extraction
Nivel de accesibilidad: acceso abierto
Condiciones de uso: http://creativecommons.org/licenses/by/4.0/
Repositorio
Institución: Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
OAI Identificador: oai:digital.cic.gba.gob.ar:11746/12582

Acceder

id	CICBA_b69d6f79ee741643c018b51db4c82c2e
oai_identifier_str	oai:digital.cic.gba.gob.ar:11746/12582
network_acronym_str	CICBA
repository_id_str	9441
network_name_str	CIC Digital (CICBA)
spelling	Web scraping by end usersTacuri, AlexFirmenich, SergioFernández, AlejandroRiva, María FlorenciaUrbieta, MatíasRossi, Gustavo HéctorCiencias de la Computación e InformaciónWeb miningEnd-user computingHuman computer interactionUser centered designWeb scrapingData integrationScraper specificationWeb data extractionScraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using generalpurpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results.2025-11-25info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttps://digital.cic.gba.gob.ar/handle/11746/12582enginfo:eu-repo/semantics/altIdentifier/issn/2169-3536info:eu-repo/semantics/altIdentifier/doi/10.1109/access.2025.3636662info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/4.0/reponame:CIC Digital (CICBA)instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Airesinstacron:CICBA2026-06-04T09:39:17Zoai:digital.cic.gba.gob.ar:11746/12582Institucionalhttp://digital.cic.gba.gob.arOrganismo científico-tecnológicoNo correspondehttp://digital.cic.gba.gob.ar/oai/snrdmarisa.degiusti@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:94412026-06-04 09:39:18.184CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Airesfalse
dc.title.none.fl_str_mv	Web scraping by end users
title	Web scraping by end users
spellingShingle	Web scraping by end users Tacuri, Alex Ciencias de la Computación e Información Web mining End-user computing Human computer interaction User centered design Web scraping Data integration Scraper specification Web data extraction
title_short	Web scraping by end users
title_full	Web scraping by end users
title_fullStr	Web scraping by end users
title_full_unstemmed	Web scraping by end users
title_sort	Web scraping by end users
dc.creator.none.fl_str_mv	Tacuri, Alex Firmenich, Sergio Fernández, Alejandro Riva, María Florencia Urbieta, Matías Rossi, Gustavo Héctor
author	Tacuri, Alex
author_facet	Tacuri, Alex Firmenich, Sergio Fernández, Alejandro Riva, María Florencia Urbieta, Matías Rossi, Gustavo Héctor
author_role	author
author2	Firmenich, Sergio Fernández, Alejandro Riva, María Florencia Urbieta, Matías Rossi, Gustavo Héctor
author2_role	author author author author author
dc.subject.none.fl_str_mv	Ciencias de la Computación e Información Web mining End-user computing Human computer interaction User centered design Web scraping Data integration Scraper specification Web data extraction
topic	Ciencias de la Computación e Información Web mining End-user computing Human computer interaction User centered design Web scraping Data integration Scraper specification Web data extraction
dc.description.none.fl_txt_mv	Scraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using generalpurpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results.
description	Scraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using generalpurpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results.
publishDate	2025
dc.date.none.fl_str_mv	2025-11-25
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	https://digital.cic.gba.gob.ar/handle/11746/12582
url	https://digital.cic.gba.gob.ar/handle/11746/12582
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/issn/2169-3536 info:eu-repo/semantics/altIdentifier/doi/10.1109/access.2025.3636662
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by/4.0/
eu_rights_str_mv	openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by/4.0/
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:CIC Digital (CICBA) instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Aires instacron:CICBA
reponame_str	CIC Digital (CICBA)
collection	CIC Digital (CICBA)
instname_str	Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
instacron_str	CICBA
institution	CICBA
repository.name.fl_str_mv	CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
repository.mail.fl_str_mv	marisa.degiusti@sedici.unlp.edu.ar
_version_	1867092451256172544
score	12.832306

Web scraping by end users

Publicaciones similares