Web scraping by end users
- Autores
- Tacuri, Alex; Firmenich, Sergio; Fernández, Alejandro; Riva, María Florencia; Urbieta, Matías; Rossi, Gustavo Héctor
- Año de publicación
- 2025
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- Scraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using generalpurpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results.
- Materia
-
Ciencias de la Computación e Información
Web mining
End-user computing
Human computer interaction
User centered design
Web scraping
Data integration
Scraper specification
Web data extraction - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by/4.0/
- Repositorio
.jpg)
- Institución
- Comisión de Investigaciones Científicas de la Provincia de Buenos Aires
- OAI Identificador
- oai:digital.cic.gba.gob.ar:11746/12582
Ver los metadatos del registro completo
| id |
CICBA_b69d6f79ee741643c018b51db4c82c2e |
|---|---|
| oai_identifier_str |
oai:digital.cic.gba.gob.ar:11746/12582 |
| network_acronym_str |
CICBA |
| repository_id_str |
9441 |
| network_name_str |
CIC Digital (CICBA) |
| spelling |
Web scraping by end usersTacuri, AlexFirmenich, SergioFernández, AlejandroRiva, María FlorenciaUrbieta, MatíasRossi, Gustavo HéctorCiencias de la Computación e InformaciónWeb miningEnd-user computingHuman computer interactionUser centered designWeb scrapingData integrationScraper specificationWeb data extractionScraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using generalpurpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results.2025-11-25info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfhttps://digital.cic.gba.gob.ar/handle/11746/12582enginfo:eu-repo/semantics/altIdentifier/issn/2169-3536info:eu-repo/semantics/altIdentifier/doi/10.1109/access.2025.3636662info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by/4.0/reponame:CIC Digital (CICBA)instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Airesinstacron:CICBA2025-12-18T08:52:23Zoai:digital.cic.gba.gob.ar:11746/12582Institucionalhttp://digital.cic.gba.gob.arOrganismo científico-tecnológicoNo correspondehttp://digital.cic.gba.gob.ar/oai/snrdmarisa.degiusti@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:94412025-12-18 08:52:23.987CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Airesfalse |
| dc.title.none.fl_str_mv |
Web scraping by end users |
| title |
Web scraping by end users |
| spellingShingle |
Web scraping by end users Tacuri, Alex Ciencias de la Computación e Información Web mining End-user computing Human computer interaction User centered design Web scraping Data integration Scraper specification Web data extraction |
| title_short |
Web scraping by end users |
| title_full |
Web scraping by end users |
| title_fullStr |
Web scraping by end users |
| title_full_unstemmed |
Web scraping by end users |
| title_sort |
Web scraping by end users |
| dc.creator.none.fl_str_mv |
Tacuri, Alex Firmenich, Sergio Fernández, Alejandro Riva, María Florencia Urbieta, Matías Rossi, Gustavo Héctor |
| author |
Tacuri, Alex |
| author_facet |
Tacuri, Alex Firmenich, Sergio Fernández, Alejandro Riva, María Florencia Urbieta, Matías Rossi, Gustavo Héctor |
| author_role |
author |
| author2 |
Firmenich, Sergio Fernández, Alejandro Riva, María Florencia Urbieta, Matías Rossi, Gustavo Héctor |
| author2_role |
author author author author author |
| dc.subject.none.fl_str_mv |
Ciencias de la Computación e Información Web mining End-user computing Human computer interaction User centered design Web scraping Data integration Scraper specification Web data extraction |
| topic |
Ciencias de la Computación e Información Web mining End-user computing Human computer interaction User centered design Web scraping Data integration Scraper specification Web data extraction |
| dc.description.none.fl_txt_mv |
Scraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using generalpurpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results. |
| description |
Scraping is a topic studied from various perspectives, encompassing automatic and AI-based approaches, and a wide range of programming libraries that expedite development. As the volume of available web content increases, it becomes increasingly challenging to anticipate end-user requirements regarding what, how, and when to extract data from the web. This challenge is compounded when integrating data from multiple websites, particularly when websites’ search engines dynamically retrieve unavailable data via permanent links. Complex scraping processes, such as these are difficult to develop using generalpurpose programming languages and are challenging to automate with AI-based approaches. Controllability is a crucial aspect of scraping, that is, how end users can make decisions during the scraper specification process, understand information sources, and how the data are ultimately extracted, compiled, and formatted for output. In response, our study presents an innovative end-user approach for specifying scrapers that focuses on seamlessly integrating data from multiple sources. Through this approach and its supporting toolset, we aim to provide users with greater control and transparency over the extraction, integration, and formatting of data, thereby addressing the key concerns in web scraping. The approach and toolset were evaluated and they yielded promising results. |
| publishDate |
2025 |
| dc.date.none.fl_str_mv |
2025-11-25 |
| dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
| format |
article |
| status_str |
publishedVersion |
| dc.identifier.none.fl_str_mv |
https://digital.cic.gba.gob.ar/handle/11746/12582 |
| url |
https://digital.cic.gba.gob.ar/handle/11746/12582 |
| dc.language.none.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/issn/2169-3536 info:eu-repo/semantics/altIdentifier/doi/10.1109/access.2025.3636662 |
| dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by/4.0/ |
| eu_rights_str_mv |
openAccess |
| rights_invalid_str_mv |
http://creativecommons.org/licenses/by/4.0/ |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.source.none.fl_str_mv |
reponame:CIC Digital (CICBA) instname:Comisión de Investigaciones Científicas de la Provincia de Buenos Aires instacron:CICBA |
| reponame_str |
CIC Digital (CICBA) |
| collection |
CIC Digital (CICBA) |
| instname_str |
Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
| instacron_str |
CICBA |
| institution |
CICBA |
| repository.name.fl_str_mv |
CIC Digital (CICBA) - Comisión de Investigaciones Científicas de la Provincia de Buenos Aires |
| repository.mail.fl_str_mv |
marisa.degiusti@sedici.unlp.edu.ar |
| _version_ |
1851853400630624256 |
| score |
13.176297 |