Negative Impact of Bots on Digital Repository Usage Statistics: A Case Analysis and Applied Strategy
- Autores
- Bértoli, Rafael; Lira, Ariel Jorge
- Año de publicación
- 2025
- Idioma
- inglés
- Tipo de recurso
- documento de conferencia
- Estado
- versión publicada
- Descripción
- Bot detection in web services has been a necessity since the massification of the internet. These agents crawl the information available on the web for various purposes, such as search engine development, SEO analysis, or, recently, for training artificial intelligence models. Repositories are particularly interesting to these agents because they offer high-quality controlled information described by curated metadata. Regardless of its objective, the repository is continuously being harvested by multiple bots, which causes occasional downtime and alterations in the usage logs used for generating statistical reports, which serve to measure the real impact of the preserved and published works. This paper presents the experience of the SEDICI repository in analyzing and cleaning 13 years' worth of collected logs. Detection is performed by analyzing bot behavior, such as excessive usage, anomalous usage patterns, scans, and attacks. Then, an AI model is used to flag bots with subtle behavior not otherwise identified as such. By applying these strategies, it was possible to eliminate over 50 million usage records originating from atypical bots that systematically and recurrently access the repository.
Traducción al inglés de "Impacto negativo de bots en estadísticas de uso de repositorios digitales: análisis de un caso y estrategia aplicada" (ver "Documentos relacionados").
Dirección PREBI-SEDICI - Materia
-
Informática
bots
machine learning
repositories
usage statistics - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- http://creativecommons.org/licenses/by-nc-sa/4.0/
- Repositorio
.jpg)
- Institución
- Universidad Nacional de La Plata
- OAI Identificador
- oai:sedici.unlp.edu.ar:10915/188909
Ver los metadatos del registro completo
| id |
SEDICI_aebc191633426691f0e5933f4740acd7 |
|---|---|
| oai_identifier_str |
oai:sedici.unlp.edu.ar:10915/188909 |
| network_acronym_str |
SEDICI |
| repository_id_str |
1329 |
| network_name_str |
SEDICI (UNLP) |
| spelling |
Negative Impact of Bots on Digital Repository Usage Statistics: A Case Analysis and Applied StrategyBértoli, RafaelLira, Ariel JorgeInformáticabotsmachine learningrepositoriesusage statisticsBot detection in web services has been a necessity since the massification of the internet. These agents crawl the information available on the web for various purposes, such as search engine development, SEO analysis, or, recently, for training artificial intelligence models. Repositories are particularly interesting to these agents because they offer high-quality controlled information described by curated metadata. Regardless of its objective, the repository is continuously being harvested by multiple bots, which causes occasional downtime and alterations in the usage logs used for generating statistical reports, which serve to measure the real impact of the preserved and published works. This paper presents the experience of the SEDICI repository in analyzing and cleaning 13 years' worth of collected logs. Detection is performed by analyzing bot behavior, such as excessive usage, anomalous usage patterns, scans, and attacks. Then, an AI model is used to flag bots with subtle behavior not otherwise identified as such. By applying these strategies, it was possible to eliminate over 50 million usage records originating from atypical bots that systematically and recurrently access the repository.Traducción al inglés de "Impacto negativo de bots en estadísticas de uso de repositorios digitales: análisis de un caso y estrategia aplicada" (ver "Documentos relacionados").Dirección PREBI-SEDICI2025-10-09info:eu-repo/semantics/conferenceObjectinfo:eu-repo/semantics/publishedVersionObjeto de conferenciahttp://purl.org/coar/resource_type/c_5794info:ar-repo/semantics/documentoDeConferenciaapplication/pdfhttp://sedici.unlp.edu.ar/handle/10915/188909enginfo:eu-repo/semantics/reference/url/https://sedici.unlp.edu.ar/handle/10915/181804info:eu-repo/semantics/openAccesshttp://creativecommons.org/licenses/by-nc-sa/4.0/Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)reponame:SEDICI (UNLP)instname:Universidad Nacional de La Platainstacron:UNLP2025-12-23T11:54:08Zoai:sedici.unlp.edu.ar:10915/188909Institucionalhttp://sedici.unlp.edu.ar/Universidad públicaNo correspondehttp://sedici.unlp.edu.ar/oai/snrdalira@sedici.unlp.edu.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:13292025-12-23 11:54:09.008SEDICI (UNLP) - Universidad Nacional de La Platafalse |
| dc.title.none.fl_str_mv |
Negative Impact of Bots on Digital Repository Usage Statistics: A Case Analysis and Applied Strategy |
| title |
Negative Impact of Bots on Digital Repository Usage Statistics: A Case Analysis and Applied Strategy |
| spellingShingle |
Negative Impact of Bots on Digital Repository Usage Statistics: A Case Analysis and Applied Strategy Bértoli, Rafael Informática bots machine learning repositories usage statistics |
| title_short |
Negative Impact of Bots on Digital Repository Usage Statistics: A Case Analysis and Applied Strategy |
| title_full |
Negative Impact of Bots on Digital Repository Usage Statistics: A Case Analysis and Applied Strategy |
| title_fullStr |
Negative Impact of Bots on Digital Repository Usage Statistics: A Case Analysis and Applied Strategy |
| title_full_unstemmed |
Negative Impact of Bots on Digital Repository Usage Statistics: A Case Analysis and Applied Strategy |
| title_sort |
Negative Impact of Bots on Digital Repository Usage Statistics: A Case Analysis and Applied Strategy |
| dc.creator.none.fl_str_mv |
Bértoli, Rafael Lira, Ariel Jorge |
| author |
Bértoli, Rafael |
| author_facet |
Bértoli, Rafael Lira, Ariel Jorge |
| author_role |
author |
| author2 |
Lira, Ariel Jorge |
| author2_role |
author |
| dc.subject.none.fl_str_mv |
Informática bots machine learning repositories usage statistics |
| topic |
Informática bots machine learning repositories usage statistics |
| dc.description.none.fl_txt_mv |
Bot detection in web services has been a necessity since the massification of the internet. These agents crawl the information available on the web for various purposes, such as search engine development, SEO analysis, or, recently, for training artificial intelligence models. Repositories are particularly interesting to these agents because they offer high-quality controlled information described by curated metadata. Regardless of its objective, the repository is continuously being harvested by multiple bots, which causes occasional downtime and alterations in the usage logs used for generating statistical reports, which serve to measure the real impact of the preserved and published works. This paper presents the experience of the SEDICI repository in analyzing and cleaning 13 years' worth of collected logs. Detection is performed by analyzing bot behavior, such as excessive usage, anomalous usage patterns, scans, and attacks. Then, an AI model is used to flag bots with subtle behavior not otherwise identified as such. By applying these strategies, it was possible to eliminate over 50 million usage records originating from atypical bots that systematically and recurrently access the repository. Traducción al inglés de "Impacto negativo de bots en estadísticas de uso de repositorios digitales: análisis de un caso y estrategia aplicada" (ver "Documentos relacionados"). Dirección PREBI-SEDICI |
| description |
Bot detection in web services has been a necessity since the massification of the internet. These agents crawl the information available on the web for various purposes, such as search engine development, SEO analysis, or, recently, for training artificial intelligence models. Repositories are particularly interesting to these agents because they offer high-quality controlled information described by curated metadata. Regardless of its objective, the repository is continuously being harvested by multiple bots, which causes occasional downtime and alterations in the usage logs used for generating statistical reports, which serve to measure the real impact of the preserved and published works. This paper presents the experience of the SEDICI repository in analyzing and cleaning 13 years' worth of collected logs. Detection is performed by analyzing bot behavior, such as excessive usage, anomalous usage patterns, scans, and attacks. Then, an AI model is used to flag bots with subtle behavior not otherwise identified as such. By applying these strategies, it was possible to eliminate over 50 million usage records originating from atypical bots that systematically and recurrently access the repository. |
| publishDate |
2025 |
| dc.date.none.fl_str_mv |
2025-10-09 |
| dc.type.none.fl_str_mv |
info:eu-repo/semantics/conferenceObject info:eu-repo/semantics/publishedVersion Objeto de conferencia http://purl.org/coar/resource_type/c_5794 info:ar-repo/semantics/documentoDeConferencia |
| format |
conferenceObject |
| status_str |
publishedVersion |
| dc.identifier.none.fl_str_mv |
http://sedici.unlp.edu.ar/handle/10915/188909 |
| url |
http://sedici.unlp.edu.ar/handle/10915/188909 |
| dc.language.none.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
info:eu-repo/semantics/reference/url/https://sedici.unlp.edu.ar/handle/10915/181804 |
| dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
| eu_rights_str_mv |
openAccess |
| rights_invalid_str_mv |
http://creativecommons.org/licenses/by-nc-sa/4.0/ Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.source.none.fl_str_mv |
reponame:SEDICI (UNLP) instname:Universidad Nacional de La Plata instacron:UNLP |
| reponame_str |
SEDICI (UNLP) |
| collection |
SEDICI (UNLP) |
| instname_str |
Universidad Nacional de La Plata |
| instacron_str |
UNLP |
| institution |
UNLP |
| repository.name.fl_str_mv |
SEDICI (UNLP) - Universidad Nacional de La Plata |
| repository.mail.fl_str_mv |
alira@sedici.unlp.edu.ar |
| _version_ |
1852334852161929216 |
| score |
12.952241 |