CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

Autores: Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent
Año de publicación: 2025
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.
Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
Fil: Kozlowski, Diego. University of Montreal; Canadá
Fil: Lariviere, Vincent. University of Montreal; Canadá
Materia: CROSS-LINGUAL INFORMATION RETRIEVAL
ACADEMIC SEARCH
MULTILINGUAL EMBEDDINGS
MACHINE TRANSLATION
EVALUATION RESOURCES
Nivel de accesibilidad: acceso abierto
Condiciones de uso: https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
Institución: Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador: oai:ri.conicet.gov.ar:11336/274494

Acceder

id	CONICETDig_23c04c358f34dfa0616ec9258ab1163b
oai_identifier_str	oai:ri.conicet.gov.ar:11336/274494
network_acronym_str	CONICETDig
repository_id_str	3498
network_name_str	CONICET Digital (CONICET)
spelling	CLIRudit: Cross-Lingual Information Retrieval of Scientific DocumentsValentini, Francisco TomásKozlowski, DiegoLariviere, VincentCROSS-LINGUAL INFORMATION RETRIEVALACADEMIC SEARCHMULTILINGUAL EMBEDDINGSMACHINE TRANSLATIONEVALUATION RESOURCEShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Kozlowski, Diego. University of Montreal; CanadáFil: Lariviere, Vincent. University of Montreal; CanadáCornell University2025-04info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/274494Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent; CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents; Cornell University; arXiv.org; 4-2025; 1-122331-8422CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://arxiv.org/abs/2504.16264info:eu-repo/semantics/altIdentifier/doi/10.48550/arXiv.2504.16264info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2026-06-04T10:56:27Zoai:ri.conicet.gov.ar:11336/274494instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982026-06-04 10:56:27.901CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv	CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
title	CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
spellingShingle	CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents Valentini, Francisco Tomás CROSS-LINGUAL INFORMATION RETRIEVAL ACADEMIC SEARCH MULTILINGUAL EMBEDDINGS MACHINE TRANSLATION EVALUATION RESOURCES
title_short	CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
title_full	CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
title_fullStr	CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
title_full_unstemmed	CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
title_sort	CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
dc.creator.none.fl_str_mv	Valentini, Francisco Tomás Kozlowski, Diego Lariviere, Vincent
author	Valentini, Francisco Tomás
author_facet	Valentini, Francisco Tomás Kozlowski, Diego Lariviere, Vincent
author_role	author
author2	Kozlowski, Diego Lariviere, Vincent
author2_role	author author
dc.subject.none.fl_str_mv	CROSS-LINGUAL INFORMATION RETRIEVAL ACADEMIC SEARCH MULTILINGUAL EMBEDDINGS MACHINE TRANSLATION EVALUATION RESOURCES
topic	CROSS-LINGUAL INFORMATION RETRIEVAL ACADEMIC SEARCH MULTILINGUAL EMBEDDINGS MACHINE TRANSLATION EVALUATION RESOURCES
purl_subject.fl_str_mv	https://purl.org/becyt/ford/1.2 https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv	Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers. Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina Fil: Kozlowski, Diego. University of Montreal; Canadá Fil: Lariviere, Vincent. University of Montreal; Canadá
description	Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.
publishDate	2025
dc.date.none.fl_str_mv	2025-04
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://hdl.handle.net/11336/274494 Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent; CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents; Cornell University; arXiv.org; 4-2025; 1-12 2331-8422 CONICET Digital CONICET
url	http://hdl.handle.net/11336/274494
identifier_str_mv	Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent; CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents; Cornell University; arXiv.org; 4-2025; 1-12 2331-8422 CONICET Digital CONICET
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/url/https://arxiv.org/abs/2504.16264 info:eu-repo/semantics/altIdentifier/doi/10.48550/arXiv.2504.16264
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
eu_rights_str_mv	openAccess
rights_invalid_str_mv	https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.format.none.fl_str_mv	application/pdf application/pdf
dc.publisher.none.fl_str_mv	Cornell University
publisher.none.fl_str_mv	Cornell University
dc.source.none.fl_str_mv	reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str	CONICET Digital (CONICET)
collection	CONICET Digital (CONICET)
instname_str	Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv	CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv	dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_	1867098499596681216
score	12.832306

CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

Publicaciones similares