CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

Autores
Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent
Año de publicación
2025
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.
Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
Fil: Kozlowski, Diego. University of Montreal; Canadá
Fil: Lariviere, Vincent. University of Montreal; Canadá
Materia
CROSS-LINGUAL INFORMATION RETRIEVAL
ACADEMIC SEARCH
MULTILINGUAL EMBEDDINGS
MACHINE TRANSLATION
EVALUATION RESOURCES
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/274494

id CONICETDig_23c04c358f34dfa0616ec9258ab1163b
oai_identifier_str oai:ri.conicet.gov.ar:11336/274494
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling CLIRudit: Cross-Lingual Information Retrieval of Scientific DocumentsValentini, Francisco TomásKozlowski, DiegoLariviere, VincentCROSS-LINGUAL INFORMATION RETRIEVALACADEMIC SEARCHMULTILINGUAL EMBEDDINGSMACHINE TRANSLATIONEVALUATION RESOURCEShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Kozlowski, Diego. University of Montreal; CanadáFil: Lariviere, Vincent. University of Montreal; CanadáCornell University2025-04info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/274494Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent; CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents; Cornell University; arXiv.org; 4-2025; 1-122331-8422CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://arxiv.org/abs/2504.16264info:eu-repo/semantics/altIdentifier/doi/10.48550/arXiv.2504.16264info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-11-12T09:40:01Zoai:ri.conicet.gov.ar:11336/274494instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-11-12 09:40:01.981CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
title CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
spellingShingle CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
Valentini, Francisco Tomás
CROSS-LINGUAL INFORMATION RETRIEVAL
ACADEMIC SEARCH
MULTILINGUAL EMBEDDINGS
MACHINE TRANSLATION
EVALUATION RESOURCES
title_short CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
title_full CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
title_fullStr CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
title_full_unstemmed CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
title_sort CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
dc.creator.none.fl_str_mv Valentini, Francisco Tomás
Kozlowski, Diego
Lariviere, Vincent
author Valentini, Francisco Tomás
author_facet Valentini, Francisco Tomás
Kozlowski, Diego
Lariviere, Vincent
author_role author
author2 Kozlowski, Diego
Lariviere, Vincent
author2_role author
author
dc.subject.none.fl_str_mv CROSS-LINGUAL INFORMATION RETRIEVAL
ACADEMIC SEARCH
MULTILINGUAL EMBEDDINGS
MACHINE TRANSLATION
EVALUATION RESOURCES
topic CROSS-LINGUAL INFORMATION RETRIEVAL
ACADEMIC SEARCH
MULTILINGUAL EMBEDDINGS
MACHINE TRANSLATION
EVALUATION RESOURCES
purl_subject.fl_str_mv https://purl.org/becyt/ford/1.2
https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.
Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
Fil: Kozlowski, Diego. University of Montreal; Canadá
Fil: Lariviere, Vincent. University of Montreal; Canadá
description Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.
publishDate 2025
dc.date.none.fl_str_mv 2025-04
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/274494
Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent; CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents; Cornell University; arXiv.org; 4-2025; 1-12
2331-8422
CONICET Digital
CONICET
url http://hdl.handle.net/11336/274494
identifier_str_mv Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent; CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents; Cornell University; arXiv.org; 4-2025; 1-12
2331-8422
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/https://arxiv.org/abs/2504.16264
info:eu-repo/semantics/altIdentifier/doi/10.48550/arXiv.2504.16264
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv Cornell University
publisher.none.fl_str_mv Cornell University
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1848597480409661440
score 12.976206