CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
- Autores
- Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent
- Año de publicación
- 2025
- Idioma
- inglés
- Tipo de recurso
- artículo
- Estado
- versión publicada
- Descripción
- Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.
Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina
Fil: Kozlowski, Diego. University of Montreal; Canadá
Fil: Lariviere, Vincent. University of Montreal; Canadá - Materia
-
CROSS-LINGUAL INFORMATION RETRIEVAL
ACADEMIC SEARCH
MULTILINGUAL EMBEDDINGS
MACHINE TRANSLATION
EVALUATION RESOURCES - Nivel de accesibilidad
- acceso abierto
- Condiciones de uso
- https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
- Repositorio
.jpg)
- Institución
- Consejo Nacional de Investigaciones Científicas y Técnicas
- OAI Identificador
- oai:ri.conicet.gov.ar:11336/274494
Ver los metadatos del registro completo
| id |
CONICETDig_23c04c358f34dfa0616ec9258ab1163b |
|---|---|
| oai_identifier_str |
oai:ri.conicet.gov.ar:11336/274494 |
| network_acronym_str |
CONICETDig |
| repository_id_str |
3498 |
| network_name_str |
CONICET Digital (CONICET) |
| spelling |
CLIRudit: Cross-Lingual Information Retrieval of Scientific DocumentsValentini, Francisco TomásKozlowski, DiegoLariviere, VincentCROSS-LINGUAL INFORMATION RETRIEVALACADEMIC SEARCHMULTILINGUAL EMBEDDINGSMACHINE TRANSLATIONEVALUATION RESOURCEShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers.Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Kozlowski, Diego. University of Montreal; CanadáFil: Lariviere, Vincent. University of Montreal; CanadáCornell University2025-04info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/274494Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent; CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents; Cornell University; arXiv.org; 4-2025; 1-122331-8422CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://arxiv.org/abs/2504.16264info:eu-repo/semantics/altIdentifier/doi/10.48550/arXiv.2504.16264info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-11-12T09:40:01Zoai:ri.conicet.gov.ar:11336/274494instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-11-12 09:40:01.981CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse |
| dc.title.none.fl_str_mv |
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents |
| title |
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents |
| spellingShingle |
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents Valentini, Francisco Tomás CROSS-LINGUAL INFORMATION RETRIEVAL ACADEMIC SEARCH MULTILINGUAL EMBEDDINGS MACHINE TRANSLATION EVALUATION RESOURCES |
| title_short |
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents |
| title_full |
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents |
| title_fullStr |
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents |
| title_full_unstemmed |
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents |
| title_sort |
CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents |
| dc.creator.none.fl_str_mv |
Valentini, Francisco Tomás Kozlowski, Diego Lariviere, Vincent |
| author |
Valentini, Francisco Tomás |
| author_facet |
Valentini, Francisco Tomás Kozlowski, Diego Lariviere, Vincent |
| author_role |
author |
| author2 |
Kozlowski, Diego Lariviere, Vincent |
| author2_role |
author author |
| dc.subject.none.fl_str_mv |
CROSS-LINGUAL INFORMATION RETRIEVAL ACADEMIC SEARCH MULTILINGUAL EMBEDDINGS MACHINE TRANSLATION EVALUATION RESOURCES |
| topic |
CROSS-LINGUAL INFORMATION RETRIEVAL ACADEMIC SEARCH MULTILINGUAL EMBEDDINGS MACHINE TRANSLATION EVALUATION RESOURCES |
| purl_subject.fl_str_mv |
https://purl.org/becyt/ford/1.2 https://purl.org/becyt/ford/1 |
| dc.description.none.fl_txt_mv |
Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers. Fil: Valentini, Francisco Tomás. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina Fil: Kozlowski, Diego. University of Montreal; Canadá Fil: Lariviere, Vincent. University of Montreal; Canadá |
| description |
Cross-lingual information retrieval (CLIR) consists in finding relevant documents in a language that differs from the language of the queries. This paper presents CLIRudit, a new dataset created to evaluate cross-lingual academic search, focusing on English queries and French documents. The dataset is built using bilingual article metadata from Érudit, a Canadian publishing platform, and is designed to represent scenarios in which researchers search for scholarly content in languages other than English. We perform a comprehensive benchmarking of different zero-shot first-stage retrieval methods on the dataset, including dense and sparse retrievers, query and document machine translation, and state-of-the-art multilingual retrievers. Our results show that large dense retrievers, not necessarily trained for the cross-lingual retrieval task, can achieve zero-shot performance comparable to using ground truth human translations, without the need for machine translation. Sparse retrievers, such as BM25 or SPLADE, combined with document translation, show competitive results, providing an efficient alternative to large dense models. This research advances the understanding of cross-lingual academic information retrieval and provides a framework that others can use to build comparable datasets across different languages and disciplines. By making the dataset and code publicly available, we aim to facilitate further research that will help make scientific knowledge more accessible across language barriers. |
| publishDate |
2025 |
| dc.date.none.fl_str_mv |
2025-04 |
| dc.type.none.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo |
| format |
article |
| status_str |
publishedVersion |
| dc.identifier.none.fl_str_mv |
http://hdl.handle.net/11336/274494 Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent; CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents; Cornell University; arXiv.org; 4-2025; 1-12 2331-8422 CONICET Digital CONICET |
| url |
http://hdl.handle.net/11336/274494 |
| identifier_str_mv |
Valentini, Francisco Tomás; Kozlowski, Diego; Lariviere, Vincent; CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents; Cornell University; arXiv.org; 4-2025; 1-12 2331-8422 CONICET Digital CONICET |
| dc.language.none.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
info:eu-repo/semantics/altIdentifier/url/https://arxiv.org/abs/2504.16264 info:eu-repo/semantics/altIdentifier/doi/10.48550/arXiv.2504.16264 |
| dc.rights.none.fl_str_mv |
info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ |
| eu_rights_str_mv |
openAccess |
| rights_invalid_str_mv |
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/ |
| dc.format.none.fl_str_mv |
application/pdf application/pdf |
| dc.publisher.none.fl_str_mv |
Cornell University |
| publisher.none.fl_str_mv |
Cornell University |
| dc.source.none.fl_str_mv |
reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas |
| reponame_str |
CONICET Digital (CONICET) |
| collection |
CONICET Digital (CONICET) |
| instname_str |
Consejo Nacional de Investigaciones Científicas y Técnicas |
| repository.name.fl_str_mv |
CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas |
| repository.mail.fl_str_mv |
dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar |
| _version_ |
1848597480409661440 |
| score |
12.976206 |