Distributed search based on self-indexed compressed text

Autores: Arroyuelo, Diego; Gil Costa, Graciela Verónica; González, Senén; Marín, Mauricio; Oyarzún, Mauricio
Año de publicación: 2012
Idioma: inglés
Tipo de recurso: artículo
Estado: versión publicada
Descripción: Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed-memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text and posting lists caching. We show that a text self-index (i.e.; an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space.
Fil: Arroyuelo, Diego. Yahoo! Research Latin America; Chile
Fil: Gil Costa, Graciela Verónica. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; Argentina. Universidad Nacional de San Luis; Argentina. Yahoo! Research Latin America; Chile
Fil: González, Senén. Yahoo! Research Latin America; Chile
Fil: Marín, Mauricio. Universidad de Santiago de Chile; Chile. Yahoo! Research Latin America; Chile
Fil: Oyarzún, Mauricio. Universidad de Santiago de Chile; Chile
Materia: QUERY PROCESSING
SELF-INDEXED COMPRESSED TEXT
SNIPPET EXTRACTION
WAVELET TREES
WEB SEARCH ENGINES
Nivel de accesibilidad: acceso abierto
Condiciones de uso: https://creativecommons.org/licenses/by-nc-nd/2.5/ar/
Repositorio
Institución: Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador: oai:ri.conicet.gov.ar:11336/197197

Acceder

id	CONICETDig_be3053ac1949b2171003e30378717205
oai_identifier_str	oai:ri.conicet.gov.ar:11336/197197
network_acronym_str	CONICETDig
repository_id_str	3498
network_name_str	CONICET Digital (CONICET)
spelling	Distributed search based on self-indexed compressed textArroyuelo, DiegoGil Costa, Graciela VerónicaGonzález, SenénMarín, MauricioOyarzún, MauricioQUERY PROCESSINGSELF-INDEXED COMPRESSED TEXTSNIPPET EXTRACTIONWAVELET TREESWEB SEARCH ENGINEShttps://purl.org/becyt/ford/1.2https://purl.org/becyt/ford/1Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed-memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text and posting lists caching. We show that a text self-index (i.e.; an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space.Fil: Arroyuelo, Diego. Yahoo! Research Latin America; ChileFil: Gil Costa, Graciela Verónica. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; Argentina. Universidad Nacional de San Luis; Argentina. Yahoo! Research Latin America; ChileFil: González, Senén. Yahoo! Research Latin America; ChileFil: Marín, Mauricio. Universidad de Santiago de Chile; Chile. Yahoo! Research Latin America; ChileFil: Oyarzún, Mauricio. Universidad de Santiago de Chile; ChilePergamon-Elsevier Science Ltd2012-03info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/197197Arroyuelo, Diego; Gil Costa, Graciela Verónica; González, Senén; Marín, Mauricio; Oyarzún, Mauricio; Distributed search based on self-indexed compressed text; Pergamon-Elsevier Science Ltd; Information Processing & Management; 48; 5; 3-2012; 819-8270306-4573CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/http://www.sciencedirect.com/science/article/pii/S0306457311000094info:eu-repo/semantics/altIdentifier/doi/10.1016/j.ipm.2011.01.008info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-nd/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2026-06-10T09:45:07Zoai:ri.conicet.gov.ar:11336/197197instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982026-06-10 09:45:08.045CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv	Distributed search based on self-indexed compressed text
title	Distributed search based on self-indexed compressed text
spellingShingle	Distributed search based on self-indexed compressed text Arroyuelo, Diego QUERY PROCESSING SELF-INDEXED COMPRESSED TEXT SNIPPET EXTRACTION WAVELET TREES WEB SEARCH ENGINES
title_short	Distributed search based on self-indexed compressed text
title_full	Distributed search based on self-indexed compressed text
title_fullStr	Distributed search based on self-indexed compressed text
title_full_unstemmed	Distributed search based on self-indexed compressed text
title_sort	Distributed search based on self-indexed compressed text
dc.creator.none.fl_str_mv	Arroyuelo, Diego Gil Costa, Graciela Verónica González, Senén Marín, Mauricio Oyarzún, Mauricio
author	Arroyuelo, Diego
author_facet	Arroyuelo, Diego Gil Costa, Graciela Verónica González, Senén Marín, Mauricio Oyarzún, Mauricio
author_role	author
author2	Gil Costa, Graciela Verónica González, Senén Marín, Mauricio Oyarzún, Mauricio
author2_role	author author author author
dc.subject.none.fl_str_mv	QUERY PROCESSING SELF-INDEXED COMPRESSED TEXT SNIPPET EXTRACTION WAVELET TREES WEB SEARCH ENGINES
topic	QUERY PROCESSING SELF-INDEXED COMPRESSED TEXT SNIPPET EXTRACTION WAVELET TREES WEB SEARCH ENGINES
purl_subject.fl_str_mv	https://purl.org/becyt/ford/1.2 https://purl.org/becyt/ford/1
dc.description.none.fl_txt_mv	Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed-memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text and posting lists caching. We show that a text self-index (i.e.; an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space. Fil: Arroyuelo, Diego. Yahoo! Research Latin America; Chile Fil: Gil Costa, Graciela Verónica. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - San Luis; Argentina. Universidad Nacional de San Luis; Argentina. Yahoo! Research Latin America; Chile Fil: González, Senén. Yahoo! Research Latin America; Chile Fil: Marín, Mauricio. Universidad de Santiago de Chile; Chile. Yahoo! Research Latin America; Chile Fil: Oyarzún, Mauricio. Universidad de Santiago de Chile; Chile
description	Query response times within a fraction of a second in Web search engines are feasible due to the use of indexing and caching techniques, which are devised for large text collections partitioned and replicated into a set of distributed-memory processors. This paper proposes an alternative query processing method for this setting, which is based on a combination of self-indexed compressed text and posting lists caching. We show that a text self-index (i.e.; an index that compresses the text and is able to extract arbitrary parts of it) can be competitive with an inverted index if we consider the whole query process, which includes index decompression, ranking and snippet extraction time. The advantage is that within the space of the compressed document collection, one can carry out the posting lists generation, document ranking and snippet extraction. This significantly reduces the total number of processors involved in the solution of queries. Alternatively, for the same amount of hardware, the performance of the proposed strategy is better than that of the classical approach based on treating inverted indexes and corresponding documents as two separate entities in terms of processors and memory space.
publishDate	2012
dc.date.none.fl_str_mv	2012-03
dc.type.none.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion http://purl.org/coar/resource_type/c_6501 info:ar-repo/semantics/articulo
format	article
status_str	publishedVersion
dc.identifier.none.fl_str_mv	http://hdl.handle.net/11336/197197 Arroyuelo, Diego; Gil Costa, Graciela Verónica; González, Senén; Marín, Mauricio; Oyarzún, Mauricio; Distributed search based on self-indexed compressed text; Pergamon-Elsevier Science Ltd; Information Processing & Management; 48; 5; 3-2012; 819-827 0306-4573 CONICET Digital CONICET
url	http://hdl.handle.net/11336/197197
identifier_str_mv	Arroyuelo, Diego; Gil Costa, Graciela Verónica; González, Senén; Marín, Mauricio; Oyarzún, Mauricio; Distributed search based on self-indexed compressed text; Pergamon-Elsevier Science Ltd; Information Processing & Management; 48; 5; 3-2012; 819-827 0306-4573 CONICET Digital CONICET
dc.language.none.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	info:eu-repo/semantics/altIdentifier/url/http://www.sciencedirect.com/science/article/pii/S0306457311000094 info:eu-repo/semantics/altIdentifier/doi/10.1016/j.ipm.2011.01.008
dc.rights.none.fl_str_mv	info:eu-repo/semantics/openAccess https://creativecommons.org/licenses/by-nc-nd/2.5/ar/
eu_rights_str_mv	openAccess
rights_invalid_str_mv	https://creativecommons.org/licenses/by-nc-nd/2.5/ar/
dc.format.none.fl_str_mv	application/pdf application/pdf application/pdf
dc.publisher.none.fl_str_mv	Pergamon-Elsevier Science Ltd
publisher.none.fl_str_mv	Pergamon-Elsevier Science Ltd
dc.source.none.fl_str_mv	reponame:CONICET Digital (CONICET) instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str	CONICET Digital (CONICET)
collection	CONICET Digital (CONICET)
instname_str	Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv	CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv	dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_	1867629160355069952
score	12.621074

Distributed search based on self-indexed compressed text

Publicaciones similares