Decoding the structure of the WWW: a comparative analysis of web crawls

Autores
Serrano, Maria Angeles; Maguitman, Ana Gabriela; Boguña, Marian; Fortunato, Santo; Vespignani, Alessandro
Año de publicación
2007
Idioma
inglés
Tipo de recurso
artículo
Estado
versión publicada
Descripción
The understanding of the immense and intricate topological structure of the World Wide Web (WWW) is a major scientific and technological challenge. This has been recently tackled by char-acterizing the properties of its representative graphs, in which vertices and directed edges areidentified with Web pages and hyperlinks, respectively. Data gathered in large-scale crawls havebeen analyzed by several groups resulting in a general picture of the WWW that encompassesmany of the complex properties typical of rapidly evolving networks. In this article, we report adetailed statistical analysis of the topological properties of four different WWW graphs obtainedwith different crawlers. We find that, despite the very large size of the samples, the statistical mea-sures characterizing these graphs differ quantitatively, and in some cases qualitatively, dependingon the domain analyzed and the crawl used for gathering the data. This spurs the issue of thepresence of sampling biases and structural differences of Web crawls that might induce propertiesnot representative of the actual global underlying graph. In short, the stability of the widely ac-cepted statistical description of the Web is called into question. In order to provide a more accuratecharacterization of the Web graph, we study statistical measures beyond the degree distribution,such as degree-degree correlation functions or the statistics of reciprocal connections. The latterappears to enclose the relevant correlations of the WWW graph and carry most of the topologica.
Fil: Serrano, Maria Angeles. Indiana University; Estados Unidos. Institute for Scientific Interchange; Italia
Fil: Maguitman, Ana Gabriela. Universidad Nacional del Sur; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca; Argentina
Fil: Boguña, Marian. Universitat de Barcelona; España
Fil: Fortunato, Santo. Institute for Scientific Interchange; Italia. Indiana University; Estados Unidos
Fil: Vespignani, Alessandro. Institute for Scientific Interchange; Italia. Indiana University; Estados Unidos
Materia
World Wide Web
Nivel de accesibilidad
acceso abierto
Condiciones de uso
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
Repositorio
CONICET Digital (CONICET)
Institución
Consejo Nacional de Investigaciones Científicas y Técnicas
OAI Identificador
oai:ri.conicet.gov.ar:11336/81668

id CONICETDig_c5439c237260b534efc055a9ef543720
oai_identifier_str oai:ri.conicet.gov.ar:11336/81668
network_acronym_str CONICETDig
repository_id_str 3498
network_name_str CONICET Digital (CONICET)
spelling Decoding the structure of the WWW: a comparative analysis of web crawlsSerrano, Maria AngelesMaguitman, Ana GabrielaBoguña, MarianFortunato, SantoVespignani, AlessandroWorld Wide Webhttps://purl.org/becyt/ford/2.2https://purl.org/becyt/ford/2The understanding of the immense and intricate topological structure of the World Wide Web (WWW) is a major scientific and technological challenge. This has been recently tackled by char-acterizing the properties of its representative graphs, in which vertices and directed edges areidentified with Web pages and hyperlinks, respectively. Data gathered in large-scale crawls havebeen analyzed by several groups resulting in a general picture of the WWW that encompassesmany of the complex properties typical of rapidly evolving networks. In this article, we report adetailed statistical analysis of the topological properties of four different WWW graphs obtainedwith different crawlers. We find that, despite the very large size of the samples, the statistical mea-sures characterizing these graphs differ quantitatively, and in some cases qualitatively, dependingon the domain analyzed and the crawl used for gathering the data. This spurs the issue of thepresence of sampling biases and structural differences of Web crawls that might induce propertiesnot representative of the actual global underlying graph. In short, the stability of the widely ac-cepted statistical description of the Web is called into question. In order to provide a more accuratecharacterization of the Web graph, we study statistical measures beyond the degree distribution,such as degree-degree correlation functions or the statistics of reciprocal connections. The latterappears to enclose the relevant correlations of the WWW graph and carry most of the topologica.Fil: Serrano, Maria Angeles. Indiana University; Estados Unidos. Institute for Scientific Interchange; ItaliaFil: Maguitman, Ana Gabriela. Universidad Nacional del Sur; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca; ArgentinaFil: Boguña, Marian. Universitat de Barcelona; EspañaFil: Fortunato, Santo. Institute for Scientific Interchange; Italia. Indiana University; Estados UnidosFil: Vespignani, Alessandro. Institute for Scientific Interchange; Italia. Indiana University; Estados UnidosAssociation for Computing Machinary2007-08info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionhttp://purl.org/coar/resource_type/c_6501info:ar-repo/semantics/articuloapplication/pdfapplication/pdfhttp://hdl.handle.net/11336/81668Serrano, Maria Angeles; Maguitman, Ana Gabriela; Boguña, Marian; Fortunato, Santo; Vespignani, Alessandro; Decoding the structure of the WWW: a comparative analysis of web crawls; Association for Computing Machinary; Acm Transactions On The Web; 1; 2; 8-2007; 1131-11551559-1131CONICET DigitalCONICETenginfo:eu-repo/semantics/altIdentifier/url/https://dl.acm.org/citation.cfm?id=1255438.1255442info:eu-repo/semantics/altIdentifier/doi/10.1145/1255438.1255442info:eu-repo/semantics/openAccesshttps://creativecommons.org/licenses/by-nc-sa/2.5/ar/reponame:CONICET Digital (CONICET)instname:Consejo Nacional de Investigaciones Científicas y Técnicas2025-09-03T09:58:02Zoai:ri.conicet.gov.ar:11336/81668instacron:CONICETInstitucionalhttp://ri.conicet.gov.ar/Organismo científico-tecnológicoNo correspondehttp://ri.conicet.gov.ar/oai/requestdasensio@conicet.gov.ar; lcarlino@conicet.gov.arArgentinaNo correspondeNo correspondeNo correspondeopendoar:34982025-09-03 09:58:02.455CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicasfalse
dc.title.none.fl_str_mv Decoding the structure of the WWW: a comparative analysis of web crawls
title Decoding the structure of the WWW: a comparative analysis of web crawls
spellingShingle Decoding the structure of the WWW: a comparative analysis of web crawls
Serrano, Maria Angeles
World Wide Web
title_short Decoding the structure of the WWW: a comparative analysis of web crawls
title_full Decoding the structure of the WWW: a comparative analysis of web crawls
title_fullStr Decoding the structure of the WWW: a comparative analysis of web crawls
title_full_unstemmed Decoding the structure of the WWW: a comparative analysis of web crawls
title_sort Decoding the structure of the WWW: a comparative analysis of web crawls
dc.creator.none.fl_str_mv Serrano, Maria Angeles
Maguitman, Ana Gabriela
Boguña, Marian
Fortunato, Santo
Vespignani, Alessandro
author Serrano, Maria Angeles
author_facet Serrano, Maria Angeles
Maguitman, Ana Gabriela
Boguña, Marian
Fortunato, Santo
Vespignani, Alessandro
author_role author
author2 Maguitman, Ana Gabriela
Boguña, Marian
Fortunato, Santo
Vespignani, Alessandro
author2_role author
author
author
author
dc.subject.none.fl_str_mv World Wide Web
topic World Wide Web
purl_subject.fl_str_mv https://purl.org/becyt/ford/2.2
https://purl.org/becyt/ford/2
dc.description.none.fl_txt_mv The understanding of the immense and intricate topological structure of the World Wide Web (WWW) is a major scientific and technological challenge. This has been recently tackled by char-acterizing the properties of its representative graphs, in which vertices and directed edges areidentified with Web pages and hyperlinks, respectively. Data gathered in large-scale crawls havebeen analyzed by several groups resulting in a general picture of the WWW that encompassesmany of the complex properties typical of rapidly evolving networks. In this article, we report adetailed statistical analysis of the topological properties of four different WWW graphs obtainedwith different crawlers. We find that, despite the very large size of the samples, the statistical mea-sures characterizing these graphs differ quantitatively, and in some cases qualitatively, dependingon the domain analyzed and the crawl used for gathering the data. This spurs the issue of thepresence of sampling biases and structural differences of Web crawls that might induce propertiesnot representative of the actual global underlying graph. In short, the stability of the widely ac-cepted statistical description of the Web is called into question. In order to provide a more accuratecharacterization of the Web graph, we study statistical measures beyond the degree distribution,such as degree-degree correlation functions or the statistics of reciprocal connections. The latterappears to enclose the relevant correlations of the WWW graph and carry most of the topologica.
Fil: Serrano, Maria Angeles. Indiana University; Estados Unidos. Institute for Scientific Interchange; Italia
Fil: Maguitman, Ana Gabriela. Universidad Nacional del Sur; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Bahía Blanca; Argentina
Fil: Boguña, Marian. Universitat de Barcelona; España
Fil: Fortunato, Santo. Institute for Scientific Interchange; Italia. Indiana University; Estados Unidos
Fil: Vespignani, Alessandro. Institute for Scientific Interchange; Italia. Indiana University; Estados Unidos
description The understanding of the immense and intricate topological structure of the World Wide Web (WWW) is a major scientific and technological challenge. This has been recently tackled by char-acterizing the properties of its representative graphs, in which vertices and directed edges areidentified with Web pages and hyperlinks, respectively. Data gathered in large-scale crawls havebeen analyzed by several groups resulting in a general picture of the WWW that encompassesmany of the complex properties typical of rapidly evolving networks. In this article, we report adetailed statistical analysis of the topological properties of four different WWW graphs obtainedwith different crawlers. We find that, despite the very large size of the samples, the statistical mea-sures characterizing these graphs differ quantitatively, and in some cases qualitatively, dependingon the domain analyzed and the crawl used for gathering the data. This spurs the issue of thepresence of sampling biases and structural differences of Web crawls that might induce propertiesnot representative of the actual global underlying graph. In short, the stability of the widely ac-cepted statistical description of the Web is called into question. In order to provide a more accuratecharacterization of the Web graph, we study statistical measures beyond the degree distribution,such as degree-degree correlation functions or the statistics of reciprocal connections. The latterappears to enclose the relevant correlations of the WWW graph and carry most of the topologica.
publishDate 2007
dc.date.none.fl_str_mv 2007-08
dc.type.none.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
http://purl.org/coar/resource_type/c_6501
info:ar-repo/semantics/articulo
format article
status_str publishedVersion
dc.identifier.none.fl_str_mv http://hdl.handle.net/11336/81668
Serrano, Maria Angeles; Maguitman, Ana Gabriela; Boguña, Marian; Fortunato, Santo; Vespignani, Alessandro; Decoding the structure of the WWW: a comparative analysis of web crawls; Association for Computing Machinary; Acm Transactions On The Web; 1; 2; 8-2007; 1131-1155
1559-1131
CONICET Digital
CONICET
url http://hdl.handle.net/11336/81668
identifier_str_mv Serrano, Maria Angeles; Maguitman, Ana Gabriela; Boguña, Marian; Fortunato, Santo; Vespignani, Alessandro; Decoding the structure of the WWW: a comparative analysis of web crawls; Association for Computing Machinary; Acm Transactions On The Web; 1; 2; 8-2007; 1131-1155
1559-1131
CONICET Digital
CONICET
dc.language.none.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv info:eu-repo/semantics/altIdentifier/url/https://dl.acm.org/citation.cfm?id=1255438.1255442
info:eu-repo/semantics/altIdentifier/doi/10.1145/1255438.1255442
dc.rights.none.fl_str_mv info:eu-repo/semantics/openAccess
https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
eu_rights_str_mv openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by-nc-sa/2.5/ar/
dc.format.none.fl_str_mv application/pdf
application/pdf
dc.publisher.none.fl_str_mv Association for Computing Machinary
publisher.none.fl_str_mv Association for Computing Machinary
dc.source.none.fl_str_mv reponame:CONICET Digital (CONICET)
instname:Consejo Nacional de Investigaciones Científicas y Técnicas
reponame_str CONICET Digital (CONICET)
collection CONICET Digital (CONICET)
instname_str Consejo Nacional de Investigaciones Científicas y Técnicas
repository.name.fl_str_mv CONICET Digital (CONICET) - Consejo Nacional de Investigaciones Científicas y Técnicas
repository.mail.fl_str_mv dasensio@conicet.gov.ar; lcarlino@conicet.gov.ar
_version_ 1842269497254215680
score 13.13397