Dongsheng Che

"Application of Efficient External Memory Algorithms to Simulated Web Graphs"

Abstract:

The Web graph is a graph of the World-Wide Web (WWW), with Web pages represented by nodes and hyperlinks represented by directed edges. In the past decade, the WWW has spawned a sharing and dissemination of information on an unprecedented scale, and since then much research has focused on crawling strategies used by search engines. Recent study has shown that breadth first search crawling, one of the current crawling strategies, yields high quality pages. However, very little has been done on increasing the overall download rate when using breadth-first search crawling in the face of the "massive" character of the Web graph. Problems for massive data sets can be solved either by storing data sets in a huge main memory, or by storing data sets in external memory but with I/O efficient techniques. The goal of our research is to study how to reduce breadth-first search time for crawling by using I/O efficient techniques. We used data structures provided by LEDA-SM to store massive data sets of our simulated Web graphs, and run BFS on these generated graphs. The simulated Web graphs share important properties with the real Web graph, i.e., the degree distributions follow the same power laws, and random-start BFS traversals exhibit sharply bimodal behavior. The results indicate that the LEDA-SM system is useful for the Web graph computations, especially on machines with modest amounts of main memory.