Track: Search: Crawlers
IRLbot: Scaling to 6 Billion Pages and Beyond
- Hsin-Tsang Lee(Texas A&M University)
- Derek Leonard(Texas A&M University)
- Xiaoming Wang(Texas A&M University)
- Dmitri Loguinov(Texas A&M University)
This paper shares our experience in designing a web crawler that can download billions of pages using a single-server implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host rate-limiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted $41$ days, IRLbot running on a single server successfully crawled $6.3$ billion valid HTML pages ($7.6$ billion connection requests) and sustained an average download rate of $319$ mb/s ($1,789$ pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over $117$ million hosts, parsed out $394$ billion links, and discovered a subset of the web graph with $41$ billion unique nodes.
Inquiries can be sent to: