Avash's Portfolio

Designing a Robust Web Crawler - From Requirements to Implementation


Designing a Robust Web Crawler - From Requirements to Implementation

What is a web crawler?

A web crawler is a bot that downloads and indexes contents from all over the internet. The goal of such bot is to learn what every page on the web is about, so the information can be retrieved when needed. - Cloudflare


We need to overcome a few obstacles while designing our web crawler


Let’s list down the steps required while crawling websites and build components for each step

image

Designing the URL Frontier

Before discussing the design of URL frontier, let’s discuss why don’t we use a standard queue. Well, a standard queue is FIFO. In the case of crawling, it is unacceptable because the in a page most URLs point to other pages hosted on the same server.

This can cause two types of problems

Note: This implementation of URL frontier is based on the implementation in the Mercator web crawler

Components of URL frontiers are :

How URL Frontier works:

image

Diagram of URL Frontier

image


Overcoming the bottleneck of DNS Resolution

DNS Resolution is the process of finding the IP Address of a web page from its URL. In this process, we make requests to Domain Name Service and get the IP Address in response. We won’t go into the details of how DNS resolution works but it can take multiple requests before we get the IP Address, so it can take a few seconds per request. This is unacceptable since we need to index hundreds of pages per second.

We can solve this bottleneck by

image

Fetching content from the web page

We will be fetching the webpage corresponding to a given URL using the appropriate network protocol. There can be multiple protocols like HTTP, FTP, etc.

image

Designing the ‘Content Seen’ Module

If we use the naive approach, we store the entire content of a web page (either in cache or in secondary memory), and then whenever we get a new web page we compare it with the previous one.

This approach is simple, but it consumes a lot of time and memory. So, we won’t be using this.

Instead, we use a structure called the document fingerprint set that stores a 64-bit checksum of the contents of each web page. Fingerprints offer provably strong probabilistic guarantees that two different strings will not have the same fingerprint.

It also consumes less memory and time.

The implementation of content seen module in Mercator uses Broder’s implementation of Rabin’s fingerprinting algorithm.

Like all other modules, we will have multiple worker threads to process web pages in parallel.


Designing the ‘URL Seen’ Test

This one seems really similar to the ‘content seen’ test. We can calculate the fingerprint for each URL and compare it with the rest.

While this approach works, it is far more efficient to hash the URL and store it for comparison. But there can be false positives if we just use the hash function so we use a data structure called Bloom Filters.

We won’t go into much detail about the working of bloom filters, but here’s a short explanation

It uses a large bit vector (only contains 0s and 1s). An element is added to the set by computing n hash functions of the element and setting the corresponding bits. An element is deemed to be in the set if the bits at all n of the element’s hash locations are set.

By using a bloom filter

If the URL is not seen we add it to the URL Frontier and repeat the process.

Now that we are done, let’s look at how our final architecture of web crawler looks

image