The last few weeks were really busy, and exciting. Until now I could not reveal any of the details of what we were up to, but a few weeks ago the project went public, and on-line: please feel free to visit the Erdős WebGraph project.
About two month ago we were asked to create a graph of domains where the directed edges represent hyper links. Our first attempt was to install several of the publicly available crawlers, but those all had indexing features that we did not need, and neither seemed lightweight enough for our extremely limited hardware capacities. Another inconvenience arouse from the almost complete lack of documentation for these software. Finally we decided to implement our own crawler.
One of the most important decisions was the choice of language. We needed a language which had a rich collection of libraries that we could use, support a high abstraction level while coding, and still remain fast enough. Our choice quickly fell on Python. It had built in functionalities for ssh connections, http connections, url handling (normalization) and foremost importantly HTML parsing. With that in hand, we could quickly and effectively implement our rather complex parallelized crawling architecture. Despite being an interpreted language, well written Python code can run pretty fast (in some cases C++ is less then 6 times faster then Python) and the C API allows a smooth transition between a Python implementation and a pure C++ implementation once it becomes necessary.
The current infrastructure consists of 12 Intel Core2 Quad CPU Q6700 @ 2.66GHz crawling machines with 4Gb memory sharing a common 1 Gb Internet connection, and an Intel Xeon X5550 @ 2.67GHz 16 core MySQL server with 16Gb RAM. Each crawling machine runs 4 threads of the crawler. Currently we are monitoring performance to decide how far we can push the SQL server before it becomes a bottle neck, and so far it seems that with small changes it will become possible to double the number of threads on each machine. With the infrastructure described above we are currently downloading, and processing 3 million urls each day, and still have good amounts of unused system resources. (Edit.: since the blog entry was written we improved that to 300 million a week.)
Enough said about the pure numbers. Let's get to the gory details of Internet chaos... Most of the pitfalls in implementing a crawler result from simple human nature: some people are plane stupid, yet extremely creative at making mistakes, others are pure evil, and deliberately deceptive. 🙂 Here is a list of the most annoying problems our crawler came across:
The most common problem is Session IDs stored in GET variables. When we started the project I did expect that we will come across them, but never expected to see so many of them. I still don't understand why would anyone use that ancient solution rooted in the pre-cookiestoric '90s. It was a viable option back than when not all browsers supported cookies, and those that did had it turned off by default, but that problem seemed so 15 years ago to me. Anyhow... one has to deal with it, but it's far from simple: sane people use get variables exclusively to pass variables to the server that invoke server side changes in a web application (and even for them a post variable should be preferred). Less sane people use them to refer to pages. Can you guess how many pages have urls like: index.php?page=xxxxx ? (For example the WordPress blog engine I use does that too... unfortunately... I really should change that...) And yet others still use it for session IDs. So which GET variable should you keep, and which one should you throw away? Of course you could just throw away GET variables like SessionID, SID, or ID in general. But than again, ID may be referring to a page ID... see how it gets tricky? And of course there is the occasional way too creative web designer who refers to the session id as "h" or "kitty" or by some other unforeseeable random sequence of characters. 🙂 And don't even think of searching for hexadecimal numbers as values, session ids take many forms, yet page ids may happen to be hexadecimal numbers. Now let's suppose you have just found the perfect solution to filtering session ids. You'd still be in trouble... we came across a website using one single GET variable that encoded both the requested page, and the session ID.
Than there are those wonderful automatically generated urls. The beauty of them lies in the virtually infinite number of combinations you can create each referring to a different website. The most annoying kind of them is generated by one of the most popular CMS engines on the market. What it does is to ignore everything between the domain and the last directory. So for that engine http://example.com/appletree/ is essentially the same as http://example.com/asdgaregregerb/appletree/ and http://example.com/orange/microsoft/google/appletree/ and so on. Now that wouldn't be such a big problem if there wasn't a single wrongly formated url on the page. But there is! There always is! Let's say the editor of http://example.com/appletree/ adds a link to the page http://example.com/orangetree/, but he forgets to add the leading slash, ending up with a link to http://example.com/appletree/orangetree/ The page loads, the content is right. No reason to correct. But than there is a similar mistake on the page orangetree, and there we go with an infinite loop generating longer and even longer urls pointing to valid pages with valid content. Even more annoying are URLs that belong to the filtering parameters of the search features of online stores. For example/80-160GB/ssd/samsung/ or /price--120-200///screen/10-15inch/ and like. The crazy thing about these, is that they are all valid urls, with valid pages behind them, containing valid and unique information. And that's how the crawler ends up with urls flooding the database form one single domain.
The real problem with both GET variables in session IDs and infinite number of valid urls on one domain, is how it will interact with the politeness policy of the crawler. When a domain floods the database, the crawlers will receive an increasing number of jobs from the same domain. The politeness policy will first postpone these urls in an attempt of being polite. As the number of urls increase it will eventually get overloaded with postponed urls, and will come to the conclusion that the domain must have too many urls, so it must belong to some popular service, which in return can endure more requests. Soon enough the crawler will load several urls per second from the same domain in an attempt to catch up with the postponed jobs. The real problem is that we have 48 crawlers each facing the same problem and each pestering the same domain with requests. What you end up with is a virtual server of some poor guy under DoS attack 🙂 And he kind of deserves it... (Of course we have solved this problem, but we did accidentally DoS a few sites at first... sorry about that!)
Finally on one ill fated evening the crawlers also came across a huge HTML page. It was the list of all possible combinations of holiday opportunities on the website of a travel agency. The list resulted in an automatically generated few hundred megabyte HTML file. When the crawler tried to parse the HTML it simply filled up the memory and crashed. That of course meant that the results were never returned to the SQL server which sent the same website to another unlucky crawler for processing. Good thing we had a script that restarted all missing crawlers, otherwise all of them would have been down by next morning. The funny part is that all major browsers (including MSIE, Firefox, Chrome, Safari, Opera) crashed when we tried to load that page... so why is it even there?!
Conclusion: writing a crawler isn't that easy. Just the pure chaos in the way people use URLs is enough to confuse bots. So if you happen to be a web master or web site designer, my advice for you is to take extra care in designing your url structure, and communication patterns, and never under estimate the importance of Search Engine Optimization. (SEO) Most search engines (including Google, and Bing) prefer well designed urls that contain ONLY relevant information about the website they refer to. Multiple equivalent urls pointing to the same page should also be avoided, as the crawler may not realize the difference, and links pointing to the page get shared between the urls decreasing your page rank. DO NOT rely on the self identification of robots, as many of them may identify themselves as standard browsers. The reason for that: some people try to deceive robots as part of black hat SEO (something you should never be involved in).