Yahoo Slurp Makes a Mess
Wednesday, June 27th, 2007For months we’ve been carefully watching how the various bots consume Urbanspoon. We enticed them inside with fresh content, well constructed pages, and sitemaps. Despite our efforts, until quite recently Yahoo Slurp didn’t have much of an appetite for Urbanspoon. Instead of digging in and indexing the whole site, Yahoo Slurp preferred to nibble around the edges.
That is, until June 16th. Notice anything odd?
Yahoo Slurp Requests to Urbanspoon.com
Someone flipped a switch down there in Sunnyvale and the Yahoo Slurp bot suddenly decided that it loved Urbanspoon.
For comparison, check out the Metamucil-like regularity of the Google bot:
Google Bot Requests to Urbanspoon.com
Let’s dig in and take a closer look at those two bots. Ready… fight!
| Yahoo Slurp (June 16-18) | Google Bot (June 16-18) |
|---|---|
|
194,464 total hits 120,076 pages (38% dups) 32 violations of robots.txt 85,396 restaurant pages 26,599 New York restaurants |
41,941 total hits 41,332 pages (1.4% dups) 27 violations of robots.txt 22,999 restaurant pages 6,573 New York restaurants |
Yahoo Slurp Duplicates
Yahoo Slurp requested many pages more than once. In fact, Yahoo Slurp was unable to resist certain pages, compulsively returning to them again and again. Here are the pages that the bot seemed to find tastiest:
| # of requests | page |
|---|---|
| 419 | /robots.txt |
| 294 | /choose |
| 273 | / |
| 21 | /c/3/New-York.html |
| 19 | /c/5/Los-Angeles.html |
| 16 | /a/3/New-York-at-night.html |
| 15 | /c/1/Seattle.html |
| 14 | /c/2/Chicago.html |
| 11 | /u/create (and this page is blocked via robots.txt!) |
I’ll spare you the other 73,306 duplicates requested by Yahoo Slurp.
Directory Crawling
Strangely, the Yahoo Slurp bot likes to explore the directories leading up to each page. For example, in addition to indexing our Sitka & Spruce page, Yahoo Slurp also tried to hit each of the directories leading up to that page:
/r/1/1084/Seattle/Eastlake-Lake-Union/Sitka-Spruce.html
/r/1/1084/Seattle/Eastlake-Lake-Union/
/r/1/1084/
/r/1/
/r/
Those URLs aren’t linked anywhere from our site. Each of them (correctly) redirects elsewhere. Why did Yahoo choose to crawl them?
Yahoo Slurp - A Sloppy Eater
We’re quite flattered by the attention, but Yahoo’s Slurp bot made a bit of a mess. I can forgive the robots.txt violations, since other bots share this transgression. The directory walking thing is bizarre, but won’t hurt our search engine results due to our clever defensive redirects.
The 38% dup rate is just plain sloppy. Really, this is not how we want to spend our precious CPU cycles and bandwidth. I’ve written a few indexing systems myself and I know that these problems are challenging, but the market leader seems to have solved them nicely.
It remains to be seen if Yahoo’s aggressive indexing will lead to a commensurate increase in traffic from Yahoo. Stay tuned!