Archive for June, 2007

Yahoo Slurp Makes a Mess

Wednesday, June 27th, 2007

For months we’ve been carefully watching how the various bots consume Urbanspoon. We enticed them inside with fresh content, well constructed pages, and sitemaps. Despite our efforts, until quite recently Yahoo Slurp didn’t have much of an appetite for Urbanspoon. Instead of digging in and indexing the whole site, Yahoo Slurp preferred to nibble around the edges.

That is, until June 16th. Notice anything odd?

Yahoo Slurp Requests to Urbanspoon.com

Someone flipped a switch down there in Sunnyvale and the Yahoo Slurp bot suddenly decided that it loved Urbanspoon.

For comparison, check out the Metamucil-like regularity of the Google bot:

Google Bot Requests to Urbanspoon.com

Let’s dig in and take a closer look at those two bots. Ready… fight!

Yahoo Slurp (June 16-18) Google Bot (June 16-18)
194,464 total hits
120,076 pages (38% dups)
32 violations of robots.txt

85,396 restaurant pages
1,008 neighborhood pages
995 cuisine pages

26,599 New York restaurants
23,002 LA restaurants
20,152 SF restaurants
7,817 Seattle restaurants
7,124 Boston restaurants
534 Chicago restaurants
109 DC restaurants

41,941 total hits
41,332 pages (1.4% dups)
27 violations of robots.txt

22,999 restaurant pages
1,366 neighborhood pages
980 cuisines

6,573 New York restaurants
4,659 LA restaurants
2,751 SF restaurants
2,584 Seattle restaurants
1,987 Boston restaurants
2,365 Chicago restaurants
2,245 DC restaurants

Yahoo Slurp Duplicates

Yahoo Slurp requested many pages more than once. In fact, Yahoo Slurp was unable to resist certain pages, compulsively returning to them again and again. Here are the pages that the bot seemed to find tastiest:

# of requests page
419 /robots.txt
294 /choose
273 /
21 /c/3/New-York.html
19 /c/5/Los-Angeles.html
16 /a/3/New-York-at-night.html
15 /c/1/Seattle.html
14 /c/2/Chicago.html
11 /u/create (and this page is blocked via robots.txt!)

I’ll spare you the other 73,306 duplicates requested by Yahoo Slurp.

Directory Crawling

Strangely, the Yahoo Slurp bot likes to explore the directories leading up to each page. For example, in addition to indexing our Sitka & Spruce page, Yahoo Slurp also tried to hit each of the directories leading up to that page:

/r/1/1084/Seattle/Eastlake-Lake-Union/Sitka-Spruce.html
/r/1/1084/Seattle/Eastlake-Lake-Union/
/r/1/1084/
/r/1/
/r/

Those URLs aren’t linked anywhere from our site. Each of them (correctly) redirects elsewhere. Why did Yahoo choose to crawl them?

Yahoo Slurp - A Sloppy Eater

We’re quite flattered by the attention, but Yahoo’s Slurp bot made a bit of a mess. I can forgive the robots.txt violations, since other bots share this transgression. The directory walking thing is bizarre, but won’t hurt our search engine results due to our clever defensive redirects.

The 38% dup rate is just plain sloppy. Really, this is not how we want to spend our precious CPU cycles and bandwidth. I’ve written a few indexing systems myself and I know that these problems are challenging, but the market leader seems to have solved them nicely.

It remains to be seen if Yahoo’s aggressive indexing will lead to a commensurate increase in traffic from Yahoo. Stay tuned!