Archive for January, 2007

Google in Action and Other Graphs

Tuesday, January 23rd, 2007

In my endless quest to learn more about Urbanspoon’s explosive growth, I put together a script to generate graphs illustrating various aspects of our traffic. There are many interesting questions that we can now answer:

  • Which cities are getting the most traffic from search engines?
  • How long does it take google to index a new Urbanspoon city?
  • etc.

I whipped up a script that periodically crunches our logs offline and creates graphs using the excellent (but cryptic) rrdtool. The graphs are generated on the hour. awstats is nice, but sometimes you have to dig in and get your hands dirty. We also use munin to keep an eye on our hardware.

GoogleBot

Below is a recent snapshot of GoogleBot crawling Urbanspoon. Green is Seattle, blue is Chicago, and red is New York. X axis is time, Y axis is pages per minute. I’ve removed the Y labels to obscure our actual numbers.

Notice the flat tops on each bulge of GoogleBot traffic - GoogleBot caps its crawl rate at a certain number of pages per minute. Over the past few weeks they’ve been gradually ramping up the rate at which they crawl Urbanspoon. Perhaps they looked at our response times and concluded that our site can handle it. Also note that they’re running out of Seattle pages to crawl. Strangely, GoogleBot tends to go to sleep around midnight PST.

Not everyone at Google is so polite. For a brief period last week Google’s mobile crawler was hitting our site with over 7000 requests per hour.

Other Robots

GoogleBot hits us far more than any other robot. To put this in perspective, here is a graph of noticeable robots hitting Urbanspoon recently. GoogleBot dominates. Maybe Yahoo should just throw in the towel and start using Nutch for their crawls.

Traffic

Our referral rate from Google is increasing rapidly but not uniformly. For example, the graph below indicates that we have more work to do in Chicago:

emacs dotfiles 2007-01-20

Saturday, January 20th, 2007

It’s time for another dotfile release. This release includes some fixes for emacs 22, and a significant improvement in abtags. Download the dotfiles here:

Adam’s Emacs Dotfiles

From the changelog:

2007-01-20
- completion fixes for emacs 22 compat
- changed nxml indent to 2
- mapped html/sgml to nxml mode
- *.rake => ruby mode
- abtags auto-reloads TAGS files now
- finally tracked down and fixed pesky loaddefs issue

Turn Off Rails Sessions for Robots

Monday, January 8th, 2007

Urbanspoon is already attracting a sizable amount of traffic, and we expect our numbers to grow rapidly now that we’ve launched Chicago and New York. Urbanspoon is regularly crawled by a large number of robots seeking to index our site.

Some of our pages squirrel information away inside the Rails session. For example, we keep track of recently visited restaurants so that we can guide users back to those restaurants when they return. This is handy if, for example, you always order pizza from one or two restaurants.

Imagine if Googlebot crawled each of our 35,000 restaurants each day. Each time the bot hits a restaurant we would attempt to record a “restaurant visit” in the session. Since robots generally don’t use cookies, that would create 35,000 useless sessions each day. Wouldn’t it be nice to suppress these sessions entirely?

I wrote a helper function called is_megatron? to detect if a request’s User-Agent indicates that the request is from a robot. The regular expression catches most of the bot traffic that hits our site:

class Util
  def Util.is_megatron?(user_agent)
    user_agent =~ /\b(Baidu|Gigabot|Googlebot|libwww-perl|lwp-trivial|msnbot|SiteUptime|Slurp|WordPress|ZIBB|ZyBorg)\b/i
  end
end

If we determine that a request appears to be from a robot, we simply disable session support for the current request:

class ApplicationController < ActionController::Base
  # turn off sessions if this is a request from a robot
  session :off, :if => proc { |request| Util.is_megatron?(request.user_agent) }

  …
end