Teleport – opinionated server setup with Ruby

I just released the first version of Teleport, a Ruby gem that makes it easy to setup Ubuntu servers. It’s conceptually based on work that I first did at Urbanspoon, then radically updated for Cubeduel and then Dwellable. Details here:

https://github.com/rglabs/teleport

Here’s a sample Telfile:

user "admin"
ruby "1.9.2"
apt "deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen", :key => "7F0CAB10"    
role :app, :packages => [:memcached]
role :db, :packages => [:mongodb-10gen]
server "server_app1", :role => :app
server "server_db1", :role => :db    
packages [:atop, :emacs, :gcc]

mysql query logging with mysql proxy and lua

Wow, that’s a long title. I just upgraded my wp theme to a wider layout, and I’m wallowing in the available space.

Anyway, this week I was tracking down a tricky mysql CPU issue and I wanted to take a closer look at what mysql was working on. Initially, I experimented with mk-query-digest –processlist, which was completely awesome but couldn’t really give me the full picture because it polls. Finally I decided that I wanted to log all the queries going through mysql. This can easily be done with mysql, but it took me a while to put all the pieces together.

Here’s what I ended up doing:

  • Install mysql-proxy. This was relatively easy, though I had to use an older version on our production machine. Don’t forget to edit /etc/default/mysql-proxy if you’re running on Debian/Ubuntu!
  • Create a lua script to log all the queries. There were a few different scripts in the repository, but ended up crafting my own. My version has a few important features:
    • No buffering. This is important when your server is running with many threads.
    • Tab delimited, with whitespace normalization. Duh.
    • Logs connect/disconnect. Handy for tracking down cron jobs gone wild.
    • Logs query execution time. I wasn’t looking for long running queries, just general load issues. Note that queries are written to the log when they complete.

    Here’s the lua code:

    
    local fh = io.open("/tmp/mysql_proxy.log", "a+")
    fh:setvbuf("no")
    
    function log_line(s, tm)
       if tm == nil then
          tm = 0
       end
       s = string.gsub(s, "[nt ]+", " ")
       s = string.format("%st%dt%dt%sn", os.date("%Y-%m-%d:%H:%M:%S"),
                         proxy.connection.server.thread_id, tm, s)
       fh:write(s)
    end
    
    function read_handshake(auth)
       if auth then
          log_line("CONNECT " .. auth.client_addr, nil) -- for 0.6.1
       else
          log_line("CONNECT " .. proxy.connection.client.src.name, nil)
       end
    end
    
    function read_query(packet)
       if string.byte(packet) == proxy.COM_QUERY then
          proxy.queries:append(1, string.sub(packet, 2))
       end
    end
    
    function read_query_result(i)
       log_line(i.query, i.query_time)
    end
    
    function disconnect_client()
       log_line("DISCONNECT", nil)
    end
    
  • There’s a great recipe for turning on the proxy with an iptables rule that redirects external connections from port 3306 to port 4040. I wrote a script to turn the rule on and off. Without the script, I kept installing the rule twice and confusing myself. Here’s the script:
    
    #!/usr/bin/ruby -w
    
    # turn mysql-proxy iptables rule on/off
    
    def run(command)
      system(command)
      raise "#{command} failed : #{$?}" if $? != 0
    end
    
    if `whoami`.strip != 'root'
      puts "must be run as root"
      exit 1
    end
    
    # is it on or off?
    rules = `iptables -t nat -L`
    is_on = rules =~ /^REDIRECT .* 4040/
    puts "mysql_proxy_iptables is currently #{is_on ? 'ON' : 'OFF'}."
    
    case ARGV.first
    when "on"
      if !is_on
        puts "Turning on..."
        run "iptables -t nat -I PREROUTING ! -s 127.0.0.1 -p tcp --dport mysql -j REDIRECT --to-ports 4040"
      end
    when "off"
      if is_on
        puts "Turning off..."    
        run "iptables -t nat -D PREROUTING ! -s 127.0.0.1 -p tcp --dport mysql -j REDIRECT --to-ports 4040"
      end
    else
      puts "Usage: mysql_proxy_iptables [on|off]"
      exit 1
    end
    
  • Note that existing connections won’t go through the proxy – the iptables rule only applies to new connections. You’ll have to restart rails to use the proxy.

Here’s some sample output:

2009-11-25:14:02:18	2470	0	CONNECT 127.0.0.1:55509
2009-11-25:14:02:18	2470	253	select @@version_comment limit 1
2009-11-25:14:02:20	2470	365	SELECT DATABASE()
2009-11-25:14:02:20	2470	547	show databases
2009-11-25:14:02:20	2470	100371	show tables
2009-11-25:14:02:22	2470	1147	select * from user
2009-11-25:14:02:40	2470	489	select * from event
2009-11-25:14:02:42	2470	19067	select * from db
2009-11-25:14:02:46	2470	0	DISCONNECT

Ruby Nearest Neighbor (fast kdtree gem)

UPDATED 10/17/12: kdtree 0.3 just released. This blog post is out of date. Check out kdtree on github for the latest.

At Urbanspoon, we often have to find restaurants that are close to a given lat/lon. This general problem is called Nearest Neighbor, and we’ve solved it in a variety of different ways as the company grew and our requirements changed. Read on.

Attempt #1: naive db query

Back when we only covered Seattle, we used a simple db query to find the nearest restaurants. We would find the closest restaurants with manhattan distance, then sort by great circle distance. Pseudocode:

list = select * from table order by manhattan(lat,lon) limit 50
list = list.sort_by { |i| great_circle(i,lat,lon) }

Sadly, this was really slow and didn’t scale beyond 10,000 rows. It worked great for Seattle, but New York alone has 24k restaurants.

Attempt #2: db query w/static box

Our second attempt was the same as #1, but we bounded the query with a box of size 0.02 degrees. Because the db only has to sift through a tiny fraction of the data, we get a huge improvement in performance. An index on lat/lon is required. Pseudocode:

add_index :table, [:lat, :lon]

list = select * from table where box(lat, lon, 0.02)
   order by manhattan(lat,lon) limit 50
list = list.sort_by { |i| great_circle(i,lat,lon) }

But that begs the question – how big should the box be? The box should be big enough to snag at least 50 restaurants, but not so big that we accidentally have to examine 1,000. 0.02 degrees might work great inside a city, but it’s totally inappropriate for less dense areas.

Attempt #3: db query w/ dynamic box

I decided to pre-calculate a density map for each of our metropolitan areas. Each city was mapped onto a 200×200 grid, and restaurant counts were placed into each cell of the grid. This was expensive, but the map can be cached and recalculated once per day.

Given a lat/lon, we use the grid to figure out how big the box needs to be to snag the desired number of restaurants. The code is pretty similar:

list = select * from table where box(lat, lon, density(lat, lon, 50))
   order by manhattan(lat,lon) limit 50
list = list.sort_by { |i| great_circle(i,lat,lon) }

My grid worked great, but it was complicated and a memory hog. Also, because we use a per-city grid it tends to fall apart for queries that are far away from cities. Sometimes it’s difficult to figure out which city a lat/lon should map into.

This is especially annoying for folks that want to use Urbanspoon out on the road.

Attempt #4: much head scratching

What I really wanted was a search that worked across our entire data set, without requiring a bounding box or a per-city constraint. I tried a few things that didn’t work:

  • mysql spatial extensions were a bit faster, but I still needed the bounding box, which was the cause of many of our problems.
  • sphinx lat/lon queries. We already use sphinx for our fulltext searches, so it was easy to experiment with its geo support. Sadly, performance was much slower than the db box query.

I was getting desperate.

Attempt #5: kd tree?

At various times along this winding road, I considered using a kd tree. A kd tree is a data structure that recursively partitions the world in order to rapidly answer nearest neighbor queries. A generic kd tree can support any number of dimensions, and can return either the nearest neighbor or a set of N nearest neighbors.

There were many stumbling blocks. My ruby implementation was slow and ate a lot of memory. I could’ve created one tree per city, but that wouldn’t have solved the basic problem. There were some very powerful C implementations floating around (see libkdtree), but they were difficult to incorporate into our Rails app. Also, strangely, the C implementations still seemed slow, probably because they were written to be totally generic.

Perhaps I could get better results on my own. It was time for some major hacking…

Introducing the kdtree gem.

I created a kdtree gem. It’s very specific to the problem I was trying to solve – two dimensional nearest neighbor searches that run in front of a db. Check out the performance on my aging 2.4ghz AMD machine. using a tree with 1 million points:

build 10.5s
nearest point 0.000009s
nearest 5 points 0.000011s
nearest 50 points 0.000031s
nearest 255 points 0.000140s

The API is very simple:

  • KDTree.new(points) – construct a new tree. Each point should be of the form [x, y, id], where x/y are floats and id is an int. Not a string, not an object, just an int.
  • kd.nearest(x, y) – find the nearest point. Returns an id.
  • kd.nearestk(x, y, k) – find the nearest k points. Returns an array of ids.

Also, I made it possible to persist the tree to disk and load it later. That way you can calculate the tree offline and load it quickly at some future point. Loading a persisted tree w/ 1 millions points takes less than a second, as opposed to the 10.5 second startup time shown above. For example:

File.open("treefile", "w") { |f| kd.persist(f) }
... later ...
kd2 = File.open("treefile") { |f| KDTree.new(f) }

Caveats and limitations:

  • No editing allowed! Once you construct a tree you’re stuck with it.
  • The tree is stored in one big memory block, 20 bytes per point. A tree with one million points will allocate a single 19mb block to store its nodes.
  • Persisted trees are architecture dependent, and may not work across different machines due to endian issues.
  • nearestk is limited to 255 results, again due to my laziness.

This is my first attempt at writing a gem, and I’m sure I’ve messed it up badly. I’m releasing this under the MIT License. Download it here:

UPDATED 10/17/12: kdtree 0.3 just released. This blog post is out of date. Check out kdtree on github for the latest.

Text Wizardry : Ten Commands

Do you process log files, spreadsheets, or XML as part of your engineering work? You too can become a text wizard by mastering these ten commands. Learn them one by one, then mix and match for best results.

1. cat

First, you need to get some text in your shell. Use cat to output a text file.

$ cat access.log

1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:50 +0000] "GET /r/16/
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "GET /r/16/
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "POST /e/vo
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "GET /r/15/
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "GET /m/r/1

2. grep/zgrep

Filters text and picks out just the bits that you care about. zgrep does the same, but works on files that might be compressed. That means that you can run zgrep on a bunch of files, including both compressed and uncompressed data.

Arguments that you care about

  • -i : case insensitive
  • -v : show lines that DON’T match
  • -s : ignore errors (useful in conjunction with find/xargs)
  • -E : search for ‘extended’ regular expressions. Necessary for all but the most trivial greps.
  • -F : search for text, not a regex
  • -o : only print the matching text, not the whole line
  • -n : include line numbers, useful for emacs
  • -c : show a count of matches, not the matches
  • -h : suppress filename
  • -R : grep files and directories recursively. Also see find, down below.
$ cat access.log | grep 50

1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:50 +0000] "GET /r/16/
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:50 +0000] "GET /r/4/5
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:50 +0000] "GET /r/4/5
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:50 +0000] "GET /fn/52
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:27:50 +0000] "GET /m/ci/
$ cat access.log | grep -Eo "Aug/[0-9]"

Aug/2008
Aug/2008
Aug/2008
Aug/2008
Aug/2008
$ cat access.log | grep -v GET

1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "POST /e/vo
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:57 +0000] "POST /e/ku
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:26 +0000] "POST /u/ac
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:26 +0000] "POST /u/ac
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:30 +0000] "POST /u/ac

3. sort

Sorts lines of text. Sort is fast, flexible, and well designed. It also contains possibly the most well-thought out option ever created in the history of command line tools: -n.

Arguments that you care about

  • -n : sort numerically (you’ll be hearing more about this later)
  • -r : reverse the sort order
  • -k : sort based on delimited columns, not just lines. Useful for spreadsheets.
$ cat access.log | grep -Eo 'GET [^"]+' | sort

GET /a/1/Seattle-at-night.html HTTP/1.0
GET /about HTTP/1.0
GET /b/favorites/1/4845/Seattle-restaurants HTTP/1.0
GET /b/favorites/3/245/New-York-restaurants HTTP/1.0
GET /blog/ HTTP/1.0
$ cat access.log | grep -Eo 'GET [^"]+' | sort -r

GET /zip/8/77539/Houston-restaurants.html HTTP/1.0
GET /zip/8/77406/Houston-restaurants.html?sort=2 HTTP/1.0
GET /w/feed/reviews/3/New-York/rss.xml HTTP/1.0
GET /u/friends/29012 HTTP/1.0
GET /u/favorites/29012 HTTP/1.0

4. uniq

Omits repeated lines. By itself this is marginally useful, but with -c and sort, it’s magic.

Arguments that you care about

  • -c : show number of occurrences

For example, here’s a poor man’s google analytics:

$ cat access.log | grep -Eo 'GET [^"]+' | sort | uniq -c | sort -nr

    183 GET /m/u/friends HTTP/1.0
     60 GET / HTTP/1.0
     42 GET /m/ci/6 HTTP/1.0
     40 GET /r/15/191259/restaurant/Southeast/Peking-Express
     39 GET /m/u/suggest_city HTTP/1.0

5. cut

Incredibly useful for slicing and dicing lines. Cut by delimiter or position. cut uses TAB as it’s default delimiter, which is handy because I can never remember how to get a TAB character into bash.

Arguments that you care about

  • -d : set the delimiter
  • -f : pick the fields to output. -f1,3 outputs fields 1 & 3. -f1-3 Outputs fields 1, 2 & 3.
  • -c : or, pick the chars to output

Here’s how you can get the dates from your log file, if you ever wanted to do such a thing. Note the use of quotes to surround the space character, which I’m using as a delimiter.

$ cat access.log | cut -d" " -f4-5

[15/Aug/2008:06:25:50 +0000]
[15/Aug/2008:06:25:51 +0000]
[15/Aug/2008:06:25:51 +0000]
[15/Aug/2008:06:25:51 +0000]
[15/Aug/2008:06:25:51 +0000]

Want to see a breakdown by hour? Just cut out the hour, sort and uniq -c:

$ cat access.log |  cut -d":" -f2 | sort | uniq -c

  17798 14
  22160 15
  26415 16
  23181 17
  30535 18

6. wc

Counts lines. Actually, it counts lots of things, but mostly I use it for lines. When combined with uniq, you can get a count of unique things. For example, here are the number of uniq user agents in my log file:

Arguments that you care about

  • -l : just show the line count
$ cat access.log | cut -d'"' -f6 | uniq | wc -l

8539

7. head/tail

Given some incoming text, just look at the start or end of the file. This is useful if you have lots of data but only want to look at a subset. The examples aren’t that interesting, but here’s how you use them:

Show the first 100 lines:

$ cat access.log | head -100

Show the last 100 lines:

$ cat access.log | tail -100

Tail also has a completely different mode where it “watches” a file and outputs lines that are written to the end. You can “tail your log file” to see what’s happening on your server:

$ tail -f access.log

Want to see if anyone is hitting that new page? Combine tail -f with grep:

$ tail -f access.log | grep cool_new_feature

8. find

Walks your disk and prints out file names. Again, not particularly useful by itself, but tremendously useful when combined with xargs (below) and grep. Unfortunately, find has a somewhat confusing command line syntax. Here are the two flavors you should learn:

Find all files and directories:

$ find

.
./Makefile.in
./config.log
./COPYING
./COMPILING

Find files only, not directories. Note that “.” means “the current directory”. So, we’re really saying “find all files in the current directory”.

$ find . -type f

./Makefile.in
./config.log
./COPYING
./COMPILING
./autogen.sh

Combine with grep to find c++ files.

$ find | grep cpp

./src/tankai/TankAIComputerTarget.cpp
./src/tankai/TankAIAdder.cpp
./src/tankai/TankAIComputer.cpp
./src/tankai/TankAIStrings.cpp
./src/tankai/TankAIComputerBuyer.cpp

9. xargs

This is a weird one, but it’s essential if you want to become a true text wizard. Basically, xargs takes a bunch of text lines and sends them as command line arguments to some other command. Why would you want this? Here are a few ideas off the top of my head:

Grep for some text in your .cpp files, but don’t grep inside the .o files

$ find | grep cpp | xargs grep strcpy

Download urls listed in a text file

cat urls.txt | xargs curl

Zip up your files, but exclude subversion cruft

find | grep -v .svn | xargs zip source.zip

10. less

Less is a text viewer. Use it to page through a file or the output from a command. You can search the contents or jump around with hotkeys. Here are the keys I use most often:

  • space : next page
  • b: previous page
  • < : jump to start of file
  • > : jump to end of file
  • / : search forward for regex
  • ? : search backward for regex

This is especially useful if you’re building up a complicated string of commands. Add less to the end of each sequence to preview the results:

$ cat access.log | less
$ cat access.log | grep -Eo 'GET [^"]+' | less
$ cat access.log | grep -Eo 'GET [^"]+' | sort | less
$ cat access.log | grep -Eo 'GET [^"]+' | sort | uniq -c | less
...

What next?

The true power of the command line comes not from any individual command, but from knowing how to chain them together. We’ve really just scratched the surface here. Once you’ve mastered these commands, here are a few others to check out:

  • convert – image processing
  • curl – fetch web pages
  • dc – calculator
  • strings – pull strings out of a binary file
  • sum – checksum a file
  • watch – run something every few seconds and look for changes
  • xxd – hex dump a file

You’ll also want to learn a bit about redirection, bash’s command line history, and the wonders of ctrl-r.

Google Chart Tips for Ruby Hackers

Recently we’ve been experimenting with Google Charts on Urbanspoon. Their API is well designed and easy to use, but it’s still nontrivial to produce good looking graphs for arbitrary data. Here are some suggestions for my fellow ruby hackers:

1. Nice Numbers for Graph Labels

The classic “Nice Numbers for Graph Labels” Graphics Gem by Paul Heckbert will generate a series of good looking axis labels given a min and max value. It works with floats as well as integers.


Automatic “nice labels” on the y axis

I ported it to Ruby:

# From the "Nice Numbers for Graph Labels" graphics gem by Paul
# Heckbert
def nicenum(x, round)
  expv = Math.log10(x).floor.to_f
  f = x / (10 ** expv)
  if round
    if f < 1.5
      nf = 1
    elsif f < 3
      nf = 2
    elsif f < 7
      nf = 5
    else
      nf = 10
    end
  else
    if f <= 1
      nf = 1
    elsif f <= 2
      nf = 2
    elsif f <= 5
      nf = 5
    else
      nf = 10
    end
  end
  nf * (10 ** expv)
end

def loose_label(options = {})
  min, max = options[:min], options[:max]
  ticks = options[:ticks] || 5
  
  range = nicenum(max - min, false);
  d = nicenum(range / (ticks - 1), true);
  
  {
    :min => (min / d).floor * d,
    :max => (max / d).ceil * d,
    :increment => d
  }
end

For example, if your data set ranges from 23-65 and you want to have
five axis labels, you could do something like this:

puts loose_label(:min => 23, :max => 65, :ticks => 5).inspect

and it would suggest this for your axis labels:

{ :min => 20.0, :max => 70.0, :increment => 10.0 }

To generate the actual labels, use something like the code below. Again, this is cribbed from the original Graphics Gem:

loose = loose_label(:min => 23, :max => 65, :ticks => 5)
ymin, ymax = loose[:min], loose[:max]
d = loose[:increment]
nfrac = -Math.log10(d).floor
nfrac = 0 if nfrac < 0
ylabels = []
i = ymin
while i < ymax + 0.5 * d
  ylabels << sprintf("%.#{nfrac}f", i)
  i += d
end

2. Add a trailing average

Here’s some code to calculate a trailing average from the previous 7 data points. The initial segment of the trailing average is calculated by averaging the data available up to that point.

trailing = 7
sum = 0.0
tdata = []
data.each_with_index do |i, index|
  count = nil
  sum += i
  if index < trailing
    count = index + 1
  else
    count = trailing
    sum -= data[index - trailing]
  end
  avg = (sum / count).to_i
  tdata << avg
end

3. Use the golden ratio

The human eye finds a certain aspect ratio naturally appealing. Namely, The Golden Ratio. If I have enough space to work with, I want my graphs to use that aspect ratio by default. That’s why I set up my api like this:

GOLDEN = 1.61803399

def chart(options = {})
  # calculate width/height
  width = options.delete(:width) || 300
  height = options.delete(:height) || (width / GOLDEN)
  ...

4. Consider using gchartrb

gchartrb is a ruby gem that wraps the Google Charts API. I haven’t used it personally but it looks great.

ActiveRecord Table Transform (or, how to write to the db 27,000 times)

At Urbanspoon, we use pretty urls for our pages to make them more palatable to users and search engines. Here’s an example:

http://www.urbanspoon.com/r/1/55069/Seattle/Fremont/Baguette-Box.html

These beautiful urls are slightly expensive to generate, since we have to “prettify” text by stripping whitespace and replacing accent characters. A few weeks back, I finally bit the bullet and started caching our pretty urls in the db instead of in memory. I lazily populate the url column for each restaurant, so we’re gradually filling in the data as users hit the server.

Then I dug into the code that generates our sitemap. For the uninitiated, a sitemap is an XML file describing every page on the server. Naturally, in order to generate this file we have to write out the pretty urls for each restaurant.

Of our ~100,000 restaurants, approximately ~27,000 hadn’t yet cached their pretty urls in the db. I naively used my lazy pretty url generator, which ended up sending 27,000 individual writes to the db. It took approximately 9 MINUTES to complete, with the CPU pegged the entire time.

It would be much better to do something like the following:

  1. Create a temp table with (id, url).
  2. Bulk insert to populate the temp table.
  3. Update the restaurants table from the temp table.

I implemented my new scheme and running time went from 9 minutes to 24 SECONDS. I liked this approach so much I decided to generalize it as ActiveRecord::Base.transform. Sample usage:

# if users don't have names, give them a random one
NAMES = ['Adam', 'Ethan', 'Patrick']
User.transform(:name, :conditions => 'name is null').each do |i|
  i.name = NAMES[rand * NAMES.length]
end

This will use a bulk transform to update all users at once instead of each user individually.

Note that this has only been tested with Mysql, and is unlikely to work out of the box with other databases. Check it out:

# helper for quickly transforming an entire table using a temp table,
# a bulk insert, and an update
class ActiveRecord::Base
  def self.transform(cols, options = {})
    temp_name = options[:temp_name] || "temp_transform_table"
    temp_options = options[:temp_options] || "character set utf8 collate utf8_general_ci"

    # munge cols into real column objects
    cols = [cols] if !cols.is_a?(Array)
    cols = cols.map { |i| i.to_s }

    cols.delete("id")
    cols.unshift("id")

    cols = cols.map { |i| columns_hash[i] || raise("column #{i} not found") }

    # load/transform
    rows = find(:all, options)
    return if rows.empty?
    rows.each { |i| yield(i) }

    # create the temp table
    cols_create = cols.map { |i| "#{i.name} #{i.sql_type}" }
    connection.execute("CREATE TEMPORARY TABLE #{temp_name} (#{cols_create.join(',')}) #{temp_options}")
    
    # bulk insert
    data = rows.map do |r|
      values = cols.map { |c| connection.quote(r[c.name], c) }
      "(#{values.join(',')})"
    end
    connection.execute("INSERT INTO #{temp_name} values #{data.join(',')}")

    # save
    cols_equal = cols.map { |i| "#{table_name}.#{i.name} = #{temp_name}.#{i.name}" }
    connection.execute("UPDATE #{table_name}, #{temp_name} SET #{cols_equal[1..-1].join(', ')} WHERE #{cols_equal.first}")
    
    connection.execute("DROP TEMPORARY TABLE #{temp_name}")
  end
end

Yahoo Slurp Makes a Mess

For months we’ve been carefully watching how the various bots consume Urbanspoon. We enticed them inside with fresh content, well constructed pages, and sitemaps. Despite our efforts, until quite recently Yahoo Slurp didn’t have much of an appetite for Urbanspoon. Instead of digging in and indexing the whole site, Yahoo Slurp preferred to nibble around the edges.

That is, until June 16th. Notice anything odd?

Yahoo Slurp Requests to Urbanspoon.com

Someone flipped a switch down there in Sunnyvale and the Yahoo Slurp bot suddenly decided that it loved Urbanspoon.

For comparison, check out the Metamucil-like regularity of the Google bot:

Google Bot Requests to Urbanspoon.com

Let’s dig in and take a closer look at those two bots. Ready… fight!

Yahoo Slurp (June 16-18) Google Bot (June 16-18)
194,464 total hits
120,076 pages (38% dups)
32 violations of robots.txt

85,396 restaurant pages
1,008 neighborhood pages
995 cuisine pages

26,599 New York restaurants
23,002 LA restaurants
20,152 SF restaurants
7,817 Seattle restaurants
7,124 Boston restaurants
534 Chicago restaurants
109 DC restaurants

41,941 total hits
41,332 pages (1.4% dups)
27 violations of robots.txt

22,999 restaurant pages
1,366 neighborhood pages
980 cuisines

6,573 New York restaurants
4,659 LA restaurants
2,751 SF restaurants
2,584 Seattle restaurants
1,987 Boston restaurants
2,365 Chicago restaurants
2,245 DC restaurants

Yahoo Slurp Duplicates

Yahoo Slurp requested many pages more than once. In fact, Yahoo Slurp was unable to resist certain pages, compulsively returning to them again and again. Here are the pages that the bot seemed to find tastiest:

# of requests page
419 /robots.txt
294 /choose
273 /
21 /c/3/New-York.html
19 /c/5/Los-Angeles.html
16 /a/3/New-York-at-night.html
15 /c/1/Seattle.html
14 /c/2/Chicago.html
11 /u/create (and this page is blocked via robots.txt!)

I’ll spare you the other 73,306 duplicates requested by Yahoo Slurp.

Directory Crawling

Strangely, the Yahoo Slurp bot likes to explore the directories leading up to each page. For example, in addition to indexing our Sitka & Spruce page, Yahoo Slurp also tried to hit each of the directories leading up to that page:

/r/1/1084/Seattle/Eastlake-Lake-Union/Sitka-Spruce.html
/r/1/1084/Seattle/Eastlake-Lake-Union/
/r/1/1084/
/r/1/
/r/

Those URLs aren’t linked anywhere from our site. Each of them (correctly) redirects elsewhere. Why did Yahoo choose to crawl them?

Yahoo Slurp – A Sloppy Eater

We’re quite flattered by the attention, but Yahoo’s Slurp bot made a bit of a mess. I can forgive the robots.txt violations, since other bots share this transgression. The directory walking thing is bizarre, but won’t hurt our search engine results due to our clever defensive redirects.

The 38% dup rate is just plain sloppy. Really, this is not how we want to spend our precious CPU cycles and bandwidth. I’ve written a few indexing systems myself and I know that these problems are challenging, but the market leader seems to have solved them nicely.

It remains to be seen if Yahoo’s aggressive indexing will lead to a commensurate increase in traffic from Yahoo. Stay tuned!