Archive for the 'Uncategorized' Category

« Previous Entries

Google Chart Tips for Ruby Hackers

Wednesday, April 2nd, 2008

Recently we’ve been experimenting with Google Charts on Urbanspoon. Their API is well designed and easy to use, but it’s still nontrivial to produce good looking graphs for arbitrary data. Here are some suggestions for my fellow ruby hackers:

1. Nice Numbers for Graph Labels

The classic “Nice Numbers for Graph Labels” Graphics Gem by Paul Heckbert will generate a series of good looking axis labels given a min and max value. It works with floats as well as integers.




Automatic “nice labels” on the y axis

I ported it to Ruby:

# From the "Nice Numbers for Graph Labels" graphics gem by Paul
# Heckbert
def nicenum(x, round)
  expv = Math.log10(x).floor.to_f
  f = x / (10 ** expv)
  if round
    if f < 1.5
      nf = 1
    elsif f < 3
      nf = 2
    elsif f < 7
      nf = 5
    else
      nf = 10
    end
  else
    if f <= 1
      nf = 1
    elsif f <= 2
      nf = 2
    elsif f <= 5
      nf = 5
    else
      nf = 10
    end
  end
  nf * (10 ** expv)
end

def loose_label(options = {})
  min, max = options[:min], options[:max]
  ticks = options[:ticks] || 5

  range = nicenum(max - min, false);
  d = nicenum(range / (ticks - 1), true);

  {
    :min => (min / d).floor * d,
    :max => (max / d).ceil * d,
    :increment => d
  }
end

For example, if your data set ranges from 23-65 and you want to have
five axis labels, you could do something like this:

puts loose_label(:min => 23, :max => 65, :ticks => 5).inspect

and it would suggest this for your axis labels:

{ :min => 20.0, :max => 70.0, :increment => 10.0 }

To generate the actual labels, use something like the code below. Again, this is cribbed from the original Graphics Gem:

loose = loose_label(:min => 23, :max => 65, :ticks => 5)
ymin, ymax = loose[:min], loose[:max]
d = loose[:increment]
nfrac = -Math.log10(d).floor
nfrac = 0 if nfrac < 0
ylabels = []
i = ymin
while i < ymax + 0.5 * d
  ylabels << sprintf("%.#{nfrac}f", i)
  i += d
end

2. Add a trailing average

Here’s some code to calculate a trailing average from the previous 7 data points. The initial segment of the trailing average is calculated by averaging the data available up to that point.

trailing = 7
sum = 0.0
tdata = []
data.each_with_index do |i, index|
  count = nil
  sum += i
  if index < trailing
    count = index + 1
  else
    count = trailing
    sum -= data[index - trailing]
  end
  avg = (sum / count).to_i
  tdata << avg
end

3. Use the golden ratio

The human eye finds a certain aspect ratio naturally appealing. Namely, The Golden Ratio. If I have enough space to work with, I want my graphs to use that aspect ratio by default. That’s why I set up my api like this:

GOLDEN = 1.61803399

def chart(options = {})
  # calculate width/height
  width = options.delete(:width) || 300
  height = options.delete(:height) || (width / GOLDEN)
  …

4. Consider using gchartrb

gchartrb is a ruby gem that wraps the Google Charts API. I haven’t used it personally but it looks great.

ActiveRecord Table Transform (or, how to write to the db 27,000 times)

Tuesday, July 31st, 2007

At Urbanspoon, we use pretty urls for our pages to make them more palatable to users and search engines. Here’s an example:

http://www.urbanspoon.com/r/1/55069/Seattle/Fremont/Baguette-Box.html

These beautiful urls are slightly expensive to generate, since we have to “prettify” text by stripping whitespace and replacing accent characters. A few weeks back, I finally bit the bullet and started caching our pretty urls in the db instead of in memory. I lazily populate the url column for each restaurant, so we’re gradually filling in the data as users hit the server.

Then I dug into the code that generates our sitemap. For the uninitiated, a sitemap is an XML file describing every page on the server. Naturally, in order to generate this file we have to write out the pretty urls for each restaurant.

Of our ~100,000 restaurants, approximately ~27,000 hadn’t yet cached their pretty urls in the db. I naively used my lazy pretty url generator, which ended up sending 27,000 individual writes to the db. It took approximately 9 MINUTES to complete, with the CPU pegged the entire time.

It would be much better to do something like the following:

  1. Create a temp table with (id, url).
  2. Bulk insert to populate the temp table.
  3. Update the restaurants table from the temp table.

I implemented my new scheme and running time went from 9 minutes to 24 SECONDS. I liked this approach so much I decided to generalize it as ActiveRecord::Base.transform. Sample usage:

# if users don’t have names, give them a random one
NAMES = [‘Adam’, ‘Ethan’, ‘Patrick’]
User.transform(:name, :conditions => ‘name is null’).each do |i|
  i.name = NAMES[rand * NAMES.length]
end

This will use a bulk transform to update all users at once instead of each user individually.

Note that this has only been tested with Mysql, and is unlikely to work out of the box with other databases. Check it out:

# helper for quickly transforming an entire table using a temp table,
# a bulk insert, and an update
class ActiveRecord::Base
  def self.transform(cols, options = {})
    temp_name = options[:temp_name] || "temp_transform_table"
    temp_options = options[:temp_options] || "character set utf8 collate utf8_general_ci"

    # munge cols into real column objects
    cols = [cols] if !cols.is_a?(Array)
    cols = cols.map { |i| i.to_s }

    cols.delete("id")
    cols.unshift("id")

    cols = cols.map { |i| columns_hash[i] || raise("column #{i} not found") }

    # load/transform
    rows = find(:all, options)
    return if rows.empty?
    rows.each { |i| yield(i) }

    # create the temp table
    cols_create = cols.map { |i| "#{i.name} #{i.sql_type}" }
    connection.execute("CREATE TEMPORARY TABLE #{temp_name} (#{cols_create.join(’,')}) #{temp_options}")

    # bulk insert
    data = rows.map do |r|
      values = cols.map { |c| connection.quote(r[c.name], c) }
      "(#{values.join(’,')})"
    end
    connection.execute("INSERT INTO #{temp_name} values #{data.join(’,')}")

    # save
    cols_equal = cols.map { |i| "#{table_name}.#{i.name} = #{temp_name}.#{i.name}" }
    connection.execute("UPDATE #{table_name}, #{temp_name} SET #{cols_equal[1..-1].join(’, ‘)} WHERE #{cols_equal.first}")

    connection.execute("DROP TEMPORARY TABLE #{temp_name}")
  end
end

Yahoo Slurp Makes a Mess

Wednesday, June 27th, 2007

For months we’ve been carefully watching how the various bots consume Urbanspoon. We enticed them inside with fresh content, well constructed pages, and sitemaps. Despite our efforts, until quite recently Yahoo Slurp didn’t have much of an appetite for Urbanspoon. Instead of digging in and indexing the whole site, Yahoo Slurp preferred to nibble around the edges.

That is, until June 16th. Notice anything odd?

Yahoo Slurp Requests to Urbanspoon.com

Someone flipped a switch down there in Sunnyvale and the Yahoo Slurp bot suddenly decided that it loved Urbanspoon.

For comparison, check out the Metamucil-like regularity of the Google bot:

Google Bot Requests to Urbanspoon.com

Let’s dig in and take a closer look at those two bots. Ready… fight!

Yahoo Slurp (June 16-18) Google Bot (June 16-18)
194,464 total hits
120,076 pages (38% dups)
32 violations of robots.txt

85,396 restaurant pages
1,008 neighborhood pages
995 cuisine pages

26,599 New York restaurants
23,002 LA restaurants
20,152 SF restaurants
7,817 Seattle restaurants
7,124 Boston restaurants
534 Chicago restaurants
109 DC restaurants

41,941 total hits
41,332 pages (1.4% dups)
27 violations of robots.txt

22,999 restaurant pages
1,366 neighborhood pages
980 cuisines

6,573 New York restaurants
4,659 LA restaurants
2,751 SF restaurants
2,584 Seattle restaurants
1,987 Boston restaurants
2,365 Chicago restaurants
2,245 DC restaurants

Yahoo Slurp Duplicates

Yahoo Slurp requested many pages more than once. In fact, Yahoo Slurp was unable to resist certain pages, compulsively returning to them again and again. Here are the pages that the bot seemed to find tastiest:

# of requests page
419 /robots.txt
294 /choose
273 /
21 /c/3/New-York.html
19 /c/5/Los-Angeles.html
16 /a/3/New-York-at-night.html
15 /c/1/Seattle.html
14 /c/2/Chicago.html
11 /u/create (and this page is blocked via robots.txt!)

I’ll spare you the other 73,306 duplicates requested by Yahoo Slurp.

Directory Crawling

Strangely, the Yahoo Slurp bot likes to explore the directories leading up to each page. For example, in addition to indexing our Sitka & Spruce page, Yahoo Slurp also tried to hit each of the directories leading up to that page:

/r/1/1084/Seattle/Eastlake-Lake-Union/Sitka-Spruce.html
/r/1/1084/Seattle/Eastlake-Lake-Union/
/r/1/1084/
/r/1/
/r/

Those URLs aren’t linked anywhere from our site. Each of them (correctly) redirects elsewhere. Why did Yahoo choose to crawl them?

Yahoo Slurp - A Sloppy Eater

We’re quite flattered by the attention, but Yahoo’s Slurp bot made a bit of a mess. I can forgive the robots.txt violations, since other bots share this transgression. The directory walking thing is bizarre, but won’t hurt our search engine results due to our clever defensive redirects.

The 38% dup rate is just plain sloppy. Really, this is not how we want to spend our precious CPU cycles and bandwidth. I’ve written a few indexing systems myself and I know that these problems are challenging, but the market leader seems to have solved them nicely.

It remains to be seen if Yahoo’s aggressive indexing will lead to a commensurate increase in traffic from Yahoo. Stay tuned!

Hashiness

Sunday, April 8th, 2007

Back in college, the killer Intro to CS class used a home grown object-oriented version of Pascal. It was a bit like Borland’s Pascal, except it ran on Solaris and the IDE was about 100x slower. We quickly covered some programming fundamentals, then dutifully moved on to inheritance and polymorphism. One particularly grueling assignment involved writing a linked list where the nodes used polymorphism instead of conditionals.

Turns out that in the real world you never want to write a linked list with polymorphism, but the lesson obviously struck a nerve. Since then I’ve pretty much used objected oriented languages exclusively. For the purposes of this blog post I’m skipping over our flirtations with C and optimized MMX instructions at Strangeberry, as well as a nightmarish Perl project at Jobster.

I can finally report that my long honeymoon with OOP is coming to an end. These days, I use the Hash much more often than I use the Class. The Hash has a certain appeal that it’s hard to resist.

The Rise of the Hash

Maybe it’s because I spend so much of my time these days on data manipulation and deployment scripts. Maybe it’s frustration with poor design. Maybe it’s because machines have simply gotten fast enough to enable my inherent laziness. Maybe it’s because of YAML.

Whatever the reason, I simply love Hash tables. I can’t get enough of them. I use Hash tables for complex method arguments, just like the rest of Rails. I’ll use a Hash table instead of a Class for as long as possible, right up until I absolutely need to add a method. Hash tables are so syntactically light and malleable, it feels like a real sacrifice to switch to a full blown Class.

Ruby’s Hash

I especially love the somewhat obscure Ruby feature that allows you to attach a block to Hash. The block gets called whenever a new element is created. I talk about this a bit in my Ruby at 60 post, but I wanted to give a different example here. I often find that I need to partition a data set based on a key. Imagine a set of employees, each with a role:

employees =
  [
   { :role => ‘ceo’, :name => ‘Mr. Burns’ },
   { :role => ‘underling’, :name => ‘Smithers’ },
   { :role => ’slave’, :name => ‘Homer’ },
   { :role => ’slave’, :name => ‘Lenny’ },
   { :role => ’slave’, :name => ‘Carl’ },
   …
  ]

The following snippet breaks up the employees by role:

by_role = Hash.new { |hash, key| hash[key] = [] }
employees.each { |i| by_role[i[:role]] << i }

We use techniques like this all over the place for managing data sets that aren’t in a database.

Hashiness

Using Hash tables allows me to experience a sense of satisfaction that I call Hashiness.

The authors of ActiveRecord clearly were going for Hashiness. Ditto for sessions/params in Rails. There are oodles of Classes in Rails that masquerade as Hashes. Many of the essential Rails methods take a Hash as a parameter, and I’ve lifted this pattern for my own work.

I claim that Hashiness makes me a more productive engineer, and makes Urbanspoon a better product. Are you getting enough Hashiness?

Rails expire_fragment(regex) Considered Harmful

Sunday, February 4th, 2007

Recently I discovered that one of our Urbanspoon actions was taking nearly two seconds to complete. This particular page stuck out like a sore thumb once I start crunching the numbers contained in our production log file. The slowness was especially puzzling because the action in question seemed to be one of the simplest actions in the entire application. I quickly determined that:

  • The slowdown wasn’t in page rendering.
  • The slowdown wasn’t coming from the db.
  • I couldn’t make it happen on my dev box.
  • It occurred even if the production machine wasn’t busy.

So, where was the problem?

In my previous Ruby at 60 post, I mentioned that Ruby’s dynamic nature can make it hard to figure out what’s happening beneath the covers. This debugging session was a perfect example. Over the course of an hour or two, I laboriously inserted benchmarking code into various bits of Rails. First validation. Then ActiveRecord callbacks. Maybe the problem was inside the logger? Unfortunately, it’s not easy to find all the major pieces of Rails that affect each request.

Eventually I was able to trace the slowdown to the following line of code in one of our cache sweepers:

expire_fragment(/base\/xyz.*/)

At the moment we use File-based Fragment Caching to speed up Urbanspoon. Some of our pages have a cached section at the top (base/xyz_top), a dynamic section in the middle, and another cached section at the bottom (xyz_bottom). When the underlying data changes, a cache sweeper would jump in and expire the two cached sections using a regex, base/xyz.*.

Silly me, for some reason I thought that FileStore would expire the regex as follows:

  1. Look in the base directory.
  2. Find all files that match xyz.*
  3. Delete them.

I couldn’t be more wrong. Instead, the code in UnthreadedFileStore works more like this:

  1. Iterate every single file in the fragment cache.
  2. Delete files which match base/xyz.*

Our production server’s fragment cache usually contains in excess of 5,000 cached fragment files. Every time this action was invoked we were iterating all of them. Ouch!

The bug was easy to fix - simply replace the regex with two separate calls to expire_fragment, one for the top fragment and one for the bottom. Somewhere in the back of my mind I knew I’d have to make this change eventually, since we’ll be switching to memcached in the not-so-distant future. I just didn’t anticipate the fire drill.

Anyway, take my advice. Avoid expire_fragment(regex). It’s seductive if you have multiple fragments to expire, but it’ll cost you in the long run.

Google in Action and Other Graphs

Tuesday, January 23rd, 2007

In my endless quest to learn more about Urbanspoon’s explosive growth, I put together a script to generate graphs illustrating various aspects of our traffic. There are many interesting questions that we can now answer:

  • Which cities are getting the most traffic from search engines?
  • How long does it take google to index a new Urbanspoon city?
  • etc.

I whipped up a script that periodically crunches our logs offline and creates graphs using the excellent (but cryptic) rrdtool. The graphs are generated on the hour. awstats is nice, but sometimes you have to dig in and get your hands dirty. We also use munin to keep an eye on our hardware.

GoogleBot

Below is a recent snapshot of GoogleBot crawling Urbanspoon. Green is Seattle, blue is Chicago, and red is New York. X axis is time, Y axis is pages per minute. I’ve removed the Y labels to obscure our actual numbers.

Notice the flat tops on each bulge of GoogleBot traffic - GoogleBot caps its crawl rate at a certain number of pages per minute. Over the past few weeks they’ve been gradually ramping up the rate at which they crawl Urbanspoon. Perhaps they looked at our response times and concluded that our site can handle it. Also note that they’re running out of Seattle pages to crawl. Strangely, GoogleBot tends to go to sleep around midnight PST.

Not everyone at Google is so polite. For a brief period last week Google’s mobile crawler was hitting our site with over 7000 requests per hour.

Other Robots

GoogleBot hits us far more than any other robot. To put this in perspective, here is a graph of noticeable robots hitting Urbanspoon recently. GoogleBot dominates. Maybe Yahoo should just throw in the towel and start using Nutch for their crawls.

Traffic

Our referral rate from Google is increasing rapidly but not uniformly. For example, the graph below indicates that we have more work to do in Chicago:

emacs dotfiles 2007-01-20

Saturday, January 20th, 2007

It’s time for another dotfile release. This release includes some fixes for emacs 22, and a significant improvement in abtags. Download the dotfiles here:

Adam’s Emacs Dotfiles

From the changelog:

2007-01-20
- completion fixes for emacs 22 compat
- changed nxml indent to 2
- mapped html/sgml to nxml mode
- *.rake => ruby mode
- abtags auto-reloads TAGS files now
- finally tracked down and fixed pesky loaddefs issue

Turn Off Rails Sessions for Robots

Monday, January 8th, 2007

Urbanspoon is already attracting a sizable amount of traffic, and we expect our numbers to grow rapidly now that we’ve launched Chicago and New York. Urbanspoon is regularly crawled by a large number of robots seeking to index our site.

Some of our pages squirrel information away inside the Rails session. For example, we keep track of recently visited restaurants so that we can guide users back to those restaurants when they return. This is handy if, for example, you always order pizza from one or two restaurants.

Imagine if Googlebot crawled each of our 35,000 restaurants each day. Each time the bot hits a restaurant we would attempt to record a “restaurant visit” in the session. Since robots generally don’t use cookies, that would create 35,000 useless sessions each day. Wouldn’t it be nice to suppress these sessions entirely?

I wrote a helper function called is_megatron? to detect if a request’s User-Agent indicates that the request is from a robot. The regular expression catches most of the bot traffic that hits our site:

class Util
  def Util.is_megatron?(user_agent)
    user_agent =~ /\b(Baidu|Gigabot|Googlebot|libwww-perl|lwp-trivial|msnbot|SiteUptime|Slurp|WordPress|ZIBB|ZyBorg)\b/i
  end
end

If we determine that a request appears to be from a robot, we simply disable session support for the current request:

class ApplicationController < ActionController::Base
  # turn off sessions if this is a request from a robot
  session :off, :if => proc { |request| Util.is_megatron?(request.user_agent) }

  …
end

Gupta-Sproull Antialiased Lines

Wednesday, November 29th, 2006

Six months ago, I managed to generate the following table. It’s not remotely relevant to Urbanspoon.

{ 0xc7, 0xc6, 0xc2, 0xbc,
  0xb3, 0xa9, 0x9c, 0x8e,
  0x7f, 0x70, 0x62, 0x54,
  0x46, 0x3a, 0x2f, 0x25,
  0x1c, 0x15, 0x0e, 0x09,
  0x05, 0x03, 0x01, 0x00 }

This is the coverage table for rendering antialiased lines using the Gupta-Sproull algorithm. Apparently, this little table is included in their original paper, first published in 1981. I couldn’t find the coverage table online. I eventually bought a copy of the paper from the ACM for $10, but it didn’t include the damned table!

Finally, I bit the bullet, sat down and read the paper. Then I hacked together a Java class to generate the table. I wasted two full days of my short life on those 24 numbers.

You’ll thank me later.

I spent two weeks trying to efficiently render antialiased lines as part of a consulting project for a consumer electronics company. During those two weeks I was completely obsessed with line rendering. I raved about it to my friends, long after they’d stopped listening. I dreamed of lines. I learned a lot about line rendering, most of which I’ve forgotten.

I eventually implemented Wu lines with a Gupta-Sproull style coverage table. In addition to the body pixel coverage table above, I also calculated an endpoint coverage table. Divide a pixel into a 16×16 grid. Move the fractional line endpoint to the nearest grid location, round the slope to the nearest 1/16, and then perform a lookup in the table. The lookup returns the coverage for the 6 pixels surrounding the endpoint. For those of you following along at home, the table is 16*16*16*6 bytes, or 24k.

The original paper dealt with integer (not fractional) endpoints. The authors suggested that it might be possible to support fractional endpoints by generating a table similar to the one I created, but that it would be too big for practical use. That was in 1981. Here in 2006, even portable consumer electronics devices can spare 24k. The Gupta-Sproull lines were much faster than agg’s lines, and looked just as good.

I came up with a great hack to generate the table. I drew some really fat lines in Java and then manually measured the coverage! A picture is worth a thousand words:

The moral of the story? A mediocre pixel tweaker like me needs to piggyback on 25 years of Moore’s Law in order to implement a basic line rendering algorithm.

Complex SQL Sorts with Rails/ActiveRecord

Saturday, November 18th, 2006

On Urbanspoon, we often need to efficiently sort a subset of records that were retrieved using a different sort order. For example, our Most Popular Restaurants in Seattle page first selects the top 100 restaurants, then allows the user to sort by name or price.

We need to perform a sort, then perform a secondary sort on a subset of the results. Here is an example of a two step sort in action:


unsorted

sort by popularity

sort by price, but
only the first 100

I tried various implementations:

  • Perform the second sort in Ruby. This is inelegant, inefficient, and impractical for expensive sorts like distance in miles.
  • Use a subselect to get the ids in your :conditions. Unfortunately, Mysql doesn’t support LIMIT in subselects. This also breaks sql_calc_found_rows, which we use with some of our complicated sorts.
  • First select the ids, then manually construct the :conditions from the ids.

After several fumbling attempts, I eventually settled on the last approach. This technique requires an additional query, but it let’s you use things like ActiveRecord’s :include, which is essential for some of our pages.

1. Perform the first sort with :limit, and grab the IDs

First, perform the sort with :limit and grab the IDs. For example:

r = Restaurant.find(:all,
                     :order => :popularity,
                     :limit => 100)
ids = r.map { |i| i.id }

Better yet, let’s just select the ids instead of populating the entire restaurant object:

r = Restaurant.find(:all,
                     :select => ‘id’,
                     :order => :popularity,
                     :limit => 100)
ids = r.map { |i| i.id }

If you’re a freak like me, you might want to get rid of some of that ActiveRecord overhead. Why bother creating those Restaurant objects at all? We can avoid creating Restaurant objects if we write a bit of SQL:

r = ActiveRecord::Base.connection.select_all("select id from restaurants order by popularity limit 100")
ids = r.map { |i| i[‘id’] }

Hm. There must be a better way. Can we use ActiveRecord to write the SQL, but avoid creating the restaurant objects? You bet, if we call send on ActiveRecord::Base’s construct_finder_sql method. This is perfect for my purposes, because my SQL skills are pretty weak. I can use ActiveRecord to write the SQL, but avoid the unnecessary overhead of creating all those objects.

options = {
  :select => ‘id’,
  :order => :popularity,
  :limit => 100
}
sql = Restaurant.send(:construct_finder_sql, options)
r = ActiveRecord::Base.connection.select_all(sql)
ids = r.map { |i| i[‘id’] }

2. Sort the subset

Now that we have the IDs of our subset, we can sort it using ActiveRecord:

Restaurant.find(:all,
                 :conditions => "id in (#{ids.join(’,')})",
                 :order => ‘price’)

We can also take advantage of :include to populate our objects with everything we need.

You’ll Like It

The example above is a bit contrived, but I needed this technique to efficiently render many of the pages on Urbanspoon. Enjoy!