Archive for October, 2009

Ruby Nearest Neighbor (fast kdtree gem)

Thursday, October 22nd, 2009

At Urbanspoon, we often have to find restaurants that are close to a given lat/lon. This general problem is called Nearest Neighbor, and we’ve solved it in a variety of different ways as the company grew and our requirements changed. Read on.

Attempt #1: naive db query

Back when we only covered Seattle, we used a simple db query to find the nearest restaurants. We would find the closest restaurants with manhattan distance, then sort by great circle distance. Pseudocode:

list = select * from table order by manhattan(lat,lon) limit 50
list = list.sort_by { |i| great_circle(i,lat,lon) }

Sadly, this was really slow and didn’t scale beyond 10,000 rows. It worked great for Seattle, but New York alone has 24k restaurants.

Attempt #2: db query w/static box

Our second attempt was the same as #1, but we bounded the query with a box of size 0.02 degrees. Because the db only has to sift through a tiny fraction of the data, we get a huge improvement in performance. An index on lat/lon is required. Pseudocode:

add_index :table, [:lat, :lon]
list = select * from table where box(lat, lon, 0.02)
   order by manhattan(lat,lon) limit 50
list = list.sort_by { |i| great_circle(i,lat,lon) }

But that begs the question – how big should the box be? The box should be big enough to snag at least 50 restaurants, but not so big that we accidentally have to examine 1,000. 0.02 degrees might work great inside a city, but it’s totally inappropriate for less dense areas.

Attempt #3: db query w/ dynamic box

I decided to pre-calculate a density map for each of our metropolitan areas. Each city was mapped onto a 200×200 grid, and restaurant counts were placed into each cell of the grid. This was expensive, but the map can be cached and recalculated once per day.

Given a lat/lon, we use the grid to figure out how big the box needs to be to snag the desired number of restaurants. The code is pretty similar:

list = select * from table where box(lat, lon, density(lat, lon, 50))
   order by manhattan(lat,lon) limit 50
list = list.sort_by { |i| great_circle(i,lat,lon) }

My grid worked great, but it was complicated and a memory hog. Also, because we use a per-city grid it tends to fall apart for queries that are far away from cities. Sometimes it’s difficult to figure out which city a lat/lon should map into.

This is especially annoying for folks that want to use Urbanspoon out on the road.

Attempt #4: much head scratching

What I really wanted was a search that worked across our entire data set, without requiring a bounding box or a per-city constraint. I tried a few things that didn’t work:

  • mysql spatial extensions were a bit faster, but I still needed the bounding box, which was the cause of many of our problems.
  • sphinx lat/lon queries. We already use sphinx for our fulltext searches, so it was easy to experiment with its geo support. Sadly, performance was much slower than the db box query.

I was getting desperate.

Attempt #5: kd tree?

At various times along this winding road, I considered using a kd tree. A kd tree is a data structure that recursively partitions the world in order to rapidly answer nearest neighbor queries. A generic kd tree can support any number of dimensions, and can return either the nearest neighbor or a set of N nearest neighbors.

There were many stumbling blocks. My ruby implementation was slow and ate a lot of memory. I could’ve created one tree per city, but that wouldn’t have solved the basic problem. There were some very powerful C implementations floating around (see libkdtree), but they were difficult to incorporate into our Rails app. Also, strangely, the C implementations still seemed slow, probably because they were written to be totally generic.

Perhaps I could get better results on my own. It was time for some major hacking…

Introducing the kdtree gem.

I created a kdtree gem. It’s very specific to the problem I was trying to solve – two dimensional nearest neighbor searches that run in front of a db. Check out the performance on my aging 2.4ghz AMD machine. using a tree with 1 million points:

build 10.5s
nearest point 0.000009s
nearest 5 points 0.000011s
nearest 50 points 0.000031s
nearest 255 points 0.000140s

The API is very simple:

  • KDTree.new(points) – construct a new tree. Each point should be of the form [x, y, id], where x/y are floats and id is an int. Not a string, not an object, just an int.
  • kd.nearest(x, y) – find the nearest point. Returns an id.
  • kd.nearestk(x, y, k) – find the nearest k points. Returns an array of ids.

Also, I made it possible to persist the tree to disk and load it later. That way you can calculate the tree offline and load it quickly at some future point. Loading a persisted tree w/ 1 millions points takes less than a second, as opposed to the 10.5 second startup time shown above. For example:

File.open("treefile", "w") { |f| kd.persist(f) }
... later ...
kd2 = File.open("treefile") { |f| KDTree.new(f) }

Caveats and limitations:

  • Not thread safe. In fact, due to my laziness it uses a single static block for storing results. You should only use one kdtree at a time!
  • No editing allowed! Once you construct a tree you’re stuck with it.
  • The tree is stored in one big memory block, 20 bytes per point. A tree with one million points will allocate a single 19mb block to store its nodes.
  • Persisted trees are architecture dependent, and may not work across different machines due to endian issues.
  • nearestk is limited to 255 results, again due to my laziness.
  • Tested on Mac & Linux, w/ Ruby 1.8.5-1.8.7.

This is my first attempt at writing a gem, and I’m sure I’ve messed it up badly. I’m releasing this under the MIT License. Download it here:

kdtree-0.1.gem
kdtree-0.1.tar.gz (source)

Feedback welcome.