Ruby Nearest Neighbor (fast kdtree gem)

UPDATED 10/17/12: kdtree 0.3 just released. This blog post is out of date. Check out kdtree on github for the latest.

At Urbanspoon, we often have to find restaurants that are close to a given lat/lon. This general problem is called Nearest Neighbor, and we’ve solved it in a variety of different ways as the company grew and our requirements changed. Read on.

Attempt #1: naive db query

Back when we only covered Seattle, we used a simple db query to find the nearest restaurants. We would find the closest restaurants with manhattan distance, then sort by great circle distance. Pseudocode:

list = select * from table order by manhattan(lat,lon) limit 50
list = list.sort_by { |i| great_circle(i,lat,lon) }

Sadly, this was really slow and didn’t scale beyond 10,000 rows. It worked great for Seattle, but New York alone has 24k restaurants.

Attempt #2: db query w/static box

Our second attempt was the same as #1, but we bounded the query with a box of size 0.02 degrees. Because the db only has to sift through a tiny fraction of the data, we get a huge improvement in performance. An index on lat/lon is required. Pseudocode:

add_index :table, [:lat, :lon]

list = select * from table where box(lat, lon, 0.02)
   order by manhattan(lat,lon) limit 50
list = list.sort_by { |i| great_circle(i,lat,lon) }

But that begs the question – how big should the box be? The box should be big enough to snag at least 50 restaurants, but not so big that we accidentally have to examine 1,000. 0.02 degrees might work great inside a city, but it’s totally inappropriate for less dense areas.

Attempt #3: db query w/ dynamic box

I decided to pre-calculate a density map for each of our metropolitan areas. Each city was mapped onto a 200×200 grid, and restaurant counts were placed into each cell of the grid. This was expensive, but the map can be cached and recalculated once per day.

Given a lat/lon, we use the grid to figure out how big the box needs to be to snag the desired number of restaurants. The code is pretty similar:

list = select * from table where box(lat, lon, density(lat, lon, 50))
   order by manhattan(lat,lon) limit 50
list = list.sort_by { |i| great_circle(i,lat,lon) }

My grid worked great, but it was complicated and a memory hog. Also, because we use a per-city grid it tends to fall apart for queries that are far away from cities. Sometimes it’s difficult to figure out which city a lat/lon should map into.

This is especially annoying for folks that want to use Urbanspoon out on the road.

Attempt #4: much head scratching

What I really wanted was a search that worked across our entire data set, without requiring a bounding box or a per-city constraint. I tried a few things that didn’t work:

  • mysql spatial extensions were a bit faster, but I still needed the bounding box, which was the cause of many of our problems.
  • sphinx lat/lon queries. We already use sphinx for our fulltext searches, so it was easy to experiment with its geo support. Sadly, performance was much slower than the db box query.

I was getting desperate.

Attempt #5: kd tree?

At various times along this winding road, I considered using a kd tree. A kd tree is a data structure that recursively partitions the world in order to rapidly answer nearest neighbor queries. A generic kd tree can support any number of dimensions, and can return either the nearest neighbor or a set of N nearest neighbors.

There were many stumbling blocks. My ruby implementation was slow and ate a lot of memory. I could’ve created one tree per city, but that wouldn’t have solved the basic problem. There were some very powerful C implementations floating around (see libkdtree), but they were difficult to incorporate into our Rails app. Also, strangely, the C implementations still seemed slow, probably because they were written to be totally generic.

Perhaps I could get better results on my own. It was time for some major hacking…

Introducing the kdtree gem.

I created a kdtree gem. It’s very specific to the problem I was trying to solve – two dimensional nearest neighbor searches that run in front of a db. Check out the performance on my aging 2.4ghz AMD machine. using a tree with 1 million points:

build 10.5s
nearest point 0.000009s
nearest 5 points 0.000011s
nearest 50 points 0.000031s
nearest 255 points 0.000140s

The API is very simple:

  • – construct a new tree. Each point should be of the form [x, y, id], where x/y are floats and id is an int. Not a string, not an object, just an int.
  • kd.nearest(x, y) – find the nearest point. Returns an id.
  • kd.nearestk(x, y, k) – find the nearest k points. Returns an array of ids.

Also, I made it possible to persist the tree to disk and load it later. That way you can calculate the tree offline and load it quickly at some future point. Loading a persisted tree w/ 1 millions points takes less than a second, as opposed to the 10.5 second startup time shown above. For example:"treefile", "w") { |f| kd.persist(f) }
... later ...
kd2 ="treefile") { |f| }

Caveats and limitations:

  • No editing allowed! Once you construct a tree you’re stuck with it.
  • The tree is stored in one big memory block, 20 bytes per point. A tree with one million points will allocate a single 19mb block to store its nodes.
  • Persisted trees are architecture dependent, and may not work across different machines due to endian issues.
  • nearestk is limited to 255 results, again due to my laziness.

This is my first attempt at writing a gem, and I’m sure I’ve messed it up badly. I’m releasing this under the MIT License. Download it here:

UPDATED 10/17/12: kdtree 0.3 just released. This blog post is out of date. Check out kdtree on github for the latest.

8 thoughts on “Ruby Nearest Neighbor (fast kdtree gem)

  1. Pingback: KD-Trees and MagLev « Maglevity

  2. Is it just me, or is your kdtree_nearest0 function buggy? See this line:

    float ad = (depth % 2) ? (x – n->x) : (y – n->y);

    Shouldn’t the values of the ternary operator be swapped? The kd-tree was constructed with depth 0 being the x-axis, 1 being y, etc.

  3. Very nice article, congrats!!! I recently came back to be interested on DataStructures and efficient Algorithms, and I was really curious to see how persist a KDTree (I need it to index spatial points, as you did, even if for a different domain).
    My question is: what’s the use case? I mean, when performing the searches, you read the whole index then perform the search, or you are able to search directly from the file?
    Many thanks in advance!

  4. First of all, I like your blog and how you share your experience.

    I’m working in a company like yours, we work with hotels instead of restaurants and we find the same problems that I read here.

    I was searching some info about kdtress for storing points and I found this post, I was trying to save 500k hotel images points descriptors (SURF) inside a DB using a KDTree for easy searching them, now we are only able to know if we have similar images inside a hotel set of images and get the best ones, we usually get 50/150 repeated photos of a hotel from 13 hotel databases services and we select the 20 best ones and not repeated using SURF and a c++ application I wrote, works fine but now I want to store the descriptors inside a kdtree, for some experiments I want to do. Your post is very useful for me (And the blog :P).

    About the point problem we search hotels within a circle, and then within polygons inside the circle, so we can get fast hotels within a very precise and defined region in a small db with not more than 80k rows and 3k polygons:

    This is the code I wrote in javascript to test the theory:

    In some months we are going to have more rows so I will think about working with KDTress to optimize the search times.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s