<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>gurge.com</title>
	<atom:link href="http://gurge.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://gurge.com/blog</link>
	<description></description>
	<lastBuildDate>Wed, 25 Nov 2009 22:03:54 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.6</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>mysql query logging with mysql proxy and lua</title>
		<link>http://gurge.com/blog/2009/11/25/mysql-query-logging-with-mysql-proxy-and-lua/</link>
		<comments>http://gurge.com/blog/2009/11/25/mysql-query-logging-with-mysql-proxy-and-lua/#comments</comments>
		<pubDate>Wed, 25 Nov 2009 21:33:56 +0000</pubDate>
		<dc:creator>amd</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gurge.com/blog/?p=90</guid>
		<description><![CDATA[Wow, that&#8217;s a long title. I just upgraded my wp theme to a wider layout, and I&#8217;m wallowing in the available space.
Anyway, this week I was tracking down a tricky mysql CPU issue and I wanted to take a closer look at what mysql was working on. Initially, I experimented with mk-query-digest &#8211;processlist, which was [...]]]></description>
			<content:encoded><![CDATA[<p>Wow, that&#8217;s a long title. I just upgraded my wp theme to a wider layout, and I&#8217;m wallowing in the available space.</p>
<p>Anyway, this week I was tracking down a tricky mysql CPU issue and I wanted to take a closer look at what mysql was working on. Initially, I experimented with <a href="http://www.maatkit.org/doc/mk-query-digest.html">mk-query-digest</a> &#8211;processlist, which was completely awesome but couldn&#8217;t really give me the full picture because it polls. Finally I decided that I wanted to log all the queries going through mysql. This can easily be done with mysql, but it took me a while to put all the pieces together.</p>
<p>Here&#8217;s what I ended up doing:</p>
<ul>
<li>Install mysql-proxy. This was relatively easy, though I had to use an older version on our production machine. Don&#8217;t forget to edit /etc/default/mysql-proxy if you&#8217;re running on Debian/Ubuntu!</li>
<li>Create a lua script to log all the queries. There were a few different scripts in the <a href="http://forge.mysql.com/tools/search.php?t=tag&#038;k=mysqlproxy">repository</a>, but ended up crafting my own. My version has a few important features:
<ul>
<li>No buffering. This is important when your server is running with many threads.</li>
<li>Tab delimited, with whitespace normalization. Duh.</li>
<li>Logs connect/disconnect. Handy for tracking down cron jobs gone wild.</li>
<li>Logs query execution time. I wasn&#8217;t looking for long running queries, just general load issues. Note that queries are written to the log <b style="color:red">when they complete</b>.</li>
</ul>
<p>Here&#8217;s the lua code:</p>
<pre class="emacs">

<span class="keyword">local</span> fh = io.open(<span class="string">"/tmp/mysql_proxy.log"</span>, <span class="string">"a+"</span>)
fh:setvbuf(<span class="string">"no"</span>)

<span class="keyword">function</span> <span class="function-name">log_line</span>(s, tm)
   <span class="keyword">if</span> tm == <span class="keyword">nil</span> <span class="keyword">then</span>
      tm = 0
   <span class="keyword">end</span>
   s = string.gsub(s, <span class="string">"[\n\t ]+"</span>, <span class="string">" "</span>)
   s = string.format(<span class="string">"%s\t%d\t%d\t%s\n"</span>, os.date(<span class="string">"%Y-%m-%d:%H:%M:%S"</span>),
                     proxy.connection.server.thread_id, tm, s)
   fh:write(s)
<span class="keyword">end</span>

<span class="keyword">function</span> <span class="function-name">read_handshake</span>(auth)
   <span class="keyword">if</span> auth <span class="keyword">then</span>
      log_line(<span class="string">"CONNECT "</span> .. auth.client_addr, <span class="keyword">nil</span>) <span class="comment-delimiter">--</span><span class="comment"> for 0.6.1
</span>   <span class="keyword">else</span>
      log_line(<span class="string">"CONNECT "</span> .. proxy.connection.client.src.name, <span class="keyword">nil</span>)
   <span class="keyword">end</span>
<span class="keyword">end</span>

<span class="keyword">function</span> <span class="function-name">read_query</span>(packet)
   <span class="keyword">if</span> string.byte(packet) == proxy.COM_QUERY <span class="keyword">then</span>
      proxy.queries:append(1, string.sub(packet, 2))
   <span class="keyword">end</span>
<span class="keyword">end</span>

<span class="keyword">function</span> <span class="function-name">read_query_result</span>(i)
   log_line(i.query, i.query_time)
<span class="keyword">end</span>

<span class="keyword">function</span> <span class="function-name">disconnect_client</span>()
   log_line(<span class="string">"DISCONNECT"</span>, <span class="keyword">nil</span>)
<span class="keyword">end</span>
</pre>
</li>
<li>There&#8217;s a <a href="http://dev.mysql.com/tech-resources/articles/proxy-gettingstarted.html">great recipe for turning on the proxy with an iptables rule</a> that redirects external connections from port 3306 to port 4040. I wrote a script to turn the rule on and off. Without the script, I kept installing the rule twice and confusing myself. Here&#8217;s the script:
<pre class="emacs">

<span class="comment-delimiter">#</span><span class="comment">!/usr/bin/ruby -w
</span>
<span class="comment-delimiter"># </span><span class="comment">turn mysql-proxy iptables rule on/off
</span>
<span class="keyword">def</span> <span class="function-name">run</span>(command)
  system(command)
  <span class="keyword">raise</span> <span class="string">"</span><span class="variable-name">#{command}</span><span class="string"> failed : </span><span class="variable-name">#{$?}</span><span class="string">"</span> <span class="keyword">if</span> <span class="variable-name">$?</span> != 0
<span class="keyword">end</span>

<span class="keyword">if</span> <span class="string">`whoami`</span>.strip != <span class="string">'root'</span>
  puts <span class="string">"must be run as root"</span>
  exit 1
<span class="keyword">end</span>

<span class="comment-delimiter"># </span><span class="comment">is it on or off?
</span>rules = <span class="string">`iptables -t nat -L`</span>
is_on = rules =~ <span class="string">/^REDIRECT .* 4040/</span>
puts <span class="string">"mysql_proxy_iptables is currently </span><span class="variable-name">#{is_on ? 'ON' : 'OFF'}</span><span class="string">."</span>

<span class="keyword">case</span> <span class="type">ARGV</span>.first
<span class="keyword">when</span> <span class="string">"on"</span>
  <span class="keyword">if</span> !is_on
    puts <span class="string">"Turning on..."</span>
    run <span class="string">"iptables -t nat -I PREROUTING ! -s 127.0.0.1 -p tcp --dport mysql -j REDIRECT --to-ports 4040"</span>
  <span class="keyword">end</span>
<span class="keyword">when</span> <span class="string">"off"</span>
  <span class="keyword">if</span> is_on
    puts <span class="string">"Turning off..."</span>
    run <span class="string">"iptables -t nat -D PREROUTING ! -s 127.0.0.1 -p tcp --dport mysql -j REDIRECT --to-ports 4040"</span>
  <span class="keyword">end</span>
<span class="keyword">else</span>
  puts <span class="string">"Usage: mysql_proxy_iptables [on|off]"</span>
  exit 1
<span class="keyword">end</span>
</pre>
</li>
<li>Note that <b style="color:red">existing connections won&#8217;t go through the proxy</b> &#8211; the iptables rule only applies to new connections. You&#8217;ll have to restart rails to use the proxy.</li>
</ul>
<p>Here&#8217;s some sample output:</p>
<pre>
2009-11-25:14:02:18	2470	0	CONNECT 127.0.0.1:55509
2009-11-25:14:02:18	2470	253	select @@version_comment limit 1
2009-11-25:14:02:20	2470	365	SELECT DATABASE()
2009-11-25:14:02:20	2470	547	show databases
2009-11-25:14:02:20	2470	100371	show tables
2009-11-25:14:02:22	2470	1147	select * from user
2009-11-25:14:02:40	2470	489	select * from event
2009-11-25:14:02:42	2470	19067	select * from db
2009-11-25:14:02:46	2470	0	DISCONNECT
</pre>
]]></content:encoded>
			<wfw:commentRss>http://gurge.com/blog/2009/11/25/mysql-query-logging-with-mysql-proxy-and-lua/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ruby Nearest Neighbor (fast kdtree gem)</title>
		<link>http://gurge.com/blog/2009/10/22/ruby-nearest-neighbor-fast-kdtree-gem/</link>
		<comments>http://gurge.com/blog/2009/10/22/ruby-nearest-neighbor-fast-kdtree-gem/#comments</comments>
		<pubDate>Thu, 22 Oct 2009 20:53:19 +0000</pubDate>
		<dc:creator>amd</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gurge.com/blog/?p=38</guid>
		<description><![CDATA[At Urbanspoon, we often have to find restaurants that are close to a given lat/lon. This general problem is called Nearest Neighbor, and we&#8217;ve solved it in a variety of different ways as the company grew and our requirements changed. Read on.
Attempt #1: naive db query
Back when we only covered Seattle, we used a simple [...]]]></description>
			<content:encoded><![CDATA[<p>At Urbanspoon, we often have to find restaurants that are close to a given lat/lon. This general problem is called <a href="http://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm">Nearest Neighbor</a>, and we&#8217;ve solved it in a variety of different ways as the company grew and our requirements changed. Read on.</p>
<h4>Attempt #1: naive db query</h4>
<p>Back when we only covered Seattle, we used a simple db query to find the nearest restaurants. We would find the closest restaurants with manhattan distance, then sort by great circle distance. Pseudocode:</p>
<pre class="code">
list = select * from table order by manhattan(lat,lon) limit 50
list = list.sort_by { |i| great_circle(i,lat,lon) }
</pre>
<p>Sadly, this was really slow and didn&#8217;t scale beyond 10,000 rows. It worked great for Seattle, but New York alone has 24k restaurants.</p>
<h4>Attempt #2: db query w/static box</h4>
<p>Our second attempt was the same as #1, but we bounded the query with a box of size 0.02 degrees. Because the db only has to sift through a tiny fraction of the data, we get a huge improvement in performance. An index on lat/lon is required. Pseudocode:</p>
<pre class="code">
add_index :table, [:lat, :lon]
</pre>
<pre>
list = select * from table where box(lat, lon, 0.02)
   order by manhattan(lat,lon) limit 50
list = list.sort_by { |i| great_circle(i,lat,lon) }
</pre>
<p>But that begs the question &#8211; how big should the box be? The box should be big enough to snag at least 50 restaurants, but not so big that we accidentally have to examine 1,000. 0.02 degrees might work great inside a city, but it&#8217;s totally inappropriate for less dense areas.</p>
<h4>Attempt #3: db query w/ dynamic box</h4>
<p>I decided to pre-calculate a density map for each of our metropolitan areas. Each city was mapped onto a 200&#215;200 grid, and restaurant counts were placed into each cell of the grid. This was expensive, but the map can be cached and recalculated once per day.</p>
<p>Given a lat/lon, we use the grid to figure out how big the box needs to be to snag the desired number of restaurants. The code is pretty similar:</p>
<pre>
list = select * from table where box(lat, lon, density(lat, lon, 50))
   order by manhattan(lat,lon) limit 50
list = list.sort_by { |i| great_circle(i,lat,lon) }
</pre>
<p>My grid worked great, but it was complicated and a memory hog. Also, because we use a per-city grid it tends to fall apart for queries that are far away from cities. Sometimes it&#8217;s difficult to figure out which city a lat/lon should map into.</p>
<p>This is especially annoying for folks that want to use Urbanspoon out on the road.</p>
<h4>Attempt #4: much head scratching</h4>
<p>What I really wanted was a search that worked across our entire data set, without requiring a bounding box or a per-city constraint. I tried a few things that didn&#8217;t work:</p>
<ul>
<li>mysql <b>spatial extensions</b> were a bit faster, but I still needed the bounding box, which was the cause of many of our problems.</li>
<li><b>sphinx</b> lat/lon queries. We already use sphinx for our fulltext searches, so it was easy to experiment with its geo support. Sadly, performance was much slower than the db box query.</li>
</ul>
<p>I was getting desperate.</p>
<h4>Attempt #5: kd tree?</h4>
<p>At various times along this winding road, I considered using a <a href="http://en.wikipedia.org/wiki/Kd-tree">kd tree</a>. A kd tree is a data structure that recursively partitions the world in order to rapidly answer nearest neighbor queries. A generic kd tree can support any number of dimensions, and can return either the nearest neighbor or a set of N nearest neighbors.</p>
<p>There were many stumbling blocks. My ruby implementation was slow and ate a lot of memory. I could&#8217;ve created one tree per city, but that wouldn&#8217;t have solved the basic problem. There were some very powerful C implementations floating around (see <a href="http://libkdtree.alioth.debian.org/">libkdtree</a>), but they were difficult to incorporate into our Rails app. Also, strangely, the C implementations still seemed slow, probably because they were written to be totally generic.</p>
<p>Perhaps I could get better results on my own. It was time for some major hacking&#8230;</p>
<h4>Introducing the kdtree gem.</h4>
<p>I created a kdtree gem. It&#8217;s very specific to the problem I was trying to solve &#8211; two dimensional nearest neighbor searches that run in front of a db. Check out the performance on my aging 2.4ghz AMD machine. using a tree with 1 million points:</p>
<table>
<tr>
<td style="width:150px">build</td>
<td>10.5s</td>
</tr>
<tr>
<td>nearest point</td>
<td>0.000009s</td>
</tr>
<tr>
<td>nearest 5 points</td>
<td>0.000011s</td>
</tr>
<tr>
<td>nearest 50 points</td>
<td>0.000031s</td>
</tr>
<tr>
<td>nearest 255 points</td>
<td>0.000140s</td>
</tr>
</table>
<p>The API is very simple:</p>
<ul>
<li><b>KDTree.new(points)</b> &#8211; construct a new tree. Each point should be of the form <i>[x, y, id]</i>, where <i>x/y</i> are floats and <i>id</i> is an int. Not a string, not an object, just an int.</li>
<li><b>kd.nearest(x, y)</b> &#8211; find the nearest point. Returns an id.</li>
<li><b>kd.nearestk(x, y, k)</b> &#8211; find the nearest <i>k</i> points. Returns an array of ids.</li>
</ul>
<p>Also, I made it possible to <b>persist</b> the tree to disk and load it later. That way you can calculate the tree offline and load it quickly at some future point. Loading a persisted tree w/ 1 millions points takes less than a second, as opposed to the 10.5 second startup time shown above. For example:</p>
<pre>
File.open("treefile", "w") { |f| kd.persist(f) }
... later ...
kd2 = File.open("treefile") { |f| KDTree.new(f) }
</pre>
<p>Caveats and limitations:</p>
<ul>
<li><b>Not thread safe</b>. In fact, due to my laziness it uses a single static block for storing results. You should only use one kdtree at a time!</li>
<li>No <b>editing</b> allowed! Once you construct a tree you&#8217;re stuck with it.</li>
<li>The tree is stored in <b>one big memory block</b>, 20 bytes per point. A tree with one million points will allocate a single 19mb block to store its nodes.</li>
<li>Persisted trees are <b>architecture dependent</b>, and may not work across different machines due to endian issues.</li>
<li>nearestk is limited to <b>255 results</b>, again due to my laziness.
<li>Tested on <b>Mac &amp; Linux, w/ Ruby 1.8.5-1.8.7</b>.</li>
</ul>
<p>This is my first attempt at writing a gem, and I&#8217;m sure I&#8217;ve messed it up badly. I&#8217;m releasing this under the MIT License. Download it here:</p>
<p><a href="http://www.gurge.com/blogi/kdtree-0.1.gem">kdtree-0.1.gem</a><br />
<a href="http://www.gurge.com/blogi/kdtree-0.1.tar.gz">kdtree-0.1.tar.gz</a> (source)</p>
<p>Feedback welcome.</p>
]]></content:encoded>
			<wfw:commentRss>http://gurge.com/blog/2009/10/22/ruby-nearest-neighbor-fast-kdtree-gem/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Text Wizardry : Ten Commands</title>
		<link>http://gurge.com/blog/2008/08/18/text-wizardry-ten-commands/</link>
		<comments>http://gurge.com/blog/2008/08/18/text-wizardry-ten-commands/#comments</comments>
		<pubDate>Mon, 18 Aug 2008 21:40:16 +0000</pubDate>
		<dc:creator>amd</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gurge.com/blog/2008/08/18/text-wizardry-ten-commands/</guid>
		<description><![CDATA[Do you process log files, spreadsheets, or XML as part of your engineering work? You too can become a text wizard by mastering these ten commands. Learn them one by one, then mix and match for best results.
1. cat
First, you need to get some text in your shell. Use cat to output a text file.
$ [...]]]></description>
			<content:encoded><![CDATA[<p>Do you process log files, spreadsheets, or XML as part of your engineering work? You too can become a text wizard by mastering these ten commands. Learn them one by one, then mix and match for best results.</p>
<h3 style="color:red">1. cat</h3>
<p>First, you need to get some text in your shell. Use cat to output a text file.</p>
<pre>$ cat access.log

1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:50 +0000] "GET /r/16/
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "GET /r/16/
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "POST /e/vo
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "GET /r/15/
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "GET /m/r/1
</pre>
<h3 style="color:red">2. grep/zgrep</h3>
<p>Filters text and picks out just the bits that you care about. zgrep does the same, but works on files that might be compressed. That means that you can run zgrep on a bunch of files, including both compressed and uncompressed data.</p>
<h4>Arguments that you care about</h4>
<ul>
<li>-i : case insensitive</li>
<li>-v : show lines that DON&#8217;T match</li>
<li>-s : ignore errors (useful in conjunction with find/xargs)</li>
<li>-E : search for &#8216;extended&#8217; regular expressions. Necessary for all but the most trivial greps.</li>
<li>-F : search for text, not a regex</li>
<li>-o : only print the matching text, not the whole line</li>
<li>-n : include line numbers, useful for emacs</li>
<li>-c : show a count of matches, not the matches</li>
<li>-h : suppress filename</li>
<li>-R : grep files and directories recursively. Also see find, down below.</li>
</ul>
<pre>$ cat access.log | grep 50

1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:50 +0000] "GET /r/16/
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:50 +0000] "GET /r/4/5
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:50 +0000] "GET /r/4/5
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:50 +0000] "GET /fn/52
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:27:50 +0000] "GET /m/ci/
</pre>
<pre>$ cat access.log | grep -Eo "Aug/[0-9]"

Aug/2008
Aug/2008
Aug/2008
Aug/2008
Aug/2008
</pre>
<pre>$ cat access.log | grep -v GET

1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "POST /e/vo
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:57 +0000] "POST /e/ku
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:26 +0000] "POST /u/ac
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:26 +0000] "POST /u/ac
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:30 +0000] "POST /u/ac
</pre>
<h3 style="color:red">3. sort</h3>
<p>Sorts lines of text. Sort is fast, flexible, and well designed. It also contains possibly the most well-thought out option ever created in the history of command line tools: <b style="font-size: 1.2em">-n</b>.</p>
<h4>Arguments that you care about</h4>
<ul>
<li>-n : sort numerically (you&#8217;ll be hearing more about this later)</li>
<li>-r : reverse the sort order</li>
<li>-k : sort based on delimited columns, not just lines. Useful for spreadsheets.</li>
</ul>
<pre>$ cat access.log | grep -Eo 'GET [^"]+' | sort

GET /a/1/Seattle-at-night.html HTTP/1.0
GET /about HTTP/1.0
GET /b/favorites/1/4845/Seattle-restaurants HTTP/1.0
GET /b/favorites/3/245/New-York-restaurants HTTP/1.0
GET /blog/ HTTP/1.0
</pre>
<pre>$ cat access.log | grep -Eo 'GET [^"]+' | sort -r

GET /zip/8/77539/Houston-restaurants.html HTTP/1.0
GET /zip/8/77406/Houston-restaurants.html?sort=2 HTTP/1.0
GET /w/feed/reviews/3/New-York/rss.xml HTTP/1.0
GET /u/friends/29012 HTTP/1.0
GET /u/favorites/29012 HTTP/1.0
</pre>
<h3 style="color:red">4. uniq</h3>
<p>Omits repeated lines. By itself this is marginally useful, but with <b style="font-size: 1.2em">-c</b> and sort, it&#8217;s magic.</p>
<h4>Arguments that you care about</h4>
<ul>
<li>-c : show number of occurrences</li>
</ul>
<p>For example, here&#8217;s a poor man&#8217;s google analytics:</p>
<pre>$ cat access.log | grep -Eo 'GET [^"]+' | sort | uniq -c | sort -nr

    183 GET /m/u/friends HTTP/1.0
     60 GET / HTTP/1.0
     42 GET /m/ci/6 HTTP/1.0
     40 GET /r/15/191259/restaurant/Southeast/Peking-Express
     39 GET /m/u/suggest_city HTTP/1.0
</pre>
<h3 style="color:red">5. cut</h3>
<p>Incredibly useful for slicing and dicing lines. Cut by delimiter or position. cut uses TAB as it&#8217;s default delimiter, which is handy because I can never remember how to get a TAB character into bash.</p>
<h4>Arguments that you care about</h4>
<ul>
<li>-d : set the delimiter</li>
<li>-f : pick the fields to output. -f1,3 outputs fields 1 &#038; 3. -f1-3 Outputs fields 1, 2 &#038; 3.</li>
<li>-c : or, pick the chars to output</li>
</ul>
<p>Here&#8217;s how you can get the dates from your log file, if you ever wanted to do such a thing. Note the use of quotes to surround the space character, which I&#8217;m using as a delimiter.</p>
<pre>$ cat access.log | cut -d" " -f4-5

[15/Aug/2008:06:25:50 +0000]
[15/Aug/2008:06:25:51 +0000]
[15/Aug/2008:06:25:51 +0000]
[15/Aug/2008:06:25:51 +0000]
[15/Aug/2008:06:25:51 +0000]
</pre>
<p>Want to see a breakdown by hour? Just cut out the hour, sort and uniq -c:</p>
<pre>$ cat access.log |  cut -d":" -f2 | sort | uniq -c

  17798 14
  22160 15
  26415 16
  23181 17
  30535 18
</pre>
<h3 style="color:red">6. wc</h3>
<p>Counts lines. Actually, it counts lots of things, but mostly I use it for lines. When combined with uniq, you can get a count of unique things. For example, here are the number of uniq user agents in my log file:</p>
<h4>Arguments that you care about</h4>
<ul>
<li>-l : just show the line count</li>
</ul>
<pre>$ cat access.log | cut -d'"' -f6 | uniq | wc -l

8539
</pre>
<h3 style="color:red">7. head/tail</h3>
<p>Given some incoming text, just look at the start or end of the file. This is useful if you have lots of data but only want to look at a subset. The examples aren&#8217;t that interesting, but here&#8217;s how you use them:</p>
<p>Show the first 100 lines:</p>
<pre>$ cat access.log | head -100</pre>
<p>Show the last 100 lines:</p>
<pre>$ cat access.log | tail -100</pre>
<p>Tail also has a completely different mode where it &#8220;watches&#8221; a file and outputs lines that are written to the end. You can &#8220;tail your log file&#8221; to see what&#8217;s happening on your server:</p>
<pre>$ tail -f access.log</pre>
<p>Want to see if anyone is hitting that new page? Combine tail -f with grep:</p>
<pre>$ tail -f access.log | grep cool_new_feature</pre>
<h3 style="color:red">8. find</h3>
<p>Walks your disk and prints out file names. Again, not particularly useful by itself, but tremendously useful when combined with xargs (below) and grep. Unfortunately, find has a somewhat confusing command line syntax. Here are the two flavors you should learn:</p>
<p>Find all files and directories:</p>
<pre>$ find

.
./Makefile.in
./config.log
./COPYING
./COMPILING
</pre>
<p>Find files only, not directories. Note that &#8220;.&#8221; means &#8220;the current directory&#8221;. So, we&#8217;re really saying &#8220;find all files in the current directory&#8221;.</p>
<pre>$ find . -type f

./Makefile.in
./config.log
./COPYING
./COMPILING
./autogen.sh
</pre>
<p>Combine with grep to find c++ files.</p>
<pre>$ find | grep cpp

./src/tankai/TankAIComputerTarget.cpp
./src/tankai/TankAIAdder.cpp
./src/tankai/TankAIComputer.cpp
./src/tankai/TankAIStrings.cpp
./src/tankai/TankAIComputerBuyer.cpp
</pre>
<h3 style="color:red">9. xargs</h3>
<p>This is a weird one, but it&#8217;s essential if you want to become a true text wizard. Basically, xargs takes a bunch of text lines and sends them as command line arguments to some other command. Why would you want this? Here are a few ideas off the top of my head:</p>
<p>Grep for some text in your .cpp files, but don&#8217;t grep inside the .o files</p>
<pre>$ find | grep cpp | xargs grep strcpy</pre>
<p>Download urls listed in a text file</p>
<pre>cat urls.txt | xargs curl</pre>
<p>Zip up your files, but exclude subversion cruft</p>
<pre>find | grep -v .svn | xargs zip source.zip</pre>
<h3 style="color:red">10. less</h3>
<p>Less is a text viewer. Use it to page through a file or the output from a command. You can search the contents or jump around with hotkeys. Here are the keys I use most often:</p>
<ul>
<li>space : next page</li>
<li>b: previous page</li>
<li>&lt; : jump to start of file</li>
<li>&gt; : jump to end of file</li>
<li>/ : search forward for regex</li>
<li>? : search backward for regex</li>
</ul>
<p>This is especially useful if you&#8217;re building up a complicated string of commands. Add less to the end of each sequence to preview the results:</p>
<pre>$ cat access.log | less
$ cat access.log | grep -Eo 'GET [^"]+' | less
$ cat access.log | grep -Eo 'GET [^"]+' | sort | less
$ cat access.log | grep -Eo 'GET [^"]+' | sort | uniq -c | less
...
</pre>
<h3 style="color:red">What next?</h3>
<p>The true power of the command line comes not from any individual command, but from knowing how to chain them together. We&#8217;ve really just scratched the surface here. Once you&#8217;ve mastered these commands, here are a few others to check out:</p>
<ul>
<li><b>convert</b> &#8211; image processing</li>
<li><b>curl</b> &#8211; fetch web pages</li>
<li><b>dc</b> &#8211; calculator</li>
<li><b>strings</b> &#8211; pull strings out of a binary file</li>
<li><b>sum</b> &#8211; checksum a file</li>
<li><b>watch</b> &#8211; run something every few seconds and look for changes</li>
<li><b>xxd</b> &#8211; hex dump a file</li>
</ul>
<p>You&#8217;ll also want to learn a bit about <a href="http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-3.html">redirection</a>, bash&#8217;s command line <a href="http://www.talug.org/events/20030709/cmdline_history.html">history</a>, and the wonders of <b>ctrl-r</b>.</p>
]]></content:encoded>
			<wfw:commentRss>http://gurge.com/blog/2008/08/18/text-wizardry-ten-commands/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Google Chart Tips for Ruby Hackers</title>
		<link>http://gurge.com/blog/2008/04/02/google-chart-tips-for-ruby-hackers/</link>
		<comments>http://gurge.com/blog/2008/04/02/google-chart-tips-for-ruby-hackers/#comments</comments>
		<pubDate>Wed, 02 Apr 2008 20:23:52 +0000</pubDate>
		<dc:creator>amd</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gurge.com/blog/2008/04/02/google-chart-tips-for-ruby-hackers/</guid>
		<description><![CDATA[Recently we&#8217;ve been experimenting with Google Charts on Urbanspoon. Their API is well designed and easy to use, but it&#8217;s still nontrivial to produce good looking graphs for arbitrary data. Here are some suggestions for my fellow ruby hackers:
1. Nice Numbers for Graph Labels
The classic &#8220;Nice Numbers for Graph Labels&#8221; Graphics Gem by Paul Heckbert [...]]]></description>
			<content:encoded><![CDATA[<p>Recently we&#8217;ve been experimenting with <a href="http://code.google.com/apis/chart/">Google Charts</a> on Urbanspoon. Their API is well designed and easy to use, but it&#8217;s still nontrivial to produce good looking graphs for arbitrary data. Here are some suggestions for my fellow ruby hackers:</p>
<h3>1. Nice Numbers for Graph Labels</h3>
<p>The classic &#8220;Nice Numbers for Graph Labels&#8221; Graphics Gem by Paul Heckbert will generate a series of good looking axis labels given a min and max value. It works with floats as well as integers.</p>
<p><center><br />
  <img src='http://chart.apis.google.com/chart?chs=300x185&#038;chxl=0%3A%7C%7C1%3A%7C0.2%7C0.4%7C0.6%7C0.8%7C1.0&#038;chxt=x%2Cy&#038;chg=0%2C25&#038;chd=e%3ADzKfQ.XPdLixn.swxF054M67&#038;cht=lc&#038;chxp=0%2C' width='300' height='185.410196481719' /><br/><br />
  <i>Automatic &#8220;nice labels&#8221; on the y axis</i><br />
</center></p>
<p>I ported it to Ruby:</p>
<pre class="emacs">
<span class="comment-delimiter"># </span><span class="comment">From the &quot;Nice Numbers for Graph Labels&quot; graphics gem by Paul
# Heckbert
</span><span class="keyword">def</span> <span class="function-name">nicenum</span>(x, round)
  expv = <span class="type">Math</span>.log10(x).floor.to_f
  f = x / (10 ** expv)
  <span class="keyword">if</span> round
    <span class="keyword">if</span> f &lt; 1.5
      nf = 1
    <span class="keyword">elsif</span> f &lt; 3
      nf = 2
    <span class="keyword">elsif</span> f &lt; 7
      nf = 5
    <span class="keyword">else</span>
      nf = 10
    <span class="keyword">end</span>
  <span class="keyword">else</span>
    <span class="keyword">if</span> f &lt;= 1
      nf = 1
    <span class="keyword">elsif</span> f &lt;= 2
      nf = 2
    <span class="keyword">elsif</span> f &lt;= 5
      nf = 5
    <span class="keyword">else</span>
      nf = 10
    <span class="keyword">end</span>
  <span class="keyword">end</span>
  nf * (10 ** expv)
<span class="keyword">end</span>

<span class="keyword">def</span> <span class="function-name">loose_label</span>(options = {})
  min, max = options[<span class="constant">:min</span>], options[<span class="constant">:max</span>]
  ticks = options[<span class="constant">:ticks</span>] || 5

  range = nicenum(max - min, <span class="variable-name">false</span>);
  d = nicenum(range / (ticks - 1), <span class="variable-name">true</span>);

  {
    <span class="constant">:min</span> =&gt; (min / d).floor * d,
    <span class="constant">:max</span> =&gt; (max / d).ceil * d,
    <span class="constant">:increment</span> =&gt; d
  }
<span class="keyword">end</span></pre>
<p>For example, if your data set ranges from 23-65 and you want to have<br />
five axis labels, you could do something like this:</p>
<pre class="emacs">
puts loose_label(<span class="constant">:min</span> =&gt; 23, <span class="constant">:max</span> =&gt; 65, <span class="constant">:ticks</span> =&gt; 5).inspect
</pre>
<p>and it would suggest this for your axis labels:</p>
<p>{ :min =&gt; 20.0, :max =&gt; 70.0, :increment =&gt; 10.0 }</p>
<p>To generate the actual labels, use something like the code below. Again, this is cribbed from the original Graphics Gem:</p>
<pre class="emacs">
loose = loose_label(<span class="constant">:min</span> =&gt; 23, <span class="constant">:max</span> =&gt; 65, <span class="constant">:ticks</span> =&gt; 5)
ymin, ymax = loose[<span class="constant">:min</span>], loose[<span class="constant">:max</span>]
d = loose[<span class="constant">:increment</span>]
nfrac = -<span class="type">Math</span>.log10(d).floor
nfrac = 0 <span class="keyword">if</span> nfrac &lt; 0
ylabels = []
i = ymin
<span class="keyword">while</span> i &lt; ymax + 0.5 * d
  ylabels &lt;&lt; sprintf(<span class="string">&quot;%.</span><span class="variable-name">#{nfrac}</span><span class="string">f&quot;</span>, i)
  i += d
<span class="keyword">end</span>
</pre>
<h3>2. Add a trailing average</h3>
<p>Here&#8217;s some code to calculate a trailing average from the previous 7 data points. The initial segment of the trailing average is calculated by averaging the data available up to that point.</p>
<pre class="emacs">
trailing = 7
sum = 0.0
tdata = []
data.each_with_index <span class="keyword">do</span> |i, index|
  count = <span class="variable-name">nil</span>
  sum += i
  <span class="keyword">if</span> index &lt; trailing
    count = index + 1
  <span class="keyword">else</span>
    count = trailing
    sum -= data[index - trailing]
  <span class="keyword">end</span>
  avg = (sum / count).to_i
  tdata &lt;&lt; avg
<span class="keyword">end</span>
</pre>
<h3>3. Use the golden ratio</h3>
<p>The human eye finds a certain aspect ratio naturally appealing. Namely, <a href="http://en.wikipedia.org/wiki/Golden_ratio">The Golden Ratio</a>. If I have enough space to work with, I want my graphs to use that aspect ratio by default. That&#8217;s why I set up my api like this:</p>
<pre class="emacs">
<span class="type">GOLDEN</span> = 1.61803399

<span class="keyword">def</span> <span class="function-name">chart</span>(options = {})
  <span class="comment-delimiter"># </span><span class="comment">calculate width/height
</span>  width = options.delete(<span class="constant">:width</span>) || 300
  height = options.delete(<span class="constant">:height</span>) || (width / <span class="type">GOLDEN</span>)
  ...
</pre>
<h3>4. Consider using gchartrb</h3>
<p><a href="http://code.google.com/p/gchartrb/">gchartrb</a> is a ruby gem that wraps the Google Charts API. I haven&#8217;t used it personally but it looks great.</p>
]]></content:encoded>
			<wfw:commentRss>http://gurge.com/blog/2008/04/02/google-chart-tips-for-ruby-hackers/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ActiveRecord Table Transform (or, how to write to the db 27,000 times)</title>
		<link>http://gurge.com/blog/2007/07/31/activerecord-table-transform-or-how-to-write-to-the-db-27000-times/</link>
		<comments>http://gurge.com/blog/2007/07/31/activerecord-table-transform-or-how-to-write-to-the-db-27000-times/#comments</comments>
		<pubDate>Wed, 01 Aug 2007 00:08:47 +0000</pubDate>
		<dc:creator>amd</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gurge.com/blog/2007/07/31/activerecord-table-transform-or-how-to-write-to-the-db-27000-times/</guid>
		<description><![CDATA[At Urbanspoon, we use pretty urls for our pages to make them more palatable to users and search engines. Here&#8217;s an example:
http://www.urbanspoon.com/r/1/55069/Seattle/Fremont/Baguette-Box.html
These beautiful urls are slightly expensive to generate, since we have to &#8220;prettify&#8221; text by stripping whitespace and replacing accent characters. A few weeks back, I finally bit the bullet and started caching our [...]]]></description>
			<content:encoded><![CDATA[<p>At <a href="http://www.urbanspoon.com">Urbanspoon</a>, we use pretty urls for our pages to make them more palatable to users and search engines. Here&#8217;s an example:</p>
<p><a href="http://www.urbanspoon.com/r/1/55069/Seattle/Fremont/Baguette-Box.html">http://www.urbanspoon.com/r/1/55069/Seattle/Fremont/Baguette-Box.html</a></p>
<p>These beautiful urls are slightly expensive to generate, since we have to &#8220;prettify&#8221; text by stripping whitespace and replacing accent characters. A few weeks back, I finally bit the bullet and started caching our pretty urls in the db instead of in memory. I lazily populate the url column for each restaurant, so we&#8217;re gradually filling in the data as users hit the server.</p>
<p>Then I dug into the code that generates our sitemap. For the uninitiated, a <a href="http://www.sitemaps.org/">sitemap</a> is an XML file describing every page on the server. Naturally, in order to generate this file we have to write out the pretty urls for each restaurant.</p>
<p>Of our ~100,000 restaurants, approximately ~27,000 hadn&#8217;t yet cached their pretty urls in the db. I naively used my lazy pretty url generator, which ended up sending 27,000 individual writes to the db. It took approximately <span style="color: red; font-size: 110%; font-weight: bold">9 MINUTES</span> to complete, with the CPU pegged the entire time.</p>
<p>It would be much better to do something like the following:</p>
<ol>
<li>Create a temp table with (id, url).</li>
<li>Bulk insert to populate the temp table.</li>
<li>Update the restaurants table from the temp table.</li>
</ol>
<p>I implemented my new scheme and running time went from 9 minutes to <span style="color: red; font-size: 110%; font-weight: bold">24 SECONDS</span>. I liked this approach so much I decided to generalize it as ActiveRecord::Base.transform. Sample usage:</p>
<pre class="emacs">
<span class="comment-delimiter"># </span><span class="comment">if users don't have names, give them a random one
</span><span class="type">NAMES</span> = [<span class="string">'Adam'</span>, <span class="string">'Ethan'</span>, <span class="string">'Patrick'</span>]
<span class="type">User</span>.transform(<span class="constant">:name</span>, <span class="constant">:conditions</span> =&gt; <span class="string">'name is null'</span>).each <span class="keyword">do</span> |i|
  i.name = <span class="type">NAMES</span>[rand * <span class="type">NAMES</span>.length]
<span class="keyword">end</span>
</pre>
<p>This will use a bulk transform to update all users at once instead of each user individually.</p>
<p>Note that this has only been tested with Mysql, and is unlikely to work out of the box with other databases. Check it out:</p>
<pre class="emacs">
<span class="comment-delimiter"># </span><span class="comment">helper for quickly transforming an entire table using a temp table,
</span><span class="comment-delimiter"># </span><span class="comment">a bulk insert, and an update
</span><span class="keyword">class</span> <span class="type">ActiveRecord</span>::<span class="type">Base</span>
  <span class="keyword">def</span> <span class="function-name">self.transform</span>(cols, options = {})
    temp_name = options[<span class="constant">:temp_name</span>] || <span class="string">&quot;temp_transform_table&quot;</span>
    temp_options = options[<span class="constant">:temp_options</span>] || <span class="string">&quot;character set utf8 collate utf8_general_ci&quot;</span>

    <span class="comment-delimiter"># </span><span class="comment">munge cols into real column objects
</span>    cols = [cols] <span class="keyword">if</span> !cols.is_a?(<span class="type">Array</span>)
    cols = cols.map { |i| i.to_s }

    cols.delete(<span class="string">&quot;id&quot;</span>)
    cols.unshift(<span class="string">&quot;id&quot;</span>)

    cols = cols.map { |i| columns_hash[i] || <span class="keyword">raise</span>(<span class="string">&quot;column </span><span class="variable-name">#{i}</span><span class="string"> not found&quot;</span>) }

    <span class="comment-delimiter"># </span><span class="comment">load/transform
</span>    rows = find(<span class="constant">:all</span>, options)
    <span class="keyword">return</span> <span class="keyword">if</span> rows.empty?
    rows.each { |i| <span class="keyword">yield</span>(i) }

    <span class="comment-delimiter"># </span><span class="comment">create the temp table
</span>    cols_create = cols.map { |i| <span class="string">&quot;</span><span class="variable-name">#{i.name}</span><span class="string"> </span><span class="variable-name">#{i.sql_type}</span><span class="string">&quot;</span> }
    connection.execute(<span class="string">&quot;CREATE TEMPORARY TABLE </span><span class="variable-name">#{temp_name}</span><span class="string"> (</span><span class="variable-name">#{cols_create.join(',')}</span><span class="string">) </span><span class="variable-name">#{temp_options}</span><span class="string">&quot;</span>)

    <span class="comment-delimiter"># </span><span class="comment">bulk insert
</span>    data = rows.map <span class="keyword">do</span> |r|
      values = cols.map { |c| connection.quote(r[c.name], c) }
      <span class="string">&quot;(</span><span class="variable-name">#{values.join(',')}</span><span class="string">)&quot;</span>
    <span class="keyword">end</span>
    connection.execute(<span class="string">&quot;INSERT INTO </span><span class="variable-name">#{temp_name}</span><span class="string"> values </span><span class="variable-name">#{data.join(',')}</span><span class="string">&quot;</span>)

    <span class="comment-delimiter"># </span><span class="comment">save
</span>    cols_equal = cols.map { |i| <span class="string">&quot;</span><span class="variable-name">#{table_name}</span><span class="string">.</span><span class="variable-name">#{i.name}</span><span class="string"> = </span><span class="variable-name">#{temp_name}</span><span class="string">.</span><span class="variable-name">#{i.name}</span><span class="string">&quot;</span> }
    connection.execute(<span class="string">&quot;UPDATE </span><span class="variable-name">#{table_name}</span><span class="string">, </span><span class="variable-name">#{temp_name}</span><span class="string"> SET </span><span class="variable-name">#{cols_equal[1..-1].join(', ')}</span><span class="string"> WHERE </span><span class="variable-name">#{cols_equal.first}</span><span class="string">&quot;</span>)

    connection.execute(<span class="string">&quot;DROP TEMPORARY TABLE </span><span class="variable-name">#{temp_name}</span><span class="string">&quot;</span>)
  <span class="keyword">end</span>
<span class="keyword">end</span>
</pre>
]]></content:encoded>
			<wfw:commentRss>http://gurge.com/blog/2007/07/31/activerecord-table-transform-or-how-to-write-to-the-db-27000-times/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Yahoo Slurp Makes a Mess</title>
		<link>http://gurge.com/blog/2007/06/27/yahoo-slurp-makes-a-mess/</link>
		<comments>http://gurge.com/blog/2007/06/27/yahoo-slurp-makes-a-mess/#comments</comments>
		<pubDate>Thu, 28 Jun 2007 04:06:13 +0000</pubDate>
		<dc:creator>amd</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gurge.com/blog/2007/06/27/yahoo-slurp-makes-a-mess/</guid>
		<description><![CDATA[For months we&#8217;ve been carefully watching how the various bots consume Urbanspoon. We enticed them inside with fresh content, well constructed pages, and sitemaps. Despite our efforts, until quite recently Yahoo Slurp didn&#8217;t have much of an appetite for Urbanspoon. Instead of digging in and indexing the whole site, Yahoo Slurp preferred to nibble around [...]]]></description>
			<content:encoded><![CDATA[<p>For months we&#8217;ve been carefully watching how the various bots consume <a href="http://www.urbanspoon.com">Urbanspoon</a>. We enticed them inside with fresh content, well constructed pages, and sitemaps. Despite our efforts, until quite recently Yahoo Slurp didn&#8217;t have much of an appetite for Urbanspoon. Instead of digging in and indexing the whole site, Yahoo Slurp preferred to nibble around the edges.</p>
<p>That is, until June 16th. Notice anything odd?</p>
<p align="center">
<b>Yahoo Slurp Requests to Urbanspoon.com</b><br/><br />
<img src="/blogi/yahoo-month.jpg" width="400" height="270"/>
</p>
<p>Someone flipped a switch down there in Sunnyvale and the Yahoo Slurp bot suddenly decided that it loved Urbanspoon.</p>
<p>For comparison, check out the Metamucil-like regularity of the Google bot:</p>
<p align="center">
<b>Google Bot Requests to Urbanspoon.com</b><br/><br />
<img src="/blogi/google-month.jpg" width="400" height="270"/>
</p>
<p>Let&#8217;s dig in and take a closer look at those two bots. Ready&#8230; fight!</p>
<table width="90%">
<tr>
<th width="50%">Yahoo Slurp (June 16-18)</th>
<th>Google Bot (June 16-18)</th>
<tr>
<td>
194,464 total hits<br />
120,076 pages (38% dups)<br />
32 violations of robots.txt</p>
<p>85,396 restaurant pages<br />
1,008 neighborhood pages<br />
995 cuisine pages</p>
<p>26,599 New York restaurants<br />
23,002 LA restaurants<br />
20,152 SF restaurants<br />
7,817 Seattle restaurants<br />
7,124 Boston restaurants<br />
534 Chicago restaurants<br />
109 DC restaurants
</td>
<td>
41,941 total hits<br />
41,332 pages (1.4% dups)<br />
27 violations of robots.txt</p>
<p>22,999 restaurant pages<br />
1,366 neighborhood pages<br />
980 cuisines</p>
<p>6,573 New York restaurants<br />
4,659 LA restaurants<br />
2,751 SF restaurants<br />
2,584 Seattle restaurants<br />
1,987 Boston restaurants<br />
2,365 Chicago restaurants<br />
2,245 DC restaurants
</td>
</tr>
</table>
<h4>Yahoo Slurp Duplicates</h4>
<p>Yahoo Slurp requested many pages more than once. In fact, Yahoo Slurp was unable to resist certain pages, compulsively returning to them again and again. Here are the pages that the bot seemed to find tastiest:</p>
<p/>
<table>
<tr>
<th width="150"># of requests</th>
<th>page</th>
<tr>
<td>419</td>
<td>/robots.txt</td>
</tr>
<tr>
<td>294</td>
<td>/choose</td>
</tr>
<tr>
<td>273</td>
<td>/</td>
</tr>
<tr>
<td>21</td>
<td>/c/3/New-York.html</td>
</tr>
<tr>
<td>19</td>
<td>/c/5/Los-Angeles.html</td>
</tr>
<tr>
<td>16</td>
<td>/a/3/New-York-at-night.html</td>
</tr>
<tr>
<td>15</td>
<td>/c/1/Seattle.html</td>
</tr>
<tr>
<td>14</td>
<td>/c/2/Chicago.html</td>
</tr>
<tr>
<td>11</td>
<td>/u/create (and this page is blocked via robots.txt!)</td>
</tr>
</table>
<p>I&#8217;ll spare you the other 73,306 duplicates requested by Yahoo Slurp.</p>
<h4>Directory Crawling</h4>
<p>Strangely, the Yahoo Slurp bot likes to explore the directories leading up to each page. For example, in addition to indexing our <a href="http://www.urbanspoon.com/r/1/1084/Seattle/Eastlake-Lake-Union/Sitka-Spruce.html">Sitka &#038; Spruce</a> page, Yahoo Slurp also tried to hit each of the directories leading up to that page:</p>
<p>/r/1/1084/Seattle/Eastlake-Lake-Union/Sitka-Spruce.html<br />
/r/1/1084/Seattle/Eastlake-Lake-Union/<br />
/r/1/1084/<br />
/r/1/<br />
/r/</p>
<p>Those URLs aren&#8217;t linked anywhere from our site. Each of them (correctly) redirects elsewhere. Why did Yahoo choose to crawl them?</p>
<h4>Yahoo Slurp &#8211; A Sloppy Eater</h4>
<p>We&#8217;re quite flattered by the attention, but Yahoo&#8217;s Slurp bot made a bit of a mess. I can forgive the robots.txt violations, since other bots share this transgression. The directory walking thing is bizarre, but won&#8217;t hurt our search engine results due to our clever defensive redirects.</p>
<p>The 38% dup rate is just plain sloppy. Really, this is not how we want to spend our precious CPU cycles and bandwidth. I&#8217;ve written a few indexing systems myself and I know that these problems are challenging, but the <a href="http://www.google.com">market leader</a> seems to have solved them nicely.</p>
<p>It remains to be seen if Yahoo&#8217;s aggressive indexing will lead to a commensurate increase in traffic from Yahoo. Stay tuned!</p>
]]></content:encoded>
			<wfw:commentRss>http://gurge.com/blog/2007/06/27/yahoo-slurp-makes-a-mess/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Hashiness</title>
		<link>http://gurge.com/blog/2007/04/08/hashiness/</link>
		<comments>http://gurge.com/blog/2007/04/08/hashiness/#comments</comments>
		<pubDate>Mon, 09 Apr 2007 03:31:09 +0000</pubDate>
		<dc:creator>amd</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gurge.com/blog/2007/04/08/hashiness/</guid>
		<description><![CDATA[Back in college, the killer Intro to CS class used a home grown object-oriented version of Pascal. It was a bit like Borland&#8217;s Pascal, except it ran on Solaris and the IDE was about 100x slower. We quickly covered some programming fundamentals, then dutifully moved on to inheritance and polymorphism. One particularly grueling assignment involved [...]]]></description>
			<content:encoded><![CDATA[<p>Back in college, the killer <a href="http://www.cs.brown.edu/courses/cs015/">Intro to CS</a> class used a home grown object-oriented version of Pascal. It was a bit like Borland&#8217;s Pascal, except it ran on Solaris and the IDE was about 100x slower. We quickly covered some programming fundamentals, then dutifully moved on to inheritance and polymorphism. One particularly grueling assignment involved writing a linked list where the nodes used polymorphism instead of conditionals.</p>
<p>Turns out that in the real world you never want to write a linked list with polymorphism, but the lesson obviously struck a nerve. Since then I&#8217;ve pretty much used objected oriented languages exclusively. For the purposes of this blog post I&#8217;m skipping over our flirtations with C and optimized MMX instructions at Strangeberry, as well as a nightmarish Perl project at Jobster.</p>
<p>I can finally report that my long honeymoon with OOP is coming to an end. These days, I use the Hash much more often than I use the Class. The Hash has a certain appeal that it&#8217;s hard to resist.</p>
<p><b>The Rise of the Hash</b></p>
<p>Maybe it&#8217;s because I spend so much of my time these days on data manipulation and deployment scripts. Maybe it&#8217;s frustration with <a href="http://gurge.com/blog/2006/08/15/laziness-part-two-the-6000-line-hashtable/">poor design</a>. Maybe it&#8217;s because machines have simply gotten fast enough to enable my inherent laziness. Maybe it&#8217;s because of <a href="http://www.yaml.org/">YAML</a>.</p>
<p>Whatever the reason, I simply love Hash tables. I can&#8217;t get enough of them. I use Hash tables for complex method arguments, just like the rest of Rails. I&#8217;ll use a Hash table instead of a Class for as long as possible, right up until I absolutely need to add a method. Hash tables are so syntactically light and malleable, it feels like a real sacrifice to switch to a full blown Class.</p>
<p><b>Ruby&#8217;s Hash</b></p>
<p>I especially love the somewhat obscure Ruby feature that allows you to attach a block to Hash. The block gets called whenever a new element is created. I talk about this a bit in my <a href="http://gurge.com/blog/2006/10/16/ruby-at-60/">Ruby at 60</a> post, but I wanted to give a different example here. I often find that I need to partition a data set based on a key. Imagine a set of employees, each with a role:</p>
<pre class="emacs">
employees =
  [
   { <span class="constant">:role</span> =&gt; <span class="string">'ceo'</span>, <span class="constant">:name</span> =&gt; <span class="string">'Mr. Burns'</span> },
   { <span class="constant">:role</span> =&gt; <span class="string">'underling'</span>, <span class="constant">:name</span> =&gt; <span class="string">'Smithers'</span> },
   { <span class="constant">:role</span> =&gt; <span class="string">'slave'</span>, <span class="constant">:name</span> =&gt; <span class="string">'Homer'</span> },
   { <span class="constant">:role</span> =&gt; <span class="string">'slave'</span>, <span class="constant">:name</span> =&gt; <span class="string">'Lenny'</span> },
   { <span class="constant">:role</span> =&gt; <span class="string">'slave'</span>, <span class="constant">:name</span> =&gt; <span class="string">'Carl'</span> },
   ...
  ]
</pre>
<p>The following snippet breaks up the employees by role:</p>
<pre class="emacs">
by_role = <span class="type">Hash</span>.new { |hash, key| hash[key] = [] }
employees.each { |i| by_role[i[<span class="constant">:role</span>]] &lt;&lt; i }
</pre>
<p>We use techniques like this all over the place for managing data sets that aren&#8217;t in a database.</p>
<p><b>Hashiness</b></p>
<p>Using Hash tables allows me to experience a sense of satisfaction that I call Hashiness.</p>
<p>The authors of ActiveRecord clearly were going for Hashiness. Ditto for sessions/params in Rails. There are oodles of Classes in Rails that masquerade as Hashes. Many of the essential Rails methods take a Hash as a parameter, and I&#8217;ve lifted this pattern for my own work.</p>
<p>I claim that Hashiness makes me a more productive engineer, and makes <a href="http://www.urbanspoon.com">Urbanspoon</a> a better product. Are you getting enough Hashiness?</p>
]]></content:encoded>
			<wfw:commentRss>http://gurge.com/blog/2007/04/08/hashiness/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Rails expire_fragment(regex) Considered Harmful</title>
		<link>http://gurge.com/blog/2007/02/04/rails-expire_fragmentregex-considered-harmful/</link>
		<comments>http://gurge.com/blog/2007/02/04/rails-expire_fragmentregex-considered-harmful/#comments</comments>
		<pubDate>Sun, 04 Feb 2007 21:57:13 +0000</pubDate>
		<dc:creator>amd</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gurge.com/blog/2007/02/04/rails-expire_fragmentregex-considered-harmful/</guid>
		<description><![CDATA[Recently I discovered that one of our Urbanspoon actions was taking nearly two seconds to complete. This particular page stuck out like a sore thumb once I start crunching the numbers contained in our production log file. The slowness was especially puzzling because the action in question seemed to be one of the simplest actions [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I discovered that one of our <a href="http://www.urbanspoon.com">Urbanspoon</a> actions was taking nearly two seconds to complete. This particular page stuck out like a sore thumb once I start crunching the numbers contained in our production log file. The slowness was especially puzzling because the action in question seemed to be one of the simplest actions in the entire application. I quickly determined that:</p>
<ul>
<li>The slowdown wasn&#8217;t in page rendering.</li>
<li>The slowdown wasn&#8217;t coming from the db.</li>
<li>I couldn&#8217;t make it happen on my dev box.</li>
<li>It occurred even if the production machine wasn&#8217;t busy.</li>
</ul>
<p>So, where was the problem?</p>
<p>In my previous <a href="http://gurge.com/blog/2006/10/16/ruby-at-60/">Ruby at 60</a> post, I mentioned that Ruby&#8217;s dynamic nature can make it hard to figure out what&#8217;s happening beneath the covers. This debugging session was a perfect example. Over the course of an hour or two, I laboriously inserted benchmarking code into various bits of Rails. First validation. Then ActiveRecord callbacks. Maybe the problem was inside the logger? Unfortunately, it&#8217;s not easy to find all the major pieces of Rails that affect each request.</p>
<p>Eventually I was able to trace the slowdown to the following line of code in one of our cache sweepers:</p>
<p><b style="color:red">expire_fragment(/base\/xyz.*/)</b></p>
<p>At the moment we use <a href="http://api.rubyonrails.org/classes/ActionController/Caching/Fragments.html">File-based Fragment Caching</a> to speed up Urbanspoon. Some of our pages have a cached section at the top (<b>base/xyz_top</b>), a dynamic section in the middle, and another cached section at the bottom (<b>xyz_bottom</b>). When the underlying data changes, a cache sweeper would jump in and expire the two cached sections using a regex, <b>base/xyz.*</b>.</p>
<p>Silly me, for some reason I thought that FileStore would expire the regex as follows:</p>
<ol>
<li>Look in the <b>base</b> directory.</li>
<li>Find all files that match <b>xyz.*</b></li>
<li>Delete them.</li>
</ol>
<p>I couldn&#8217;t be more wrong. Instead, the code in UnthreadedFileStore works more like this:</p>
<ol>
<li>Iterate every single file in the fragment cache.</li>
<li>Delete files which match <b>base/xyz.*</b></li>
</ol>
<p>Our production server&#8217;s fragment cache usually contains in excess of 5,000 cached fragment files. Every time this action was invoked we were iterating all of them. Ouch!</p>
<p>The bug was easy to fix &#8211; simply replace the regex with two separate calls to expire_fragment, one for the top fragment and one for the bottom. Somewhere in the back of my mind I knew I&#8217;d have to make this change eventually, since we&#8217;ll be switching to memcached in the not-so-distant future. I just didn&#8217;t anticipate the fire drill.</p>
<p>Anyway, take my advice. Avoid expire_fragment(regex). It&#8217;s seductive if you have multiple fragments to expire, but it&#8217;ll cost you in the long run.</p>
]]></content:encoded>
			<wfw:commentRss>http://gurge.com/blog/2007/02/04/rails-expire_fragmentregex-considered-harmful/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Google in Action and Other Graphs</title>
		<link>http://gurge.com/blog/2007/01/23/google-in-action-and-other-graphs/</link>
		<comments>http://gurge.com/blog/2007/01/23/google-in-action-and-other-graphs/#comments</comments>
		<pubDate>Tue, 23 Jan 2007 18:51:28 +0000</pubDate>
		<dc:creator>amd</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gurge.com/blog/2007/01/23/google-in-action-and-other-graphs/</guid>
		<description><![CDATA[In my endless quest to learn more about Urbanspoon&#8217;s explosive growth, I put together a script to generate graphs illustrating various aspects of our traffic. There are many interesting questions that we can now answer:

Which cities are getting the most traffic from search engines?
How long does it take google to index a new Urbanspoon city?
etc.

I [...]]]></description>
			<content:encoded><![CDATA[<p>In my endless quest to learn more about <a href="http://www.urbanspoon.com">Urbanspoon&#8217;s</a> explosive growth, I put together a script to generate graphs illustrating various aspects of our traffic. There are many interesting questions that we can now answer:</p>
<ul>
<li>Which cities are getting the most traffic from search engines?</li>
<li>How long does it take google to index a new Urbanspoon city?</li>
<li>etc.</li>
</ul>
<p>I whipped up a script that periodically crunches our logs offline and creates graphs using the excellent (but cryptic) <a href="http://oss.oetiker.ch/rrdtool/">rrdtool</a>. The graphs are generated on the hour. <a href="http://awstats.sourceforge.net/">awstats</a> is nice, but sometimes you have to dig in and get your hands dirty. We also use <a href="http://munin.projects.linpro.no/">munin</a> to keep an eye on our hardware. </p>
<h4>GoogleBot</h4>
<p>Below is a recent snapshot of GoogleBot crawling Urbanspoon. Green is <a href="http://www.urbanspoon.com/c/1/Seattle.html">Seattle</a>, blue is <a href="http://www.urbanspoon.com/c/2/Chicago.html">Chicago</a>, and red is <a href="http://www.urbanspoon.com/c/3/New-York.html">New York</a>. X axis is time, Y axis is pages per minute. I&#8217;ve removed the Y labels to obscure our actual numbers.<br />
<center><img src="/blogi/google_bot-week.jpg" width="400" height="273"/></center></p>
<p>Notice the flat tops on each bulge of GoogleBot traffic &#8211; GoogleBot caps its crawl rate at a certain number of pages per minute. Over the past few weeks they&#8217;ve been gradually ramping up the rate at which they crawl Urbanspoon. Perhaps they looked at our response times and concluded that our site can handle it. Also note that they&#8217;re running out of Seattle pages to crawl. Strangely, GoogleBot tends to go to sleep around midnight PST.</p>
<p>Not everyone at Google is so polite. For a brief period last week Google&#8217;s mobile crawler was hitting our site with over 7000 requests per hour.</p>
<h4>Other Robots</h4>
<p>GoogleBot hits us far more than any other robot. To put this in perspective, here is a graph of noticeable robots hitting Urbanspoon recently. GoogleBot dominates. Maybe Yahoo should just throw in the towel and start using <a href="http://lucene.apache.org/nutch/">Nutch</a> for their crawls.<br />
<center><img src="/blogi/robots-week.jpg" width="309" height="233"/></center></p>
<h4>Traffic</h4>
<p>Our referral rate from Google is increasing rapidly but not uniformly. For example, the graph below indicates that we have more work to do in <a href="http://www.urbanspoon.com/c/2/Chicago.html">Chicago</a>:</p>
<p><center><img src="/blogi/google_ref-week.jpg" width="400" height="282"/></center></p>
]]></content:encoded>
			<wfw:commentRss>http://gurge.com/blog/2007/01/23/google-in-action-and-other-graphs/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>emacs dotfiles 2007-01-20</title>
		<link>http://gurge.com/blog/2007/01/20/emacs-dotfiles-2007-01-20/</link>
		<comments>http://gurge.com/blog/2007/01/20/emacs-dotfiles-2007-01-20/#comments</comments>
		<pubDate>Sat, 20 Jan 2007 21:15:26 +0000</pubDate>
		<dc:creator>amd</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gurge.com/blog/2007/01/20/emacs-dotfiles-2007-01-20/</guid>
		<description><![CDATA[It&#8217;s time for another dotfile release. This release includes some fixes for emacs 22, and a significant improvement in abtags. Download the dotfiles here:
Adam&#8217;s Emacs Dotfiles
From the changelog:
2007-01-20
- completion fixes for emacs 22 compat
- changed nxml indent to 2
- mapped html/sgml to nxml mode
- *.rake => ruby mode
- abtags auto-reloads TAGS files now
- finally tracked [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s time for another dotfile release. This release includes some fixes for emacs 22, and a significant improvement in abtags. Download the dotfiles here:</p>
<p><a href="/amd/emacs/">Adam&#8217;s Emacs Dotfiles</a></p>
<p>From the changelog:</p>
<p><b>2007-01-20</b><br />
- completion fixes for emacs 22 compat<br />
- changed nxml indent to 2<br />
- mapped html/sgml to nxml mode<br />
- *.rake => ruby mode<br />
- abtags auto-reloads TAGS files now<br />
- finally tracked down and fixed pesky loaddefs issue</p>
]]></content:encoded>
			<wfw:commentRss>http://gurge.com/blog/2007/01/20/emacs-dotfiles-2007-01-20/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
