Do you process log files, spreadsheets, or XML as part of your engineering work? You too can become a text wizard by mastering these ten commands. Learn them one by one, then mix and match for best results.
1. cat
First, you need to get some text in your shell. Use cat to output a text file.
$ cat access.log
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:50 +0000] "GET /r/16/
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "GET /r/16/
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "POST /e/vo
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "GET /r/15/
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "GET /m/r/1
2. grep/zgrep
Filters text and picks out just the bits that you care about. zgrep does the same, but works on files that might be compressed. That means that you can run zgrep on a bunch of files, including both compressed and uncompressed data.
Arguments that you care about
- -i : case insensitive
- -v : show lines that DON’T match
- -s : ignore errors (useful in conjunction with find/xargs)
- -E : search for ‘extended’ regular expressions. Necessary for all but the most trivial greps.
- -F : search for text, not a regex
- -o : only print the matching text, not the whole line
- -n : include line numbers, useful for emacs
- -c : show a count of matches, not the matches
- -h : suppress filename
- -R : grep files and directories recursively. Also see find, down below.
$ cat access.log | grep 50
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:50 +0000] "GET /r/16/
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:50 +0000] "GET /r/4/5
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:50 +0000] "GET /r/4/5
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:50 +0000] "GET /fn/52
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:27:50 +0000] "GET /m/ci/
$ cat access.log | grep -Eo "Aug/[0-9]"
Aug/2008
Aug/2008
Aug/2008
Aug/2008
Aug/2008
$ cat access.log | grep -v GET
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:51 +0000] "POST /e/vo
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:25:57 +0000] "POST /e/ku
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:26 +0000] "POST /u/ac
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:26 +0000] "POST /u/ac
1.2.3.4 urbanspoon.com - [15/Aug/2008:06:26:30 +0000] "POST /u/ac
3. sort
Sorts lines of text. Sort is fast, flexible, and well designed. It also contains possibly the most well-thought out option ever created in the history of command line tools: -n.
Arguments that you care about
- -n : sort numerically (you’ll be hearing more about this later)
- -r : reverse the sort order
- -k : sort based on delimited columns, not just lines. Useful for spreadsheets.
$ cat access.log | grep -Eo 'GET [^"]+' | sort
GET /a/1/Seattle-at-night.html HTTP/1.0
GET /about HTTP/1.0
GET /b/favorites/1/4845/Seattle-restaurants HTTP/1.0
GET /b/favorites/3/245/New-York-restaurants HTTP/1.0
GET /blog/ HTTP/1.0
$ cat access.log | grep -Eo 'GET [^"]+' | sort -r
GET /zip/8/77539/Houston-restaurants.html HTTP/1.0
GET /zip/8/77406/Houston-restaurants.html?sort=2 HTTP/1.0
GET /w/feed/reviews/3/New-York/rss.xml HTTP/1.0
GET /u/friends/29012 HTTP/1.0
GET /u/favorites/29012 HTTP/1.0
4. uniq
Omits repeated lines. By itself this is marginally useful, but with -c and sort, it’s magic.
Arguments that you care about
- -c : show number of occurrences
For example, here’s a poor man’s google analytics:
$ cat access.log | grep -Eo 'GET [^"]+' | sort | uniq -c | sort -nr
183 GET /m/u/friends HTTP/1.0
60 GET / HTTP/1.0
42 GET /m/ci/6 HTTP/1.0
40 GET /r/15/191259/restaurant/Southeast/Peking-Express
39 GET /m/u/suggest_city HTTP/1.0
5. cut
Incredibly useful for slicing and dicing lines. Cut by delimiter or position. cut uses TAB as it’s default delimiter, which is handy because I can never remember how to get a TAB character into bash.
Arguments that you care about
- -d : set the delimiter
- -f : pick the fields to output. -f1,3 outputs fields 1 & 3. -f1-3 Outputs fields 1, 2 & 3.
- -c : or, pick the chars to output
Here’s how you can get the dates from your log file, if you ever wanted to do such a thing. Note the use of quotes to surround the space character, which I’m using as a delimiter.
$ cat access.log | cut -d" " -f4-5
[15/Aug/2008:06:25:50 +0000]
[15/Aug/2008:06:25:51 +0000]
[15/Aug/2008:06:25:51 +0000]
[15/Aug/2008:06:25:51 +0000]
[15/Aug/2008:06:25:51 +0000]
Want to see a breakdown by hour? Just cut out the hour, sort and uniq -c:
$ cat access.log | cut -d":" -f2 | sort | uniq -c
17798 14
22160 15
26415 16
23181 17
30535 18
6. wc
Counts lines. Actually, it counts lots of things, but mostly I use it for lines. When combined with uniq, you can get a count of unique things. For example, here are the number of uniq user agents in my log file:
Arguments that you care about
- -l : just show the line count
$ cat access.log | cut -d'"' -f6 | uniq | wc -l
8539
7. head/tail
Given some incoming text, just look at the start or end of the file. This is useful if you have lots of data but only want to look at a subset. The examples aren’t that interesting, but here’s how you use them:
Show the first 100 lines:
$ cat access.log | head -100
Show the last 100 lines:
$ cat access.log | tail -100
Tail also has a completely different mode where it “watches” a file and outputs lines that are written to the end. You can “tail your log file” to see what’s happening on your server:
$ tail -f access.log
Want to see if anyone is hitting that new page? Combine tail -f with grep:
$ tail -f access.log | grep cool_new_feature
8. find
Walks your disk and prints out file names. Again, not particularly useful by itself, but tremendously useful when combined with xargs (below) and grep. Unfortunately, find has a somewhat confusing command line syntax. Here are the two flavors you should learn:
Find all files and directories:
$ find
.
./Makefile.in
./config.log
./COPYING
./COMPILING
Find files only, not directories. Note that “.” means “the current directory”. So, we’re really saying “find all files in the current directory”.
$ find . -type f
./Makefile.in
./config.log
./COPYING
./COMPILING
./autogen.sh
Combine with grep to find c++ files.
$ find | grep cpp
./src/tankai/TankAIComputerTarget.cpp
./src/tankai/TankAIAdder.cpp
./src/tankai/TankAIComputer.cpp
./src/tankai/TankAIStrings.cpp
./src/tankai/TankAIComputerBuyer.cpp
9. xargs
This is a weird one, but it’s essential if you want to become a true text wizard. Basically, xargs takes a bunch of text lines and sends them as command line arguments to some other command. Why would you want this? Here are a few ideas off the top of my head:
Grep for some text in your .cpp files, but don’t grep inside the .o files
$ find | grep cpp | xargs grep strcpy
Download urls listed in a text file
cat urls.txt | xargs curl
Zip up your files, but exclude subversion cruft
find | grep -v .svn | xargs zip source.zip
10. less
Less is a text viewer. Use it to page through a file or the output from a command. You can search the contents or jump around with hotkeys. Here are the keys I use most often:
- space : next page
- b: previous page
- < : jump to start of file
- > : jump to end of file
- / : search forward for regex
- ? : search backward for regex
This is especially useful if you’re building up a complicated string of commands. Add less to the end of each sequence to preview the results:
$ cat access.log | less
$ cat access.log | grep -Eo 'GET [^"]+' | less
$ cat access.log | grep -Eo 'GET [^"]+' | sort | less
$ cat access.log | grep -Eo 'GET [^"]+' | sort | uniq -c | less
...
What next?
The true power of the command line comes not from any individual command, but from knowing how to chain them together. We’ve really just scratched the surface here. Once you’ve mastered these commands, here are a few others to check out:
- convert – image processing
- curl – fetch web pages
- dc – calculator
- strings – pull strings out of a binary file
- sum – checksum a file
- watch – run something every few seconds and look for changes
- xxd – hex dump a file
You’ll also want to learn a bit about redirection, bash’s command line history, and the wonders of ctrl-r.