On Using Awk Instead of Python
August 29, 2019
A Software Niche
Python is widely used as a language for one-off scripts that deal with text files. As an interpreted, dynamically typed language, Python has extremely low development and deployment overhead, making it perfect for automating tasks that would otherwise be done manually, or which are done as part of a manual process. This is perhaps part of the reason why Python is so popular in data science workflows. While the upfront complexity of transforming a input into a usable data table can be high, the number of “transactions”, or processing batches, is relatively low. Therefore, data science pre-processing problems benefit from a language that offers high productivity up front, with a low amortized performance penalty.
While Python, and to a lesser extent Ruby, have justly taken much of
the mind share for this kind of problem, I want to promote an older,
perhaps cooler alternative. Many developers likely know awk
as the inscrutable answer to StackOverflow questions such as “Shell
command to sum integers, one per line?”, and “Find
and kill a process in one line using bash and regex“. If you’re
anything like me, you’ve probably copied and pasted more than a few
awk
oneliners without understanding the tool in any detail.
I recently took the time to learn awk
in some more detail,
and found that it is not only powerful, but extremely approachable.
My goal with this post is not to promote awk
over
Python, even for the very specific kind of workload that we’ll consider.
Rather, I want to introduce people to some of awk
’s power,
and show that most of what comes naturally about Python is just as
natural, even more so, in awk
.
A Motivating Example
Let’s use the problem I was solving as an example. I replicate my org
file TODOs to a server, from which I want to run a daily cron
summarizing the task status changes. For example, if yesterday a task
was marked as TODO, but today it is DONE, I want to have that task
listed in an email summarizing the things I got done yesterday. In order
to access historical information on the status of the org files, rather
than put them in version control and inspect the history, I materialize
the current status with a separate daily cron job, which writes rows to
a .tsv
file. So the architecture we have is two separate
cron jobs, each using an awk
script.
Let’s look at each problem, and compare a strawman Python solution with my awk script.
Dealing with row-based data
The first script needs to take org-mode headlines, and transform them into structured data. An org-mode headline might look like this:
** TODO separate books for donation :moving:
|--| |--------------------------------------------------------|
keyword heading
We want to extract the heading, and treat that as a stable
identifier. The TODO
state keyword, as well as the file and
the current date, will be stored in separate columns. The first task is
to grep for all of the headings. I think even the most dedicated Python
jockey would tend to reach for the command line tool here:
find ~/org -name '[^.]*.org' -type f \
-exec egrep '\* (TODO|DONE|CANCELLED|NEXT|WAIT|HOLD)' {} + | \
This will produce a line of output per match, prepended with the name of the file. We can now pipe this stream through a processing script.
Python version
In Python, you might write something like this:
import sys
import datetime
= datetime.datetime.now()
now = "{}-{}-{}".format(now.year, now.month, now.day)
date
for line in sys.stdin.readlines():
= line.split(" ")
words 0] = words[0].replace(":", " ")
words[= " ".join(words[2:])
heading = "%s\t%s\t%s\t%s".format(heading, words[0], words[1], date)
output print(output)
and use it by piping into the python
interpreter.
Awk version
BEGIN {
"date \"+%Y-%m-%d\"" | getline date
}
The first part is the BEGIN
block, which is optional. If
specified, it will run before the rest of the script. Conveniently, any
variables defined here will be in scope for the per-line portion of the
script. In this case we are using the cmd | getline
form to
precompute today’s date. This stores the output of cmd
in
the variable named by the argument of getline
.
Next is the main body of the script. This will run once per line.
awk
has already split the line by the field separator (by
default, any whitespace), and stored the parts in the numbered variables
$1, $2, $3
and so on.
{
sub(":", " ", $1);
for (i=3;i<=NF;i++) {
printf("%s ", $i)
}
printf("\t%s\t%s\t%s\n", $1, $2, date)
}
The first thing we do is an in-place substitution, using the
sub
function. This replaces the first instance of
":"
with a space, in the string $1
. In our
case, $1
contains a string like work.org:**
.
Next, we write out the whole headline after the TODO
keyword, by iterating over every field past $3
. The
variable NF
contains the number of fields in the current
row. Finally, we use the date
variable we defined in the
BEGIN
clause to write the rest of the row. We end up with a
row like this:
separate books for donation\tpersonal.org\t**\tTODO\t2019-07-21
Comparison
The Python version of the script uses the expressive standard
library, and comes out very succinct. The higher character count is
mainly down to having better readability, and performing string copies
rather than in-place modification. List slicing in particular makes it
nice to write—" ".join(words[2:])
looks better than the
for
loop that awk
requires.
On the other hand, writing that for
loop really
isn’t that much trouble. What strikes me most is that the
awk
version, while it reads differently than the
Python, has most of the same semantics that make Python appealing to
write: easy use of global variables, solid string manipulation
primitives, and simple access to system resources like the current
time.
So far we are still in the realm of what I always knew
awk
was good for—munging rows of text into other rows of
text. Next we’ll take a look at a task for which I definitely would have
used Python in the past.
Stringly- and dictly-typed programming
Now that we have per-day snapshots of what my org headings look like, it’s time to process the data into a daily report. There are two requirements for the report:
- It needs to show the number of tasks in the previous day that were marked DONE from a non-DONE state, and list each one.
- It should list the five most stale tasks which remain undone—that is, those tasks whose TODO status has not changed in the longest period.
These scripts are more involved, so I’ll discuss them with the corresponding sections interleaved, to highlight the similarities.
Preamble
In both scripts, we precompute the current day’s string representation. Awk uses a pair of built-in functions to format the date string:
BEGIN {
= systime();
now = strftime("%Y-%m-%d", now);
today }
In Python, we also precompute the date. We have to set up our container data structures as well: a bunch of dictionaries and one set.
import sys
import datetime
= datetime.datetime.now()
now = now.strftime("%Y-%m-%d")
date
= set([])
headlines = {}
first_appearance = {}
latest_appearance = {}
previous_status = {}
current_file = {}
latest_status = {} days_in_state
Per-line processing
We can do everything we want in a single pass of the input data. For each line, we need to do the following:
- Make a note of the latest date we have seen it. Since the input is sorted by day, this is just the date of the current row. We’ll use this to filter out TODOs that are no longer current.
- Store the current status and the previous day’s status.
- If the current status is equal to the previous day’s, increment the number of days it has been in that status.
- Store the file where the TODO is found on that day.
In Python, these steps look like so:
for line in sys.stdin.readlines():
= line.strip().split("\t")
words = words[0]
h
headlines.add(h)= words[4]
latest_appearance[h]
= words[3]
status = latest_status[h] if h in latest_status else ""
previous_status[h] = status
latest_status[h] if h in days_in_state and status == previous_status[h]:
= days_in_state[h] + 1
days_in_state[h] else:
= 1
days_in_state[h]
= words[1] current_file[h]
The awk
version is strikingly similar. The only data
structure on offer is the versatile associative array. These take the
place of both dictionaries and lists. They also allow us to model a set
by simply using the number 1
(or any other value) as the
value for each key we want to store.
{
= $1;
h = 1;
headlines[h] = $5;
latest_appearance[h]
= $4
status = latest_status[h];
previous_status[h] = status;
latest_status[h] if (status == previous_status[h]) {
++;
days_in_state[h]} else {
= 1;
days_in_state[h] }
= $2;
current_file[h] }
Even though we didn’t declare any of these arrays in the preamble,
they do “the right thing” for missing values. For example, in the first
instance of a headline, latest_status[h]
is the empty
string. awk
has written the Python version’s
if
expression for us. Similarly,
days_in_state[h]++
increments the missing value to 1, as
desired.
Aggregations
After looping over all the per-day records, we need to materialize a few aggregates of the data.
- For each headline, if its latest status is DONE and its previous status was NEXT or TODO, it is added to the list of tasks completed today.
- The incomplete headlines are sorted by the number of days they have been in their current state, and the top 5 are added to the list of most stale TODOs.
Once again, this is code that Python is perfectly suited for. In
particular, the list manipulation functions filter
and
sorted
make it very easy to express the calculation for the
five most stale TODOs.
= []
donelist for h in headlines:
if latest_appearance[h] == date:
if previous_status[h] in ["NEXT", "TODO"] and latest_status[h] == "DONE":
donelist.append(h)
print("{} tasks done:\n\t".format(len(donelist)), end="")
print("\n\t".join(donelist))
= filter(lambda h: latest_status[h] in ["NEXT", "TODO"], headlines)
todos = sorted(todos, key=lambda h: days_in_state[h], reverse=True)[:5]
stalest_todos
print("Top 5 stalest tasks:")
for h in stalest_todos:
print("\t- {}".format(h))
The awk
version is somewhat more ungainly. However, it
still reads broadly similarly to the python.
END {
delete done_yesterday;
for (h in headlines) {
if (latest_appearance[h] == today) {
= previous_status[h];
prev = latest_status[h];
curr if ((prev == "TODO" || prev == "NEXT") && curr == "DONE") {
= 1;
done_yesterday[h] }
if (curr == "TODO" || curr == "NEXT") {
= days_in_state[h];
todos[h] }
}
}
printf("Tasks completed yesterday: %s\n", length(done_yesterday));
for (h in done_yesterday) {
printf("\t%s", h);
}
printf("\n\n");
asorti(todos, stalest, "@val_num_desc")
printf("Top 5 most stale TODO tasks:\n")
for (i = 1; i <= 5; i++) {
= stalest[i];
h printf("\t[[file:%s::*%s][%s]]\n", current_file[h], h, h);
}
}
The ugliest wart here is the line delete done_yesterday
.
This is required for length(done_yesterday)
to return
0
, if we never evaluate done_yesterday[h] = 1
.
awk
doesn’t require you to declare your variables, but it
also doesn’t have a facility for assigning a variable to an empty array!
This is a fairly serious gotcha, in my opinion.
We also see an example of sorting arrays. While terse, sorting in
awk
is rather unnatural compared to the Python. The
asorti
function sorts the indices of its
first argument, the (associative) array todos
. The second
argument tells asorti
to copy the result into
stalest
, initializing it. The third argument,
@val_num_desc
, is a sigil describing how to sort the
indices. asorti
can take a variety of traversal order
specifiers in this argument, or a user-provided function. We populate
todos
with the value of days_in_state
in the
loop, so this function call sorts the headlines by their staleness,
numerically, in descending order.
Comparison and Conclusion
Some of the more obvious differences between awk
and
Python are the most relevant to their applicability to this sort of
problem. Python is a general purpose programming language, which means
that one has to set up the boilerplate to read from STDIN for every
script. On the other hand, its standard library offers basic data
processing functions that are substantially more ergonomic than
awk
. Sorting, filtering, and joining strings are all more
readable and less error-prone in Python.
Nevertheless, awk
is competitive with Python when it
comes to its bread and butter: munging data in and out of associative
data structures. It’s frankly embarrassing how many problems are most
naturally solved by an awkward series of intermediate hash maps. For my
money, awk beats Python when it comes to this under-appreciated and
under-discussed programming paradigm. Its handling of default values is
extremely nice to have. How many times have you tried to solve a problem
in Python with just standard dictionaries, only to find yourself going
back and adding import defaultdict
to the top of your
file?
Ease of deployment is a small point in awk
’s favor as
well. While dependency hell is almost never an issue Python scripts this
short, I don’t think I’ve ever logged onto a server where
awk
was not available.
Finally, I have barely touched on performance at all. I hope to do a
followup post comparing awk
’s performance with associative
arrays to a variety of Python data structures. For what it’s worth, the
two scripts described in this post were within 50% of each other’s
(perfectly adequate) performance.