On Using Awk Instead of Python

2019-08-29

A Software Niche

Python is widely used as a language for one-off scripts that deal with text files. As an interpreted, dynamically typed language, Python has extremely low development and deployment overhead, making it perfect for automating tasks that would otherwise be done manually, or which are done as part of a manual process. This is perhaps part of the reason why Python is so popular in data science workflows. While the upfront complexity of transforming a input into a usable data table can be high, the number of "transactions", or processing batches, is relatively low. Therefore, data science pre-processing problems benefit from a language that offers high productivity up front, with a low amortized performance penalty.

While Python, and to a lesser extent Ruby, have justly taken much of the mind share for this kind of problem, I want to promote an older, perhaps cooler alternative. Many developers likely know awk as the inscrutable answer to StackOverflow questions such as "Shell command to sum integers, one per line?", and "Find and kill a process in one line using bash and regex". If you're anything like me, you've probably copied and pasted more than a few awk oneliners without understanding the tool in any detail. I recently took the time to learn awk in some more detail, and found that it is not only powerful, but extremely approachable.

My goal with this post is not to promote awk over Python, even for the very specific kind of workload that we'll consider. Rather, I want to introduce people to some of awk's power, and show that most of what comes naturally about Python is just as natural, even more so, in awk.

A Motivating Example

Let's use the problem I was solving as an example. I replicate my org file TODOs to a server, from which I want to run a daily cron summarizing the task status changes. For example, if yesterday a task was marked as TODO, but today it is DONE, I want to have that task listed in an email summarizing the things I got done yesterday. In order to access historical information on the status of the org files, rather than put them in version control and inspect the history, I materialize the current status with a separate daily cron job, which writes rows to a .tsv file. So the architecture we have is two separate cron jobs, each using an awk script.

Let's look at each problem, and compare a strawman Python solution with my awk script.

Dealing with row-based data

The first script needs to take org-mode headlines, and transform them into structured data. An org-mode headline might look like this:

** TODO separate books for donation                       :moving:
   |--| |--------------------------------------------------------|
 keyword                            heading

We want to extract the heading, and treat that as a stable identifier. The TODO state keyword, as well as the file and the current date, will be stored in separate columns. The first task is to grep for all of the headings. I think even the most dedicated Python jockey would tend to reach for the command line tool here:

find ~/org -name '[^.]*.org' -type f \
     -exec egrep '\* (TODO|DONE|CANCELLED|NEXT|WAIT|HOLD)' {} + | \

This will produce a line of output per match, prepended with the name of the file. We can now pipe this stream through a processing script.

Python version

In Python, you might write something like this:

import sys
import datetime

now = datetime.datetime.now()
date = "{}-{}-{}".format(now.year, now.month, now.day)

for line in sys.stdin.readlines():
    words = line.split(" ")
    words[0] = words[0].replace(":", " ")
    heading = " ".join(words[2:])
    output = "%s\t%s\t%s\t%s".format(heading, words[0], words[1], date)
    print(output)

and use it by piping into the python interpreter.

Awk version

BEGIN {
    "date \"+%Y-%m-%d\"" | getline date
}

The first part is the BEGIN block, which is optional. If specified, it will run before the rest of the script. Conveniently, any variables defined here will be in scope for the per-line portion of the script. In this case we are using the cmd | getline form to precompute today's date. This stores the output of cmd in the variable named by the argument of getline.

Next is the main body of the script. This will run once per line. awk has already split the line by the field separator (by default, any whitespace), and stored the parts in the numbered variables $1, $2, $3 and so on.

{
    sub(":", " ", $1);
    for (i=3;i<=NF;i++) {
        printf("%s ", $i)
    }
    printf("\t%s\t%s\t%s\n", $1, $2, date)
}

The first thing we do is an in-place substitution, using the sub function. This replaces the first instance of ":" with a space, in the string $1. In our case, $1 contains a string like work.org:**. Next, we write out the whole headline after the TODO keyword, by iterating over every field past $3. The variable NF contains the number of fields in the current row. Finally, we use the date variable we defined in the BEGIN clause to write the rest of the row. We end up with a row like this:

separate books for donation\tpersonal.org\t**\tTODO\t2019-07-21

Comparison

The Python version of the script uses the expressive standard library, and comes out very succinct. The higher character count is mainly down to having better readability, and performing string copies rather than in-place modification. List slicing in particular makes it nice to write---" ".join(words[2:]) looks better than the for loop that awk requires.

On the other hand, writing that for loop really isn't that much trouble. What strikes me most is that the awk version, while it reads differently than the Python, has most of the same semantics that make Python appealing to write: easy use of global variables, solid string manipulation primitives, and simple access to system resources like the current time.

So far we are still in the realm of what I always knew awk was good for---munging rows of text into other rows of text. Next we'll take a look at a task for which I definitely would have used Python in the past.

Stringly- and dictly-typed programming

Now that we have per-day snapshots of what my org headings look like, it's time to process the data into a daily report. There are two requirements for the report:

It needs to show the number of tasks in the previous day that were marked DONE from a non-DONE state, and list each one.
It should list the five most stale tasks which remain undone---that is, those tasks whose TODO status has not changed in the longest period.

These scripts are more involved, so I'll discuss them with the corresponding sections interleaved, to highlight the similarities.

Preamble

In both scripts, we precompute the current day's string representation. Awk uses a pair of built-in functions to format the date string:

BEGIN {
  now = systime();
  today = strftime("%Y-%m-%d", now);
}

In Python, we also precompute the date. We have to set up our container data structures as well: a bunch of dictionaries and one set.

import sys
import datetime

now = datetime.datetime.now()
date = now.strftime("%Y-%m-%d")

headlines = set([])
first_appearance = {}
latest_appearance = {}
previous_status = {}
current_file = {}
latest_status = {}
days_in_state = {}

Per-line processing

We can do everything we want in a single pass of the input data. For each line, we need to do the following:

Make a note of the latest date we have seen it. Since the input is sorted by day, this is just the date of the current row. We'll use this to filter out TODOs that are no longer current.
Store the current status and the previous day's status.
If the current status is equal to the previous day's, increment the number of days it has been in that status.
Store the file where the TODO is found on that day.

In Python, these steps look like so:

for line in sys.stdin.readlines():
    words = line.strip().split("\t")
    h = words[0]
    headlines.add(h)
    latest_appearance[h] = words[4]

    status = words[3]
    previous_status[h] = latest_status[h] if h in latest_status else ""
    latest_status[h] = status
    if h in days_in_state and status == previous_status[h]:
        days_in_state[h] = days_in_state[h] + 1
    else:
        days_in_state[h] = 1

    current_file[h] = words[1]

The awk version is strikingly similar. The only data structure on offer is the versatile associative array. These take the place of both dictionaries and lists. They also allow us to model a set by simply using the number 1 (or any other value) as the value for each key we want to store.

{
  h = $1;
  headlines[h] = 1;
  latest_appearance[h] = $5;

  status = $4
  previous_status[h] = latest_status[h];
  latest_status[h] = status;
  if (status == previous_status[h]) {
     days_in_state[h]++;
  } else {
    days_in_state[h] = 1;
  }

  current_file[h] = $2;
}

Even though we didn't declare any of these arrays in the preamble, they do "the right thing" for missing values. For example, in the first instance of a headline, latest_status[h] is the empty string. awk has written the Python version's if expression for us. Similarly, days_in_state[h]++ increments the missing value to 1, as desired.

Aggregations

After looping over all the per-day records, we need to materialize a few aggregates of the data.

For each headline, if its latest status is DONE and its previous status was NEXT or TODO, it is added to the list of tasks completed today.
The incomplete headlines are sorted by the number of days they have been in their current state, and the top 5 are added to the list of most stale TODOs.

Once again, this is code that Python is perfectly suited for. In particular, the list manipulation functions filter and sorted make it very easy to express the calculation for the five most stale TODOs.

donelist = []
for h in headlines:
    if latest_appearance[h] == date:
        if previous_status[h] in ["NEXT", "TODO"] and latest_status[h] == "DONE":
            donelist.append(h)

print("{} tasks done:\n\t".format(len(donelist)), end="")
print("\n\t".join(donelist))

todos = filter(lambda h: latest_status[h] in ["NEXT", "TODO"], headlines)
stalest_todos = sorted(todos, key=lambda h: days_in_state[h], reverse=True)[:5]

print("Top 5 stalest tasks:")
for h in stalest_todos:
    print("\t- {}".format(h))

The awk version is somewhat more ungainly. However, it still reads broadly similarly to the python.

END {
  delete done_yesterday;
  for (h in headlines) {
    if (latest_appearance[h] == today) {
      prev = previous_status[h];
      curr = latest_status[h];
      if ((prev == "TODO" || prev == "NEXT") && curr == "DONE") {
        done_yesterday[h] = 1;
      }
      if (curr == "TODO" || curr == "NEXT") {
        todos[h] = days_in_state[h];
      }
    }
  }
  printf("Tasks completed yesterday: %s\n", length(done_yesterday));
  for (h in done_yesterday) {
    printf("\t%s", h);
  }

  printf("\n\n");

  asorti(todos, stalest, "@val_num_desc")
  printf("Top 5 most stale TODO tasks:\n")
  for (i = 1; i <= 5; i++) {
    h = stalest[i];
    printf("\t[[file:%s::*%s][%s]]\n", current_file[h], h, h);
  }
}

The ugliest wart here is the line delete done_yesterday. This is required for length(done_yesterday) to return 0, if we never evaluate done_yesterday[h] = 1. awk doesn't require you to declare your variables, but it also doesn't have a facility for assigning a variable to an empty array! This is a fairly serious gotcha, in my opinion.

We also see an example of sorting arrays. While terse, sorting in awk is rather unnatural compared to the Python. The asorti function sorts the indices of its first argument, the (associative) array todos. The second argument tells asorti to copy the result into stalest, initializing it. The third argument, @val_num_desc, is a sigil describing how to sort the indices. asorti can take a variety of traversal order specifiers in this argument, or a user-provided function. We populate todos with the value of days_in_state in the loop, so this function call sorts the headlines by their staleness, numerically, in descending order.

Comparison and Conclusion

Some of the more obvious differences between awk and Python are the most relevant to their applicability to this sort of problem. Python is a general purpose programming language, which means that one has to set up the boilerplate to read from STDIN for every script. On the other hand, its standard library offers basic data processing functions that are substantially more ergonomic than awk. Sorting, filtering, and joining strings are all more readable and less error-prone in Python.

Nevertheless, awk is competitive with Python when it comes to its bread and butter: munging data in and out of associative data structures. It's frankly embarrassing how many problems are most naturally solved by an awkward series of intermediate hash maps. For my money, awk beats Python when it comes to this under-appreciated and under-discussed programming paradigm. Its handling of default values is extremely nice to have. How many times have you tried to solve a problem in Python with just standard dictionaries, only to find yourself going back and adding import defaultdict to the top of your file?

Ease of deployment is a small point in awk's favor as well. While dependency hell is almost never an issue Python scripts this short, I don't think I've ever logged onto a server where awk was not available.

Finally, I have barely touched on performance at all. I hope to do a followup post comparing awk's performance with associative arrays to a variety of Python data structures. For what it's worth, the two scripts described in this post were within 50% of each other's (perfectly adequate) performance.