Paper Reading: Efficient Path Profiling

Recently I’ve been going through CS 6120 from Cornell (compilers), and one of the papers listed in the course was quite interesting, namely Efficient Path Profiling. Once in a while you see a solution so neat that it almost feels like the problem was created in order to make such a solution useful; this paper gave me that feeling. This blog post will give a high level understanding of the problem, the algorithm and some intuitions, while leaving out all the technical details.

The problem setting is that you have a control flow graph (CFG), where each node is a block of code that always executes together (no branches), and each edge is a branch/jump instruction. With huge loss of generality, we assume there will be an ENTRY node and an EXIT node, and the CFG will be a directed acyclic graph (DAG) always going from ENTRY to EXIT. This is clearly unrealistic for normal programs due to the lack of loops, but the paper provides workarounds that aren’t very interesting. The task is to record the paths taken in each execution (from ENTRY to EXIT), so that we can compute statistics about which paths are the most common and make compiler optimizations accordingly.

In other words, we’re doing something like the following. Say we give each node a unique identifier (e.g. 0, 1, 2 …). Each time the program runs, we maintain a list of these identifiers, appending to it every time we visit a new node. And by the end we can add the resulting list to some sort of aggregating data structure.

But that’s a horribly inefficient way to do it. Both appending to the list in each node and aggregating the resulting lists at the end of each execution are going to be expensive. Here, the authors propose: what if we could instead somehow give integer weights to the edges in the CFG such that each path from ENTRY to EXIT has a unique sum, to replace the list of node identifiers? What if those path sums are small enough numbers that you could just make an array and increment the element at the index equal to the path sum?? What if you can pick the edges that are triggered the most and set those weights to 0, so you don’t even need to do anything in the hot paths???

Compact Unique Path Sums

It turns out all of those are possible. First, it’s actually easy to weight the edges such that each ENTRY->EXIT path gives a different sum. You could just pick unique powers of 2, which could give you really large weights. But you can actually do much, much better and make the sums “compact”, meaning they form a range between 0 and the number of unique paths – 1, so the path sums are as small as possible. It’s also really simple to do so.

First, we define NumPaths(v) as the number of unique paths from node v to EXIT. This takes linear time to compute. Then, for each node, say we have a list of outgoing edges. For the ith edge, we simply take all edges from 0 to i – 1, find their destinations, add up their NumPaths, and use that as the weight. The intuition behind this is that for each outgoing edge, there are NumPaths way to get to the EXIT, and each path has a unique sum between 0 and NumPaths – 1 (by induction). Since the first edge already claimed sums 0 to NumPaths – 1, the second edge has to start counting from NumPaths, and we can achieve that by adding NumPaths to the second edge’s weight.

If the maximum NumPaths of a CFG is small enough, we can just maintain an array of length NumPaths to count the frequency each path is taken across many runs. Otherwise, we can still maintain a hash table, incurring a larger overhead.

Choosing Weights to Zero Out

So far it’s been very simple. Notice that for each node, one of the outgoing edges has weight 0. Of course, at those edges, we don’t actually need to add an instruction to add the weight to the path sum. So we can actually just pick the most frequent edge and assign it 0, to minimize the overhead we’re adding to the program.

But we can actually have more flexibility than picking one outgoing edge per node to zero out. This part is a bit more involved to understand, and this paper basically says “just look at that other paper”. While that other paper has a proof, it still didn’t quite explain why it works. I think I have an intuition, which I will lay out below, but I’m Not a Computer Scientist, so it might be wrong, etc.

The way it works is that: you start off with some estimations for the relative frequencies of each edge being taken. You add an edge from EXIT to ENTRY with frequency 1 (fires every time the program runs), and compute the spanning tree of the resulting graph with the maximum weight (sum of edge frequencies), ignoring directions. All edges in that spanning tree will have an updated weight 0. Note that this is never worse than zeroing out the most frequent edge of each node as described above, because taking one outgoing edge per node also forms a spanning tree (n – 1 edges in total, all nodes are linked to EXIT).

For any edge not in the spanning tree, we call it a “chord”. Each chord f, when added to the spanning tree, forms a cycle C(f), also ignoring direction. Now, we have the weight assignments for any edge e, W(e), from the previous section’s algorithm (for the added edge, W(EXIT->ENTRY) = 0). The new weight of any chord f, W'(f), is the sum of W over C(f), but we negate W for edges that are in the opposite direction of f in the cycle. For example, say the chord A -> B has C(A -> B) = A -> B <- C -> A, then W'(A -> B) = W(A -> B) – W(C -> B) + W(C -> A). The claim is that, for any given path from ENTRY to EXIT, W and W’ yield the same path sum.

But why is that? Here’s the handwavy part. The intuition is that every program execution, when appended with the edge EXIT->ENTRY, becomes a loop. A directed program execution loop D must contains chords, since all loops contain edges not in the spanning tree, and D is really just a “sum” of all C(f) for chords f in D. So the sum of W over D is equal to the sum of W over all C(f) for f in D. The sum of W over any C(f) is by definition equal to the sum of W’ over C(f).

Hence:
sum W over D
= sum W over (C(f) for chord f in D)
= sum W’ over (C(f) for chord f in D)
= sum W’ over D

Here’s a simple example, not a proof, since I don’t have one.

program       spanning tree     execution loop
  ENTRY         ENTRY             ENTRY
   / \             \               /   \
  A   |         A   |             A     |
  | \ |         | \ |             |     |
  B   C         B   C             B     |
   \ /             /               \   /
   EXIT          EXIT              EXIT

In our execution loop, we have 3 chords, ENTRY->A, B->EXIT, and EXIT -> ENTRY.

C(ENTRY->A) = ENTRY – A – C – ENTRY
C(B->EXIT) = B – EXIT – C – A – B
C(EXIT->ENTRY) = EXIT – ENTRY – C – EXIT

Joining all three together (at A and EXIT), we have:

ENTRY – A – B – EXIT – ENTRY – C – EXIT – C – AC – ENTRY

Cancelling out opposite edges, this simplifies to:

ENTRY – A – B – EXIT – ENTRY – C – EXIT – C – A – C – ENTRY
ENTRY – A – B – EXIT – ENTRY – C – ENTRY
ENTRY – A – B – EXIT – ENTRY

Which is exactly the execution loop.

With these two steps – compute weights, optimize the locations of the zero weight edges – we can insert instructions at the edges in the given program to efficiently compute the unique sum of the path taken each time a program finishes executing. In retrospect, the algorithm to assign weights to give compact path sums seems almost obvious, but that’s more a sign that we’ve asked the right question than that the problem is really trivial.

Variants of the Optimal Coding Problem

A Variant of Optimal Coding Problem
Prerequisites: binary prefix code trees, Huffman coding

Here’s a problem: say we are given n positive numbers, and you are allowed to each time pick two numbers (a and b) that are in the list, add them together and put it back to the list, and pay me an amount of dollars equal to the sum (a+b). Obviously your optimal strategy would be not to play this stupid game. Here’s the twist: what if I point a gun at your head and force you to play until we end up with only one number, then what would be your optimal strategy, and what would be the minimum amount of money you have to pay me?

At the first glance this seems quite easy, of course you would just pick the smallest two numbers each time and add them, because that would be the best option you have each time. It would minimize your cost every round, and I will end up with the minimum amount possible. However if you paid attention in your greedy algorithm lecture in college, you will notice the fallacy of this statement – picking the best option each time does not necessarily lead to the best overall outcome.

I claim that it is not obvious that this algorithm is right, not only because it is greedy, but also because this problem is equivalent to the classic optimal coding problem, which is not that easy. For those who haven’t heard, optimal coding problem is the problem where you’re given a bunch of symbols and the average frequencies for each of them, and you try to design a set of binary prefix codes for these symbols such that using this coding scheme, the average sequence of symbols will be represented by the shortest expected number of bits.

What? Yeah, this number summing and merging problem is equivalent to the optimal coding problem. I’ll spare you and myself a formal proof, but here’s an example to convince you. Say we have 3 numbers a, b and c, and we first sum a+b, then sum (a+b)+c. Your total cost will be 2a+2b+c. Now if we draw out the summation as a tree:

 a+b+c
/\
/\ c
a b

(pardon my ascii art) you will see that a and b sit at depth=2, and c sits at depth=1. This gives another way to come up with the total cost: multiply each number with its tree depth, and sum these products together. This is always the same as the total cost, because for each number, as you go up a level in the tree, you add that number to the total cost once.

Now if you paid attention in your algorithm lecture on binary prefix code trees, you will see this looks exactly like a binary prefix code tree. To complete the analogy, say we are given three symbols A, B and C, and they have frequencies a, b and c. Had we given them the same binary code tree as above, the average code length weighted by frequencies will be 2a+2b+c. If you don’t see why, you should pause here and think until you get it (or not, you decide).

Now that we established these two problems are the same (same objective function, same solution space), it follows that the optimal algorithms for both are the same. The solution to the optimal coding problem is Huffman Coding, as spoiled from the very beginning of this post. Now, this algorithm is exactly the greedy algorithm above that we weren’t sure was right or not. So – I just spent this much time convincing you that algorithm might not be right, and then convincing you it is indeed correct. Yay progress!

Variants of that Variant

(Technically, that wasn’t really a variant because that was exactly the same problem.) After talking to a few people in real life about this, I found it hard to convince people that this innocent looking algorithm could possibly be wrong, and it certainly didn’t help that the algorithm ended up being right. One line I tried was: “even Shannon tried this problem and came up with a suboptimal Shannon-Fano coding scheme,” but people didn’t seem to appreciate that. In fact, it is quite stunning that when we turn the optimal coding problem into the summing problem, the solution seems so much more obvious.

So I came up with a second attempt: what if we tweak the problem a bit so that greedy wouldn’t work? In the original problem, when we merge two numbers, it costs us the sum of the two, and we reinsert the sum to the list of numbers. What if the cost function or the merge function is not an addition? For example, we can imagine changing the cost function to take minimum, or the merge function to return the square of the sum, etc. Then, would the greedy algorithm still work? Now people are much less sure – which kind of proves the point that they shouldn’t have been so confident in the first place.

But would it? It turns out, perhaps unsurprisingly, in most cases it won’t work anymore. Just to raise a few counter examples, (1 1 1 1) yields an optimal merging of 1 + (1 + (1 + 1)) if the cost function is taking minimum. (1 1 1 2 3) yields an optimal merging of ((1 + 1) + 2) + (1 + 3) if we take maximum instead. And if we take cost as the square of the sum, for the case (1 1 2 2), the optimal way is (1 + 2) + (1 + 2). In all these cases, the greedy way will give a sub-optimal total cost. It happens that if we take both cost and merging functions to taking min/max, the greedy approach works, but they are a lot more trivial than the other cases.

At this point, it seems like we really lucked out that the optimal coding problem corresponds to a case where greedy works. Otherwise, computers would either spend so many more cycles computing the optimal codes for compressing files, or we would have ended up with larger file sizes for various kinds of compressed files.

To end this post with a trivia: did you know that the Huffman tree for a gzip file is also encoded using a Huffman tree? Since every file has a different frequency count of symbols, each file has a different Huffman tree that needs to be saved in the compressed file as well. This tree is then encoded, again, using a standard Huffman tree that is the same for all compressed files, so the decoder knows how to decode, which is pretty meta.