Theoretically Or Practically Fast: Two Types of Hash Tables

When people say a program is fast, they usually mean it takes a short time to execute; but when computer scientists say an algorithm is efficient, they mean the growth rate of the run-time with respect to the input size is slow. It is often the case that these two notions approximately mean the same thing, that a more efficient algorithm yields a faster program, even though it is not always true. I recently learned about two types of hash tables that show this difference: Cuckoo hashing and Robin Hood hashing. Even though in theory Cuckoo is more efficient, Robin Hood is in practice a lot faster.

Vanilla Hash Tables

Before going into details, let’s revisit how a hash table works. Simple thing, you have a hash function that takes a key and yields a random (but deterministic) number as the “hash”, and you use this hash to figure out where to put this item in the table. If you want to take something out, just compute the hash again and you can find exactly where the item is stored.

That’s not the whole story though. What if you want to insert item A into spot 1, but item B is already there? You can’t just throw away item B, what are you, a savage? There are multiple ways to deal with this problem called collisions. The easiest way is chaining, which is to make each slot into a list of items, and inserting is just appending to the corresponding list. Sure, that would work, but now your code is a lot slower. This is because each operation takes a look up to get the list, and another look up to get the contents of the list. This is really bad because the table and the lists will not align nicely in cache, and cache misses are BAD BAD things.

Linear Probing

To improve cache performance, people started to throw out bad ideas. One bad idea that kind of worked was, if this slot is occupied, just take the next one! If the next one is taken, just take the next next one! This is called linear probing, and is pretty commonly used. Now we solved the caching problem because we’re only accessing nearby memory slots, our programs now run a lot faster. Why is this bad? Well, it turns out as the table gets filled up, the probe count can get really high. Probe count is how many spots need to be checked before an an item is found, or how far away the item is from where it should be. Imagine out of 5 spots, the first 3 are occupied. Then, the next insertion has a 60% chance of collision. If it does collide, the probe count is on average 2 (probe count = 3 if the hashed index is the first slot, 2 if second, and 1 if third). And after a collision, the size of the occupied segment increases by one, making the next insertion even worse.

Robin Hood to the Rescue

This brings us to Robin Hood hashing. It is actually very similar to linear probing, but with one simple trick. Say in the example above, we have item A sitting at slot 1, B at 2, C at 3 like this: [A B C _ _] and we have another collision trying to insert D at slot 1. Instead of placing D like [A B C D _], we arrange them as [A D B C _], then the probe count would be [0 1 1 1 _] instead of [0 0 0 3 _]. Now there are two things to understand: (1) What exactly happened? (2) Why is this better?

To answer (1), imagine we’re inserting D to slot 1. We see that A is there, then just like linear probing, we move on. Now B is there with probe count 0, while D already has probe count 1. That’s not fair! How dare you have a lower probe count than I do! So D kicks out B, and now B has to find a place to live. It looks at C and yells and kick C out, now C has to find a place, which happens to be slot 4. In other words, during probing, if the current item has a lower probe count, it is swapped out and probing continues.

But (2) why is this better? If you think about it, the total probe count (which is 3) is unchanged, so the average probe count is the same as just linear probing. It should have exactly the same performance! This approach is superior because the highest probe count is much lower. In this case, the highest probe count is reduced from 3 to 1. If we can keep the max probe count small, then all probing can be done with almost no cache misses. This makes Robin Hood hash tables really fast. Having a low maximum probe count also solves a really painful problem in linear probing, which is when you remove an item in the hash table and leave a gap. When we look for items, we can’t just declare missing when we first probe an empty spot, because maybe our item is stored further ahead, and this spot was just removed later on. With the maximum probe count really low in Robin Hood hashing, we can then say “I’m just going to look at these n spots, if you can’t find it it ain’t there.” Which is nice and simple. (It also enables shift back deletion instead of tombstone, which is another boost to the performance, but I don’t want to get too deep.)

A Theoretically Better Solution

Even though the highest probe count is small, it still grows as we have more items in the hash table (probe count goes up when load factor goes up). What if I tell you I don’t want it to grow? Meet Cuckoo hashing, which has a maximum probe count of 1.

The algorithm is slightly more complicated. Instead of one table, we have two tables, and one hash function each, chosen randomly among the space of all hash functions (not practical, but close enough). When we look up an item, we just need to compute the hash for the first and the second table, and look at those two spots. Done!

I mean, sure, but that’s easy – how do you insert? What if both spots are full? That’s where the name cuckoo comes from. You pick one of the spots, kick that guy out and put your item there. Now with the new guy, you find its other available spot, kick that guy out, so on and so forth. If you end up in a loop (or spend too long and decide to give up), you pick another two hash functions, and rebuild the hash table. As far as at least half of the slots are empty, it is highly likely that the whole operation takes constant time in expectation. I don’t have any simple intuitions about why this is true. If you do, please let me know.

Theoretically, Cuckoo hashing beats Robin Hood because the worst case look up time is constant, unlike in Robin Hood. Why is it still slower than Robin Hood in practice? As it turns out, the constant is both a blessing and a curse. When looking up missing items, there has to be two look ups; even when looking up existing items, on average there still has to be 1.5 look ups. To make matters worse, since the two spots are spatially uncorrelated, they usually cause two cache misses. Once again, cache performance makes a big difference in the overall speed.

Having said all that, performance in real life is a very complicated matter. Cache performance is just one of the unpredictable components. There are also various compiler optimizations, chip architectures, threading/synchronization issues or language designs that can affect performance, even given then same algorithm. Writing a fast program is all about profiling, fine tuning, and finding the balance*.

*My coworker once said, “at the end of almost any meeting, you can say ‘it’s all about the balance’, and everyone would agree with you.

Variants of the Optimal Coding Problem

A Variant of Optimal Coding Problem
Prerequisites: binary prefix code trees, Huffman coding

Here’s a problem: say we are given n positive numbers, and you are allowed to each time pick two numbers (a and b) that are in the list, add them together and put it back to the list, and pay me an amount of dollars equal to the sum (a+b). Obviously your optimal strategy would be not to play this stupid game. Here’s the twist: what if I point a gun at your head and force you to play until we end up with only one number, then what would be your optimal strategy, and what would be the minimum amount of money you have to pay me?

At the first glance this seems quite easy, of course you would just pick the smallest two numbers each time and add them, because that would be the best option you have each time. It would minimize your cost every round, and I will end up with the minimum amount possible. However if you paid attention in your greedy algorithm lecture in college, you will notice the fallacy of this statement – picking the best option each time does not necessarily lead to the best overall outcome.

I claim that it is not obvious that this algorithm is right, not only because it is greedy, but also because this problem is equivalent to the classic optimal coding problem, which is not that easy. For those who haven’t heard, optimal coding problem is the problem where you’re given a bunch of symbols and the average frequencies for each of them, and you try to design a set of binary prefix codes for these symbols such that using this coding scheme, the average sequence of symbols will be represented by the shortest expected number of bits.

What? Yeah, this number summing and merging problem is equivalent to the optimal coding problem. I’ll spare you and myself a formal proof, but here’s an example to convince you. Say we have 3 numbers a, b and c, and we first sum a+b, then sum (a+b)+c. Your total cost will be 2a+2b+c. Now if we draw out the summation as a tree:

 a+b+c
/\
/\ c
a b

(pardon my ascii art) you will see that a and b sit at depth=2, and c sits at depth=1. This gives another way to come up with the total cost: multiply each number with its tree depth, and sum these products together. This is always the same as the total cost, because for each number, as you go up a level in the tree, you add that number to the total cost once.

Now if you paid attention in your algorithm lecture on binary prefix code trees, you will see this looks exactly like a binary prefix code tree. To complete the analogy, say we are given three symbols A, B and C, and they have frequencies a, b and c. Had we given them the same binary code tree as above, the average code length weighted by frequencies will be 2a+2b+c. If you don’t see why, you should pause here and think until you get it (or not, you decide).

Now that we established these two problems are the same (same objective function, same solution space), it follows that the optimal algorithms for both are the same. The solution to the optimal coding problem is Huffman Coding, as spoiled from the very beginning of this post. Now, this algorithm is exactly the greedy algorithm above that we weren’t sure was right or not. So – I just spent this much time convincing you that algorithm might not be right, and then convincing you it is indeed correct. Yay progress!

Variants of that Variant

(Technically, that wasn’t really a variant because that was exactly the same problem.) After talking to a few people in real life about this, I found it hard to convince people that this innocent looking algorithm could possibly be wrong, and it certainly didn’t help that the algorithm ended up being right. One line I tried was: “even Shannon tried this problem and came up with a suboptimal Shannon-Fano coding scheme,” but people didn’t seem to appreciate that. In fact, it is quite stunning that when we turn the optimal coding problem into the summing problem, the solution seems so much more obvious.

So I came up with a second attempt: what if we tweak the problem a bit so that greedy wouldn’t work? In the original problem, when we merge two numbers, it costs us the sum of the two, and we reinsert the sum to the list of numbers. What if the cost function or the merge function is not an addition? For example, we can imagine changing the cost function to take minimum, or the merge function to return the square of the sum, etc. Then, would the greedy algorithm still work? Now people are much less sure – which kind of proves the point that they shouldn’t have been so confident in the first place.

But would it? It turns out, perhaps unsurprisingly, in most cases it won’t work anymore. Just to raise a few counter examples, (1 1 1 1) yields an optimal merging of 1 + (1 + (1 + 1)) if the cost function is taking minimum. (1 1 1 2 3) yields an optimal merging of ((1 + 1) + 2) + (1 + 3) if we take maximum instead. And if we take cost as the square of the sum, for the case (1 1 2 2), the optimal way is (1 + 2) + (1 + 2). In all these cases, the greedy way will give a sub-optimal total cost. It happens that if we take both cost and merging functions to taking min/max, the greedy approach works, but they are a lot more trivial than the other cases.

At this point, it seems like we really lucked out that the optimal coding problem corresponds to a case where greedy works. Otherwise, computers would either spend so many more cycles computing the optimal codes for compressing files, or we would have ended up with larger file sizes for various kinds of compressed files.

To end this post with a trivia: did you know that the Huffman tree for a gzip file is also encoded using a Huffman tree? Since every file has a different frequency count of symbols, each file has a different Huffman tree that needs to be saved in the compressed file as well. This tree is then encoded, again, using a standard Huffman tree that is the same for all compressed files, so the decoder knows how to decode, which is pretty meta.