4 minute read

Combing through old papers on concurrent data structures, I was delighted to see the bit reversal subroutine being used in 2 of the designs in ingenious ways. In this post I will try to capture the essence of them.

Bit reversal? Concurrent data structures?

Bit reversal is where we take an int (say, 64 bits) and reverse it, e.g. reversing 8 bits may look like 00110101 -> 10101100. There is no CPU instruction to do so, but it can be implemented relatively efficiently in a few ways, e.g. by masking + shifting, bswap (instruction for swapping bytes), lookup tables etc.

Concurrent data structures are data structures that are safe to use in shared-memory parallelism. For performance, modern computers don’t guarantee strong memory consistency, and programs that need to access the same memory locations need to explicitly do so, e.g. via mutexes and atomic operations.

Concurrent priority queue by Hunt et. al. (1994)

This section covers the paper An Efficient Algorithm for Concurrent Priority Queue Heaps. The problem is: let’s say we want to have a heap (supports inserting an element with priority and removing the element with highest priority) and let a bunch of threads access it. The most naive thing to do is probably an array-based binary heap that is put behind a lock. The ith node has children 2i+1 and 2i+2, parent has higher priority than children, inserts happen at the bottom and bubble up to the root, removals swap the last element to the root and bubbles down, etc. Any thread that attempts to read/write needs to take the lock, and only one thread can proceed at a time.

What if we put a lock at each node? Before you swap two nodes in a bubble-up process, you’d only need to lock those two nodes. That way, you wouldn’t need to lock the entire data structure, giving other threads a chance to proceed.

Unfortunately this results in high contention. If two threads are inserting at the same time, one will be at node i, and the other will be at node i+1. Chances are they have the same parent, so they immediately start contending for the same lock. Even if they have different parents, their parents’ parents are probably the same, and so on.

This paper’s main insight is: what if after inserting at node i, we don’t insert at node i+1, but somewhere far away in the tree? Imagine the following binary tree:

    (0)
   /   \
 (1)   (2)
 / \   / \
3   4 5   6

Only nodes 0-2 are occupied. Inserting in the order (3, 4, 5, 6) will likely cause high contention. But if we insert in the order (3, 5, 4, 6), then we might avoid some of the most common lock contentions, since neighboring pairs don’t share the same parent.

More generally, when a low bit of the heap size changes, you want to choose a different sub-tree to insert; when a high bit changes, we have inserted a lot of nodes, so it’s fine to choose a closer sub-tree.

What this ends up looking like is: the high bits of the node at which to insert depends on the low bits of the heap size. We could literally implement that by reversing the order of bits after the leading 1 bit. So the insert order would look like:

0, 1; 10, 11; 100, 110, 101, 111; 1000, 1100, 1010, 1110, 1001, 1101, 1011, 1111…

This idea is simple but clever. Unfortunately this paper didn’t quite stand the test of time, most importantly there’s still global contention to mutate the heap size and to pop the heap at the root. Skip-list based designs are now considered a better choice.

Split-ordered lists by Shalev et. al. (2006)

This section covers the paper Split-Ordered Lists: Lock-Free Extensible Hash Tables.

Compared to priority queues, hash tables are undoubtedly more ubiquitous. Generally they can be open/closed - open meaning each bucket contains a resizable container (e.g. linked list) and closed meaning all data lives in the array, and collisions are handled by probing.

A significant challenge in designing a concurrent hash table is the resizing operation. Typically, resizes are done by doubling the size of the array and moving all old data over to the new one. During this migration, no readers/writers can proceed.

This paper asks: what if we don’t move the data into buckets, but move the buckets to point to the data in the right way? If we’re only creating a new index of the same underlying data, then the old index is still valid and can be used in the meanwhile.

More concretely, we want to put all the items in a linked list, so that all the hashes that are equal mod 2^k (for any k) are consecutive in the list. This way, we can maintain a hash table by having an array where the ith bucket points to the first node where hash mod 2^k = i.

It turns out you get this property by ordering the nodes by the reverse of the hash. Thinking about it more, this isn’t that surprising - at first, you want all the nodes with the hashes having 0 as the last bit to be in front of the rest; then, within those, you want all the 0 as the second last bit to be in front…

As a quick example, say we have 8 items with hashes 0-7. Sorted by this order, we have:

000 100 010 110 001 101 011 111
^               ^               | 2 buckets (hash mod 2)
^       ^       ^       ^       | 4 buckets (hash mod 4)
^   ^   ^   ^   ^   ^   ^   ^   | 8 buckets (hash mod 8)

Notice how across different number of buckets, the chains of each bucket is still valid, even though the underlying linked list never changed.

From what I can tell, this design is still not obsolete in 2025, even as new improvements have appeared and computers evolved.