Unexpected Chinese Remainder Theorem
Last week, we had an incident at work. Both the bug itself and the debugging process were mildly interesting, and I’ll describe both briefly below, and discuss some lessons.
The Setup and the Incident
We have a system that basically subscribes to a whole bunch of data, does a bunch of computations on them, and publishes output in real time. The computations are split into tasks identifiable by unique names. For both CPU & memory reasons, the system spawns up to a few dozens of workers (Linux processes) across a bunch of computers, and assigns each task to one process by the hash of its name, modulus the number of workers.
For redundancy and latency, we run a few replicas of the whole thing across multiple data centers, all doing roughly the same computations.
One day, during a routine system upgrade, all replicas crashed one after another. This got us into panic mode.
The Debugging Process
To be clear, this is bad - this is what we specifically tried to prevent by running replicas, and stagger their upgrade schedule.
To find out the root cause, the first thing as always is to inspect the logs. There was a single error message saying which worker crashed first, and the exception that crashed it. The exception suggested that one of the values that came from a data subscription was too large and caused a buffer overflow.
OK, that’s something, but we have hundreds of thousands of data subscriptions, so we need more cleverness to narrow it down so we can find the problematic data.
Well, we know the worker number is X out of N total workers from the logs. If we get a list of all data scriptions and their task names (a data subscription is also considered a task), then we can compute the hash of each name and see which ones are running on worker X. This will narrow it down by a factor of N, which is on the order of 50. Which is not nearly enough.
But it happens that due to whatever reason, some replicas run with a different number of workers. That means we can gather a bunch of Xi and Ni pairs, and narrow down the set of suspected tasks further.
If you have a few equations of the form A mod Ni = Xi
, you can merge them together to get A mod N* = X*
where N* is the LCM of all Ni using the Chinese Remainder Theorem. The larger we can make N*, the fewer tasks will satisfy the equation, and the better chance we have to pin down exactly the one task we’re looking for.
So we gathered 4 pairs of X and N, and ended up shrinking the number of suspected tasks by a factor of a few thousand, leaving us with only a few dozen options. Poring over the task names one by one, we finally found the one subscription that caused the crashes.
The Bug
The bug itself is fairly simple, but the mechanism in which it crashed all replicas is a bit subtle.
There was a recent code change that changes the behavior when the system gets erroneous values from data subscriptions. In the past, when a worker gets an error, it just passes the error value to downstream computations. The code change was to append metadata to these errors to help track down where they came from.
This change seems innocent enough, but the issue manifests when the system is configured to subscribe to data published by itself. Let’s say such a self loop exists in a task. When this task first computes, it subscribes to data that hasn’t been published yet, so it will result in an error. In the old code, this error will then be fed in to the task again, but the output will not change. However, in the new code, each round of feedback leads to a bigger error value due to the additional metadata, eventually overflowing buffers and crashing the process and also clients consuming the value.
This also explains why despite rolling out the new executable to only a subset of replicas, all of them crashed. When subscribing to a data, you have to specify some sort of url, and this url points to one of the replicas. In other words, only one replica has the self loop, and other replicas are consumers of the output of the loop. So when the loop is completed in the roll process, all replicas will crash upon receiving the large value, regardless of whether they contain the code change.
Reflections
This incident didn’t end up causing too much headache because it was fixable with a rollback. Either way, it’s a good exercise to think through it to learn the maximum amount of lessons out of it.
Pinning down the issue this time required some amount of luck. In particular, we had replicas running with different number of workers. This did not have to be the case, and it even seems undesirable to have different enviroments. One change we could make here is to change the hashing scheme of task names to worker. We could hash the task name and the replica ID together (i.e. use the ID to salt the hash). This way, we don’t have to rely on the numbers of workers being coprime with each other to narrow down tasks using worker IDs.
This incident is not the first time that snowballing error values caused hiccups. I’ve also heard of stories where parsing and appending to error values lead to accidentally quadratic time complexity. Perhaps we should be a bit careful when dealing with error values, because they can sometimes be unexpectedly large. (I’m not sure how much this makes sense in various programming languages; some languages might not have the concept of a error value object that can be manipulated at runtime.)
Another thing that is less clear is that perhaps we could just outright ban self loops, as these are probably just footguns. But this might or might be reasonable depending on the actual situation.
One might also be tempted to think that instead of relying on clever filtering based on hashes, we should just improve the error message in the logs to show exactly what caused the crash. I think practically this would not have helped in this case. Sometimes you just don’t know where the system could crash - if we had anticipated it, we would’ve fixed it. Wrapping every single part of code with error tagging just seems excessive and infeasible.
In the end, I didn’t think anyone made a mistake in the process, and there was not much we could’ve done to avoid it. Testing couldn’t have caught it because the loop would only exist in production, due to all testing systems also subscribing to the url that points to production; this bug was hard to anticipate in code review; and careful deployment wouldn’t have prevented it.
Sometimes, incidents are just a cost of business, even if they happen in production.