Thoughts on Fooled by Randomness

Just finished Nassim Nicholas Taleb’s well-known book, Fooled by Randomness. Here are some brief thoughts, in no particular order.

The Birthday Irony

Despite the author’s years working in trading and writing a book on probability, in one of the few cases where he did actual math, he did it wrong. Here’s the original:

If you meet someone randomly, there is a one in 365.25 chance of your sharing their birthday, and a considerably smaller one of having the exact birthday of the same year.

Nassim Nicholas Taleb, Fooled by Randomness

It seems like he was trying to say – on average, there are 365.25 days a year (first order approximation of leap years), so you have a \frac{1}{365.25} chance of meeting someone of the same birthday.

If you do the math though, here’s the actual probability: every four years (365 \times 4+1 = 1461 days), there are 1460 days in which your probability of sharing a birthday is \frac{4}{1461}, and 1 day in which it is \frac{1}{1461}. So, the probability is \frac{1460}{1461} \times \frac{4}{1461} + \frac{1}{1461} \times \frac{1}{1461} \approx \frac{1}{365.44}. That’s significantly off from 365.25 that you can’t really say “I just made a first order approximation”.

To fully understand this error, let’s say there is one extra day in n years, instead of 4. Then the number, instead of 365.25 or 365.44, will be (\frac{365n^2}{(365n+1)^2} + \frac{1}{(365n+1)^2})^{-1}. After taking Taylor series expansions, we get 365 + \frac{2}{n} - \frac{364}{365n^2} + O(n^{-3}), or 365 + \frac{2}{n} + O(n^{-2}), instead of the 365 + \frac{1}{n} that the author had guessed.

Let’s spend a little time to gain intuitions on why it’s 365 + \frac{2}{n} instead of \frac{1}{n}. Consider Alice and Bob, and a year is exactly 365 days. Then the chance of sharing a birthday is 1 in 365. Now say we add x days to Bob’s calendar only, so Bob’s birthday has 365+x possible choices while Alice still has 365. Then, the probability that they have the same birthday is 1 in 365+x. At this point, it is clear that if we add x days to Alice’s calendar, the chance of sharing a birthday goes down, therefore we know that the author’s estimate of probability is too high. Then, add x to Alice’s calendar. If x is small, we can ignore the probability that their shared birthday is on one of the days in x (that probability is second order). Then, approximately we have the probability of sharing a birthday as \frac{365}{(365+x)^2}, which is close to \frac{1}{365 + 2x}, again ignoring the second order term. Substituting x for \frac{1}{n}, we have arrived at the desired result. The factor 2 comes from the fact that we added a leap day not only to Bob, but also to Alice.

Anyway, on a higher level, the lesson is that you should fully justify your simplifying assumptions, instead of jumping to conclusions.

Wittgenstein’s Ruler

This idea has never explicitly come to my mind, so I thought it was interesting. It says something like if you don’t have a reliable ruler, and you use it against a table, you might be measuring your ruler with the table. One example he mentioned was that some people in finance claimed that a ten sigma event happened. Using the principle – if you measured a ten sigma event, your ruler (mathematical model) is probably seriously flawed.

One takeaway from this is that statistics is merely a language to simplify and describe the real world, the world does not run according to the rules. It would be ridiculous to plot data points under a bell shape, and say that the world is wrong when the new data point doesn’t fit under it.

Another way of saying the same thing is conditional probability. Relevant xkcd: https://www.xkcd.com/1132/

One way I’ve seen it in real life is the current political situation in Hong Kong. Say there’s a certain probability that one citizen goes nuts and riot in the street, and there’s a certain probability that the government has done something terribly wrong. If you have very few people rioting, then the ruler tells you that those guys are probably at fault. But if you have a majority of citizens supporting the riots or rioting, then those guys become the ruler, and you’re measuring the government.

Think about All Possibilities

One very valid point in the book is that you should think of the world as taking one sample path in infinitely many possibilities. When you evaluate an outcome, you should think of all the things that could have happened. For example, if your friend did a thing and made a huge success, it doesn’t mean he made a good decision or that you should’ve done the same, or even that you should follow suit. We have only one data point, you don’t know what the probability distribution looks like. Maybe he could have lost it all. When you think about all that could have happened, you will have less jealousy to the lucky and more sympathy to the unfortunate.

Happiness is Relative

This is a tangential point to randomness, but still important to keep in mind. Given that you have basic human needs fulfilled, your happiness often doesn’t depend on how much you have, but how much more you have compared to those around you. More generally, it’s not the absolute well-being that matters, but the changes. So to be happy, don’t be the medium fish in the big pond, go to the small pond and be a king. If you start out at the top, tough luck, because chances are your status will revert to mean over time.

Limit Your Loss

If there’s one actionable item from the book, that’s to always remember to limit your worst case scenario. Between a steady increase in personal well-being with no risk of going bankrupt and more income but also a chance of losing everything, you should prefer the former, because eventually the unfortunate thing will happen. That’s called the ergodicity – any event with a nonzero probability will eventually happen, mathematically.

The Author’s Conspicuous Faults

I believe most readers will often find the author’s comments controversial and provocative, if not arrogant and overgeneralizing. There’s a bunch of stuff he said that is just plain wrong.

He said in the beginning of the book that he didn’t rewrite according to his editor’s suggestions, because he didn’t want to hide his personal shortcomings. But the point of a nonfiction book that is non-autobiographical is not to convey who you are, but to give readers inspirations and positive influence. If you say a bad thing in the book that you believe in, you’re not “being true”, you’re bad influence! I don’t know what exactly he was referring to, but I suspect they should include my following points.

He’s exceptionally arrogant, way off the charts. You’ll see him saying things like “I know nothing about this, despite having read a lot into it” and “I know nothing, but I am the person that knows the most about knowing nothing”. He just couldn’t write one sentence that ends in a defeated tone. Before he puts a period down, he must add another clause to the sentence to remind the readers that he’s just being humble, he didn’t mean it. It’s quite funny when you look for it.

He also loves stereotyping people to the extreme. He would say things like “journalists are born to be fooled by randomness”, “MBAs don’t know what they’re doing”, “company executives don’t have visible skills” and “economists don’t understand this whatever concept”. One thing he said in the beginning of the book was that he didn’t need data to back up his claims, because he’s only doing “thought experiments”. I think he mistook that for “unfounded personal opinions”. When you make claims about journalists and economists being dumb, that’s hardly a thought experiment. You absolutely need to back up your claims.

Overall, this book has some good ideas, but not that many. If you already have a decent background in math, maybe you can skip this book without harm.

Fast RNG in an Interval

https://arxiv.org/abs/1805.10941 – Fast Random Integer Generation in an Interval

Just read this interesting little paper recently. The original paper is already quite readable, but perhaps I can give a more readable write-up.

Problem Statement

Say you want to generate a random number in range [0, s), but you only have a random number generator that gives you a random number in range [0, 2^L) (inclusive, exclusive). One simple thing you can do is to first generate a number x, divide that by s and take the remainder. Another thing you can do is to scale the range down by something like this: x * (s / 2^L), with some floating point math, casting, whatever that works. Both ways will give you a resulting integer in the specified range.

But these are not “correct”, in a sense that they don’t generate random integers with a uniform distribution. Say, s = 3 and 2^L = 4, then you will always end up with one number being generated with probability 1/2, the other two numbers 1/4. Given 4 equally likely inputs, you just cannot convert that to 3 cases with equal probability. More generally, these simple approaches cannot work when s is not a power of 2.

First Attempt at Fixing Statistical Biases

To fix that, you will need to reject some numbers and try again. Like in the above example, when you get the number 3, you shuffle again, until you get any number from 0 to 2. Then, all outcomes are equally likely.

More generally, you need to throw away 2^L mod s numbers, so that the rest will be divisible by s. Let’s call that number r, for remainder. So you can throw away the first r numbers and use the first approach of taking remainder, as shown in this first attempt (pseudocode):

r = (2^L - s) mod s // 2^L is too large, so we subtract s
x = rand()
while x < r do
x = rand()
return x mod s

That’s a perfectly fine solution, and in fact it has been used in some popular standard libraries (e.g. GNU C++). However, division is a slow operation compared to others like multiplication, addition and branching, and in this function we are always doing two divisions (mod). If we can somehow cut down on our divisions, our function may run a lot faster.

Reducing number of divisions

It turns out we can do just that, with just a simple twist. Instead of getting rid of the first r numbers, we get rid of the last r numbers. And we can verify whether x is in the last r numbers like so:

x = rand ()
x_mod_s = x mod s
while x - x_mod_s > 2^L - s do
x = rand ()
x_mod_s = x mod s
return x_mod_s

The greater-than comparison on line 3 is a little tricky. It’s mathematically the same as comparing x - x_mod_s + s with 2^L, but we do this instead because you can’t express 2^L with L number of bits. So basically, the check is saying if the next multiple of s after x is larger than 2^L, then x is in the last r numbers and must be thrown away. We never actually calculate r, but with a little cleverness we manage to do the same check.

How many divisions are we doing here? Well, at least one on line 2, and possibly 0 or many more, depending on how many times the loop is run. Since we’re rejecting less than half of the possible outcomes (we’re at least keeping s and at most rejecting s - 1), we have at least 1/2 chance of breaking out of the loop each time, which means the expected number of loops is at most 1 (0 * 1/2 + 1 * 1/4 + 2 * 1/8 … = 1). So we know that the expected number of divisions is at worst 2, equal to that of the previous attempt. But most of the time, the expected number is a lot closer to 1 (e.g. when s is small), so this can theoretically be almost a 2x speed up.

So that’s pretty cool. But can we do even better?

Finally, Fast Random Integer

Remember other than taking remainders, there’s also the scaling approach x * (s / 2^L)? It turns out if you rewrite that as (x * s) / 2^L, it becomes quite efficient to compute, because computers can “divide” by a power of two by just chopping off bits from the right. Plus, a lot of hardware has support for getting the full multiplication results, so we don’t have to worry about x * s overflowing. In the approach using mod, we inevitably need one expensive division, but here we don’t anymore, due to quirks of having a denominator of power of 2. So this direction seems promising, but again we have to fix the statistical biases.

So let’s investigate how to do that with our toy example of s = 3, 2^L = 4. Let’s look at what happens to all possible values of x.

xs * x(s * x) / 2^L(s * x) mod 2^L
0000
1303
2612
3921

Essentially we have s intervals of size 2^L, and each interval maps to one single unique outcome. In this case, [0,4) maps to 0, [4, 8) maps to 1, and [8, 12) maps to 2. From the third column, we have two cases mapping to 0, and we’d like to get rid of one of them.

Note that the fundamental reason behind this uneven distribution is because 2^L is not divisible by s, so any contiguous range of 2^L numbers will contain a variable number of multiples of s. That menas we can fix that by rejecting r numbers in each range! More specifically, if we reject the first r numbers in each interval, then each interval will contain the same number of multiples of s. In the above example, the mapping becomes [1, 4) maps to 0, [5, 8) maps to 1, and [9, 12) maps to 2. Fair and square!

Let’s put that in pseudocode:

r = (2^L - s) mod s
x = rand ()
x_s = x * s
x_s_mod = lowest_n_bits x_s L // equivalent to x_s mod 2^L
while x_s_mod < r do
x = rand ()
x_s = x * s
x_s_mod = lowest_n_bits x_s L
return shift_right x_s L // equivalent to x_s / 2^L

Now that would work, and it would take exactly 1 expensive division on line 1 to compute r every single time. That beats both of the above algorithms! But wait, we can do even better! Since r < s, we can first check x_s_mod against s, and only compute r if that check fails. This is the algorithm proposed in the paper. It looks something like this:

x = rand ()
x_s = x * s
x_s_mod = lowest_n_bits x_s L
if x_s_mod < s then
r = (2^L - s) mod s
while x_s_mod < r do
x = rand ()
x_s = x * s
x_s_mod = lowest_n_bits x_s L
return shift_right x_s L

Now the number of expensive divisions is either 0 or 1, with some probability depending on s and 2^L. This looks clearly faster than the other algorithms, and experiments in the paper confirmed that. But as often is the case, performance comes at the cost of less readable code. Also in this case, we’re relying on hardware support for full multiplication results, so the code is less portable and in reality looks pretty low level and messy. Go and Swift have adopted this, deciding the tradeoff worthy, according to the author’s blog (https://lemire.me/blog/2019/09/28/doubling-the-speed-of-stduniform_int_distribution-in-the-gnu-c-library/), C++ may also use this soon.

How Many Divisions Exactly?

There’s still one last part we haven’t figured out – we know the expected number of divisions is between 0 and 1, but what exactly is it? In other words, how many multiples of s, in the range [0, s * 2^L), has a remainder less than s when dividing by 2^L? To people with more number theory background, this is probably obvious. But starting from scratch, it can take quite a lot of work to prove, so I’ll just sketch the intuitions.

It’s a well known fact that if p and q are co-prime (no common factors other than 1), then the numbers { 0, p mod q, 2p mod q, 3p mod q ... (q-1) p mod q } will be exactly 0 to q-1. This is because if there is any repeated number, then we have a * p mod q = b * p mod q (assuming a > b), which indicates (a - b) * p mod q = 0. But we know that 0 < a - b < q, and p has no common factor with q, so if we multiply those two together, it cannot be a multiple of q. So it’s impossible to have duplicates, and multiples of p will evenly distribute among [0, q) when taken mod q.

Now if s and 2^L are co-prime, there will be exactly s number of multiples of s that has a remainder ranging from 0 to s - 1. That means the expected number of divisions in this case is s / 2^L.

If they aren’t co-prime, that means s is divisible by some power of 2. Say s = s' * 2^k, where s' is odd. Then, s * 2^(L-k) = s' * 2^L will be 0 mod 2^L. So your multiples of s mod q will go back to 0 after 2^(L-k) times. And you have 2^k iterations of that. So if you go through the final count, it goes 2^k, followed by 2^k - 1 number of 0s, rinse and repeat. How many are below s? You have s' number of nonzero counts, each one equal to 2^k – it’s again, unsurprisingly, s. So the expected number of divisions is still indeed s / 2^L.

Final Thoughts

Earlier I said each time you need to throw away 2^L mod s numbers to make an even distribution, but that’s not completely necessary. For example, if s = 5 and 2^L = 8, you don’t have to fully reject 3 cases. In fact, you can save up those little randomness for the next iteration. In the next iteration, say you get into 1 of the 3 cases again. Then, combined with the 3 cases you saved up last time, you are now in 9 equally likely events. If you are in the first 5, then you can safely return that value without introducing biases. However, this is only useful when generating the random bit strings are really expensive, which is totally not the case in non-cryptographic use cases.

One last note – we have established that the expected number of divisions is s / 2^L. As s gets close to 2^L, it seems like our code can become slower. But I think that’s not necessarily the case, because the time division takes is probably variable as well, if the hardware component uses any sort of short-circuiting at all. When s is close to 2^L, 2^L mod s is essentially one or two subtractions plus some branching, which can theoretically be done really fast. So, given my educated guess/pure speculation, s / 2^L growing isn’t a real concern.