Longest Sequence Without Duplicate Substring of Constant Length

Around 9 years ago I was working on a multiple choice exam and this problem came up. Recently I thought about it again and it’s actually very simple.

The problem is: say you are writing a sequence of letters A, B, C, D, with one constraint. Whatever substring of length 3 that already appeared in your sequence cannot appear again. What is the longest sequence length you can possibly come up with?

As an example, the sequence “ABCDACBB” is valid, but the sequence “ABCDABCB” is not, since “ABC” appears twice. By the same token, “AAAAB” is also invalid, since “AAA” appears twice.

There is an easy generalization of this problem to k choices of letters instead of 4, and m as the length of substring instead of 3. Quick maths: there are k^m substring combinations, so there can be k^m unique starting points, and the answer is upper bounded by k^m + m – 1.

9 years ago I didn’t know anything so I thought it was hard. But now it just looks like a graph traversal problem, which is either trivial or intractable (depending on whether it’s Eulerian or Hamiltonian). With this in mind, the problem is trivial. You can either read my solution below or spend a few minutes to figure it out.

Construct a directed graph where each node represents a unique substring made up of the k choices of letters of length m – 1. There are k outgoing edges and k incoming edges for each node. The k outgoing edges represent the k different ways you can append a letter to the current substring of length m – 1. So for example for the node “AB”, the outgoing edge “C” would point to the node “BC”, since after adding a letter “C”, you get to the new prefix “BC”. Then each edge corresponds to one unique substring of length m, and the problem becomes to find an Eulerian path in this graph. Since this graph is always connected and each node has the same number of in degrees as out degrees, the path always exists and must be a cycle, regardless of m and k. There will be many possible cycles as well. Now this problem is theoretically solved, and also programmatically solved.

As I later found out from my roommate’s math homework, this graph construction is called de Bruijn graph. Of course everything has to have a name.

I wish I could go back in time and explain this to the little me.

Having fun with queues

Recently I’ve been reading Okasaki’s Purely Functional Data Structures. In the first few chapters, the book discusses several really interesting ideas, that I will attempt to summarize and introduce below.

TLDR: persistence; lazy evaluation; amortization; scheduling.

Before getting into algorithms, there are some prerequisites to understanding the context of functional programming.

Immutable objects, persistence

Normally in imperative programming, we have variables, instances of classes and pointers that we can read or write. For example we can define an integer and change its value.

int x = 4;
x = x + 1;
cout << x << endl;

However in the functional version of the same code, even though it looks as if it does the same thing, variables are not modified.

let x = 4 in
let x = x + 1 in
printf "%d\n" x

What the above OCaml code does, is it first binds the name x to the number 4, then binds the name x to the number 5. The old value is not overwritten but shadowed, which means the old value is theoretically still there, just that we cannot access it anymore, since we lost the name that refers to it. In purely functional code, everything is immutable.

But if we use a new name for the new value, we can keep the old value, obviously:

let x = 4 in
let y = x + 1 in
printf "%d %d\n" x y

Now let’s look at some nontrivial example, modifying a list.

let x = [ 1; 2; 3 ] in
let y = 0 :: x in
print_list x;
print_list y
(* 1 2 3
   0 1 2 3 *)

Note that the double colon operator adds a new value on the head of a singly linked list. What this means is that after adding a new element to the linked list, not only do we have a linked list of [ 0; 1; 2; 3 ] called y, but we also still have the old list of [ 1; 2; 3 ] called x. We did not modify the original linked list in any way; all the values and pointers in it are still unchanged.

Now this, in itself, is already interesting: first, you can have y = 0 :: x and z = -1 :: x, essentially creating 3 linked list in total. But since they all share the same tail of length 3, so only 5 (-1, 0, 1, 2, 3) memory locations are being allocated, instead of 3+4+4 = 11. It is also worth the time to go through common list operations (reverse, map, reduce, fold, iter, take, drop, head, tail…) and verify that you can implement them in the same time complexity without making modifications to the original list. You cannot apply the same techniques to random access arrays, and that is why lists are the basic building blocks of functional programs the same way arrays are the basic building blocks of imperative programs.

Lazy evaluation, memoization

One of the advantages of having immutable objects is that imagine for a function f that takes in only a list as input, the output of the function will always be the same as far as the list has the same physical address. Then we do not have to repeat the same work, given the technique of memoization: after executing a pure function with an immutable input, we can write down the input output pair in a hash table and simply look it up the next time. What is even better is we can do this at the language/compiler level, so memoization is hidden from the programmer. A nice consequence is that we can simply write a recursion to solve problems that automatically turn into DP solutions.

Lazy evaluation capability is also built into many functional languages. The point of lazy evaluation is to delay computation as late as possible until we actually need the values. With this technique, you can define a sequence of infinite length, such as a sequence of all the natural numbers. There are two basic operations here: creating a lazy computation, and forcing it to get the output. This will be important later in our design of queues.

Queue: 1st gen

Now let’s get to the point: implementing purely functional queues. The first challenge: actually have a queue, regardless of time complexity. Given building blocks of lists, how do we implement a queue? Note that it is not desirable to build from arrays, because they are intrinsically mutable, using them will not help to create an immutable data structure. Fortunately we have a classic algorithm for this: implement a queue with two lists, F and R. We pop from the front of F and push to the front of R, and do (F, R) = (List.rev R, []), reverse the list R and put it in the front, whenever F becomes empty. Here’s the obligatory leetcode link.

This implementation claims to have amortized costs of O(1) for push, pop and top operators. This is obvious for push and top, and pop takes a little bit of analysis. Sometimes when popping, we will reverse a linked list of n elements, which takes time O(n). However in order to trigger this O(n) operation, we need to already have done n O(1) operations. That means approximately every n operations, we can at most have done n*O(1) + O(n) work, which is still O(n). Therefore on average we still only had O(1) work per operation. This is a very brief description, assuming prior knowledge in readers.

Amortization breaking down

However this amortization bound does not work for our functional version of queue. This is perhaps obvious to see: say we have one element in F, and we have n elements in R. These two lists together form a functional queue instance, called x. Calling y = Queue.pop x would be O(n), then calling y = Queue.pop (Queue.push x 1) would be another O(n). In fact, we can create many such O(n) operations. Obviously, the cost now for Queue.pop could be as bad as O(n), even in the sense of amortized cost. This is because now that we can access the entire history of the data structure, we can force it to perform the worse operation over and over again, hence breaking time complexity bound. Because of this way of repeating the expensive operation, it seems unlikely that persistent data structures in general can utilize amortization in run time analysis.

Queue: 2nd gen

The key idea to fixing the amortized cost is, spoiler alert, lazy evaluation. And this is supposedly the aha moment of this post as well. Instead of reversing the R when F becomes empty, we “reverse the list R and append it to F” when R’s length becomes greater than F’s length. The operation of F = F @ (List.rev R), where @ means concatenation, is written in quotes, because it is computed lazily. The intuition is that, say we have F and R both of length n, and pushing one element on R (or popping one element from F) will create a lazy computation of reversing R in O(n) time. However, this reversal will not actually be carried out until O(n) more pops are executed, such that the first element of (List.rev R) becomes exposed. Then, the O(n) cost of reversal can be amortized to the O(n) steps forcing the reversal, hence all operations become O(1).

Let’s compare this amortization with the 1st gen amortization. How come this works and the 1st gen didn’t work? The key distinction is we need to amortize expensive operations into future operations, not those in the history. The old method of forcing an expensive operation to happen multiple times to break amortization does not work anymore, because we must call pop O(n) times before we force a lazy reversal of O(n) to happen, and once it is reversed, we memoize that result, so there is no way to force that O(n) to duplicate itself. Meanwhile, memoization of list reversal doesn’t help with the 1st gen design, because we can add any element to R to make it a different list, which would constitute a different input to the List.rev function.

Sketch of proof

Having a working data structure is great, but what’s better is a framework for proving the amortization cost rigorously for not only queues, but also other persistent data structures. Traditionally we have 2 methods of proving amortized cost, the banker’s method of gaining and spending credits and the physicist’s method of assigning potential to data structures. Both of them are similar in that they calculate how much credit we have accumulated in the history up till a point, and then we can spend them on expensive operations that follow.

To prove amortization into the future for persistent data structures, we accumulate debits and try to pay for them in the future one step at a time. Here, debits would be the suspended costs of lazy computation. Adding n debit is essentially saying, “hey I need to do something O(n) here, please execute O(n) other operations before asking for my result.” Hence, you cannot ask for a particular result (like the reversed list R) if your debit account is positive; you must pay off all your debt first.

Then, proving these amortization bounds is simply a matter of choosing the right debits and payments for each operator in the banker’s method, and choosing the right debit potential function for each instance of the data structure. There are rigorous results given in the book, which will not be reproduced here. The curious reader can… just look them up in the book.


Lastly, another important point: we don’t really need amortization at all, instead we can just make everything O(1) worst case. The idea is that instead of lazily delaying the work until it is actually needed, we just actually complete it one step at a time. If reversing the list takes n steps, and we are amortizing that cost on n operations, why don’t we just actually do one step per operation? Since the operations come after the lazy computation is created, we can for sure do that. It was not possible before when the operations being amortized on came before the expensive computation. Of course, the code is going to be a little trickier and has more performance overhead, but theoretically the queue with scheduling achieves a worst case time complexity the same as the ephemeral (non-persistent) counterpart, everything is O(1).

This is not always possible, as for some other data structures it is not easy to find out what pieces of expensive computations we need to pay off per operation. Fortunately for queues it is not hard to imagine how to implement one.

What I left out

Of course, this is still a TLDR version of all the technical details of the implementation, which are all explained in length in the book. The proofs are still nontrivial to complete even given the ideas, and it takes some effort to make sure the scheduling is efficient. I also left out all actual implementations of the data structure in ML code in the book, since explaining the syntax would be heavy and would not contribute to understanding the key ideas.

Anyway, this is about the amount I want to discuss in this post; it’s also interesting to think about implementing maps and sets (like c++ STL ones) as functional data structures with the exact same asymptotic time complexity on common operators like add, remove, find for both and set, get for maps. Happy functional programming 🙂

Nerd sniping: prison break edition

First, relevant xkcd.

Nerd sniping is like a chain email; after being sniped, if you don’t nerd snipe 10 other people in 3 days, your IQ will drop by 30 points. – A wise man

Anyways here are three problems that sniped me throughout the years that I still remember. Solutions are in white font following the problems.

  1. There are 64 prisoners, 64 different types of hats, and one guard. The guard is going to play a game with these prisoners. First, the guard randomly picks a hat for each prisoner to put on, and any prisoner cannot see his/her own hat, but can see the hats of everyone else. Some of these hats might be of the same type. They all get one chance to guess their own hat type (individually without knowing other prisoners’ guesses), and if any prisoner gets it right, they all win the game. All prisoners can negotiate before starting the game, but cannot talk to each other afterwards. How do they win the game?
  2. There are 2 prisoners A and B, and one guard. The guard pulls out a chess board (8×8) and put one coin on each grid, either heads or tails. The guard points at a certain coin, and only A can see the choice (also the board with all coins). Then A gets to flip exactly one coin, then he leaves the room, and B comes in. By only looking at the 64 coins after A’s coin flip, can B figure out which coin the guard pointed at? Same conditions as before: negotiation only before the game starts.
  3. There are 2 prisoners A and B, and also the same guard. This time the guard pulls out 64 cards with numbers 1-64 written on each card, then scrambles the order and put them on a table, face up. A gets to see all cards and also can swap one pair of cards. Then all cards are concealed, keeping the order after A’s swap. B comes in, and the guard tells him a number between 1 and 64 inclusively, and B can try to find the corresponding card in 32 trials. In each trial, B can only reveal one card to see if it matches with the guard’s specified card. How can B find the card? Same conditions as before, blah blah blah.


Just kidding, I’m not going to write down the solutions. It wouldn’t exactly be nerd sniping if that’s the case :^)

Bloomberg CodeCon 2017

Went to Bloomberg’s headquarters in New York for a coding contest a few days ago. It looks more like a recruiting event than an actual contest, as the problems were fairly not difficult comparing to ACM contests, and it was only 2 hours.

Getting into the contest

There are like around 2^7 contestants in this contest, 3 from each invited university. Couple months before the actual contest on January 27, a local contest was held within USC to select representatives. 8 problems in 2 hours, each problem holds a different weighting, correlated with the difficulty.

After qualifying, Bloomberg schedules the flights, hotels and transport, so we the contestants just have to ditch the Friday classes and head to the airport. Of course, they pay for everything: round trip flights, stay at Fitzpatrick Grand Central over the weekend, cars between hotel and airport, etc. For free stuff, there’s a Bloomberg hoodie, beanie, $100 gift card, 2 $5.5 Metro cards, and a random nanoblock thing.

The horseshoe shaped building, from the inside.

When we got there, they had a little tour around their pretty nice building. They didn’t walk us through a lot of it though; we went to the top floor to see the Manhattan view, grab free food, look at the UX lab and some random fish tanks and that’s it. The other contestants didn’t seem to give a * about the tour, I bet most of them came for the free stuff.


Basically the CodeCon was the same as local contest: 8 problems in 2 hours, open Internet open book and everything. Technically we could’ve joined the contest without all the flying since it was online anyways. I think there was also a synchronous event in London, and colleges like ICL and Oxford were there too.

Top three guys can actually win a prize. They gave out a laptop, a PSVR and something else I forgot. Given the slim chances I didn’t try very hard, since the marginal benefit is basically 0.

The contest started to everybody’s surprise since they “instead of pushing back the contest by 10 minutes, accidentally pushed it early by 10 minutes.” Well… alright, let’s get into coding then.


Honestly I don’t like this problem set too much. Even the local USC contest was more interesting; it had a little math thing like GCD, BFS maze search and DP. A few problems in the finals, however, are purely brute-force problems. The input sizes would be like around 2^4, and you just have to enumerate all the possibilities and do some stuff. Here are a few examples.

The sorting problem

Given an integer array of size at most 5 with unique elements, print the minimum number of pair swaps to make it sorted. The correct/most efficient way was to pick the minimum and move it in the front, something like this:

int sorting(vector<int>& v) {
    int ans = 0;
    for (int i = 0; i < v.size(); i++) {
        int mn = i;
        for (int j = i+1; j < v.size(); j++)
            if (v[j] < v[mn])
                mn = j;
        if (mn != i) {
            swap(v[mn], v[i]);
    return ans;

But here’s what I did:

bool check(vector<int>& v) {
    for (int i = 1; i < v.size(); i++)
        if (v[i-1] > v[i])
            return false;
    return true;
int sorting(vector<int>& v) {
    vector<pair<int, vector<int> > > bfs;
    bfs.insert(make_pair(0, v));
    for (int i = 0; i < bfs.size(); i++) {
        if (check(bfs[i].second))
            return bfs[i].first;
        vector<int> copy = bfs[i].second;
        for (int j = 0; j < v.size(); j++)
            for (int k = j+1; k < v.size(); k++) {
                swap(copy[j], copy[k]);
                bfs.push_back(make_pair(bfs[i].first+1, copy));
                swap(copy[j], copy[k]);
    return 0; // not reached

I was a little rushed and didn’t want to think through the correctness of picking the smallest every time, so I just enumerated all the possible swapping without any optimization. Each level multiplies the number of elements in the BFS vector by 10, and the number of levels is bounded by 4, so it was definitely fast enough.

The Pokemon problem

Two trainers are battling N Pokemons of their own. Each trainer has an order of battling the Pokemons, and they battle 1vs1 until their health goes to zero, then the next Pokemon in the queue take up the space. Damage is a function of parameters of the attacking and defending Pokemons. The problem is to maximize the number of Pokemons standing from trainer A at the end of the war using any permutations of the queue, given the order of trainer B’s Pokemons.

The solution is also quite trivial, simply implement the battles and use next_permutation() to enumerate through all possible permutations. However I got stuck here and couldn’t debug because I was using the names of the Pokemons in the comparator of my Pokemon struct, but those names are not unique, so I’m missing some permutations. It was such a stupid bug.

And so on…

And then there are 2 more problems of exactly the same type: literally enumerate all possible configurations. I used the same recursive call structure, so the two solutions looked alike. After solving 4 and getting stuck on Pokemons, there weren’t too much time left, so I kind of just gave up at that point.


Team USC

4 problems and ranked 48th; my ICPC teammate Yuehui Wang also from USC got 7th. As you can tell, he carried my ICPC contests. It was more or less a speed contest, since I think only the last problem actually requires some algorithms (it was like a 2D matrix connectivity thing). I like the ones with a little more difficulty instead of “enumerate and take max”. Anyway, I need to get better and faster for next year’s ICPC SoCal regional.

The rest of the trip

Before heading to Bloomberg, I paid a short visit to Jane Street, since I haven’t seen the new office yet. Don’t think I’m allowed to take pictures inside, but here’s a T shirt I got from them. It has quite a number of references to JS’s favorite functional programming language OCaml.

Since it was the day before Chinese New Year, a big dinner is somehow compulsory by tradition. Sakagura is a legit Japanese place hidden in a basement somewhere.

Then I headed to Yale to visit a friend. First time at Yale, looks very much like Harvard to me; perhaps it’s an East coast thing.

The Harkness Tower and me

Last but not least, my favorite thing in New York: Luke’s Lobster.

Euler Path: 8 Lines Solution

The problem of Euler path marked a very fundamental moment in algorithm studies. When Euler posed the 7-bridge problem, there was no mathematical tool to solve it, hence he created the tool – graph theory. It sounds like how Newton invented calculus to solve his gravity problems and how Bernoulli invented calculus of variations to solve his brachistochrone problem. As I quote Sutherland,

“Well, I didn’t know it was hard.”

Problem statement

Say you have arrived in a new country with a bunch of islands and one way bridges between some pairs of islands. As a tourist you would like to visit all the bridges once and only once. Abstractly, an Euler path is a path that traverses all the edges once and only once. Is it always possible? If possible, how do we find such a path?

Existence of path

Unlike many other problems, we can determine the existence of the path without actually finding the path. There are 2 conditions for it. The first has to do with in and out degrees, and the second is about the connectivity of the graph.

For a directed graph, the in degree is the number of incoming edges to a vertex, and the out  degree is the number of outgoing edges to a vertex. Easily, if we have an Euler path, then there will be a start and end vertex. The start vertex will have an out degree – in degree of 1, the end vertex will have in degree – out degree of 1. Every other vertex will have an equal number of out degree and in degree, because if you have a different number, you obviously cannot go through all of those edges in one path. The only exception is when the start and end vertices are the same, in which case all the vertices have the same number of in and out degrees.

The second condition is that all vertices have to be weakly connected, meaning that by treating all edges as undirected edges, there exists a path between every pair of vertices. The only exception is for the vertices that have no edges – those can just be removed from the graph, and the Euler path trivially does not include them.

Given the above two properties, you can prove there is an Euler path by the following steps. First it is easy to see that if you start walking from the start vertex (out – in = 1) and removing edges as you walk through them, you can only end up at the end vertex (in – out = 1). Then, the remaining graph is full of cycles that can be visited through some vertex in the path we already have, and you just have to merge the cycles and the path to get the Euler path.

It is easier to see the other direction of the proof: if there is an Euler path, both conditions have to be met.

Implementing existence of path

As mentioned above, we code up the two conditions and check whether we can find a valid starting position, otherwise return -1. Vertices are numbered 0 to n-1. The graph is stored as an adjacency list, meaning that adj[i] has all the neighbors (edges go from i to those). So the out degree of i is adj[i].size().

int start(vector<vector<int> >& adj) {
    // condition 1: in and out degrees
    vector<int> deg(adj.size());
    for (int i = 0; i < adj.size(); i++) {
        deg[i] += adj[i].size();
        for (int x : adj[i])
    int ans = -1;
    for (int i = 0; i < deg.size(); i++)
        if ((ans != -1 && deg[i] != 0 && deg[i] != -1)
            || deg[i] > 1)
            return -1;
        else if (deg[i] == 1)
            ans = i;
    if (ans == -1)  // start and end vertices are the same 
        for (int i = 0; i < adj.size() && ans == -1; i++)
            if (!adj[i].empty())
                ans = i;
    if (ans == -1)  // there is no edge at all
        return -1;
    // condition 2: connectivity
    vector<bool> vis(adj.size());
    vector<int> bfs{ans};
    vis[ans] = true;
    for (int i = 0; i < bfs.size(); i++)
        for (int x : adj[bfs[i]])
            if (!vis[x]) {
                vis[x] = true;
    for (int i = 0; i < adj.size(); i++)
        if (!vis[i] && !adj[i].empty())
            return -1;
    return ans;

That is slightly more clumsy that I would like it to be, but it should be clear. Basically just counting in and out degrees and running a BFS on the starting vertex.

Actually finding the path

As mentioned above in the sketch of proof, finding a path consists of the 3 steps:

  1. Walk from the start vertex, removing edges as we use them, until there is nowhere to go. Then we have a path from start to end.
  2. For the remaining edges, start at a vertex on the path and randomly walk until we go back to the same vertex. Then we have a cycle. Merge the cycle on the path. For example, if we had a path s->a->b->c->…->t and a cycle c->alpha->beta->c, we merge them to become s->a->b->c->alpha->beta->c->…->t.
  3. Repeat step 2 until there is no remaining edge.

Well, that does not sound very easy to write nor very efficient to run, if we literally implemented the above. How many lines of code would that be?

The 8 lines solution

The answer is 8. Here’s the code:

void euler(vector<vector<int> >& adj, vector<int>& ans, int pos) {
    while (!adj[pos].empty()) {
        int next = adj[pos].back();
        euler(adj, ans, next);

This is a very beautiful solution using recursion, in my opinion. I was quite surprised when I first saw this. Where is everything? Where is getting the main path? Where is getting the cycles? Where are we merging them?

The gist of this recursion is a post-order DFS, meaning that we visit the end of the graph first, and then backtrack. This comes from a very crucial observation: we can always be sure what could be at the end of the path, but not the front. It is important to know that the answer array is in reverse order of visit, i.e. we need to reverse it to get the Euler path. Let’s go through the stages of how the program works.

  1. Path from start to end vertex. The first time we push back is when we run out of outgoing edges, which can only be the case of the end vertex, with in-out = 1. In all other vertices, the numbers are the same so if you can go in, you can definitely go out of that vertex. Hence the first push back occurs with the end vertex, and at that point the program execution stack has the entire path.
  2. As we return from a recursive call of the function, we are essentially going back from the end to the start. If a vertex on the main path does not have any outgoing edge, we know we will visit it next, so we push it to the ans vector and return from the function.
  3. But if a vertex on the path does have an outgoing edge, that means there is at least one cycle including this vertex. Then in the next iteration of the while loop, we will visit one of the outgoing edges and start another round of recursion. Again, this recursion must end on the same vertex because it is the only one with in-out = 1. This recursion gets you a cycle and by the time it returns, the cycle would have been pushed to the ans array already, finishing the merge operation.

By studying this code, there is one interesting point to note. That is, the while loop will only be executed 0 to 2 times in any recursive call. It will only be 0 at the end vertex, and on the main path with no branches, it will be 1. On the main path with branches, it will only be 2 but not more regardless of the out degree, because surely the recursive call for cycle needs to end on that vertex but it will not end until all outgoing edges are used up. Therefore to the program, multiple cycles on one vertex is just one big cycle. On the other vertices on cycles, it works the same way whether they have branches or not.

In case you want to see how to run it, here is the main function I wrote to test it:

int main() {
    int n, q;
    cin >> n >> q;
    vector<vector<int> > adj(n);
    for (int i = 0; i < q; i++) {
        int u, v;
        cin >> u >> v;
    int s = start(adj);
    if (s == -1) {
        cout << "no path" << endl;
        return 0;
    vector<int> ans;
    euler(adj, ans, s);
    reverse(ans.begin(), ans.end());
    for (int x : ans)
    cout << x << endl;
    return 0;

That’s it – a simple problem with a simple solution. Leetcode has one Euler path problem, and the algorithm in this blog post comes from the discussion of that problem. Everything is pretty much the same for undirected graphs – you just have to use different data structures to store the edges. The proof is mostly the same with the first condition now about odd/even number of edges at each vertex, as there is no distinction between in and out degree.

TIW: Subset Sum

Subset sum, aka the coin change problem, is a very specific problem that for some reason gets mentioned a lot. The problem is very simple: given some coins of different values, what values of items can we buy with a subset of the given coins? Sometimes the problem statement asks, can we buy an item of a certain price with our coins without change? If not, what is the minimum amount of change needed? What if we have a number of the same coins, or even an infinite number? Many of these variants can be solved with some tweaks to the same code. So let’s start with the most basic formulation: given a vector of positive integers, return a vector of all integers that are subset sums (i.e. there exists a subset in the input vector with a sum equal to that number).

There are many good ways to accomplish this task.

First attempt: iterative, boolean array

Let’s jump into the code directly:

vector<int> solution1(vector<int>& coins) {
    int total = 0;
    for (int x : coins)
        total += x;
    vector<bool> pos(total+1);
    pos[0] = true;
    for (int x : coins)
        for (int i = total; i >= 0; i--)
            if (pos[i])
                pos[i+x] = true;
    vector<int> ans;
    for (int i = 0; i <= total; i++)
        if (pos[i])
    return ans; 

OK, so first we calculated the total values of the coins, then we know all possible subset sums are in the range [0, total]. Then we constructed a boolean vector with indices in the same range, to indicate what values we can form. The algorithm proceeds by adding a new coin every iteration. Say we used to be able to pay {0, 1, 2, 3, 4, 5, 6}, and we have a new coin of value 10. Then by adding the new coin in, we can pay {10, 11, 12, 13, 14, 15, 16}. But we can also not use the new coin and pay [0, 6], so we need to keep that in the original array. Then we combine the two.

The most important part here is the loop for (int i = total; i >= 0; i–). It counts down, because counting up will lead to a logic error. Think of it this way: you could already pay 0, and you add a new coin of value 2 in, so you set pos[2] = true. And then you count up and see pos[2] is true – you will then set pos[4] to be true, although you already used the new coin in that combination! Counting down avoids this problem because every new element you set to be true will not be visited again in this iteration.

Second attempt: iterative, set of integers

vector<int> solution2(vector<int>& coins) {
    set<int> pos;
    for (int x : coins)
        for (auto it = pos.rbegin(); it != pos.rend(); it++)
    return vector<int>(pos.begin(), pos.end());

The basic idea is the same, but instead of indicating val as possible by setting pos[val] = true, we insert it to the set. The code is much more concise.

Third attempt: recursive, set of integers

void soln3helper(vector<int>& coins, int idx, int val, set<int>& pos) {
    if (idx == coins.size()) {
    soln3helper(coins, idx+1, val, pos);
    soln3helper(coins, idx+1, val+coins[idx], pos);

vector<int> solution3(vector<int>& coins) {
    set<int> pos;
    soln3helper(coins, 0, 0, pos);
    return vector<int>(pos.begin(), pos.end());

This is basically DFS, if you treat whether using a coin or not as an edge and a combination of coins as a vertex in the graph, you will get a tree. At the leaf nodes of the tree, insert the sum of the combination into the set.

Why the three approaches?

Each approach has its pros and cons. The second one has the least amount of code, and is easy to understand. The first one might be marginally more efficient for small test cases, because it only uses vectors but not any sets. Note however that the boolean vector could be huge if the coins are of really big values, in which case it would be very inefficient. The third one is intriguing because there is not any branch cutting. It incidentally takes care of negative coin values, which the other solutions do not. That means our runtime will always be the worst case runtime, 2^n where n is the number of coins. But is it really that bad? Let’s compare the runtimes:

First attempt: O(n*sum)

This is obvious from the nested loops. How big is sum though? The lower bound is n, when all the numbers are 1. In that case we get O(n^2) overall runtime. There is however not an upper bound – any coin can be arbitrarily big, and in that case we are kind of screwed.

Second attempt: O(n*2^n)

In the ith loop, we have at most 2^(i+1) items in the set. Looping through them and all insertions cost O(2^(i+1)*log(2^(i+1))) = O(i*2^i). Summing from i = 0 to i = n-1, we can see that each later term is larger than the previous term by 2 times at least, so we can drop all the front terms and only take the last, and we get the answer.

Third attempt: O(2^n)

Very simple: on each level i, we have 2^i edges coming out, i.e. O(2^i) work. Summing them gives O(2^n) by the same argument.

From the runtime analyses, you can see either the first or the third algorithms could be optimal for different inputs. The second, however, is never asymptotically optimal. Even in the case of all 1s, it runs in O(n^2*log(n)). Hence it is most useful when you don’t really need that much performance but prefer pretty code instead (say, during an interview).

Fourth attempt: iterative, integer array

vector<int> solution4(vector<int>& coins) {
    vector<int> pos{0};
    for (int x : coins) {
        vector<int> newpos, next;
        for (int p : pos)
        int l = 0, r = 0;
        while (l < pos.size() || r < newpos.size()) {
            if (l == pos.size() ||
                (r != newpos.size() && pos[l] >= newpos[r])) {
                if (next.empty() || next.back() != newpos[r])
            } else {
                if (next.empty() || next.back() != pos[l])
    return pos;       

This is clearly a more elaborate approach to the problem. What we’re trying to achieve here is an O(2^n) algorithm that has a best case scenario of O(n^2), in the case of all 1s as input array. This is built mostly upon the second attempt, and we’re trying to eliminate the use of a set, since it inevitably gives us an extra time complexity factor. The way to accomplish this is to store the new values in a new vector, and merge the two vectors like in merge sort but removing all duplicates. In the ith iteration, we must only have O(2^i) amount of work, and summing through all i’s gives O(2^n).

Multiple coins

That’s about it, but one last discussion about multiple coins or infinite coins. Say we have the coins {3, 4, 7, 19} and we want to pay for an item of price 25, can we do it? Yes, 3+3+19 = 25. How do we do this? There are a few approaches, but here is one of them. I will build it on attempt 2:

bool multipleCoins(vector<int>& coins, int val) {
    set<int> pos;
    for (int x : coins)
        for (auto it = pos.rbegin(); it != pos.rend(); it++)
            for (int v = x+*it; v <= val; v += x)
    return pos.count(val);

Instead of pushing in only one value, we keep pushing in more values with more coins of the same type until we go over the desired value. Of course we can immediately return true if we find val, which could potentially save a lot of work. But this is good enough as an example to show how to modify the above versions to account for multiple coins.

Brief Intro to Segment Tree

Something I just learned – segment tree, is a data structure more advanced and generalized than binary indexed tree. Even though I just learned it and might not be qualified to discuss it yet, I’m pretty excited so who cares. So here’s an intro to segment tree from a noob point of view. Most of the content comes from this Codeforces blog by Al.Cash, but that blog assumes more prior knowledge and I will attempt to explain it from scratch.


Member binary indexed tree? I member. For range sum query, BIT supports logarithmic time complexity for updating an element and querying the sum of any range. The way querying from l to r was done was a 2-phase process: first find the prefix sum of l-1 and r, then subtract the two. Pretty straight forward.

But what if we want to take minimum or maximum of the range, instead of just taking the sum? With BIT, we can only query the range from the beginning to a certain point. It is trivial really, refer to the code below:

// binary indexed tree for taking maximum of range [0, k)
void update(int v, int k, vector<int>& bit) {
    for (k++; k < bit.size(); k += k&(-k))
        bit[k] = max(bit[k], v);
int query(int k, vector<int>& bit) {
    int ans = 0;
    for (k++; k > 0; k -= k&(-k))
        ans = max(ans, bit[k]);
    return ans;

There are a few limitations with this approach. First, you can only update an element to a bigger integer, because there is no reverse operation for maximum like subtraction for addition. Second, you can only take maximum from the beginning to a certain element but not an arbitrary range. With summation that’s not a problem, as we can subtract sums of different ranges to get partial sums. But with taking minimum, you can’t just subtract minima can you?

Therefore, we need a data structure that is like BIT, but supports true range query, instead of prefix range query.

Basic specs

  1. Segment tree is like a binary indexed tree on steroids, all problems solved by BIT can be solved by segment tree.
  2. It takes 2 times the space of the original array (double of BIT).
  3. You can update any element of the original array in O(log(n)) time (same as BIT).
  4. You can look up any element of the original array in constant time (BIT takes log(n)).
  5. The original array can be of any data type (same as BIT).
  6. You can perform a range query of a certain associative function in O(log(n)) time (BIT can only make prefix range queries).

The last point is important. As far as a function is associative, you can use it with ST. It does not have to be commutative or reversible. A few examples:

  1. Addition. Just like BIT, given {1, 2, 3, 4, 5}, you can sum {2, 3, 4} in log(n) time, for example.
  2. Maximum. Given {1, 2, 3, 4, 5}, you can take maximum of {2, 3, 4} in log(n) time, for example.
  3. Matrix multiplication. Given {A1, A2, A3, A4, A5} of five matrices, you can get the product {A2, A3, A4} in log(n) time.

Associativity: say you have 3 variables a, b and c, and a function f that operates on two parameters. Then f(f(a, b), c) = f(a, f(b, c)) iff f is associative. Note that commutativity does not follow, f(a, b) does not necessarily equal f(b, a).

Associativity is important because for a range {a, b, c, d, e}, you can precompute {a, b} and {c, d, e}, then combine the two results.

Basic idea

Let’s borrow the sample from BIT: {1, 2, 3, 4, 5, 6, 7, 8}. Here’s a graph of what segment tree stores:


For comparison, here’s a graph of what binary indexed tree stores:


A few observations:

  1. ST is simply BIT with the whole table filled in, without any blanks.
  2. st[8..15] is our original array. In the second half of the ST array, we always store the original array.
  3. st[i] = st[i*2] + st[i*2+1]. Node i/2 is node i’s parent.
  4. To update any element changes the same number of nodes in the tree.
  5. We can sum up ranges directly. For example {3, 4, 5, 6, 7} = st[5]+st[6]+st[14].

It is easy to see that ST is an extension of BIT and supports direct range queries while BIT doesn’t. The only thing left now is how to actually implement it.


In general, the function does not have to be addition of integers, so I will abstract it to any function that takes 2 structs and returns a struct.

struct data {
    int num;
    data() {
        num = 0;
    data(int n) {
        num = n;
data merge(data a, data b) {
    return data(a.num+b.num);

For simplicity I still made it a wrapper of addition of numbers. You can make it any function with any data, either minimum of long longs, multiplication of matrices etc.

Then, given an array, we need to build the segment tree.

vector<data> build(vector<data>& v) {
    int n = v.size();
    vector<data> st(n*2);
    for (int i = 0; i < v.size(); i++) st[i+n] = v[i]; for (int i = n-1; i > 0; i--)
        st[i] = merge(st[i*2], st[i*2+1]);
    return st;

Technically we do not need a build function if we have an update function, just like with BIT. We can just update the entries one by one. But that would take n*log(n) time, while this function is O(n), so this is not entirely useless.

The build function takes in the original array and makes an array double its size. Then, we copy the array to the second half of the tree. Constructing the tree is a bottom-up procedure, each time calculating the new sum from the two lower nodes (from observation 3). That’s it.

The update function is even simpler.

void update(data v, int k, vector<data>& st) {
    int n = st.size()/2;
    st[k+n] = v;
    for (int i = (k+n)/2; i > 0; i /= 2)
        st[i] = merge(st[i*2], st[i*2+1]);

It is a 4-line function. Technically you can make it 1 line, but why?? Here, we first update the entry in the second half of the tree, then we go to its parent iteratively by dividing by 2, until you hit 1, the root.

Now the actual hard part: queries. Given the range [l, r) from l to r-1, query the “sum” (could be product or any arbitrary function) of the range.

data query(int l, int r, vector<data>& st) {
    int n = st.size()/2;
    data ansl, ansr;
    for (l += n, r += n; l < r; l /= 2, r /= 2) {
        if (l%2 == 1) {
            ansl = merge(ansl, st[l]);
        if (r%2 == 1) {
            ansr = merge(st[r], ansr);
    return merge(ansl, ansr);

I will only try my best to explain, but I will not go through everything in fine detail because it is too tedious. Let’s say we want to sum {2, 3, 4, 5, 6}, i.e. l = 1, r = 6 (because v[1] = 2, v[6-1] = 6). First we add n to both l and r, so we have l = 9, r = 14. Refer to the above graph, our goal is to sum st[9], st[5] and st[6]. In fact, in the first loop we will pick up st[9], because l is an odd number. After we add it to the left sum, we add 1 to l to denote that we have already added numbers in this subtree, so we have a smaller range of numbers to add. In the next loop, l and r are divided by 2 to go up a level in the tree, and become 5 and 7. Now both numbers are odd, and we will merge st[5] and st[6] to the left sum and right sum respectively. In the next iteration, l and r are both 3, meaning that the range to sum is empty, and the loop breaks.

To be honest, I don’t 100% understand why the conditions are l and r are odd. The idea could be that if l is even, that means both l and l+1 are in the range, and we would rather add the number at l/2 since it includes both, therefore we do not do anything when l is even. The only exception when not both l and l+1 are in the range is when r = l+1, but that means r is odd and we will add st[r-1], which is st[l]. Otherwise if l is odd, we might as well eliminate this subtree by adding st[l] and moving l to l+1. Everything should be mirrored and r should be checked to be even, but r is not included in the range [l, r), so the actual end point is r-1, so we check whether it is odd. Just like with low bits in BIT, you don’t actually need to understand it to use it; I bet you can’t implement a red black tree either but you still use set<> like an algorithm master anyway.

Important point: There are two ans variables, ansl and ansr, and they take sums from both parts respectively. This is to maintain the order of computation, in case the merge function is not commutative. In this case however it does not make any difference.

Array of arbitrary size

The above is explained with an array size of 8, which was sort of cheating, because you will most likely want an array of arbitrary size. Of course, you can add padding zeros at the end to make it a power of two.

n = v.size();
while (n != lowbit(n))  // n&(-n)
    n += lowbit(n);

That would do. This is because when n is a power of two, its low bit is itself. However this is not even needed; the original code, although designed for powers of two, will work for any vector size n. There is a short explanation on the original blog, but the full proof should be too complicated and not useful to know. We only need to know that it automagically works for any n, and happily copy paste code.

Sample: matrix multiplication

Build, update and query are copy pasted, so they are omitted in the sample.

struct data {
    int m, n;
    vector<vector<int> > A;
    data() {
        m = 0;
        n = 0;
    data(const vector<vector<int> >& B) {
        A = B;
        m = A.size();
        n = a[0].size();
    void print() {
        for (auto r : A) {
            for (auto c : r)
                cout << c << " ";
            cout << endl;
data merge(data a, data b) {
    if (a.m*a.n == 0)
        return b;
    if (b.m*b.n == 0)
        return b;
    int m = a.m, n = b.n, l = a.n;
    vector<vector<int> > C(m, vector<int>(n));
    for (int i = 0; i < m; i++)
        for (int j = 0; j < n; j++)
            for (int k = 0; k < l; k++)
                C[i][j] += a.A[i][k]*b.A[k][j];
    return data(C);
vector<data> build(vector<data>& v) {...}
void update(data v, int k, vector<data>& st) {...}
data query(int l, int r, vector<data>& st) {...}
int main() {
    vector<data> v;
    v.push_back(data({{2, 0}, {0, 2}}));
    v.push_back(data({{1, 1, 4}, {4, 2, 2}}));
    v.push_back(data({{-2, 0}, {1, 4}, {1, 2}}));
    v.push_back(data({{0, 1}, {1, 0}}));
    vector<data> st = build(v);
    int l, r;
    while (cin >> l >> r)
        query(l, r, st).print();
    return 0;

You can mostly just copy paste the three functions and modify the data and merge definitions to fit your applications. I was motivated to study segment trees because of this problem on Codeforces. I was not able to do this problem during the contest, but neither could tourist, so whatever.

New Year and Old Subsequence

This is a slightly more advanced application of segment tree. The problem: given a string of digits, return the minimum number of digits that need to be removed such that there is a subsequence of “2017” but not “2016”. If 2017 is not a subsequence, print -1. More precisely, after a string of length up to 200,000 characters is given, there are up to 200,000 queries of the range l to r, and we need to answer what is the minimum number of digits to remove such that the sequence [l, r] has a subsequence of “2017” but not “2016”. The algorithm is described here. The gist is that for an interval of a string, all that we need to know is given we already have a certain prefix of 2017, how many digits do we need to erase in this interval so that we will have a longer prefix of 2017. For example, if the current digit is 6, then given “201” or “2017”, we will have to erase one digit (the 6) to ensure we have a prefix of “201” or “2017” without any “2016”. Otherwise the 6 doesn’t matter. Here’s my code.

struct data {
    unsigned int dp[5][5];
    data() {
        for (int j = 0; j < 5; j++)
            for (int i = 0; i <= j; i++)
                dp[j][i] = INT_MAX;
    void clear() {
        for (int i = 0; i < 5; i++)
        dp[i][i] = 0;

data merge(const data& a, const data& b) {
    data temp;
    for (int j = 0; j < 5; j++)
        for (int i = 0; i <= j; i++)
            for (int k = i; k <= j; k++)
                temp.dp[j][i] = min(temp.dp[j][i], a.dp[k][i]+b.dp[j][k]);
    return temp;

int main() {
    int n, q;
    string s;
    cin >> n >> q >> s;
    vector<data> st(2*n);
    for (int i = 0; i < n; i++) {
        if (s[i] == '2') {
            st[i+n].dp[0][0] = 1;
            st[i+n].dp[1][0] = 0;
        } else if (s[i] == '0') {
            st[i+n].dp[1][1] = 1;
            st[i+n].dp[2][1] = 0;
        } else if (s[i] == '1') {
            st[i+n].dp[2][2] = 1;
            st[i+n].dp[3][2] = 0;
        } else if (s[i] == '7') {
            st[i+n].dp[3][3] = 1;
            st[i+n].dp[4][3] = 0;
        } else if (s[i] == '6') {
            st[i+n].dp[3][3] = st[i+n].dp[4][4] = 1;
    // build segment tree
    for (int i = n-1; i; i--)
        st[i] = merge(st[i<<1], st[i<<1|1]);
    while (q--) {
        int l, r;
        cin >> l >> r;
        // query from l-1 to r
        data ansl, ansr;
        for (l += n-1, r += n; l < r; l >>= 1, r >>= 1) {
            if (l&1) {
                ansl = merge(ansl, st[l]);
            if (r&1) {
                ansr = merge(st[r], ansr);
        int ans = merge(ansl, ansr).dp[4][0];
        cout << (ans == INT_MAX ? -1 : ans) << endl;
    return 0;

If you paid attention to the code, you will see I embedded the build and query functions in the main function. Also you will notice I used integer array in C style instead of vectors in C++ style. There are also changes in details such as replacing *2 by <<1 (left shift) and +1 by |1 (bitwise or). I hate to do this, but the judge on Codeforces is very demanding and my normal coding style got TLE.

Tinkering with Deep Learning Style Transfer

A while ago, Prisma was quite popular on social media and everyone was filtering pictures with its artistic filters. Got some free time yesterday, so I thought I should try out some neural network style transfer apps.

A few words about what deep learning style transfer does. Two pictures for input: one style image and one content image. The style image is supposed to be an artistic piece with a distinct look, and the content image is normally a photo you take. The algorithm will then produce an output image that uses the artistic style from the style image to draw the objects shown in the content image. You will see some examples below.

First I went to Google’s Deep Dream. The content image:


This is a picture I took in New York, and you probably recognize it as I used this picture as the banner of this blog.

And the style image:


This is an artwork from Shinkai Makoto’s movie 5 Centimeters Per Second.

Alright, so I uploaded to deep dream and this is what I got:


That’s pretty cool actually. You can see the color palette is transferred over accurately, and the objects are clearly visible. Here are a few pluses:

  1. Colors are preserved well from the style image. There was a lot of yellow in content, but none in style, so it was entirely removed.
  2. The most visibly styled objects are humans. You can see the clothes turned into a gradient, and the shapes getting abstracted a little bit.
  3. Generation was relatively fast; I waited for a few minutes only. It was also free so I wouldn’t complain.

However there are two things I was not satisfied with:

  1. The resolution is pretty low and it looks quite compressed. The pictures I uploaded were HD, and I got this tiny picture with 693×520.
  2. There are many visible artifacts in the sky. It is understandable since there were clouds in the content image, and a lot of electric cables in the style image. It looks like the training was ended prematurely.

Therefore, I decided to pull the source code and run it for myself.

First attempt

First I Googled style transfer deep learning, and found the link to this Github repo. I’m running it on a Mac, and the installation instructions were quite clear. With all default settings, I got these results:





These four images are produced sequentially. As you can see the quality got better and better over time. There are no more line shaped artifacts in the sky, but you can still see a few red and green dots close to the skyline. Over all it looks like the one generated by deep dream, but I like the blue cloud in the center more.

However, these pictures are even smaller! The images were only 512 pixels wide, and it took my Mac 2 hours already. It’s sort of my fault for running deep neural nets on a Macbook Pro laptop without a GPU. But I really want to generate larger and clearer pictures, and if with 512 it’s already taking so long, generating a four times larger picture is going to take much much longer. So I Googled again for a faster solution.

Second attempt

With some more Googling I found this repo. It is written by the same guy working in the Stanford lab, Justin Johnson. It is more painful to install all the dependencies, and I had to modify some code for it to compile, but eventually I got it to work somehow. The read me file claims that the generation is hundreds of times faster and supports near real time video style transfer, so it should be good. Some results:




These pictures are styled with the pre-trained models, and even with a width of 1024, they are generated almost in real time. These models are styled with Starry Night, The Scream and a window mosaic art respectively. They are actually very lovely! You can see the brush strokes are vivid, and the images are of such high quality.

But where’s my Shinkai Makoto?

It turns out that if we already have a model trained with a styled image, then generating with a content image is very fast. But we need to train a model each time we have a new style image, and that I assume is what takes more time. Unfortunately I didn’t do it, for reasons explained below.

Sacrificing my CPU for art and science (and reading papers in the meantime)

Since I really want to make this one good picture, I am going back to the original code. Not only does generating a large picture take a lot of CPU time, it also takes a lot of space; I had to delete some huge software that I never used to make up enough space for it. It looks like the program is going to run for approximately a day. Meanwhile I should read the papers behind the above codes, and maybe study the code a little bit.

Here’s the paper for the original algorithm by Gatys, Ecker and Bethge, and here’s the website for the faster code by Johnson. In my understanding (which could very well be wrong), here are the TL;DRs:

Original paper:

  1. The outline of the algorithm is to extract the styles and content of an image separately, then start over with a random noise picture, modifying the pixels slightly over many times until it matches both in style and content of our desired picture.
  2. We already know how to extract content of a picture before the publication of this paper. There is a free trained model online called the VGG network, which is basically a computational graph with a fixed structure and fixed parameters, that is known for identifying objects as well as humans do.
  3. The way VGG works, or any other convolutional neural network, is like the following. On each layer, we have an input vector of numbers. We carry out certain mathematical operations to these numbers. We multiply some of them together, add them, add a constant, scale them, tanh them, take the max of them… all sorts of math, and generate a new vector of numbers. A deep neural network has many layers, one layer’s output feeding into the next layer’s input. If you just do random math, then the generated numbers will be meaningless. But a “trained” network like VGG will produce a vector of numbers that are meaningful. Maybe the 130th number in the output vector indicates how likely this picture has a cat, that kind of thing. Rumor has it that the field of computer vision is started primarily to deal with cat pics.
  4. A convolutional neural network is just a neural network with a special set of mathematical operations that are designed to capture the information in a picture, as it employs a hierarchical structure of calculations that preserves the 2D structure of pixels.
  5. The key breakthrough of this paper: activations from features represent the objects, we already know that. But if we look at different “features” of the network and take the correlation matrix of the output signals across features of a certain layer, we have obtained style information.
  6. So step one: run the content picture through VGG and capture the output signals of a certain layer. Step two: run the style picture through VGG and capture the correlation matrix of output signals from features of a certain layer. Step three: start with a random noise picture, run through VGG and capture the content and style information just as above. Step four: compare our random picture with our captured signals, and figure out how to change these random pixels a little bit so that the style matches with the style picture and the content matches with the content picture. Step five: go back to step three until your computer crashes and burns. Step six: output the “random picture” – it’s already not random anymore!

As you can imagine, changing the pixels a little bit at a time to make it look like something eventually is definitely not going to be fast. That is very understandable.

Faster code:

  1. The original paper framed the problem as an optimization problem, meaning that we have a function f(x), and we want to find the x that maximizes or minimizes f(x). This is true if we think of the output picture to be x and the combined difference between x and our desired picture in terms of style and content as f(x). f(x) is indeed our loss function, and we are trying to minimize it. The style and content images are hidden in the loss function.
  2. This new paper, however, frames the problem as a transformation problem. This means we have an input x, and we want to calculate y = g(x). This is actually very natural to think about, because we have 2 input pictures, and we want 1 output picture, so x could be the style and content images, y our generated picture, and g(x) will be our algorithm.
  3. Finding an unknown function is basically what machine learning does best: first make a really dumb robot, then tell it x0 (some input), it spins out some random crap y0*. You say no, no, no, bad boy; you should say y0. It’s really dumb so it only remembers a little bit. Then you move on to the next input x1 and so on, until the robot learns some patterns from your supervision and starts to make some sense. So one way to solve the transformation problem of style transfer could be something like this: collect many style pictures and content pictures, and run through the slow code to generate pictures. Then make a dumb robot and teach it the corresponding input output pairs, until the robot can do it by itself.
  4. All of the above was prior knowledge to the paper, and this approach has a great advantage over the old one: it is very fast and simple to generate a new picture now. We don’t have to guess anymore; just throw the pictures at the robot and it will instantly give a new one back to you. The downside of this approach, of course, is that getting the robot in the first place can be very expensive; you need to generate many thousands of pictures through slow code before you can generate one picture through the robot.
  5. The key insight of this paper is more of a subtle and technical one: when we teach the robot how to turn x into y, we don’t just compare the robot’s output to a picture we want, but instead we run the output image through the VGG network to extract the style and content, then we use the style and content differences to teach the robot how to do better. Teaching the robot has a formal name called “back propagation” because of how it is practically done. This approach gives higher quality pictures.

Although training can be more expensive, generating new pictures can be real time now. This is great for commercialization. Let’s say a company trains many models based on some distinctive artistic styles, then when users upload a picture, they can get instant artistic filters provided by the company. That’s basically what Prisma does, I suppose. Yet for my purpose, it will not be any faster than the optimization approach.

There are some exciting new developments by Google as well. It builds on top of Johnson’s work, and allows interpolation between styles, so you can mix Van Gogh with Monet, for example. It came out just a month ago! Since they also released the code, I’m going to try it out a little bit. Here’s a quick Monet style:


It’s alright, doesn’t look too great. Probably Monet’s brushes are too small, so this big picture looks just textured instead of styled. Unfortunately, training a new model takes YUGE space, like 500GB. YUGE. This is why the transformation approach is not suitable for a random individual like myself: training a model is very demanding in resources, and the benefits don’t outweigh the costs. Even more sadly, attempting to run this crashed my computer and I have to restart my 1024-Shinkai Makoto picture after running for 18 hours.

Anyway, done with reading papers, I’m just going to sit here and wait for results. After about a day of computation:


…I should really get myself a GPU.

TIW: Binary Indexed Tree

Binary indexed tree, also called Fenwick tree, is a pretty advanced data structure for a specific use. Recall the range sum post: binary indexed tree is used to compute the prefix sum. In the prefix sum problem, we have an input array v, and we need to calculate the sum from the first item to index k. There are two operations. Update: change the number at one index by adding a value (not resetting the value), and query: getting the sum from begin to a certain index. How do we do it? There are two trivial ways:

  1. Every time someone queries the sum, just loop through it and return the sum. O(1) update, O(n) query.
  2. Precompute the prefix sum array, and return the precomputed value from the table. O(n) update, O(1) query.

To illustrate the differences and better explain what we’re trying to achieve, I will write the code for both approaches. They are not the theme of this post though.

class Method1 {
    vector<int> x;
    Method1(int size) {
        x = vector<int>(size);
    void update(int v, int k) {
        x[k] += v;
    int query(int k) {
        int ans = 0;
        for (int i = 0; i <= k; i++)
            ans += x[i];
        return ans;
class Method2 {
    vetor<int> s;
    Method2(int size) {
        s = vector<int>(size);
    void update(int v, int k) {
        for (; k < s.size(); k++)
            x[k] += v;
    int query(int k) {
        return s[k];

Read through this and make sure you can write this code with ease. One note before we move on: we’re computing the sum from the first item to index k, but in general we want the range sum from index i to index j. To obtain range sum, you can simply subtract the prefix sums: query(j)-query(i-1).

OK, that looks good. If we make a lot of updates, we use method 1; if we make a lot of queries, we use method 2. What if we make the same amount of updates and queries? Say we make n each operations, then no matter which method we use, we end up getting O(n^2) time complexity (verify!). We either spend too much time pre-computing or too much time calculating the sum over and over again. Is there any way to do better?

Yes, of course! Instead of showing the code and convincing you that it works, I will derive it from scratch.

The quest of log(n)

The problem: say we have same amount of updates and queries, and we do not want to bias the computation on one of them. So we do a bit of pre-computation, and a bit of summation. That’s the goal.

Say we have an array of 8 numbers, {1, 2, 3, 4, 5, 6, 7, 8}. To calculate the sum of first 7 numbers, we would like to sum up a bunch of numbers (since there has to be a bit of summation). But the amount of numbers to be summed has to be sub-linear. Let’s say we want it to be log(n). log2(7) is almost 3, then maybe we can sum 3 numbers. In this case, we choose to sum the 3 numbers: sum{1, 2, 3, 4}, sum{5, 6} and sum{7}. Assume that we have these sums already pre-computed, we have log(n) numbers to sum, hence querying will be log(n). For clarity, let me put everything in a table:

Table 1a

sum{1} = sum{1}

sum{1, 2} = sum{1, 2}

sum{1, 2, 3} = sum{1, 2} + sum{3}

sum{1, 2, 3, 4} = sum{1, 2, 3, 4}

sum{1, 2, 3, 4, 5} = sum{1, 2, 3, 4} + sum{5}

sum{1, 2, 3, 4, 5, 6} = sum{1, 2, 3, 4} + sum{5, 6}

sum{1, 2, 3, 4, 5, 6, 7} = sum{1, 2, 3, 4} + sum{5, 6} + sum{7}

sum{1, 2, 3, 4, 5, 6, 7, 8} = sum{1, 2, 3, 4, 5, 6, 7, 8}

The left hand side of the table is the query, and all the terms on the right hand side are pre-computed. If you look closely enough you will see the pattern: for summing k numbers, first take the largest power of 2, 2^m, that is ≤ k, and pre-compute it. Then for the rest of the numbers, k-2^m, take the largest power of 2, 2^m’ such that 2^m’ ≤ k-2^m, and pre-compute it, and so on.

There are two steps to do: show that querying (adding terms on the right hand side) is log(n) and show that pre-computing the terms on the right hand side is log(n).

Querying is log(n) is easily seen, because by taking out the largest power of 2 each time, we will at least take out half of the numbers (Use proof by contradiction). Taking out no less than one half each time, after O(log(n)) time we would have taken out all of it.

Now we are one step from finishing on the theoretical side: how do we pre-compute those terms?

Let’s say we want to change the number 1 into 2, essentially carrying out update(1, 0). Look at the terms above: we need to change sum{1}, sum{1, 2}, sum{1, 2, 3, 4} and sum{1, 2, 3, 4, 5, 6, 7, 8}. Each time we update one more pre-computed term, we cover double the number of elements in the array. Therefore we also only need to update log(n) terms. Let’s see it in a table:

Table 1b

update 1: sum{1}, sum{1, 2}, sum{1, 2, 3, 4}, sum{1, 2, 3, 4, 5, 6, 7, 8}

update 2: sum{1, 2}, sum{1, 2, 3, 4}, sum{1, 2, 3, 4, 5, 6, 7, 8}

update 3: sum{3}, sum{1, 2, 3, 4}, sum {1, 2, 3, 4, 5, 6, 7, 8}

update 4: sum{1, 2, 3, 4}, sum{1, 2, 3, 4, 5, 6, 7, 8}

update 5: sum{5}, sum{5, 6}, sum{1, 2, 3, 4, 5, 6, 7, 8}

update 6: sum{5, 6}, sum{1, 2, 3, 4, 5, 6, 7, 8}

update 7: sum{7}, sum{1, 2, 3, 4, 5, 6, 7, 8}

update 8: sum{1, 2, 3, 4, 5, 6, 7, 8}

Cool, now we have a vague idea about what to pre-compute for update and what to add for query. Now we should figure out the details of the code.

How is the code written?

First, we need to determine the representation of the pre-computed terms. Here is a list of all pre-computed terms:

{1}, {1, 2}, {3}, {1, 2, 3, 4}, {5}, {5, 6}, {7}, {1, 2, 3, 4, 5, 6, 7, 8}

The last number of each term is unique and covers the range 1-8. That’s great news! We can use a vector to store these terms easily, and let the index of the array be the last number of the term. For example, the sum of {5, 6} will be stored at bit[6].

First, the query operation. Let’s revisit the table with binary representation of numbers:

Table 2a: revised version of table 1a, with sums written as bit elements, indices in binary

query 0001: bit[0001]

query 0010: bit[0010]

query 0011: bit[0011]+bit[0010]

query 0100: bit[0100]

query 0101: bit[0101]+bit[0100]

query 0110: bit[0110]+bit[0100]

query 0111: bit[0111]+bit[0110]+bit[0100]

query 1000: bit[1000]

Do you see the pattern yet? Hint: for queries that have k ones, we have k terms on the right. The pattern is that while the index has at least 2 ones, we remove the lowest bit that is one, then move on to the next term. 0111->0110->0100. Finally, here’s the code:

int query(vector<int>& bit, int k) {
    int ans = 0;
    for (k++; k; k -= k & (-k))
        ans += bit[k];
    return ans;

After all the work we’ve been through, the code is extremely concise! Two things to notice: the k++ is to change the indexing from 0-based to 1-based, as we can see from the above derivation we go from 1 to 8, instead of 0 to 7. The second thing is the use of k & (-k) to calculate the lowest bit. You can refer to the previous blog post on bitwise operations.

OK, we’re almost done. What about update? Another table:

Table 2b: revised version of table 1b

update 0001: bit[0001], bit[0010], bit[0100], bit[1000]

update 0010: bit[0010], bit[0100], bit[1000]

update 0011: bit[0011], bit[0100], bit[1000]

update 0100: bit[0100], bit[1000]

update 0101: bit[0101], bit[0110], bit[1000]

update 0110: bit[0110], bit[1000]

update 0111: bit[0111], bit[1000]

update 1000: bit[1000]

What’s the pattern this time? Hint: again, look for the lowest bit! Yes, this time instead of removing the lowest bit, we add the lowest bit of the index to itself. This is less intuitive than the last part. For example, lowest bit of 0101 is 1, so the next index is 0101+1 = 0110; lowest bit is 0010, next index is 0110+0010 = 1000.

So here’s the code, note also the k++:

void update(vector<int>& bit, int v, int k) {
    for (k++; k < bit.size(); k += k & (-k))
        bit[k] += v;

This is deceivingly easy! That can’t be right. It can’t be that easy… Actually it can, if you look at the category of this post; nothing I write about is hard.

Actually it was easy because we were just matching patterns and assuming it would generalize. Let’s study the for loops a little more to understand why and how low bits are involved in this. This is rather complicated, so for practical purposes you might as well skip them.

First, observe that the low bit of each index indicates the number of integers that index sums over. Say, 0110 has low bit 0010 which is 2, so bit[6] is a sum of two numbers: 5 and 6. This is by design, since this is exactly how we picked the pre-computed terms, so there is no way of explanation.

Second, bit[k] is the sum from indices k-lowbit(k)+1 to k. This is a direct consequence from (1) bit[k] is a summation that ends at the kth number and (2) bit[k] sums over lowbit(k) numbers.

In light of this fact, the code for querying becomes clear: for an index k, we first get the sum from k-lowbit(k)+1 to k from bit[k], then we need to find the sum from 1 to k-lowbit(k). The latter becomes a sub-problem, which is solved by setting k-lowbit(k) as the new k value and going into the next iteration.

For updating, it is much trickier. From the above, we have l-lowbit(l) < k ≤ l, iff bit[l] includes k. Below is a sketch of proof, the actual proof will include more details and be more tedious and boring to go through. For the kth number, bit[k+lowbit(k)] must include it. This is because the lowbit of k+lowbit(k) must be at least 2 times lowbit(k), so k+lowbit(k)-lowbit(k+lowbit(k)) ≤ k-lowbit(k) < k ≤ k+lowbit(k), satisfying the inequality. Also, k can be updated to k+lowbit(k) in the next iteration, because given lowbit(k) < lowbit(m) < lowbit(n) and that bit[m] includes k and bit[n] includes m, bit[n] must include k as well. Till now, we have shown that the bit[k] values we have modified in the for loop must include k.

Then, we also need to show that all bit[l] values that include k are modified in our loop. We can actually count all the bit[l] values that include k: it is equal to one plus the number of zeros before lowbit(k). It is not difficult to see how the for loop reduces the number of zeros before lowbit(k) each time the loop moves on to the next iteration. The only question remaining is why that number? Let’s look at the table 2b again. The numbers of terms for the first four entries, i.e. {4, 3, 3, 2}, are one more than the number of terms for the second four entries, i.e. {3, 2, 2, 1}. This is by design, because bit[4] covers the first four but not the second four, and everything else is pretty symmetric. Again, the first two entries have one more coverage than the second two entries, because bit[2] records the first two but not the second two. Hence, each time we “go down the tree” on a “zero edge” (appending a 0 to the index prefix), the numbers will be covered once more than if we “go down the tree” on a “one edge” (appending a 1 to the index prefix). After we hit the low bit, no more terms of smaller low bits will cover this index, and of course the index itself includes itself, thus the plus one. This is a basic and very rough explanation on how the numbers of zeros relate to the number of terms including a certain index. Here we have argued semi-convincingly the update loop is valid and complete.


OK, anyway, time for practice: Range Sum Query – Mutable

It’s literally implementing a binary indexed tree, nothing more.

class NumArray {
    vector<int> bit;
    void update_helper(int v, int k) {
        for (k++; k < bit.size(); k += k & (-k))
            bit[k] += v;
    int query_helper(int k) {
        int ans = 0;
        for (k++; k; k -= k & (-k))
            ans += bit[k];
        return ans;
    NumArray(vector<int> &nums) {
        for (int i = 0; i < nums.size(); i++)
            update_helper(nums[i], i);
    void update(int i, int val) {
        update_helper(val-query_helper(i)+query_helper(i-1), i);
    int sumRange(int i, int j) {
        return query_helper(j)-query_helper(i-1);

It got a little complicated because I didn’t store the original values, so we need some work on line 21 to calculate the change at a certain index given the new value and the old range sums. But that’s nothing important.

That’s it for the basic introduction of binary indexed trees. There are some variants to it, such as replacing the + sign in update function to a min or max function to take prefix min or max, or extending the algorithm to a 2D matrix, aka 2D binary indexed tree. We can even use it for some dynamic programming questions. There are in fact a few more questions on Leetcode that uses this data structure. But that’s for later.

I learned binary indexed tree through the TopCoder tutorial. If you think I did a really bad job and you do not understand at all, you can refer to it as well.

TIW: Bitwise

Bitwise operations are black magic. It is so simple but with them you can do things that you never thought would be so easy. For those who have never seen them, bitwise operations operate on one or two integer type variables, and treat them as an array of booleans. Each operation acts on the elements individually. Let’s see what bitwise operations we have:

  1. And: a & b. 0011 & 0101 = 0001 in binary, so 3 & 5 = 1.
  2. Or: a | b. 0011 | 0101 = 0111 in binary, so 3 & 5 = 7.
  3. Exclusive or: a ^ b. 0011 ^ 0101 = 0110 in binary, so 3 & 5 = 6.
  4. Not: ~a. ~00001111 = 11110000. Depending on the number of bits of your integer data type, the values could vary.
  5. Shift left and right: 001011 << 1 = 010110, 001011 >> 2 = 000010. It is essentially multiplying or dividing by 2 to the power of k.

Applications are sort of miscellaneous and interesting. First let me go through some common routines, then I will go over some problems.

Taking a bit at a certain index

int bitAtIndex(int v, int k) {
    return (v >> k) & 1;

Shift the bit you want to the least significant bit, then and it with 1 to get rid of all the other higher bits.

Clearing and setting a bit

void clearBit(int& v, int k) {
    v &= ~(1 << k);
void setBit(int& v, int k) {
    v |= 1 << k;

This idea is called masking: create a mask, apply it on the number by either or-ing or and-ing.

Getting the lowest 1 bit

int lowBit(int v) {
    return v & (-v);

This is sort of tricky. Let’s walk through it. Say our number is 00011000. In two’s complement, the negative number of v is obtained by 1+(~v). So the negative of 00011000 is 11100111 plus 1, which is 11101000. Taking the and result with 00011000 will yield 00001000, which is the lowest bit we want. The way to understand it is that the tail of our input number v, defined as the pattern 100000… at the end, will remain the same after taking the negative. Everything to the left of the tail will be flipped. Therefore taking the and result will yield the lowest bit that is a 1. This is particularly useful for Binary Indexed Tree, which I will talk about in a coming post.

Here’s the most cliched problem.

Single Number

Given an array, find the only number that appeared once. This problem is called Single Number for a reason: imagine it’s Christmas time and you’re out there on the streets alone, and everyone you see is with their significant others. Probably how they got the problem idea. O(n) time with O(1) space. To solve this problem, we need to find an operation that acting on the same number twice will yield the identity function, i.e. f(f(a, b), b) = a. This function had better be commutative and associative, so we can do it in any order, and cancel out all the pairs. Obviously +, -, *, / don’t meet the requirements. The answer is exclusive or. The ideas are: a^a = 0, a^0 = a. Therefore 3^2^1^2^3 = 2^2^3^3^1 = 0^1 = 1, exclusive or-ing all numbers gives you the answer.

int singleNumber(vector<int>& nums) {
    int ans = 0;
    for (int x : nums)
        ans ^= x;
    return ans;

One more remark: it’s possible to extend this algorithm to find the only number that appears once, given all other numbers appear n times for n ≥ 2. What we want to accomplish essentially is to create a commutative function that goes to identity after n operations. The function is this: maintain a vector of integers and size 32, counting the numbers of 1 at each bit mod n. It is easy to see after n times, the counts will be either 0 or n, both equal to 0 mod n. So we will end up with the answer. Our solution above is just a degenerate case when n = 2, so a vector of int mod 2 can be replaced by simply an integer, and modular addition can be replaced by exclusive or. Single Number II is the problem for n = 3. Don’t ask me where they got the problem idea from 🙂

int singleNumber(vector<int>& nums) {
    vector<int> c(32);
    for (int x : nums)
        for (int j = 0; j < 32; j++)
            c[j] = (c[j]+((x>>j)&1))%3;
    int ans = 0;
    for (int i = 0; i < 32; i++)
        ans |= c[i] << i;
    return ans;

For the people who have never seen bitwise, this is sort of complicated. You can see taking a bit on line 5 and setting a bit on line 8.

Generating the power set

Given a set s, the power set is the set of all subsets of s. For example if s = {1, 2, 3}, P(s) = {{}, {3}, {2}, {2, 3}, {1}, {1, 3}, {1, 2}, {1, 2, 3}}. Here is one possible implementation using bitwise and:

vector<vector<int> > powerSet(vector<int>& s) {
    vector<vector<int> > ans;
    for (int i = 0; i < (1 << s.size()); i++) {
        for (int j = 0; j < s.size(); j++)
            if (i & (1 << j))
    return ans;

The key idea is the outer loop of i. Say for s = {1, 2, 3}, i will loop from 0 to 7, or 000 to 111 in binary. We will have all the bit patterns in that case: 000, 001, 010, 011, 100, 101, 110, 111. Now for each number, each bit indicates whether we include a certain element of the original set. By looping through these bits, we can create each subset and generate the power set.

Counting the number of 1 bits

Given a number v, count how many bits are 1. There are different ways to do it, two will be shown below.

int countBits(unsigned int v) {
    int ans = 0;
    for (int i = 0; i < 32; i++)
        if ((v >> i) & 1)
    return ans;
int countBits(unsigned int v) {
    int ans = 0;
    for (; v > 0; v -= v & (-v), ans++);
    return ans;

The second method uses the low bit function, removing the lowest bit every time. I suppose it is more efficient, but it probably doesn’t make a real difference.

Swapping two numbers without space

void swapNumbers(int& a, int& b) {
    a = a^b;  //  a ^= b;
    b = a^b;  //  b ^= a;
    a = a^b;  //  a ^= b;

Not very useful, but good to know.

Sum of Two Integers

Here’s a brainteaser: add two numbers, but the code cannot have any + or -. The idea of the code: calculate the carry bits and the addition without carry bits, then add the two results together. The base case is when one number becomes 0. There will not be an infinite loop because the number of trailing zeros of the carry bits must increase each time in a recursion. In the following code, if any of the numbers is 0, the sum is equal to bitwise or. Otherwise, the bits without carry will be a exclusive or b, and the carry bits will be where both a and b are 1, shifted to the left by 1.

int getSum(int a, int b) {
    return (a == 0 || b == 0)? a | b : getSum(a ^ b, (a & b) << 1);

Sudoku Solver by Lee Hsien Loong

This code is written by the prime minister of Singapore. It is written in pure C, so it is kind of hard to read. He used a lot of bitwise operators in this code. I read it two years ago, and I don’t want to spend the time understanding everything again, so my short explanation might be faulty. The algorithm is not tricky, as he simply picks a grid with the smallest number of choices, and tries everything recursively (line 180). To speed things up, he used integers as boolean arrays to indicate what numbers are still available for a certain row, column or 3×3 block. Therefore to get the possible placements at a certain grid simply requires taking the bitwise and result (line 171). To use one possible result, he took the lowest bit (line 173). Another trick to reduce the runtime is to pick the grid with the fewest possible choices (lines 162, 163, 188 I assume). He also pre-computed some functions into arrays to avoid repeated work. Most of these are optimizations that reduce the time constant, replacing one O(1) operation with another O(1) operation. Tricky and efficient, but also with reduced readability, in my opinion.


Anyway that’s a lot already; I will split the Binary Indexed Tree part in a separate post. Surely these are mostly brainteasers, but some interviewers do like them, and for some lower level (closer to hardware) jobs they are quite important.