Longest Sequence Without Duplicate Substring of Constant Length
Around 9 years ago I was working on a multiple choice exam and this problem came up. Recently I thought about it again and it’s actually very simple.
The problem is: say you are writing a sequence of letters A, B, C, D, with one constraint. Whatever substring of length 3 that already appeared in your sequence cannot appear again. What is the longest sequence length you can possibly come up with?
As an example, the sequence “ABCDACBB” is valid, but the sequence “ABCDABCB” is not, since “ABC” appears twice. By the same token, “AAAAB” is also invalid, since “AAA” appears twice.
There is an easy generalization of this problem to k choices of letters instead of 4, and m as the length of substring instead of 3. Quick maths: there are k^m substring combinations, so there can be k^m unique starting points, and the answer is upper bounded by k^m + m - 1.
9 years ago I didn’t know anything so I thought it was hard. But now it just looks like a graph traversal problem, which is either trivial or intractable (depending on whether it’s Eulerian or Hamiltonian). With this in mind, the problem is trivial. You can either read my solution below or spend a few minutes to figure it out.
Construct a directed graph where each node represents a unique substring made up of the k choices of letters of length m - 1. There are k outgoing edges and k incoming edges for each node. The k outgoing edges represent the k different ways you can append a letter to the current substring of length m - 1. So for example for the node “AB”, the outgoing edge “C” would point to the node “BC”, since after adding a letter “C”, you get to the new prefix “BC”. Then each edge corresponds to one unique substring of length m, and the problem becomes to find an Eulerian path in this graph. Since this graph is always connected and each node has the same number of in degrees as out degrees, the path always exists and must be a cycle, regardless of m and k. There will be many possible cycles as well. Now this problem is theoretically solved, and also programmatically solved.
As I later found out from my roommate’s math homework, this graph construction is called de Bruijn graph. Of course everything has to have a name.
I wish I could go back in time and explain this to the little me.