Constructing Master String From Substrings

by Michael

I stumbled upon a problem where I had to construct a master string m given a corpus of substrings of the master string. In other words, the set S contains strings s such that the test m.contains(s) == true holds for all s. The result had to be constructed of as few substrings as possible. A substring may be used multiple times in order to reconstruct the master string.

m = ABCDEFGHIJKLM

S = { ABC, AB, CDEF, EFHI, BCDEFH, JKLM, … }

The first step in finding a solution is to find the position of s in m for all s in S. Once we have the positions, we can construct a directed graph whereby an edge connects two consecutive substrings s. We then keep connecting substrings to each other until we have reconstructed the entire master string.

position : substring
0 : ABCD

0 : AB
2 : CDEF
4 : EFHI
etc…

In order to find the solution which uses the least number of substrings, we should traverse our graph using Breadth-first search (BFS). Starting from the root node, we visit each child and check if we have satisfied the end condition. The end condition being that the last substring matches the ending of the master string m. If the end condition is not met, we go to the next level of substrings connected to the nodes that we’ve just visited.

When the end condition is met, we can traverse back up the branch in order to collect all the substrings needed. Alternatively, we could keep track of the nodes that we have visited so far, but this seems an unnecessary use of memory.

Some substrings may occur in multiple positions. We should use all the possible positions of a substring when we construct our graph since we’re allowed to use a substring more than once. Note that we do not actually need to fully create the graph, instead we can keep the strings in a map indexed by their positions and walk through the map as though it were a graph structure.

In the case of missing substrings, e.g., if we have no substrings that contain one of the letters in m, we could try to find the closest match by simply increasing the position counter until we hit the next substring. This means we tolerate a gap in our solution.

Since we use BFS traversal, we are guaranteed to stop when we have the solution with the least number of substrings. Depth-first search (DFS) would result in us traversing all possible branches first before finding the optimal solution, which is not what we want, but the problem can be solved this way too.

Would be nice to find out if anyone has a better solution, please let me know if you do.

Advertisements