Levenshtein Distance using Clojure

by Michael

I’m pretty much a total beginner when in it comes to writing Clojure. In order to familiarize myself more with the language, I decided to try to implement the Levenshtein distance algorithm.

The Levenshtein distance is a metric that measures the number of operations required to transform a word into another word. For example, it takes 3 operations to transform the word “kitten” to the word “sitting”.

My reason for choosing the Levenshtein distance for learning Clojure is the fact that implementing dynamic programming solutions in functional languages is non-trivial. The solutions to dynamic programming problems usually require populating a matrix for intermediate values that hold the current optimal solution.

I ended up writing three different solutions. I wasn’t really happy with the first two, the last solution is quite ok, in my opinion.

My first solution was to use loop and recur, whereby I pass the current distance table as an argument in the loop. It was pretty much a straight-up rip-off of the pseudo-code implementation given on wikipedia. Note also that I created a separate function for creating the initial values of the distance table.

(defn init-distance [m n]
  (reduce #(assoc %1 %2 (apply max %2)) {}
    (concat
      '([0 0])
      (for [i (range 1 (inc m))] [i 0])
      (for [j (range 1 (inc n))] [0 j]))))

(defn levenshtein-distance1 
  (let [m (count source)
        n (count target)]
    (loop [j 1
           i 1
           d (init-distance m n)]
      (cond
        (> j n) (d [m n])
        (> i m) (recur (inc j) 1 d)
        (= (get source (dec i)) (get target (dec j))) (recur j (inc i) (assoc d [i j] (d [(dec i) (dec j)])))
        :else (recur j (inc i) (assoc d [i j] (inc (min
                                                     (d [(dec i) j])
                                                     (d [i (dec j)])
                                                     (d [(dec i) (dec j)])))))))))

Not very elegant. I use conditions to iterate over [i j], generally the code is pretty messy. Also, I’m pretty sure we don’t need a separate function for creating the initial distance values. Using a loop/recur usually means that we can use a map operation instead.

Here comes version 2.

(def dist (memoize (fn [pred index] (let [[i j] index]
                                      (cond
                                        (zero? (min i j)) (max i j)
                                        (pred index) (dist pred [(dec i) (dec j)])
                                        :else (inc (min
                                                     (dist pred [(dec i) j])
                                                     (dist pred [i (dec j)])
                                                     (dist pred [(dec i) (dec j)]))))))))

(defn levenshtein-distance2 
  (let [pred (fn [index] (let [[i j] index]
                           (=
                             (get source (dec i))
                             (get target (dec j)))))]
    (->> (for [j (range (count target))
               i (range (count source))]
           [i j])
      (map (partial dist pred))
      last)))

I removed the loop and replaced it with a recursive call. The init-distance function is no longer needed since I added an extra condition that returns the correct value for when either i or j is zero. By calling the named function, we can use memoization of intermediate results which means that we don’t have to recalculate the values over and over again when we call the recursive function. Also, there is quite a bit of repeated code, the fact that we always call dist with the pred parameter. Another concern is that we don’t really control how the caching of intermediate values is managed which could lead to huge memory usage.

Let’s take a look at the next and last version.

(defn dist-step [pred d index]
  (let [[i j] index]
    (assoc d [i j]
      (cond
        (zero? (min i j)) (max i j)
        (pred index) (d [(dec i) (dec j)])
        :else (inc (min
                     (d [(dec i) j])
                     (d [i (dec j)])
                     (d [(dec i) (dec j)])))))))

(defn levenshtein-distance3 
  (let [m (count source)
        n (count target)
        pred (fn [index] (let [[i j] index]
                           (=
                             (get source (dec i))
                             (get target (dec j)))))
        step (partial dist-step pred)
        dist (reduce step {} (for [j (range n) i (range m)] [i j]))]
    (dist [(dec m) (dec n)])))

I removed the recursive call, instead I take the current form of the distance table as an argument to the dist-step function and return the distance containing the new distance value for [i j]. This allows us to use the function as a reducer by mapping a list of indices to a single distance table which is what we do in the actual levenshtein function. Above all, this version is a lot more readable compared to the previous two versions. It also exposes a general strategy for solving recursion problems when using functional languages. Instead of doing the recursive call that is so inherent in dynamic programming, we create a step function that performs one such recursive operation on a piece of data and returns the new data value. We then use this inside a reduce operation.

I’m adamant that there is a much simpler and more elegant way of writing this in Clojure – would love to get some comments and feedback.

Some interesting links regarding recursion and dynamic programming in Clojure:

Advertisements