Monday, December 29, 2014

Optimizing the Levenshtein Algorithm in C#

This is part 1 of a multi-part series of posts covering Levenshtein and Damerau-Levenshtein edit distance algorithms using C# and SQL. Part 2 covers Levenshtein in TSQL. Part 3 covers Damerau-Levenshtein in C#.Part 4 covers Damerau-Levenshtein in TSQL.

There is a well known edit distance metric called the Levenshtein distance. It measures the number of insertion, deletion, and substitution edits required to transform one string to another string. I won’t go into its details, because that’s covered well at Wikipedia, and thousands of other sites on the internet. What I will go into is how to make it run faster, because it can be somewhat expensive in time and memory.

If we literally implement the algorithm in C#, using Math.Min, we have a basic implementation. I’ll use this as the reference to compare our optimized versions. I’m not going to give actual benchmark times, because it’s entirely dependent on the strings being compared. Whether it’s big strings, little strings, big to little, lots of shared characters, very dissimilar, it changes relative speeds of different implementations. I’ll just speak generally about the differences. You should benchmark typical strings in whatever your application is to know how it will play out for you.

The first optimizations we can do is change our loops to be zero based (with matching changes to code using the loop variables), and replace the Math.Min calls with the comparable straight C# code, This can give us about a 20% improvement in speeds. It doesn’t change memory consumed, which is still primarily the matrix of edit costs whose size is the string1.Length * string2.Length.

Now we start doing optimizations that really matter. The first one is pretty significant. It’s an improvement described Sten Hjelmqvist in his article “Fast, memory efficient Levenshtein algorithm”. It uses a clever observation that an entire length1 * length2 matrix is not required to compute the Levenshtein edit distance. All that’s required is two arrays representing two columns of that matrix. When comparing large strings, this delivers a significant reduction in memory. Besides that, and even with small strings, it also delivers an improvement in speed. Using the implementation given in his article, execution time is cut in half compared to the literal implementation we started with.

We can improve this further. One minor thing we can do is ensure that the larger string is the one whose length is used for the inner loop. This reduces the number of executions of code that’s only in the outer loop. This decision improves speed, but at a slight cost in memory, since the column arrays are sized to the length of the string processed in the inner loop. This should only matter when strings can be of greatly different sizes. If you’re more concerned with memory than speed, you can reverse this logic, and ensure the smaller string is the one whose length is used for the inner loop. Another advantage of checking and swapping if necessary to ensure which of the two strings is the smaller is that we can then always use the length of that string as the min length of the two without further comparisons.

If you examine the sequence of how elements in the two arrays are read and written, it becomes evident that the idea described by Hjelmqvist can be taken even further. Rather than two arrays, we can do the job with a single array and a couple temporary int variables. This is done by reading the elements we need (containing values for the column just computed) just prior to them being overwritten with values of the next column being computed, and we do this all in-place in the single array. This reduces the memory used by half. It also allows us to reduce our array access to a single read and a single write in each inner loop iteration. The result is a further speed improvement, cutting the time in half again, in addition to the memory savings.

The final optimization, depending on the nature of the strings you throw at it, can be significant. When looking at two strings, any shared suffix and prefix should not affect the final edit distance. If we can identify and exclude shared suffix and/or prefix characters, the cost to do that will be much cheaper than the cost to have those characters processed through the normal algorithm. For example, if I want the edit distance between

johnathan and jonithan

we know that the cost is zero for the shared prefix jo and the shared suffix than. So we only need to find the edit distance between hna and ni. Even if we can shave off a couple characters, this optimization pays off. If even more characters are shared, as in this example, the improvement in time and memory can be very significant. With the original strings, we execute the inner loop 72 times (the length of string 1 times the length of string 2). By shaving off the shared 6 characters, we only execute the inner loop 6 times.

The code below is the version of Levenshtein edit distance after making the optimizations described above. What we end up with is a pretty fast implementation that requires less than 25% of the time of the basic implementation we started with when used with 9 character or 200 character test strings having no common prefix or suffix characters. When there are common suffix and prefix characters, the speed improvement is even more noticeable, with the "johnathan" example given earlier taking only 6% of the time required by the original algorithm. In addition to the speed improvement, we also use less memory, and if the strings being compared are large, we use significantly less memory.

/// <summary>
/// Computes and returns the Levenshtein edit distance between two strings, i.e. the
/// number of insertion, deletion, and sustitution edits required to transform one
/// string to the other. This value will be >= 0, where 0 indicates identical strings.
/// Comparisons are case sensitive, so for example, "Fred" and "fred" will have a 
/// distance of 1.
/// http://blog.softwx.net/2014/12/optimizing-levenshtein-algorithm-in-c.html
/// </summary>
/// <remarks>See http://en.wikipedia.org/wiki/Levenshtein_distance
/// This is based on Sten Hjelmqvist's "Fast, memory efficient" algorithm, described
/// at http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm, 
/// with some additional optimizations.
/// </remarks>
/// <param name="s">String being compared for distance.</param>
/// <param name="t">String being compared against other string.</param>
/// <returns>int edit distance, >= 0 representing the number of edits required to transform one string to the other.</returns>
public static int Levenshtein(this string s, string t) {
    if (String.IsNullOrEmpty(s)) return (t ?? "").Length;
    if (String.IsNullOrEmpty(t)) return s.Length;

    // if strings of different lengths, ensure shorter string is in s. This can result in a little
    // faster speed by spending more time spinning just the inner loop during the main processing.
    if (s.Length > t.Length) {
        var temp = s; s = t; t = temp; // swap s and t
    }
    int sLen = s.Length; // this is also the minimun length of the two strings
    int tLen = t.Length;

    // suffix common to both strings can be ignored
    while ((sLen > 0) && (s[sLen - 1] == t[tLen - 1])) { sLen--; tLen--; }

    int start = 0;
    if ((s[0] == t[0]) || (sLen == 0)) { // if there's a shared prefix, or all s matches t's suffix
        // prefix common to both strings can be ignored
        while ((start < sLen) && (s[start] == t[start])) start++;
        sLen -= start; // length of the part excluding common prefix and suffix
        tLen -= start;

        // if all of shorter string matches prefix and/or suffix of longer string, then
        // edit distance is just the delete of additional characters present in longer string
        if (sLen == 0) return tLen;

        t = t.Substring(start, tLen); // faster than t[start+j] in inner loop below
    }
    var v0 = new int[tLen];
    for (int j = 0; j < tLen; j++) v0[j] = j + 1;

    int current = 0;
    for (int i = 0; i < sLen; i++) {
        char sChar = s[start + i];
        int left = current = i;
        for (int j = 0; j < tLen; j++) {
            int above = current;
            current = left; // cost on diagonal (substitution)
            left = v0[j];    
            if (sChar != t[j]) {
                current++;              // substitution
                int insDel = above + 1; // deletion
                if (insDel < current) current = insDel;
                insDel = left + 1;      // insertion
                if (insDel < current) current = insDel;
            }
            v0[j] = current;
        }
    }
    return current;
}