Sunday, February 22, 2015

Edit Distance

When a spell checker encounters a possible misspelling, it looks in its dictionary for other words that are close by. What is the appropriate notion of closeness in this case?
A natural measure of the distance between two strings is the extent to which they can be aligned, or matched up. Technically, an alignment is simply a way of writing the strings one above the other. For instance, here are two possible alignments of SNOWY and SUNNY:


The - indicates a “gap”, any number of these can be placed in either string. The cost of an alignment is the number of columns in which the letters differ. And the edit distance between two strings is the cost of their best possible alignment.

Edit distance is so named because it can also be thought of as the minimum number of edits—insertions, deletions, and substitutions of characters—needed to transform the first string into the second.

Dynamic Programming Solution:

Our goal is to find the edit distance between two strings x[1...m] and y[1...n]. So we need to find out the sub-problem, it should go part of the way toward solving the whole problem, so how about looking at the edit distance between some prefix of the first string, x[1...i], and some prefix of the second, y[1...j]? Call this sub-problem E(i, j). Our objective is to compute E(m, n).


where diff(i, j) = 0 if x[i] = y[j] and 1 otherwise.

Example:


Implementation:

int min(int i, int j, int k)
{
    return i < j ? (i < k ? i : k) : (j < k ? j : k);
}

int editDistance(string str1, string str2)
{
    int m = str1.length();
    int n = str2.length();
    int **E = new int*[m + 1];
    for(int i = 0; i <= m; ++i)
    {
            E[i] = new int[n + 1];
            E[i][0] = i;
    }
    for(int i = 0; i <= n; ++i)
            E[0][i] = i;
    
    for(int i = 1; i <= m; ++i)
    {
            for(int j = 1; j <= n; ++j)
                    E[i][j] = min(E[i - 1][j] + 1, E[i][j - 1] + 1, E[i - 1][j - 1] + (int)(str1[i - 1] != str2[j - 1]));
    }
    return E[m][n];
}

Time Complexity: O(mn)

No comments:

Post a Comment