Friday, February 13, 2015

Knuth-Morris-Pratt(KMP)

String Matching Problem:
             
            Suppose a Text is an array T[1..n] of length n and pattern is also an array P[1..m] (m <= n). We say pattern P occurs with shift s in Text T if 0 <= s <= n-m and T[s+1..s+m] = P[1..m] i.e. T[s + j] = P[j] for all 1 <= j <=m.
             
                 If P occurs with shift s in T the s is a valid shift otherwise it is an invalid shift.The string matching problem is the problem of finding all valid shifts with which a given Pattern P occurs in a given Text T.

 
                                   
Notation and Terminology:

Prefix : W is a prefix of string x if x = wy for some string y.
Suffix : W is suffix of string x if x = yw for some string y.
xy – string concatenation   | x | - length of string x

Knuth-Morris-Pratt Algorithm:

Knuth, Morris and Pratt proposed a linear time algorithm for the string matching problem.
A matching time of O(n) is achieved by avoiding comparisons with elements of T that have previously been involved in comparison with some elements of the pattern P to be matched. i.e., backtracking on the string T never occurs.

Components of KMP Algorithm:

Prefix Function ΠPrefix function Π for a pattern encapsulate knowledge about how the pattern matches against shifts of itself.

         
Knowing these q text characters allows us to determine immediately that certain shifts are invalid since the first character a would be aligned with a text character that is known to match with second pattern character b. So next shift should be s = s + 2.
           
 Given P[1..q] matches T[s+1..s+q] what is the least shift s1 > s such that
 P[1..k] = T[s1+1 . . s1+k] where s1 + k = s + q

Prefix function Π [q] = max ( k : k < q and Pk is suffix of Pq  where Pk  is P[1..k] i.e.  Π[q] is the length of longest prefix of P that is a proper suffix P."

My implementation of Prefix function is as follows:


intPrefix(char *pattern)
{
    int len = strlen(pattern);
    int *pre = new int[len];

    pre[0] = 0;
    int k = 0;

    for(int i=1i<len; ++i)
    {
        while(k>0 && pattern[k] != pattern[i])
            k = pre[k];
        if(pattern[k] == pattern[i])
            ++k;
        pre[i] = k;
    }
    return pre;
}

KMP Matcher:  With text T, pattern P and prefix function Π as inputs, it finds all the occurrence of P  in T and returns the number of shifts of  P after which occurrence is found.
Implementation of KMP Match function is as follows:

 bool Match(char *textchar *patternvector<int>& shifts)
{
    int textLength = strlen(text);
    int patrnLength = strlen(pattern);
    intprefixArray = Prefix(pattern); // Π function
    bool found = false;
    int q = 0;

    for(int i=0i<textLength; ++i)
    {
        whileq>0 && pattern[q] != text[i])
            q = prefixArray[q-1]; //Next character does not macth
        if(pattern[q] == text[i])
            ++q;  //Next character Matched
        if(q == patrnLength//All matched
        {
            shifts.push_back(i-patrnLength);
            found = true;
            q = prefixArray[q-1];  //Look for next match
        }
       
    }
    return found;
}

You can see both prefix( ) function and Match function takes linear time to execute.

No comments:

Post a Comment