**String Matching Problem:**

Suppose a Text is an array T[1..n] of length n and pattern is also an array P[1..m] (m <= n). We say pattern P occurs with shift s in Text T if 0 <= s <= n-m and T[s+1..s+m] = P[1..m] i.e. T[s + j] = P[j] for all 1 <= j <=m.

If P occurs with shift s in T the s is a valid shift otherwise it is an invalid shift.

**The string matching problem is the problem of finding all valid shifts with which a given Pattern P occurs in a given Text T.**

**Notation and Terminology:**

**Prefix :**W is a prefix of string x if x = wy for some string y.

**Suffix :**W is suffix of string x if x = yw for some string y.

**xy**– string concatenation

**| x |**- length of string x

**Knuth-Morris-Pratt Algorithm:**

Knuth, Morris and Pratt proposed a linear time algorithm for the string matching problem.

A matching time of O(n) is achieved by avoiding comparisons with elements of T that have previously been involved in comparison with some elements of the pattern P to be matched. i.e., backtracking on the string T never occurs.

**Components of KMP Algorithm:**

**Prefix Function**

**Π:**Prefix function Π for a pattern encapsulate knowledge about how the pattern matches against shifts of itself.

Knowing these q text characters allows us to determine immediately that certain shifts are invalid since the first character a would be aligned with a text character that is known to match with second pattern character b. So next shift should be s = s + 2.

Given P[1..q] matches T[s+1..s+q] what is the least shift s1 > s such that

P[1..k] = T[s1+1 . . s1+k] where s1 + k = s + q

Prefix function

**Π [q] = max ( k : k < q and P**where P_{k}is suffix of P_{q })_{k }is P[1..k] i.e.**"****Π[q] is the length of longest prefix of P that is a proper suffix P**_{q }."My implementation of Prefix function is as follows:

int* Prefix(char *pattern)

{

int len = strlen(pattern);

int *pre = new int[len];

pre[0] = 0;

int k = 0;

for(int i=1; i<len; ++i)

{

while(k>0 && pattern[k] != pattern[i])

k = pre[k];

if(pattern[k] == pattern[i])

++k;

pre[i] = k;

}

return pre;

}

**KMP Matcher:**With text T, pattern P and prefix function Π as inputs, it finds all the occurrence of P in T and returns the number of shifts of P after which occurrence is found.

Implementation of KMP Match function is as follows:

bool Match(char *text, char *pattern, vector<int>& shifts)

{

int textLength = strlen(text);

int patrnLength = strlen(pattern);

int* prefixArray = Prefix(pattern); // Π function

bool found = false;

int q = 0;

for(int i=0; i<textLength; ++i)

{

while( q>0 && pattern[q] != text[i])

q = prefixArray[q-1]; //Next character does not macth

if(pattern[q] == text[i])

++q; //Next character Matched

if(q == patrnLength) //All matched

{

shifts.push_back(i-patrnLength);

found = true;

q = prefixArray[q-1]; //Look for next match

}

}

return found;

}

## No comments:

## Post a Comment