Pages

About Me - 关于我

My photo
Madison, WI, United States
Joy Young ~~

2012/07/21

KMP supporting wild card

One day, I found it very challenging to extend the standard KMP algorithm to support basic wild card:
. : match arbitrary 1 character
*: match any length of strings.
I spent some time in this algorithm, and extended the KMP to support them without increase the time complexity: it's still O(n)
vector<int> next_ext(const string &str)
{
       int leng = str.size();
       if (leng < 0 || str.empty())
              return vector<int>();
       vector<int> next(leng + 1);
       int i = -1, j, off; //treated like str[-1] is a '*'
       off = i, j = off, i ++, next[i] = off;
       while (i < leng)
       {
              if (str[i] == '*' ) //consistent with the above comment.
                     off = i, j = off, i ++, next[i] = off;
              else if (j == off || str[i] == str[j] || str[j] == '.')
                     ++i, ++j, next[i] = j;
              else
                     j = next[j];
       }
       return next;
}
//if matched, return the last position of the first matched substring of the string, else return -1.

int kmp_ext(string str, string pattern)
{
       int str_len = str.size(), pattern_len = pattern.size();
       vector<int> next = next_ext(pattern);
       int i = 0, j = -1, off; // treated like pattern[-1] is a '*'
       off = j, j ++;
       while(i < str_len && j < pattern_len)
       {
              if (j != off && pattern[j] == '*' )
                     off = j, j ++; //consistent with the above comment
              else if( j == off || str[i] == pattern[j] || pattern[j] == '.')
                     j++, i++;
              else
                     j = next[j];
       }
       if(j == pattern_len)
              return i;
       else
              return -1;
}
int main()
{
       string s = "abdcxdefabc";    
       cout<<kmp_ext(s, "ab..cx")<<endl;
       cout<<kmp_ext(s, "ab*x.*fa*c")<<endl;
       cout<<kmp_ext(s, "cxd*")<<endl;
       cout<<kmp_ext(s, "*b*x.*fa*c")<<endl;
       cout<<kmp_ext(".ab", ".a.")<<endl;
       cout<<kmp_ext("a*b", "a*b")<<endl;
}

I did the minimum modification to do this. The differences from classic KMP of this method are:
1. Add one more condition in 'if' statement indicating pattern[j] == '.' is considered as a vaild match for any character.
2. Add one more 'if' statement to generalize the standard KMP to handle '*'.
3. Introduced an additional variable 'off', whcih is initialized with -1, but changes when a '*' is met. In standard KMP, it is set to be a constant: -1. Here I generalized it to a variable.
These 3 differences occur both in next() and kmp() function, symmetrically.
Except for these 3 differences (in each function), all other part is standard KMP.

How to extend it to solve a matching problem?
KMP is a "find" method. Generally speaking, "find" is usually harder than "match".
So, with slight modification, this method can be used to solve the "match" too-- just also return the beginning position of the first matched substring of the string. If the found substring turns out to cover the whole string, it is a match.

4 comments:

  1. I think your program is incorrect. When the input is:

    s=abaab
    p=*?a?

    The expected answer is 4 but your ouput is -1.

    ReplyDelete
  2. You have to replace "?" to ".", bitch. :)

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. try this case:

    `cout<<kmp_ext(".ab", "baabbaa")<<endl;`

    the result is `-1` on my computer, but apparently it's not correct

    ReplyDelete