字符串匹配算法- 校招VIP

文章申明：转载来源：https://www.cnblogs.com/gaochundong/p/string_matching.html

字符串匹配问题的形式定义：

文本（Text）是一个长度为 n 的数组 T[1..n]；
模式（Pattern）是一个长度为 m 且 m≤n 的数组 P[1..m]；
T 和 P 中的元素都属于有限的字母表 Σ 表；
如果 0≤s≤n-m，并且 T[s+1..s+m] = P[1..m]，即对 1≤j≤m，有 T[s+j] = P[j]，则说模式 P 在文本 T 中出现且位移为 s，且称 s 是一个有效位移（Valid Shift）。

比如上图中，目标是找出所有在文本 T = abcabaabcabac 中模式 P = abaa 的所有出现。该模式在此文本中仅出现一次，即在位移 s = 3 处，位移 s = 3 是有效位移。

解决字符串匹配的算法包括：
朴素算法（Naive Algorithm）、Rabin-Karp 算法、有限自动机算法（Finite Automation）、 Knuth-Morris-Pratt 算法（即 KMP Algorithm）、Boyer-Moore 算法、Simon 算法、Colussi 算法、Galil-Giancarlo 算法、Apostolico-Crochemore 算法、Horspool 算法和 Sunday 算法等。

朴素的字符串匹配算法（Naive String Matching Algorithm）
Knuth-Morris-Pratt 字符串匹配算法（即 KMP 算法）
Boyer-Moore 字符串匹配算法

字符串匹配算法通常分为两个步骤：
预处理（Preprocessing）和匹配（Matching）。
所以算法的总运行时间为预处理和匹配的时间的总和。

上图描述了常见字符串匹配算法的预处理和匹配时间。

朴素的字符串匹配算法（Naive String Matching Algorithm）

朴素的字符串匹配算法又称为暴力匹配算法（Brute Force Algorithm），它的主要特点是：

1、没有预处理阶段；
2、滑动窗口总是后移 1 位；
3、对模式中的字符的比较顺序不限定，可以从前到后，也可以从后到前；
4、匹配阶段需要 O((n - m + 1)m) 的时间复杂度；
5、需要 2n 次的字符比较；
很显然，朴素的字符串匹配算法 NAIVE-STRING-MATCHER 是最原始的算法，它通过使用循环来检查是否在范围 n-m+1 中存在满足条件 P[1..m] = T [s + 1..s + m] 的有效位移 s。

1 NAIVE-STRING-MATCHER(T, P)
2 n ← length[T]
3 m ← length[P]
4 for s ← 0 to n - m
5 do if P[1 .. m] = T[s + 1 .. s + m]
6 then print "Pattern occurs with shift" s

如上图中，对于模式 P = aab 和文本 T = acaabc，将模式 P 沿着 T 从左到右滑动，逐个比较字符以判断模式 P 在文本 T 中是否存在。
可以看出，NAIVE-STRING-MATCHER 没有对模式 P 进行预处理，所以预处理的时间为 0。而匹配的时间在最坏情况下为 Θ((n-m+1)m)，如果 m = [n/2]，则为 Θ(n2)。
下面是 NAIVE-STRING-MATCHER 的代码示例。

namespace StringMatching
{
  class Program
  {
    static void Main(string[] args)
    {
      char[] text1 = "BBC ABCDAB ABCDABCDABDE".ToCharArray();
      char[] pattern1 = "ABCDABD".ToCharArray();

      int firstShift1;
      bool isMatched1 = NaiveStringMatcher.TryMatch1(text1, pattern1, out firstShift1);
      Contract.Assert(isMatched1);
      Contract.Assert(firstShift1 == 15);

      char[] text2 = "ABABDAAAACAAAABCABAB".ToCharArray();
      char[] pattern2 = "AAACAAAA".ToCharArray();

      int firstShift2;
      bool isMatched2 = NaiveStringMatcher.TryMatch2(text2, pattern2, out firstShift2);
      Contract.Assert(isMatched2);
      Contract.Assert(firstShift2 == 6);

      char[] text3 = "ABAAACAAAAAACAAAABCABAAAACAAAAFDLAAACAAAAAACAAAA".ToCharArray();
      char[] pattern3 = "AAACAAAA".ToCharArray();

      int[] shiftIndexes = NaiveStringMatcher.MatchAll(text3, pattern3);
      Contract.Assert(shiftIndexes.Length == 5);
      Contract.Assert(string.Join(",", shiftIndexes) == "2,9,22,33,40");

      Console.WriteLine("Well done!");
      Console.ReadKey();
    }
  }

  public class NaiveStringMatcher
  {
    public static bool TryMatch1(char[] text, char[] pattern, out int firstShift)
    {
      firstShift = -1;
      int n = text.Length;
      int m = pattern.Length;
      int s = 0, j = 0;

      // for..for..
      for (s = 0; s < n - m; s++)
      {
        for (j = 0; j < m; j++)
        {
          if (text[s + j] != pattern[j])
          {
            break;
          }
        }
        if (j == m)
        {
          firstShift = s;
          return true;
        }
      }

      return false;
    }

    public static bool TryMatch2(char[] text, char[] pattern, out int firstShift)
    {
      firstShift = -1;
      int n = text.Length;
      int m = pattern.Length;
      int s = 0, j = 0;

      // while..
      while (s < n && j < m)
      {
        if (text[s] == pattern[j])
        {
          s++;
          j++;
        }
        else
        {
          s = s - j + 1;
          j = 0;
        }

        if (j == m)
        {
          firstShift = s - j;
          return true;
        }
      }

      return false;
    }

    public static int[] MatchAll(char[] text, char[] pattern)
    {
      int n = text.Length;
      int m = pattern.Length;
      int s = 0, j = 0;
      int[] shiftIndexes = new int[n - m + 1];
      int c = 0;

      // while..
      while (s < n && j < m)
      {
        if (text[s] == pattern[j])
        {
          s++;
          j++;
        }
        else
        {
          s = s - j + 1;
          j = 0;
        }

        if (j == m)
        {
          shiftIndexes[c] = s - j;
          c++;

          s = s - j + 1;
          j = 0;
        }
      }

      int[] shifts = new int[c];
      for (int y = 0; y < c; y++)
      {
        shifts[y] = shiftIndexes[y];
      }

      return shifts;
    }
  }
}

上面代码中 TryMatch1 和 TryMatch2 分别使用 for 和 while 循环达到相同效果。

Knuth-Morris-Pratt 字符串匹配算法（即 KMP 算法）

我们来观察一下朴素的字符串匹配算法的操作过程。如下图（a）中所描述，在模式 P = ababaca 和文本 T 的匹配过程中，模板的一个特定位移 s，q = 5 个字符已经匹配成功，但模式 P 的第 6 个字符不能与相应的文本字符匹配。

此时，q 个字符已经匹配成功的信息确定了相应的文本字符，而知道这 q 个文本字符，就使我们能够立即确定某些位移是非法的。
例如上图（a）中，我们可以判断位移 s+1 是非法的，因为模式 P 的第一个字符 a 将与模式的第二个字符 b 匹配的文本字符进行匹配，显然是不匹配的。
而图（b）中则显示了位移 s’ = s+2 处，使模式 P 的前三个字符和相应的三个文本字符对齐后必定会匹配。
KMP 算法的基本思路就是设法利用这些已知信息，不要把 "搜索位置" 移回已经比较过的位置，而是继续把它向后面移，这样就提高了匹配效率。

已知模式 P[1..q] 与文本 T[s+1..s+q] 匹配，那么满足 P[1..k] = T[s’+1..s’+k] 其中 s’+k = s+q 的最小位移 s’ > s 是多少？
这样的位移 s’ 是大于 s 的但未必非法的第一个位移，因为已知 T[s+1..s+q] 。在最好的情况下有 s’ = s+q，
因此立刻能排除掉位移 s+1, s+2 .. s+q-1。
在任何情况下，对于新的位移 s’，无需把 P 的前 k 个字符与 T 中相应的字符进行比较，因为它们肯定匹配。

可以用模式 P 与其自身进行比较，以预先计算出这些必要的信息。
例如上图（c）中所示，由于 T[s’+1..s’+k] 是文本中已经知道的部分，所以它是字符串 Pq 的一个后缀。

此处我们引入模式的前缀函数 π（Pai），π 包含有模式与其自身的位移进行匹配的信息。
这些信息可用于避免在朴素的字符串匹配算法中，对无用位移进行测试。
π[q] = max {k : k < q and Pk ⊐ Pq}
π[q] 代表当前字符之前的字符串中，最长的共同前缀后缀的长度。（π[q] is the length of the longest prefix of P that is a proper suffix of Pq.）

下图给出了关于模式 P = ababababca 的完整前缀函数 π，可称为部分匹配表（Partial Match Table）。

计算过程：
π[1] = 0，a 仅一个字符，前缀和后缀为空集，共有元素最大长度为 0；
π[2] = 0，ab 的前缀 a，后缀 b，不匹配，共有元素最大长度为 0；
π[3] = 1，aba，前缀 a ab，后缀 ba a，共有元素最大长度为 1；
π[4] = 2，abab，前缀 a ab aba，后缀 bab ab b，共有元素最大长度为 2；
π[5] = 3，ababa，前缀 a ab aba abab，后缀 baba aba ba a，共有元素最大长度为 3；
π[6] = 4，ababab，前缀 a ab aba abab ababa，后缀 babab abab bab ab b，共有元素最大长度为 4；
π[7] = 5，abababa，前缀 a ab aba abab ababa ababab，后缀 bababa ababa baba aba ba a，共有元素最大长度为 5；
π[8] = 6，abababab，前缀 .. ababab ..，后缀 .. ababab ..，共有元素最大长度为 6；
π[9] = 0，ababababc，前缀和后缀不匹配，共有元素最大长度为 0；
π[10] = 1，ababababca，前缀 .. a ..，后缀 .. a ..，共有元素最大长度为 1；
KMP 算法 KMP-MATCHER 中通过调用 COMPUTE-PREFIX-FUNCTION 函数来计算部分匹配表。

KMP-MATCHER(T, P)
n ← length[T]
m ← length[P]
π ← COMPUTE-PREFIX-FUNCTION(P)
q ← 0                          //Number of characters matched.
for i ← 1 to n                 //Scan the text from left to right.
    do while q > 0 and P[q + 1] ≠ T[i]
            do q ← π[q]        //Next character does not match.
        if P[q + 1] = T[i]
            then q ← q + 1     //Next character matches.
        if q = m               //Is all of P matched?
            then print "Pattern occurs with shift" i - m
            q ← π[q]           //Look for the next match.

COMPUTE-PREFIX-FUNCTION(P)
m ← length[P]
π[1] ← 0
k ← 0
for q ← 2 to m
     do while k > 0 and P[k + 1] ≠ P[q]
            do k ← π[k]
        if P[k + 1] = P[q]
           then k ← k + 1
        π[q] ← k
return π

预处理过程 COMPUTE-PREFIX-FUNCTION 的运行时间为 Θ(m)，KMP-MATCHER 的匹配时间为 Θ(n)。
相比较于 NAIVE-STRING-MATCHER，KMP-MATCHER 的主要优化点就是在当确定字符不匹配时对于 pattern 的位移。
NAIVE-STRING-MATCHER 的位移效果是：文本向后移一位，模式从头开始。

s = s - j + 1;
    j = 0;

KMP-MATCHER 首先对模式做了获取共同前缀后缀最大长度的预处理操作，位移过程是先将模式向后移 partial_match_length - table[partial_match_length - 1]，然后再判断是否匹配。
这样通过对已匹配字符串的已知信息的利用，可以有效节省比较数量。

if (j != 0)
        j = lps[j - 1];
    else
        s++;

下面描述了当发现字符 j 与 c 不匹配时的位移效果。

// partial_match_length - table[partial_match_length - 1]
    rrababababjjjjjiiooorababababcauuu
      ||||||||-
      ababababca
    // 8-6=2
    rrababababjjjjjiiooorababababcauuu
      xx||||||-
        ababababca
    // 6-4=2
    rrababababjjjjjiiooorababababcauuu
        xx||||-
          ababababca
    // 4-2=2
    rrababababjjjjjiiooorababababcauuu
          xx||-
            ababababca
    // 2-0=2
    rrababababjjjjjiiooorababababcauuu
            xx-
              ababababca

综上可知，KMP 算法的主要特点是：
1、需要对模式字符串做预处理；
2、预处理阶段需要额外的 O(m) 空间和复杂度；
3、匹配阶段与字符集的大小无关；
4、匹配阶段至多执行 2n - 1 次字符比较；
5、对模式中字符的比较顺序时从左到右；
下面是 KMP-MATCHER 的代码示例。

namespace StringMatching
{
  class Program
  {
    static void Main(string[] args)
    {
      char[] text1 = "BBC ABCDAB ABCDABCDABDE".ToCharArray();
      char[] pattern1 = "ABCDABD".ToCharArray();

      int firstShift1;
      bool isMatched1 = KmpStringMatcher.TryMatch1(text1, pattern1, out firstShift1);
      Contract.Assert(isMatched1);
      Contract.Assert(firstShift1 == 15);

      char[] text2 = "ABABDAAAACAAAABCABAB".ToCharArray();
      char[] pattern2 = "AAACAAAA".ToCharArray();

      int firstShift2;
      bool isMatched2 = KmpStringMatcher.TryMatch2(text2, pattern2, out firstShift2);
      Contract.Assert(isMatched2);
      Contract.Assert(firstShift2 == 6);

      char[] text3 = "ABAAACAAAAAACAAAABCABAAAACAAAAFDLAAACAAAAAACAAAA".ToCharArray();
      char[] pattern3 = "AAACAAAA".ToCharArray();

      int[] shiftIndexes3 = KmpStringMatcher.MatchAll1(text3, pattern3);
      Contract.Assert(shiftIndexes3.Length == 5);
      Contract.Assert(string.Join(",", shiftIndexes3) == "2,9,22,33,40");
      int[] shiftIndexes4 = KmpStringMatcher.MatchAll2(text3, pattern3);
      Contract.Assert(shiftIndexes4.Length == 5);
      Contract.Assert(string.Join(",", shiftIndexes4) == "2,9,22,33,40");

      Console.WriteLine("Well done!");
      Console.ReadKey();
    }
  }

  public class KmpStringMatcher
  {
    public static bool TryMatch1(char[] text, char[] pattern, out int firstShift)
    {
      // KMP needs a pattern preprocess to get the Partial Match Table
      int[] lps = PreprocessToComputeLongestProperPrefixSuffixArray(pattern);
      // pattern: ABCDABD
      // char:  | A | B | C | D | A | B | D |
      // index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
      // lps:   | 0 | 0 | 0 | 0 | 1 | 2 | 0 |

      firstShift = -1;
      int n = text.Length;
      int m = pattern.Length;
      int s = 0, j = 0;

      while (s < n && j < m)
      {
        if (j == -1 || text[s] == pattern[j])
        {
          s++;
          j++;
        }
        else
        {
          // here is different with naive string matcher
          if (j != 0)
            j = lps[j - 1];
          else
            s++;
        }

        if (j == m)
        {
          firstShift = s - j;
          return true;
        }
      }

      return false;
    }

    static int[] PreprocessToComputeLongestProperPrefixSuffixArray(char[] pattern)
    {
      int m = pattern.Length;

      // hold the longest prefix suffix values for pattern
      int[] lps = new int[m];
      lps[0] = 0;

      // length of the previous longest prefix suffix
      int k = 0;
      int q = 1;
      while (q < m)
      {
        if (pattern[k] == pattern[q])
        {
          k++;
          lps[q] = k;
          q++;
        }
        else
        {
          if (k != 0)
          {
            k = lps[k - 1];
          }
          else
          {
            lps[q] = 0;
            q++;
          }
        }
      }

      return lps;
    }

    public static bool TryMatch2(char[] text, char[] pattern, out int firstShift)
    {
      // KMP needs a pattern preprocess
      int[] next = PreprocessToGetNextArray(pattern);
      // pattern: ABCDABD
      // char:  | A | B | C | D | A | B | D |
      // index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
      // lps:   | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
      // next:  |-1 | 0 | 0 | 0 | 0 | 1 | 2 | -> shift LPS 1 position to right

      firstShift = -1;
      int n = text.Length;
      int m = pattern.Length;
      int s = 0, j = 0;

      while (s < n && j < m)
      {
        if (j == -1 || text[s] == pattern[j])
        {
          s++;
          j++;
        }
        else
        {
          // here is different with naive string matcher
          j = next[j];
        }

        if (j == m)
        {
          firstShift = s - j;
          return true;
        }
      }

      return false;
    }

    static int[] PreprocessToGetNextArray(char[] pattern)
    {
      int m = pattern.Length;
      int[] next = new int[m];
      next[0] = -1;

      int k = -1;
      int q = 0;
      while (q < m - 1)
      {
        if (k == -1 || pattern[k] == pattern[q])
        {
          k++;
          q++;

          //next[q] = k;       // does not optimize

          if (pattern[k] != pattern[q])
            next[q] = k;
          else
            next[q] = next[k]; // with optimization
        }
        else
        {
          k = next[k];
        }
      }

      return next;
    }

    public static int[] MatchAll1(char[] text, char[] pattern)
    {
      // KMP needs a pattern preprocess
      int[] lps = PreprocessToComputeLongestProperPrefixSuffixArray(pattern);
      // pattern: ABCDABD
      // char:  | A | B | C | D | A | B | D |
      // index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
      // lps:   | 0 | 0 | 0 | 0 | 1 | 2 | 0 |

      int n = text.Length;
      int m = pattern.Length;
      int s = 0, j = 0;
      int[] shiftIndexes = new int[n - m + 1];
      int c = 0;

      while (s < n && j < m)
      {
        if (j == -1 || text[s] == pattern[j])
        {
          s++;
          j++;
        }
        else
        {
          // here is different with naive string matcher
          if (j != 0)
            j = lps[j - 1];
          else
            s++;
        }

        if (j == m)
        {
          shiftIndexes[c] = s - j;
          c++;

          j = lps[j - 1];
        }
      }

      int[] shifts = new int[c];
      for (int y = 0; y < c; y++)
      {
        shifts[y] = shiftIndexes[y];
      }

      return shifts;
    }

    public static int[] MatchAll2(char[] text, char[] pattern)
    {
      // KMP needs a pattern preprocess
      int[] next = PreprocessToGetNextArray(pattern);
      // pattern: ABCDABD
      // char:  | A | B | C | D | A | B | D |
      // index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
      // lps:   | 0 | 0 | 0 | 0 | 1 | 2 | 0 |
      // next:  |-1 | 0 | 0 | 0 | 0 | 1 | 2 | -> shift LPS 1 position to right

      int n = text.Length;
      int m = pattern.Length;
      int s = 0, j = 0;
      int[] shiftIndexes = new int[n - m + 1];
      int c = 0;

      while (s < n && j < m)
      {
        if (j == -1 || text[s] == pattern[j])
        {
          s++;
          j++;
        }
        else
        {
          // here is different with naive string matcher
          j = next[j];
        }

        if (j == m)
        {
          shiftIndexes[c] = s - j;
          c++;

          j = next[j - 1];
        }
      }

      int[] shifts = new int[c];
      for (int y = 0; y < c; y++)
      {
        shifts[y] = shiftIndexes[y];
      }

      return shifts;
    }
  }
}