The topic here is string search algorithm and not a regular expression. For example, find the position(s) of matched patterns in a loooong text.
The easiest implementation is:
A performance of this brute-force search is O(n^2), which considered bad. When the given text is huge, it’s apparent.
There’s a few algorithms to effectively find the pattern. Among them, Knuth-Morris-Pratt (KMP) algorithm would be well-known algorithm. KMP uses prefix function to skip or find a pattern effectively. The performance of KMP is O(N+M) where N, M are the lengths of text and pattern respectively.
KMP is quite effective, however, understanding its idea is quite tough. Especially, a prefix function, which is also called failure function or partial match table, is a challenging part. Actually, it is an array, neither of function nor table.
This famous algorithm is explained in books and websites, so we can easily find many. Unfortunately, most of them don’t help to understand the idea so much except:
This blog post is really good. The key idea of prefix function is to find “the longest proper prefix of a substring which is also a suffix.” The blog post tells us what is it clearly using good examples.
Also, the idea how to find the pattern using prefix function is described with easy-to-understand illustrations.
The prefix function (an array) will be created only from a pattern. We don’t use a text at all to compute values in prefix function. The function is used to see how the pattern matches itself. Based on this information, we can safely skip useless comparisons.
Here’s the code to calculate prefix function:
static int[] computePrefixFunction(char[] pattern) {
int m = pattern.length;
int[] p = new int[m];
p[0] = 0;
int k = 0;
for (int q=1; q<m; q++) {
while (k > 0 && pattern[k] != pattern[q]) {
k = p[k];
}
if (pattern[k] == pattern[q]) {
k++;
}
p[q] = k;
}
return p;
}
When the pattern is “ababaca”, above computes:
i | 0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|---|
pattern[i] | a | b | a | b | a | c | a |
p[i] | 0 | 0 | 1 | 2 | 3 | 0 | 1 |
Once prefix function is computed, it’s time to search the pattern. This part absolutely use the given text. The rule is:
Below is the code to search the pattern:
static void kmp(char[] text, char[] pattern) {
List<Integer> indexes = new ArrayList<>();
int n = text.length;
int m = pattern.length;
int[] p = computePrefixFunction(pattern);
int q = 0;
for (int i=0; i<n; i++) {
if (q > 0 && pattern[q] != text[i]) {
q = p[q-1];
}
if (pattern[q] == text[i]) {
q++;
}
if (q == m) {
indexes.add(i-q+1);
q = p[q-1];
}
}
if (indexes.size()==0) {
System.out.println((new String(pattern)) + " not found");
} else {
System.out.println((new String(pattern)) + " found at: " + indexes.toString());
}
}
All parts are ready. It’s time to run the code and see the result.
public static void main(String[] args) {
char[] pat = "ababaca".toCharArray();
char[] text = "bacbababaabcbab".toCharArray();
kmp(text, pat);
pat = "ababaca".toCharArray();
text = "bacbababaabcbababaca".toCharArray();
kmp(text, pat);
pat = "aba".toCharArray();
text = "bacbababaabcbababaca".toCharArray();
}
Above prints,
ababaca not found
ababaca found at: [13]
aba found at: [4, 6, 13, 15]