Friday, July 10, 2020

Approximate String Matching for iOS Dev.

Introduction.

Approximate string matching is one of the “sudden” problems. I got jumped by it, while working on the little OCR pet project. The situation wouldn’t have been so severe, if it hadn’t been for the time constraint, which usually comes along with any computer related task.

In short my problem was in finding a very long string (recognized page text) in a much longer string (known book’s text). It wouldn’t  be much of a pickle had I been looking for a precise match. It would require time, which depends linearly on the length of the text. That’s not so bad. But it's not exactly my case - optical character recognition is a process prone to errors. I have to account for this. So, what if I want two strings to match, but with some spelling errors?

To approach this question let's figure out what is a “spelling error”. It's best to think about those in terms of edit distances. There are many edit distances (Hamming, Levenstein, Damerau-Levenstein), but the one that is “ours” and which is commonly referred as THE EDIT DISTANCE is called Levenstein edit distance. Speaking clearly, edit distances differ from each other by the sets of unit operations that they allow. For instance, Hamming distance allows only substitution. That is, replacement of the characters in the corresponding positions in two strings. The more replacements there are, the larger the distance between two strings, the more spelling errors. Levenstein's distance allows for substitutions, insertions and deletions. It has been found to be particularly useful in spellchecking, which is kind of what I need.

Levenstein’s distance between two strings can be calculated straightforwardly using dynamic programming. The setup of the problem is the following: we have d.p. matrix $C$, the first string and the second string. We want to know the edit distance between the two. String one is indexed by index $i$, string two by index $j$. We then traverse the matrix columnwise, writing down values to the matrix using the following rule: $ C[i, j] = min(C[i-1,j]+1, C[i,j-1]+1, C[i-1,j-1]+1_{str[i] \neq str[j]}) $ The resulting distance, thus will be in $C[i_{max}+1, j_{max}+1]$ Putting it in code I got the following:

Easy enough, right? However, it's clear that the runtime of this code is $O(nm)$. The code above calculates distance between two strings, but the situation doesn't change much, when we restate the problem, and start looking for a substring in a text: the naive approach still yields a runtime of $ O(nm) $, and given that in my case $n$ would be on the order of $10^7$, and $m$ would be on the order of $10^3$, well this approach would limit my target audience to some very, very patient people.

Being almost, but not completely disheartened by this discovery, I thought: “Knuth-Morris-Pratt algorithm finds a substring in a string in linear time, while naive method does it in $ O(nm) $. Is it possible, that there is an algorithm, like KMP, but for the approximate string matching? If we allow for spelling errors, does the problem become so complicated, that the runtime of $O(nm) $ cannot be significantly improved?”

After short search I discovered that a lot of other people had the same problem and came up with a lot of great solutions, one of which I am going to discuss. 


Myers’s Algorithm

The approach I decided to go with is described in the article by Gene Myers “A Fast Bit-Vector Algorithm For Approximate String Matching Based on Dynamic Programming” (1998) The abstract of the article promises a seductive runtime of $O(kn/w)$ where $k$ is the edit distance we want, $n$ is the length of text, and $w$… Well, $w$ is the length of the longest integer you got: 32 or 64 bit, depending on the architecture. As it turned out, a factor of $\frac{1}{w}$ is a significant speed up. Of course, if you set $k$ to be 0 you won’t get an instantaneous result. And of course there is a preprocessing fee of $O(\left \lceil{\frac{m}{w}}\right \rceil \sigma) $ where $\sigma$ is the size of the alphabet… But, that’s still a tremendous improvement. Even more so, in the article by Gonzalo Navarro “A Guided Tour to Approximate String Matching” (2001), it is mentioned that as far as looking for approximate matches in terms of Levenstein's edit distance goes, Myers's approach is about as good as it gets.

Myers’s article is quite convoluted. Most of it are proofs of different lemmas, along with the introduction of the concept called “block-based dynamic programing”. However, the article also contains a step-by-step guide to the implementation of the method, along with some pseudo-code and actual code samples. This makes it possible to implement the method, without diving head-first into its foundations.

However, even though the implementation is easy enough, the result which the method returns requires some additional thinking. And here is why: given a text $ T = t_1t_2...t_n $, a pattern $ P=p_1p_2...p_m $ and edit distance $ k $ the algorithm finds all values $ j $, such that for $ t_1...t_j $ there exists a suffix $ t_g...t_j $ ($ 1 \leq g \leq j$)for which $editDistance(t_g...t_j,P) \leq k $ Look at this statement closely: method doesn’t find the location in text, where pattern started or ended matching. It finds a location, and somewhere around that location lies pattern, within a given edit distance. So the question which suffixes out of all the suffixes in $ t_1...t_j $ are matches, is kinda open.

Another question that comes to mind is - if there is a match at some edit distance, then how confident am I that I found what I was looking for? For instance “slaughter” is at distance 1 from “laughter”. That’s an important thing to consider, since as it turns out one distance gives a spot on match, while the other, not much different one, yields something completely orthogonal to our query.

The answers to these questions are intertwined. I’ll address them shortly, however, meanwhile, let’s take a look at one curious bug, which I encountered while implementing the method.


Implementation Caveat

In this section I’m going to rivet your attention to the little inconsistency, which I spotted, while I was implementing Myers's method. Lets take a look at the following pseudocode (image from the article):
In the loop at lines 15-16 there is no check for $y$ going out of bounds. Since arrays, which I use in my implementation are just chunks of memory, allocated with malloc, the program usually wouldn’t crash. But as $y$ would go to negative values, $Score[y]$ would contain all kinds of garbage. This is an obnoxious, and extremely hard to find bug, which however, is easy enough to fix. I added a check for the value of y:

From the article it’s not immediately clear, whether this check is needed. “y” is an index of something called a block, and blocks are numbered starting at 1 (pg. 409). So, you can’t really get rid of any more blocks, once you are at the first one.


Edit Distance Threshold

With the working implementation at hand it’s now time to return to one of the most important questions - is edit distance a confidence level? What does it mean to have an edit distance of 5, 15, or 300?

A pretty neat way to think about it can be found in Navarro’s article “A Guided Tour to Approximate String Matching” (2001). The author suggests the following (G. Navarro, pg. 40): let $f(m,k) $ be the probability of finding a match given at most k errors. For further convenience let's define an error level $\alpha = \frac{k}{m}$. How does $f(m, \alpha m)$ depend on $\alpha$?

The author suggests a way to find $\alpha$ by the means of the following experiment: let’s take a randomly generated text, and select a piece from it, several hundred characters long. Then lets look for the pattern in the text, allowing for greater and greater values of $\alpha$. How will $f(m, \alpha m)$ change?

As it turns out, for the random text, where each character has an equal chance of appearing, the probability of finding a match would be very low up to $\alpha = 0.8$, and at $\alpha = 0.8$ there would be as many matches as there are characters in the text (a 100% chance of a match).  Interestingly enough, for a non-random text, “Moby Dick” this number will be around 0.7 I explain this critical $\alpha$ being lower for non-random text then for random due to the following: in a random text all characters are equally likely to appear, while in non-random text some characters would be more frequent, and others would be less frequent (this is confirmed by calculating variances of the respective character distributions). Thus, by changing less frequent characters, we will make the text more generic, and increase the probability of a match. However, overall there are less unique characters, because what makes them unique is how rare they are. That's why we still have to change a lot - around 70% percent of the text. There seems to be some sort of equilibrium around this value. I did the same experiment with the text of "Catcher In the Rye", expecting $\alpha$ to be much lower, since in the whole book there are about one hundred words worth of vocabulary, but, I got the same result: $\alpha = 0.7$.  This can be researched some more, however the answer will be the same: for each text there is a value of $\alpha$ for which there is no point in searching, since almost anything will match. Well, duh...

What’s more important to note is that as alpha grows the number of matches grows exponentially. Let’s take a look at the following two plots:
On these plots, along the $Y$-axis there is a $\log_{10}(matches)$ - logarithm of the number of matches given the error level $\alpha$. The trend line is a linear regression. For both plots $R^2$ is close to 1, meaning that trend line fits the data nicely. And given the fact that along $Y$ there is a logarithm - the number of matches grows exponentially with the growth of $\alpha$.

Undoubtedly, more points would give a much more ironclad proof. But, I didn’t really need that. For me this problem was one of the things, which better be done once on time, than twice correctly. The main takeaway should be - the maximum allowed edit distance $k$ must be very, very low. In my application I set it to around 1%-2% ($k=\alpha m, \alpha = 0.01$). With this edit distance I am confident that I found what I was looking for. By this approach I don't exactly solve the problem of "slaughter" vs. "laughter", however I don't think it can be solved without making computer understand what the words actually mean. 


Suffix Lookup

Now we know that we shouldn’t allow for very high values of $k$, lest we find some complete gibberish. This fact comes in very handy, when we actually start looking for the suffix.

But first, what does it mean “to look for the suffix”? Well, remember from the problem statement, that algorithm finds in text $T=t_1t_2...t_n$ all positions $j$, for which there is a suffix $t_g...t_j$ such that $editDistance(t_g...t_j, Pattern) \leq k$ So, algorithm returns positions in a text, around which the pattern can be found, with at most $k$ errors. By saying “look for the suffix”, I mean, that among all of the suffixes in $t_1...t_j$, we try to find the one, which is at edit distance no bigger, than $k$. Matter of fact, there can be multiple suffixes matching that criterion around any given position $j$ - for simplicity I'll focus on a single representative of that set.

The fact that $k$ cannot be very big comes in handy here, because the number of suffixes we have to go through is actually bound by $k$. There is no point in looking at anything shorter than $m-k$, because going any lower i.e. $m-k-1$, would give us a suffix with $k+1$ edit distance in the best case ($k+1$ characters have to be deleted from pattern to match the suffix). Also there is no point at looking at any suffix longer than $m+k$, because, i.e. for $m+k+1$ the edit distance would be, again, $k+1$ in the best case - even if $m$ characters match exactly, there are still $k+1$ remaining characters in the suffix, which would bring the edit distance to $k+1$. And we want the edit distance to be at most $k$. Hence the limits. Thus we have to look through at most $2k$ suffixes.

Suffix lookup is an expensive procedure. For each candidate suffix the edit distance has to be calculated, and then compared with the target. Thus the more efficient the edit distance calculations, the cheaper the lookup. Currently I use the most straight forward, quadratic time implementation. There is a simple enough way of reducing this time to $O(k\times min(n,m))$ mentioned in Esko Ukkonen’s article “Algorithms for Approximate String Matching”. In some instances, when $k$ is much, much smaller than $m$, i.e. ($\frac{k}{m} \leq 0.1$), it is possible to avoid lookup completely. Consider for instance the situation, when you are looking for a book page’s text several thousand symbols long, in a book’s text, with at most 5-10 errors. Method tells you that there is a match at position $j = 4,589,777$ in the text (it’s a freaking huge book). Your page (the end of its text) then will be located around position $j$- a couple of places forwards or backwards might not make that much of a difference.


Putting It All Together

I’ve put all of the above in the Objective-C framework, which currently lives here: https://github.com/g00dm0us3/NiftySubstringSearch

The general usage advice is: keep your $k$ values low, and avoid unnecessary suffix lookups. 

One important implementation note: Myers's algorithm relies heavily on the concept of alphabet. In my implementation alphabet is bound by 256 possible values, and thus it basically looks for the byte sequences. The idea of extending the alphabet structure beyond simple array, and comparing actual character sequences using specialized Unicode algorithms currently is a good todo item, but not the immediate necessity. It works fine with simple UTF-8 encodings.


Conclusion

The problem of finding an approximate match for the substring in the text is a “sudden” one. There are many ways to state it, and each statement will have its own solution. I call it “sudden” because it seems like there should already be some solution, nicely wrapped in a pod, and yet there isn't. At least not for the statement I described. It calls for additional, unexpected dive into research.

And more generally, it seems like a lot of people were surprised by it throughout history of computer science, which led to the development of many different methods, starting with the simple edit-distance based substring search algorithms, like the one I discussed in this post, and going all the way down to the intricate machine learning techniques, which can deal with the actual meaning of the text (kinda).