Tuesday, December 29, 2020

Smooth Derivation of the Perspective Projection Matrix

Introduction

So, recently I’ve been toying around with 3D graphics, and decided that I really want to get to the bottom of why does perspective projection matrix look like this:

$$ \begin{bmatrix} \frac{2n}{r - l} & 0 & - \frac{r+l}{r-l} & 0 \\ 0 & \frac{2n}{t - b} & - \frac{t+b}{t-b} & 0 \\ 0 & 0 & \frac{f+n}{f-n} & - \frac{2fn}{f-n} \\ 0 & 0 & 1 & 0 \end{bmatrix} $$

The answer kind of surprised me. The derivation was trivial, once I learned what exactly this matrix is used for. I mean in the big picture of 3D rendering. Being sort of, like a really tiny bit nervous about forgetting all this I decided to write it down.


Transforms for x & y

When I first attempted to derive this matrix the only thing I knew was that it’s supposed to transform coordinates from camera frustum into an NDC coordinate space: $[-1,1]\times[-1,1]\times[-1,1]$.


That’s literally the first thing I read on some answer on SO. Then there was a link to the derivation. I followed it, and got staggered by a page full of some middle-school algebra. It looked overwhelming. It wasn’t complicated, but I wasn’t a hundred percent sure I’m going to be able to follow author’s reasoning. But then I thought: “it’s cool, in its essence it’s just a transform between two linear subspaces” – and I just read about those. I felt confident, I had Hoffman & Kuntz on my side (I thought so).

So, I started reading into it. Coasted through the derivation of formulas for  projections.


I mean from the image above it’s quite clear why for any given x/y the following holds (I’m gonna give only formula for ‘x’, for ‘y’ it’s the same):

$$\frac{x_n}{z_n} = \frac{x_g}{z_g} \\ x_n = \frac{z_n x_g}{z_g}$$

Note: $z_n$ denotes $n$, or near plane.

     However, soon I hit a brick wall. There were line equations, and stuff got plugged into those, in a manner, which I found jarring. I tried to imagine, where these lines would lie, and failed. Have you ever had this desperate feeling, when you want to yell at the monitor: “what the hell does that even mean?!”, but you know that no one will answer? I had been drowning in it for a couple of seconds and then I put two and two together.

    By that point I’ve already projected my $x$ coordinates onto the near plane, and was quite sure that its value had very little reason of being less then $l$ or more than $r$(or vice a versa depending on your coordinate system orientation). Also, I knew that NDC “wants” it to be in between -1 and 1. And then I just happen to know this formula, for rescaling a value in one range into a new one:

$$val_{new}=\frac{new_{high}-new_{low}}{range_{high}-range_{low}}(val_{range}-range_{low})+new_{low}$$

Here $range$ denotes the input range, which we want to transform from, in our case for $x_g$ coordinate it is $[r,l]$; the $new$ correspondingly denotes the range we are transforming to, in our case it’s NDCs $[-1,1]$.

    This formula looks almost exactly like a line equation: $y=mx+b$, except for the fact that $x = (val_{range} – range_{low})$. That’s why I’ve always felt more comfortable thinking about it in terms of ranges, instead of in terms of lines. It goes something like: “suppose I have a value between 1 and 2, but I want it to be between 0 and 1, what should I do?” Obviously, I have to subtract 1 from my input – this way it will automatically lay between 0 and 1. In this manner, toying with simple ranges, I’ve always been able to arrive at the formula above – it’s all about squeezing and stretching and shifting a little to left and a little to the right. I feel like here it’s best to fiddle with the concrete examples, and leave the words like “line equation” out of the picture.


    However, reiterating on the example above, how do I know that I should subtract and scale, instead of, say use a function like:

$$ \begin{equation} val_{new} = \begin{cases} 0, & val_{range} < 1.5 \\ 1, & val_{range} \geq 1.5 \end{cases} \end{equation} $$

Or perhaps something, that is less seemingly ludicrous, and has the desired range? Why would I choose linear transform, out of all the possible functions? 

    Currently, I don’t have the exact answer for this question. I’m fairly certain, that it has to do with the fact that linear transform doesn’t “significantly alter the behavior" of the function it’s applied to. For instance, if I had a function $f(x)$, and I wanted to get a new function:

$g(x) = m*f(x)+b$, then the derivative of g would be different from f only by a constant multiplier:

$$\frac{dg}{dx}=m\frac{df}{dx}$$

    I’m quite sure that it’s a big deal. I feel like it might be fun to toss the boring rescaling formula out of this derivation, and replace it with something more unstable and fun, but we will stick with what we have for the time being, and do the fun thing in another post.


Anyhow, plugging the equations for $x_n$, $y_n$ in the rescaling formula above yields:


$$ x_{ndc} = \frac{2z_n}{r-l}(\frac{x_g}{z_g})-\frac{r+l}{r-l}\\ y_{ndc} = \frac{2z_n}{t-b}(\frac{y_g}{z_g})-\frac{t+b}{t-b} $$

Note: I omitted some of the trivial transformations you need to arrive at this specific form of expressions.


Important Digression

To finish up the derivation, at this point, we need to recall what we are deriving. Specifically, we are looking for the matrix representing the following transformation (spoiler alert!):



$$ \begin{bmatrix} a & b & c & d\\ e & f & g & h\\ i & j & k & l\\ m & n & o & p \end{bmatrix} \left[\begin{array}{c}x_g\\y_g\\z_g\\w\end{array}\right] = \left[\begin{array}{c}x_{ndc}\\y_{ndc}\\z_{ndc}\\z_g\end{array}\right] $$

We have already partially filled out this matrix with the expressions to transform $x_g$ and $y_g$ to the NDC space:

$$ x_{ndc} = \frac{2z_n}{r-l}(\frac{x_g}{z_g})-\frac{r+l}{r-l}\\ y_{ndc} = \frac{2z_n}{t-b}(\frac{y_g}{z_g})-\frac{t+b}{t-b} $$

Plugging them into matrix gives us:


$$ \begin{bmatrix} \frac{2z_{n}}{r-l} & 0 & -\frac{r+l}{r-l} & 0\\ 0 & \frac{2z_n}{t-b} & -\frac{t+b}{t-b} & 0\\ i & j & k & l\\ m & n & o & p \end{bmatrix} \left[\begin{array}{c}x_g\\y_g\\z_g\\w\end{array}\right] = \left[\begin{array}{c}x_{ndc}\\y_{ndc}\\z_{ndc}\\z_g\end{array}\right] $$

We place the coefficients so, that multiplication will yield:


 $$ x = \frac{2z_n}{r-l}(x_g)-\frac{r+l}{r-l}z_g\\ y = \frac{2z_n}{t-b}(y_g)-\frac{t+b}{t-b}z_g $$

    Dividing that by $z_g$ will give us the NDC coordinates. Now, all we have to do is understand, why in the world would we divide by $z_g$. Or at least find a decent reason to write it off to. To do so, let’s start with an intriguing question.


    Isn’t it peculiar, that we have a four component vector, to represent something in 3D? What is $w$? The answer to this question lies in understanding, what are homogeneous coordinates. Now, understanding this is really quite simple. Imagine a point in space, and a plane, which lies above it. We draw a ray from that point- it will cross the plane at some coordinates: $(x_c,y_c,z_c)$. For further convenience, let’s say, that this plane lies parallel to x/y plane of the orthonormal coordinate basis with the beginning at the said point, and that it satisfies an equation: $z = 1$, in this basis (which together with the point of origin gives an affine coordinate system). All of that would look something like this:


    Now, when we have said and drawn this for our convenience, let’s get back to our arbitrary ray, which starts from the aforementioned point, and crosses a plane at some coordinates $(x_c, y_c, z_c)$ (point $A$ on the image above). Since our plane satisfies the equation $z = 1$, the coordinates of the crossing will be $(x_c, y_c, 1)$. Now let’s take any other point on this line, say $B$, with coordinates: $(x_l, y_l, z_l)$. It is clear, that if we divide each of these by $z_l$, we’d get $(\frac{x_l}{z_l}, \frac{y_l}{z_l}, 1)$ – a point on our plane. What is a little less clear is that these will be the coordinates of the very same point $A$. If we take point $C$, and transform its coordinates in a similar manner, we would again, get the coordinates of $A$. And, even more generally, if we take any point on this line, and divide its coordinates we would end up in $A$. This property, allows us to map (bijectively I daresay) (almost) each ray originating in point $O$ to a certain point on a plane. The coordinates of a point (say $B$ or $C$) belonging to a certain ray in this construct are called “homogeneous” coordinates (“hom-” – same, “genos” – kin).

    Homogeneous coordinates are important in 3D graphics since they very naturally model the idea of rendering a 3D scene as being observed from a camera point of view. In this case a point $O$ will be the camera and lines going out of this point – are rays of light, which intersect some object in a 3D space. Of course, in reality rays of light go into the camera, instead of going out of it, but we will omit this detail, as insignificant.

    You may have noticed, how in the discussion of the homogeneous coordinates we were talking about triplets: $(x_l, y_l, z_l)$. And here is why – x and y coordinates are enough, if we want a way of mapping rays onto points, however, for rendering a 3D scene we would usually need a z coordinate.  Thus we would need a four component vector. But the fact that it’s four component doesn’t really change its essence – you still divide (meaning rendering pipeline does this for you) each of the components by the fourth one, to get what you need:



$$ x_{ndc} = \frac{x}{z_g}\\ y_{ndc} = \frac{y}{z_g}\\ z_{ndc} = \frac{z}{z_g} $$

This thought actually gives us the fourth row of the transformation matrix:


$$ \begin{bmatrix} \frac{2z_{n}}{r-l} & 0 & -\frac{r+l}{r-l} & 0\\ 0 & \frac{2z_n}{t-b} & -\frac{t+b}{t-b} & 0\\ i & j & k & l\\ 0 & 0 & 1 & 0 \end{bmatrix} $$

And, even more so – since we know that $z$coordinate will be inevitable divided by $z_g$, we can get a head start on deriving expression for $z$ coordinate:



$$\frac{ \left[\begin{array}& i & j & k & l\end{array}\right]\left[\begin{array}& x_{g} & y_{g} & z_{g} & w\end{array}\right]^T}{z_g} = z_{ndc} $$

Note: later by the application of the same rescaling formula we used to get NDC coordinates, x/y coords will be rescaled from $\left[-1,1\right]\times \left[-1,1\right]$ to $\left[0,screen_{width}\right]\times\left[0,screen_{height}\right]$. Z coordinate will be used in more esoteric ways to create some calming illusions, of shadows, volume and whatnot.

Wrapping it up

Using the insights we gained in the previous section, let’s find the final row of a matrix, you know, the i,j,k,l in this:
$$ \left[\begin{array}& i & j & k & l\end{array}\right]\left[\begin{array}{c}x_{g} \\ y_{g} \\ z_{g} \\ w\end{array}\right] $$


While i and j are obviously zero, k and l should be so, that:
 $$ \frac{kz_g+lw}{z_g}=z_{ndc} $$ 
as we learned in the previous section. Without loss of generality, we can say that w = 1. And furthermore, we know the following (just plugging in respective limits of ranges):
 $$ \begin{cases} k+\frac{l}{n} = -1\\ k+\frac{l}{f} = 1 \end{cases} $$ 
Careful solution of this system yields: 
$$ \begin{cases} k = \frac{f+n}{f-n}\\ l = -\frac{2fn}{f-n} \end{cases} $$ 
These are the final pieces we need to put our matrix together: 
$$ \begin{bmatrix} \frac{2z_n}{r - l} & 0 & - \frac{r+l}{r-l} & 0 \\ 0 & \frac{2z_n}{t - b} & - \frac{t+b}{t-b} & 0 \\ 0 & 0 & \frac{f+z_n}{f-z_n} & - \frac{2fz_n}{f-z_n} \\ 0 & 0 & 1 & 0 \end{bmatrix} $$

Conclusion

    This derivation was built kind of backwards – firstly we found the expressions for the projections of $x_g$ and $y_g$ onto the near plane (by the way intrinsically performing perspective projection), then we rescaled them, and only afterwards we have written down what we are looking for, and figured, that we’d have to divide by everything by $z_g$. After that I placed into all of this a concept of homogenous coordinates – to explain the division, and added info about occlusion and shadows, to explain why matrix is 4x4 (if you omit depth test, z coordinate is no longer needed and the matrix would be 3x3). There is a more direct way of deriving this matrix – it involves multiplying several scaling matrices. Find it if you prefer a more direct approach. Derive a matrix for orthogonal projection if you like doing homework. Basically, just follow your heart (responsibly).

Friday, July 10, 2020

Approximate String Matching for iOS Dev.

Introduction.

Approximate string matching is one of the “sudden” problems. I got jumped by it, while working on the little OCR pet project. The situation wouldn’t have been so severe, if it hadn’t been for the time constraint, which usually comes along with any computer related task.

In short my problem was in finding a very long string (recognized page text) in a much longer string (known book’s text). It wouldn’t  be much of a pickle had I been looking for a precise match. It would require time, which depends linearly on the length of the text. That’s not so bad. But it's not exactly my case - optical character recognition is a process prone to errors. I have to account for this. So, what if I want two strings to match, but with some spelling errors?

To approach this question let's figure out what is a “spelling error”. It's best to think about those in terms of edit distances. There are many edit distances (Hamming, Levenstein, Damerau-Levenstein), but the one that is “ours” and which is commonly referred as THE EDIT DISTANCE is called Levenstein edit distance. Speaking clearly, edit distances differ from each other by the sets of unit operations that they allow. For instance, Hamming distance allows only substitution. That is, replacement of the characters in the corresponding positions in two strings. The more replacements there are, the larger the distance between two strings, the more spelling errors. Levenstein's distance allows for substitutions, insertions and deletions. It has been found to be particularly useful in spellchecking, which is kind of what I need.

Levenstein’s distance between two strings can be calculated straightforwardly using dynamic programming. The setup of the problem is the following: we have d.p. matrix $C$, the first string and the second string. We want to know the edit distance between the two. String one is indexed by index $i$, string two by index $j$. We then traverse the matrix columnwise, writing down values to the matrix using the following rule: $ C[i, j] = min(C[i-1,j]+1, C[i,j-1]+1, C[i-1,j-1]+1_{str[i] \neq str[j]}) $ The resulting distance, thus will be in $C[i_{max}+1, j_{max}+1]$ Putting it in code I got the following:

Easy enough, right? However, it's clear that the runtime of this code is $O(nm)$. The code above calculates distance between two strings, but the situation doesn't change much, when we restate the problem, and start looking for a substring in a text: the naive approach still yields a runtime of $ O(nm) $, and given that in my case $n$ would be on the order of $10^7$, and $m$ would be on the order of $10^3$, well this approach would limit my target audience to some very, very patient people.

Being almost, but not completely disheartened by this discovery, I thought: “Knuth-Morris-Pratt algorithm finds a substring in a string in linear time, while naive method does it in $ O(nm) $. Is it possible, that there is an algorithm, like KMP, but for the approximate string matching? If we allow for spelling errors, does the problem become so complicated, that the runtime of $O(nm) $ cannot be significantly improved?”

After short search I discovered that a lot of other people had the same problem and came up with a lot of great solutions, one of which I am going to discuss. 


Myers’s Algorithm

The approach I decided to go with is described in the article by Gene Myers “A Fast Bit-Vector Algorithm For Approximate String Matching Based on Dynamic Programming” (1998) The abstract of the article promises a seductive runtime of $O(kn/w)$ where $k$ is the edit distance we want, $n$ is the length of text, and $w$… Well, $w$ is the length of the longest integer you got: 32 or 64 bit, depending on the architecture. As it turned out, a factor of $\frac{1}{w}$ is a significant speed up. Of course, if you set $k$ to be 0 you won’t get an instantaneous result. And of course there is a preprocessing fee of $O(\left \lceil{\frac{m}{w}}\right \rceil \sigma) $ where $\sigma$ is the size of the alphabet… But, that’s still a tremendous improvement. Even more so, in the article by Gonzalo Navarro “A Guided Tour to Approximate String Matching” (2001), it is mentioned that as far as looking for approximate matches in terms of Levenstein's edit distance goes, Myers's approach is about as good as it gets.

Myers’s article is quite convoluted. Most of it are proofs of different lemmas, along with the introduction of the concept called “block-based dynamic programing”. However, the article also contains a step-by-step guide to the implementation of the method, along with some pseudo-code and actual code samples. This makes it possible to implement the method, without diving head-first into its foundations.

However, even though the implementation is easy enough, the result which the method returns requires some additional thinking. And here is why: given a text $ T = t_1t_2...t_n $, a pattern $ P=p_1p_2...p_m $ and edit distance $ k $ the algorithm finds all values $ j $, such that for $ t_1...t_j $ there exists a suffix $ t_g...t_j $ ($ 1 \leq g \leq j$)for which $editDistance(t_g...t_j,P) \leq k $ Look at this statement closely: method doesn’t find the location in text, where pattern started or ended matching. It finds a location, and somewhere around that location lies pattern, within a given edit distance. So the question which suffixes out of all the suffixes in $ t_1...t_j $ are matches, is kinda open.

Another question that comes to mind is - if there is a match at some edit distance, then how confident am I that I found what I was looking for? For instance “slaughter” is at distance 1 from “laughter”. That’s an important thing to consider, since as it turns out one distance gives a spot on match, while the other, not much different one, yields something completely orthogonal to our query.

The answers to these questions are intertwined. I’ll address them shortly, however, meanwhile, let’s take a look at one curious bug, which I encountered while implementing the method.


Implementation Caveat

In this section I’m going to rivet your attention to the little inconsistency, which I spotted, while I was implementing Myers's method. Lets take a look at the following pseudocode (image from the article):
In the loop at lines 15-16 there is no check for $y$ going out of bounds. Since arrays, which I use in my implementation are just chunks of memory, allocated with malloc, the program usually wouldn’t crash. But as $y$ would go to negative values, $Score[y]$ would contain all kinds of garbage. This is an obnoxious, and extremely hard to find bug, which however, is easy enough to fix. I added a check for the value of y:

From the article it’s not immediately clear, whether this check is needed. “y” is an index of something called a block, and blocks are numbered starting at 1 (pg. 409). So, you can’t really get rid of any more blocks, once you are at the first one.


Edit Distance Threshold

With the working implementation at hand it’s now time to return to one of the most important questions - is edit distance a confidence level? What does it mean to have an edit distance of 5, 15, or 300?

A pretty neat way to think about it can be found in Navarro’s article “A Guided Tour to Approximate String Matching” (2001). The author suggests the following (G. Navarro, pg. 40): let $f(m,k) $ be the probability of finding a match given at most k errors. For further convenience let's define an error level $\alpha = \frac{k}{m}$. How does $f(m, \alpha m)$ depend on $\alpha$?

The author suggests a way to find $\alpha$ by the means of the following experiment: let’s take a randomly generated text, and select a piece from it, several hundred characters long. Then lets look for the pattern in the text, allowing for greater and greater values of $\alpha$. How will $f(m, \alpha m)$ change?

As it turns out, for the random text, where each character has an equal chance of appearing, the probability of finding a match would be very low up to $\alpha = 0.8$, and at $\alpha = 0.8$ there would be as many matches as there are characters in the text (a 100% chance of a match).  Interestingly enough, for a non-random text, “Moby Dick” this number will be around 0.7 I explain this critical $\alpha$ being lower for non-random text then for random due to the following: in a random text all characters are equally likely to appear, while in non-random text some characters would be more frequent, and others would be less frequent (this is confirmed by calculating variances of the respective character distributions). Thus, by changing less frequent characters, we will make the text more generic, and increase the probability of a match. However, overall there are less unique characters, because what makes them unique is how rare they are. That's why we still have to change a lot - around 70% percent of the text. There seems to be some sort of equilibrium around this value. I did the same experiment with the text of "Catcher In the Rye", expecting $\alpha$ to be much lower, since in the whole book there are about one hundred words worth of vocabulary, but, I got the same result: $\alpha = 0.7$.  This can be researched some more, however the answer will be the same: for each text there is a value of $\alpha$ for which there is no point in searching, since almost anything will match. Well, duh...

What’s more important to note is that as alpha grows the number of matches grows exponentially. Let’s take a look at the following two plots:
On these plots, along the $Y$-axis there is a $\log_{10}(matches)$ - logarithm of the number of matches given the error level $\alpha$. The trend line is a linear regression. For both plots $R^2$ is close to 1, meaning that trend line fits the data nicely. And given the fact that along $Y$ there is a logarithm - the number of matches grows exponentially with the growth of $\alpha$.

Undoubtedly, more points would give a much more ironclad proof. But, I didn’t really need that. For me this problem was one of the things, which better be done once on time, than twice correctly. The main takeaway should be - the maximum allowed edit distance $k$ must be very, very low. In my application I set it to around 1%-2% ($k=\alpha m, \alpha = 0.01$). With this edit distance I am confident that I found what I was looking for. By this approach I don't exactly solve the problem of "slaughter" vs. "laughter", however I don't think it can be solved without making computer understand what the words actually mean. 


Suffix Lookup

Now we know that we shouldn’t allow for very high values of $k$, lest we find some complete gibberish. This fact comes in very handy, when we actually start looking for the suffix.

But first, what does it mean “to look for the suffix”? Well, remember from the problem statement, that algorithm finds in text $T=t_1t_2...t_n$ all positions $j$, for which there is a suffix $t_g...t_j$ such that $editDistance(t_g...t_j, Pattern) \leq k$ So, algorithm returns positions in a text, around which the pattern can be found, with at most $k$ errors. By saying “look for the suffix”, I mean, that among all of the suffixes in $t_1...t_j$, we try to find the one, which is at edit distance no bigger, than $k$. Matter of fact, there can be multiple suffixes matching that criterion around any given position $j$ - for simplicity I'll focus on a single representative of that set.

The fact that $k$ cannot be very big comes in handy here, because the number of suffixes we have to go through is actually bound by $k$. There is no point in looking at anything shorter than $m-k$, because going any lower i.e. $m-k-1$, would give us a suffix with $k+1$ edit distance in the best case ($k+1$ characters have to be deleted from pattern to match the suffix). Also there is no point at looking at any suffix longer than $m+k$, because, i.e. for $m+k+1$ the edit distance would be, again, $k+1$ in the best case - even if $m$ characters match exactly, there are still $k+1$ remaining characters in the suffix, which would bring the edit distance to $k+1$. And we want the edit distance to be at most $k$. Hence the limits. Thus we have to look through at most $2k$ suffixes.

Suffix lookup is an expensive procedure. For each candidate suffix the edit distance has to be calculated, and then compared with the target. Thus the more efficient the edit distance calculations, the cheaper the lookup. Currently I use the most straight forward, quadratic time implementation. There is a simple enough way of reducing this time to $O(k\times min(n,m))$ mentioned in Esko Ukkonen’s article “Algorithms for Approximate String Matching”. In some instances, when $k$ is much, much smaller than $m$, i.e. ($\frac{k}{m} \leq 0.1$), it is possible to avoid lookup completely. Consider for instance the situation, when you are looking for a book page’s text several thousand symbols long, in a book’s text, with at most 5-10 errors. Method tells you that there is a match at position $j = 4,589,777$ in the text (it’s a freaking huge book). Your page (the end of its text) then will be located around position $j$- a couple of places forwards or backwards might not make that much of a difference.


Putting It All Together

I’ve put all of the above in the Objective-C framework, which currently lives here: https://github.com/g00dm0us3/NiftySubstringSearch

The general usage advice is: keep your $k$ values low, and avoid unnecessary suffix lookups. 

One important implementation note: Myers's algorithm relies heavily on the concept of alphabet. In my implementation alphabet is bound by 256 possible values, and thus it basically looks for the byte sequences. The idea of extending the alphabet structure beyond simple array, and comparing actual character sequences using specialized Unicode algorithms currently is a good todo item, but not the immediate necessity. It works fine with simple UTF-8 encodings.


Conclusion

The problem of finding an approximate match for the substring in the text is a “sudden” one. There are many ways to state it, and each statement will have its own solution. I call it “sudden” because it seems like there should already be some solution, nicely wrapped in a pod, and yet there isn't. At least not for the statement I described. It calls for additional, unexpected dive into research.

And more generally, it seems like a lot of people were surprised by it throughout history of computer science, which led to the development of many different methods, starting with the simple edit-distance based substring search algorithms, like the one I discussed in this post, and going all the way down to the intricate machine learning techniques, which can deal with the actual meaning of the text (kinda).