Tuesday, December 29, 2020

Smooth Derivation of the Perspective Projection Matrix

Introduction

So, recently I’ve been toying around with 3D graphics, and decided that I really want to get to the bottom of why does perspective projection matrix look like this:

$$ \begin{bmatrix} \frac{2n}{r - l} & 0 & - \frac{r+l}{r-l} & 0 \\ 0 & \frac{2n}{t - b} & - \frac{t+b}{t-b} & 0 \\ 0 & 0 & \frac{f+n}{f-n} & - \frac{2fn}{f-n} \\ 0 & 0 & 1 & 0 \end{bmatrix} $$

The answer kind of surprised me. The derivation was trivial, once I learned what exactly this matrix is used for. I mean in the big picture of 3D rendering. Being sort of, like a really tiny bit nervous about forgetting all this I decided to write it down.


Transforms for x & y

When I first attempted to derive this matrix the only thing I knew was that it’s supposed to transform coordinates from camera frustum into an NDC coordinate space: $[-1,1]\times[-1,1]\times[-1,1]$.


That’s literally the first thing I read on some answer on SO. Then there was a link to the derivation. I followed it, and got staggered by a page full of some middle-school algebra. It looked overwhelming. It wasn’t complicated, but I wasn’t a hundred percent sure I’m going to be able to follow author’s reasoning. But then I thought: “it’s cool, in its essence it’s just a transform between two linear subspaces” – and I just read about those. I felt confident, I had Hoffman & Kuntz on my side (I thought so).

So, I started reading into it. Coasted through the derivation of formulas for  projections.


I mean from the image above it’s quite clear why for any given x/y the following holds (I’m gonna give only formula for ‘x’, for ‘y’ it’s the same):

$$\frac{x_n}{z_n} = \frac{x_g}{z_g} \\ x_n = \frac{z_n x_g}{z_g}$$

Note: $z_n$ denotes $n$, or near plane.

     However, soon I hit a brick wall. There were line equations, and stuff got plugged into those, in a manner, which I found jarring. I tried to imagine, where these lines would lie, and failed. Have you ever had this desperate feeling, when you want to yell at the monitor: “what the hell does that even mean?!”, but you know that no one will answer? I had been drowning in it for a couple of seconds and then I put two and two together.

    By that point I’ve already projected my $x$ coordinates onto the near plane, and was quite sure that its value had very little reason of being less then $l$ or more than $r$(or vice a versa depending on your coordinate system orientation). Also, I knew that NDC “wants” it to be in between -1 and 1. And then I just happen to know this formula, for rescaling a value in one range into a new one:

$$val_{new}=\frac{new_{high}-new_{low}}{range_{high}-range_{low}}(val_{range}-range_{low})+new_{low}$$

Here $range$ denotes the input range, which we want to transform from, in our case for $x_g$ coordinate it is $[r,l]$; the $new$ correspondingly denotes the range we are transforming to, in our case it’s NDCs $[-1,1]$.

    This formula looks almost exactly like a line equation: $y=mx+b$, except for the fact that $x = (val_{range} – range_{low})$. That’s why I’ve always felt more comfortable thinking about it in terms of ranges, instead of in terms of lines. It goes something like: “suppose I have a value between 1 and 2, but I want it to be between 0 and 1, what should I do?” Obviously, I have to subtract 1 from my input – this way it will automatically lay between 0 and 1. In this manner, toying with simple ranges, I’ve always been able to arrive at the formula above – it’s all about squeezing and stretching and shifting a little to left and a little to the right. I feel like here it’s best to fiddle with the concrete examples, and leave the words like “line equation” out of the picture.


    However, reiterating on the example above, how do I know that I should subtract and scale, instead of, say use a function like:

$$ \begin{equation} val_{new} = \begin{cases} 0, & val_{range} < 1.5 \\ 1, & val_{range} \geq 1.5 \end{cases} \end{equation} $$

Or perhaps something, that is less seemingly ludicrous, and has the desired range? Why would I choose linear transform, out of all the possible functions? 

    Currently, I don’t have the exact answer for this question. I’m fairly certain, that it has to do with the fact that linear transform doesn’t “significantly alter the behavior" of the function it’s applied to. For instance, if I had a function $f(x)$, and I wanted to get a new function:

$g(x) = m*f(x)+b$, then the derivative of g would be different from f only by a constant multiplier:

$$\frac{dg}{dx}=m\frac{df}{dx}$$

    I’m quite sure that it’s a big deal. I feel like it might be fun to toss the boring rescaling formula out of this derivation, and replace it with something more unstable and fun, but we will stick with what we have for the time being, and do the fun thing in another post.


Anyhow, plugging the equations for $x_n$, $y_n$ in the rescaling formula above yields:


$$ x_{ndc} = \frac{2z_n}{r-l}(\frac{x_g}{z_g})-\frac{r+l}{r-l}\\ y_{ndc} = \frac{2z_n}{t-b}(\frac{y_g}{z_g})-\frac{t+b}{t-b} $$

Note: I omitted some of the trivial transformations you need to arrive at this specific form of expressions.


Important Digression

To finish up the derivation, at this point, we need to recall what we are deriving. Specifically, we are looking for the matrix representing the following transformation (spoiler alert!):



$$ \begin{bmatrix} a & b & c & d\\ e & f & g & h\\ i & j & k & l\\ m & n & o & p \end{bmatrix} \left[\begin{array}{c}x_g\\y_g\\z_g\\w\end{array}\right] = \left[\begin{array}{c}x_{ndc}\\y_{ndc}\\z_{ndc}\\z_g\end{array}\right] $$

We have already partially filled out this matrix with the expressions to transform $x_g$ and $y_g$ to the NDC space:

$$ x_{ndc} = \frac{2z_n}{r-l}(\frac{x_g}{z_g})-\frac{r+l}{r-l}\\ y_{ndc} = \frac{2z_n}{t-b}(\frac{y_g}{z_g})-\frac{t+b}{t-b} $$

Plugging them into matrix gives us:


$$ \begin{bmatrix} \frac{2z_{n}}{r-l} & 0 & -\frac{r+l}{r-l} & 0\\ 0 & \frac{2z_n}{t-b} & -\frac{t+b}{t-b} & 0\\ i & j & k & l\\ m & n & o & p \end{bmatrix} \left[\begin{array}{c}x_g\\y_g\\z_g\\w\end{array}\right] = \left[\begin{array}{c}x_{ndc}\\y_{ndc}\\z_{ndc}\\z_g\end{array}\right] $$

We place the coefficients so, that multiplication will yield:


 $$ x = \frac{2z_n}{r-l}(x_g)-\frac{r+l}{r-l}z_g\\ y = \frac{2z_n}{t-b}(y_g)-\frac{t+b}{t-b}z_g $$

    Dividing that by $z_g$ will give us the NDC coordinates. Now, all we have to do is understand, why in the world would we divide by $z_g$. Or at least find a decent reason to write it off to. To do so, let’s start with an intriguing question.


    Isn’t it peculiar, that we have a four component vector, to represent something in 3D? What is $w$? The answer to this question lies in understanding, what are homogeneous coordinates. Now, understanding this is really quite simple. Imagine a point in space, and a plane, which lies above it. We draw a ray from that point- it will cross the plane at some coordinates: $(x_c,y_c,z_c)$. For further convenience, let’s say, that this plane lies parallel to x/y plane of the orthonormal coordinate basis with the beginning at the said point, and that it satisfies an equation: $z = 1$, in this basis (which together with the point of origin gives an affine coordinate system). All of that would look something like this:


    Now, when we have said and drawn this for our convenience, let’s get back to our arbitrary ray, which starts from the aforementioned point, and crosses a plane at some coordinates $(x_c, y_c, z_c)$ (point $A$ on the image above). Since our plane satisfies the equation $z = 1$, the coordinates of the crossing will be $(x_c, y_c, 1)$. Now let’s take any other point on this line, say $B$, with coordinates: $(x_l, y_l, z_l)$. It is clear, that if we divide each of these by $z_l$, we’d get $(\frac{x_l}{z_l}, \frac{y_l}{z_l}, 1)$ – a point on our plane. What is a little less clear is that these will be the coordinates of the very same point $A$. If we take point $C$, and transform its coordinates in a similar manner, we would again, get the coordinates of $A$. And, even more generally, if we take any point on this line, and divide its coordinates we would end up in $A$. This property, allows us to map (bijectively I daresay) (almost) each ray originating in point $O$ to a certain point on a plane. The coordinates of a point (say $B$ or $C$) belonging to a certain ray in this construct are called “homogeneous” coordinates (“hom-” – same, “genos” – kin).

    Homogeneous coordinates are important in 3D graphics since they very naturally model the idea of rendering a 3D scene as being observed from a camera point of view. In this case a point $O$ will be the camera and lines going out of this point – are rays of light, which intersect some object in a 3D space. Of course, in reality rays of light go into the camera, instead of going out of it, but we will omit this detail, as insignificant.

    You may have noticed, how in the discussion of the homogeneous coordinates we were talking about triplets: $(x_l, y_l, z_l)$. And here is why – x and y coordinates are enough, if we want a way of mapping rays onto points, however, for rendering a 3D scene we would usually need a z coordinate.  Thus we would need a four component vector. But the fact that it’s four component doesn’t really change its essence – you still divide (meaning rendering pipeline does this for you) each of the components by the fourth one, to get what you need:



$$ x_{ndc} = \frac{x}{z_g}\\ y_{ndc} = \frac{y}{z_g}\\ z_{ndc} = \frac{z}{z_g} $$

This thought actually gives us the fourth row of the transformation matrix:


$$ \begin{bmatrix} \frac{2z_{n}}{r-l} & 0 & -\frac{r+l}{r-l} & 0\\ 0 & \frac{2z_n}{t-b} & -\frac{t+b}{t-b} & 0\\ i & j & k & l\\ 0 & 0 & 1 & 0 \end{bmatrix} $$

And, even more so – since we know that $z$coordinate will be inevitable divided by $z_g$, we can get a head start on deriving expression for $z$ coordinate:



$$\frac{ \left[\begin{array}& i & j & k & l\end{array}\right]\left[\begin{array}& x_{g} & y_{g} & z_{g} & w\end{array}\right]^T}{z_g} = z_{ndc} $$

Note: later by the application of the same rescaling formula we used to get NDC coordinates, x/y coords will be rescaled from $\left[-1,1\right]\times \left[-1,1\right]$ to $\left[0,screen_{width}\right]\times\left[0,screen_{height}\right]$. Z coordinate will be used in more esoteric ways to create some calming illusions, of shadows, volume and whatnot.

Wrapping it up

Using the insights we gained in the previous section, let’s find the final row of a matrix, you know, the i,j,k,l in this:
$$ \left[\begin{array}& i & j & k & l\end{array}\right]\left[\begin{array}{c}x_{g} \\ y_{g} \\ z_{g} \\ w\end{array}\right] $$


While i and j are obviously zero, k and l should be so, that:
 $$ \frac{kz_g+lw}{z_g}=z_{ndc} $$ 
as we learned in the previous section. Without loss of generality, we can say that w = 1. And furthermore, we know the following (just plugging in respective limits of ranges):
 $$ \begin{cases} k+\frac{l}{n} = -1\\ k+\frac{l}{f} = 1 \end{cases} $$ 
Careful solution of this system yields: 
$$ \begin{cases} k = \frac{f+n}{f-n}\\ l = -\frac{2fn}{f-n} \end{cases} $$ 
These are the final pieces we need to put our matrix together: 
$$ \begin{bmatrix} \frac{2z_n}{r - l} & 0 & - \frac{r+l}{r-l} & 0 \\ 0 & \frac{2z_n}{t - b} & - \frac{t+b}{t-b} & 0 \\ 0 & 0 & \frac{f+z_n}{f-z_n} & - \frac{2fz_n}{f-z_n} \\ 0 & 0 & 1 & 0 \end{bmatrix} $$

Conclusion

    This derivation was built kind of backwards – firstly we found the expressions for the projections of $x_g$ and $y_g$ onto the near plane (by the way intrinsically performing perspective projection), then we rescaled them, and only afterwards we have written down what we are looking for, and figured, that we’d have to divide by everything by $z_g$. After that I placed into all of this a concept of homogenous coordinates – to explain the division, and added info about occlusion and shadows, to explain why matrix is 4x4 (if you omit depth test, z coordinate is no longer needed and the matrix would be 3x3). There is a more direct way of deriving this matrix – it involves multiplying several scaling matrices. Find it if you prefer a more direct approach. Derive a matrix for orthogonal projection if you like doing homework. Basically, just follow your heart (responsibly).