Rasterization: Software Rasterization Pipeline

A painting of a rasterization pipeline by DALL-E

A painting of a rasterization pipeline by DALL-E 2

This post will simply be an outline of my plans for my CPU software rasterization project.

Scope

This post will primarily concern just the rasterizer stage of a triangle-rasterizer software rendering pipeline. That is to say, it will mostly not discuss vertex shading or pixel shading, nor any other stage before or after rasterization. However, in the rasterization I also plan on including several sub-stages such as frustum culling, binning, occlusion culling, hierarchical Z-buffering, and so on. Some closely related systems will also be described such as the instancing and batched drawing approach.

Features

For perfomance reasons I will attempt to make as good use of SIMD, multithreading and cache efficiency as possible. This will impact several aspects of the design of the rasterizer. As an experiment I plan on writing a small library to use 16 element wide SIMD operations for integer single precision floating point data, similar to AVX-512 but implemented using AVX and AVX2, as well as some FMA instructions. The expected advantage of this is that each operand will correspond to an entire cache line, making full use of the cache capacity, and simplifying the task of making sure data is contiguous, not straddling cache line boundaries, and preventing false sharing between threads. It should also make it relatively straight forward to estimate the number of cache lines used in various parts of the code. The cache line count could be used as a heuristic, or parameter, for tuning data batch and block sizes during development.

Sub-stages

Frustum culling
Backface culling
Triangle setup ( some data may be re-used from the culling steps )
Triangle binning (low-res conservative rasterization)
Occlusion culling
Rasterization and depth testing

Input and output

The input to the procedure will be a pointer to a buffer of triangle vertex positions, a pointer to a buffer of triangle vertex indices, a triangle count, and triange index offset. The triangle input offset will be used to handle instancing and batched rendering. The output will be a Z-buffer to that can be used by the next shading steps, material calculation and so on, and a visibility buffer, which is simply a buffer, or image, of triangle indices. The indices will also have the instance index baked into them (using the triangle index offset), so the index will have to be decoded to find the actual triangle index for the mesh.

Frustum culling details

There are some notes on frustum culling on the ryg blog here

Backface culling details

When we have access to the normal, or equivalent information, we can simply backface cull triangles by computing a single dot product by the view direction and the triangle normal.

Triangle setup details

For computing triangle overlap one will need to compute the triangle edge functions. It is possible to save some work by making sure to only compute the parts that are invariant over the triangle just once.

Triangle binning details

Triangle binning basically amounts to performing overestimated conservative rasterization, rasterizing triangles over tiles instead of pixels. A pixel is usually rasterized by sampling a single point, but a tile is a screen space axis aligned rectangular region. The basic sampling step for this is essentially a rectangle-triangle overlap test. This is essentially overestimated conservative rasterization. In the 2D case it is possible to determine overlap by testing whether the tile is fully outside one of the triangle edges or the tile is fully outside the triangle bounding box. In the 3D case we do not always have appropriate bounding boxes for the triangles. Instead it is possible to instead use the test of whether the tile is fully outside one of the triangle edges or the triangle is fully outside one of the tile edges. This can be made efficient for a grid of tiles (where many edges are shared), and generalized to 3D and clipless rasterization where the edges become planes for both triangles and tiles.

The test is a series of dot products. The tile edges are axis aligned, so it is possible to omit one coordinate. For 2D this becomes a one-coordinate test. For 3D it becomes a two-coordinate test. The vertical tile edges have normals with zero $y$ (vertical) coordinate. Since we do not care about the length of the normal, just the sign of the dot products, we can just rotate the vector from the camera to where the edge crosses the screen coordinate axes by a quarter turn. Conceptually it is just an optimization of taking the cross product of two adjacent corner positions of the tile.

The overlap test between the tile and the triangle in 3D is closely related to frustum culling with a skew frustum. However, because there are many tiles and they all share their edges, it is better to reuse as much computation and testing as possible instead of doing many full skew frustum - tetrahedon overlap tests.

It is also worth remembering that the edges of the tiles can be defined to lie exactly on the sample point location, reducing the area of the tiles slightly compared to the case where the edges were taken to line up with the pixel borders. However, this requires one to be extra careful with numerical precision and rasterization rules.

Acceptance and rejection tests

There are three basic outcomes of the overlap test. Full acceptance, full rejection, and partial overlap. If at any point there is full acceptance or rejection, then that branch of the computation can complete, and some savings can be had. The cases where there is full overlap must be split into two cases based on whether it is the triangle that fully covers the tile or vice versa. The outcome can be broken down into these cases:

The triangle fully covers the tile ( all triangle vertices are “outside” but around the tile )
The tile fully covers the triangle ( All triangle vertices are inside the tile )
The triangle is fully outside the tile
The triangle and tile partially overlap

For the last case it is sufficient but not neccessary that some of the triangle vertices are inside the tile. But is is neccessary that some edge crossings occurr.

When evaluating the edge function of a triangle on the tile corners it is sufficient to only test the nearest corner, which can be selected using the direction of the edge normal.

Occlusion culling details

For occlusion culling it makes sense to use a hierarchical depth buffer using conservative depth for all but the highest resolution level. If we at any level can find a set of tiles that completely conver the object to be drawn, e.g a triangle or bounding box, and all the depth values in those tiles have a depth value that is nearer than the nearest point of that object, then we can safely cull it. To find the nearest depth value of a triangle one can use rasterization, or one can compute the nearest depth of the camera space bounding box without needing to compute the other values.

Rasterization and depth testing details

Visibility buffer details

every triangle index is an unsigned 32-bit integer value that represents the triangle index of the currently rendered mesh plus a triangle index offset given as a paramter to the rasterization function.

Implementation details

Test cases

Triangle rendering test cases

Crossing frustum clipping planes
Trapped in a box ( all pixels covered )
Ground planes (large triangles with perfectly vertical normals)
Depth order ( overlapping triangles behind eachother rendered in various orders )
Behind camera
Behind “near plane”
Backfacing triangles
Intersecting triangles
Crossing camera view direction plane
Two triangles ( fullscreen )
One triangle ( fullscreen )
Triangle orientations: Side/edge on triangle

Footnotes and references:

Visibility Buffer Rendering with Material Graphs, by John Hable

Rasterization on Larrabee, by Michael Abrash

Conservative rasterization, By Tomas Akenine-Moller and Timo Aila