d-buffer: letting a third dimension back in...
It
has been over thirty years since Wolfgang Straßer (in early 1974) and
later Edwin Earl Catmull introduced us to the concept of Z-buffering as
a way of "flattening" 3D objects for representation on a 2D surface of
a display device. Their idea was so simple and yet so efficient that it
is the most widely used and what is more important hardware
accelerated visible surface problem solution to date. But what if we
wanted to have multiple surfaces visible for each picture element? What
if we wanted to store arbitrary number of data elements per-pixel? What
if we wanted to account for the fact that the world around us is not
flat nor perfectly regularly distributed between two distinguishable
points of space?
The facts represented by above questions are
painfully obvious but not as simple to solve in the rigid world of
highly parallel graphics hardware. Some time ago I have decided to jump
on the "real-3D" band wagon, to move beyond the world of a single
surface per pixel. The road was long and bumpy, but it seams that I
have at last found the efficient solution to these problems. As it turns out a
simple idea is most probably the efficient one just like in the
legendary Z-buffer's case.
Read on for the rest of the description of my not so clever storage optimization.
Acknowledgments:
To prof. Przemysław Rokita for encouraging me to keep working on the idea presented herein.
To prof. Karol Myszkowski for suggesting the idea of hierarchical optimizations based on count buffer contents.
Kudos:
To
J., C. Yang, J. Hensley, H. Gruen, and N. Thibieroz for their inspiring
work on "Real-Time Concurrent Linked List Construction on the GPU".
To H. Gruen and N. Thibieroz for their informational presentation on "
OIT and Indirect Illumination using DX11 Linked Lists".
To C. Crassin whose
blog pointed me towards the above presentation.
And last but not least to OpenGL ARB for finally making my dreams a core feature of the specification.
d-buffer:
Why
the "d"? Simply but not understandably from the "deque" -- both for its
similarity in pronunciation to the deck of cards (as a linear
arrangement
of elements from a multi-dimensional space -- for instance: rank and
suit) and frequently abused data structure. Why not "A"? First because
this technique was not designed with anti-aliasing at its core, nor it
has anything to do with a traditional list representation of an
A-buffer. Most simply put it is an irregular array of arrays. Then why
not Per-Pixel Fragment Tables? I feel overwhelmed by capitals in other
acronyms already and "d" is more catchy.
Problem:
Find a way
to store multiple fragments (that is contributions form different
polygons) in each pixel that is both efficient in terms of memory used
and reference order discovery.
Solution:
Use a preliminary fragment counting pass which finds both:
- total number of fragments generated for all pixels as well as
- the number of fragments that need to be stored for each individual pixel.
Pros:
- By
drawing into the count buffer you get the information about the number
of fragments that will by generated during the actual displayable
frame buffer rendering. This allows you to allocate the amount of memory
that you actually need. With solutions like those proposed in Per-Pixel Fragment Lists you really have two lose-lose choices:
- either you allocate memory for the worst possible case (see the benchmarking results below) or
- you
take into account that you can loose some fragments that will not fit
in the fragment buffer allocated based on the results of the previously
rendered frame.
- You get the precise information about the
fragment distribution before actual drawing. This can allow you to
optimize pixel processing based on its fragment count.
- And last
but not least, as mentioned before, this technique can be perceived in
terms of an "early, simplified rendering pass" just like "early
depth-only pass" used with Z-buffer. This can allow for some clever
hardware optimizations both hierarchical (when we assume some regularity
in the fragment distribution) as well as threshold value based, just like
in hierarchical and early Z-buffer test. The data collected during
fragment counting pass can be used to assign work to processing units
based on pixel fragment ownership (for increased locality of memory
references) and fragment count representing processing complexity (for
more efficient threading).
Cons:
- Additional
full-geometry rendering pass (I deal with that using separate geometry
transformation pass but you still need to rasterize twice).
- Additional fill-rate consumed by count and address buffer calculation passes (see details below).
Contributions:
- Fast memory access scheme based on simple element count reference planing.
- Efficient and accurate memory resources allocation.
- Implementable in terms of last three generations of graphics hardware.
- This
technique can provide means for increasing memory reference efficiency
for highly parallel architectures not just GPUs. Think of it as low cost
simulation of a particular computational algorithm run allowing the hardware to discover
the way that software will reference its memory (one of the most costly
operation on compute heavy architectures). Why guess (through caching
polices) the memory reference scheme, when you can simulate it cheaply.
Implementation:
I
have implemented the technique in terms of different versions of the
OpenGL graphics system specification. You can view the source code for
the ones prepared against the Shader Model 5.0 (yes I know that it is
awkward to bring up Direct3D naming scheme in this context, but what
better alternative is there, (GL_ARB_gpu_shader5 +
GL_ARB_shader_image_load_store + GL_shader_atomic_counters + ...)?)
hardware capabilities.
There are two versions of the source code:
Both released under terms of the
GNU LESSER GENERAL PUBLIC LICENSE (Version 3). The results provided below come from the second version.
The
source-code is preprocessor heavy, as it was meant for experimentation
and benchmarking. In particular it allows you to divide the drawing
into multiple sub-regions for possible increase in memory access
parallelism. I chose the compile time optimization over run time
parameter switching for performance reasons -- this is graphics code
and each operation adds to overall drawing cost.
As my work on this
subject started way back in the SM3 hardware era, I have also provided
results for SM3 and SM4 code-paths. I do not release the source code
for them yet, as they are still under heavy development. In particular
I am working on a geometry decomposition scheme, that would allow me to
use them for drawing objects with arbitrary number of layers. They now
use a temporary multiple-rendering-targets or multi-layered
frame buffers to store data later gathered into a d-buffer and the number
of layers in limited to the number of images in each temporary buffer.
This is the cause for missing columns in the results tables. Long story
short: stay tuned for the release of the source code for them if you
are gaming console or hand held developer.
Results:
May
vary with different hardware/software configurations. Therefore I
encourage you to take it for a spin (WSAD and horizontal mouse scroll
at your disposal). If you are up for the task that is (need at least
AMD Evergreen or NVIDIA Fermi based graphics card with OpenGL 4.2
drivers, although only tested on the latter):
All in all I get around 50% to 100% increase in performance compared to
Per-Pixel Fragment Lists
(state of the art solution prepared by guys at AMD). The benefits are
more prenounced, when there are more non-empty (containing at least one
fragment) pixels on the screen. As in the case of Per-Pixel Fragment
Lists, I have allocated memory for the worst case scenario (all pixels
fully-populated by the maximum number of layers for a given
model), the method flats out on the memory limit quickly. The overall
tendency is for the acceleration to rise for higher resolutions.
The future looks bright as the method proves to be scalable
in terms of resolution and reaches 100 frames-per-second mark for the
highest resolutions on current hardware. Maybe it is not yet
"production-ready", but it can surely be a fruit for thought for
interactive application developers. Interesting what can be done with a
"redundant" rendering pass.
The main idea behind d-buffer is almost embarrassingly simple.
Store the data required by your algorithm in a hardware friendly manner.
That is create a structure, which data order reassmbles the way
processing element will access it. This is especially important in
highly parallel architectures such as GPUs. Their processing elements
often need to access data in a non-conflicting way to do their job as
fast as possible. You also do not want to force your processing
elements to synchronize their memory operations, to keep them busy at
all time.

GPUs and graphics APIs used to solve the problem in a
somewhat brute force but extremely efficient manner: assign every
processing element a distinct memory location to output its result
into. It is one of the most efficient ways, if you consider that every
processing element outputs relevant data and that there is only one
output per processing element (lets assume that we use an architecture
that processes N element using N possibly virtual processing elements -
that is the base of GPU stream processing efficiency: independence of
results data between processing elements or kernels if you will). This
is how Z-buffer based architecture works (one output color for the
closest fragment belonging to some pixel). But what, if you want to
output multiple results per-element? It is fine, as long as you want to
output the same amount of data per-element.
This is how multiple
rendering targets (MRT) works for GPUs.
But what, if some
processing elements do not output any data? Then you get some unused,
but still allocated memory fragments. See first picture for an illustration of
this problem. Here you see an example of multi-layered image structure (frame buffer).
You can say that the frame buffer is sparse, if it has some unused (that is filled with clear color or not drawn to) pixels.
You can say that the frame buffer's pixel is sparse, if there are some unused fragments between used ones in consecutive layers.
In short: there is some wasted (irrelevant) space in terms of data processing algorithm in use.
To solve this we could pack the 2D (possibly multi-layered) structure into a dense 1D one.
But
in such unpredictable architecture as unsynchronized GPU, we can not
make any assumptions of data arrival order. This is what makes GPUs
work so fast. We need to be prepared for any case. The simplest way to
deal with it, is to use a 1D Per-Pixel Fragment List and some 1D to 2D
mapping structure (for example a 2D pointer map). The map will be
sparse, in therms of its pixels, but the fragment list will be dense.
This is how the guys at AMD solved this problem. I like to classify this method as unordered storage scheme.
The
problem with this method is that, we have to store next element pointer
for each generated fragment. This results in additional memory cost
in both fragment access time and storage.

Doing
random access into linear memory space can lead to major inefficiency.
It is bad for both caching and data bus occupancy. If we can assume
that we will need to access fragments belonging to the same pixel in
the same processing element, we could pack the pixel's fragments close
together in memory.
That is where the d-buffer comes into play. It is both:
We
only need one fragment pointer for each pixel. The fragments can be
accessed sequentially beginning with the pointer up to count buffer entry
elements.
And this is it! This is how d-buffering is done.
It works remarkably well for most of the multi-fragment computer graphics problems, which include a fragment merging step.
It only assumes that the device uses some form of linear memory addressing scheme.
Realization
of this idea is little more complicated and gets even more so for older
(more constrained) architectures.