d-buffer: letting a third dimension back in...

It has been over thirty years since Wolfgang Straßer (in early 1974) and later Edwin Earl Catmull introduced us to the concept of Z-buffering as a way of "flattening" 3D objects for representation on a 2D surface of a display device. Their idea was so simple and yet so efficient that it is the most widely used and what is more important hardware accelerated visible surface problem solution to date. But what if we wanted to have multiple surfaces visible for each picture element? What if we wanted to store arbitrary number of data elements per-pixel? What if we wanted to account for the fact that the world around us is not flat nor perfectly regularly distributed between two distinguishable points of space?
The facts represented by above questions are painfully obvious but not as simple to solve in the rigid world of highly parallel graphics hardware. Some time ago I have decided to jump on the "real-3D" band wagon, to move beyond the world of a single surface per pixel. The road was long and bumpy, but it seams that I have at last found the efficient solution to these problems. As it turns out a simple idea  is most probably the efficient one just like in the legendary Z-buffer's case.
Read on for the rest of the description of my not so clever storage optimization.

Acknowledgments:

To prof. Przemysław Rokita for encouraging me to keep working on the idea presented herein.
To prof. Karol Myszkowski for suggesting the idea of hierarchical optimizations based on count buffer contents.

Kudos:

To J., C. Yang, J. Hensley, H. Gruen, and N. Thibieroz for their inspiring work on "Real-Time Concurrent Linked List Construction on the GPU".
To H. Gruen and N. Thibieroz for their informational presentation on "OIT and Indirect Illumination using DX11 Linked Lists".
To C. Crassin whose blog pointed me towards the above presentation.
And last but not least to OpenGL ARB for finally making my dreams a core feature of the specification.

d-buffer:

Why the "d"? Simply but not understandably from the "deque" -- both for its similarity in pronunciation to the deck of cards (as a linear arrangement of elements from a multi-dimensional space -- for instance: rank and suit) and frequently abused data structure. Why not "A"? First because this technique was not designed with anti-aliasing at its core, nor it has anything to do with a traditional list representation of an A-buffer. Most simply put it is an irregular array of arrays. Then why not Per-Pixel Fragment Tables? I feel overwhelmed by capitals in other acronyms already and "d" is more catchy.

Problem:

Find a way to store multiple fragments (that is contributions form different polygons) in each pixel that is both efficient in terms of memory used and reference order discovery.

Solution:

Use a preliminary fragment counting pass which finds both:

Pros:

Cons:

Contributions:

Implementation:

I have implemented the technique in terms of different versions of the OpenGL graphics system specification. You can view the source code for the ones prepared against the Shader Model 5.0 (yes I know that it is awkward to bring up Direct3D naming scheme in this context, but what better alternative is there, (GL_ARB_gpu_shader5 + GL_ARB_shader_image_load_store + GL_shader_atomic_counters + ...)?) hardware capabilities. There are two versions of the source code:
Both released under terms of the GNU LESSER GENERAL PUBLIC LICENSE (Version 3). The results provided below come from the second version.
The source-code is preprocessor heavy, as it was meant for experimentation and benchmarking. In particular it allows you to divide the drawing into multiple sub-regions for possible increase in memory access parallelism. I chose the compile time optimization over run time parameter switching for performance reasons -- this is graphics code and each operation adds to overall drawing cost.
As my work on this subject started way back in the SM3 hardware era, I have also provided results for SM3 and SM4 code-paths. I do not release the source code for them yet, as they are still under heavy development. In particular I am working on a geometry decomposition scheme, that would allow me to use them for drawing objects with arbitrary number of layers. They now use a temporary multiple-rendering-targets or multi-layered frame buffers to store data later gathered into a d-buffer and the number of layers in limited to the number of images in each temporary buffer. This is the cause for missing columns in the results tables. Long story short: stay tuned for the release of the source code for them if you are gaming console or hand held developer.

Results:

May vary with different hardware/software configurations. Therefore I encourage you to take it for a spin (WSAD and horizontal mouse scroll at your disposal). If you are up for the task that is (need at least AMD Evergreen or NVIDIA Fermi based graphics card with OpenGL 4.2 drivers, although only tested on the latter):
All in all I get around 50% to 100% increase in performance compared to Per-Pixel Fragment Lists (state of the art solution prepared by guys at AMD). The benefits are more prenounced, when there are more non-empty (containing at least one fragment) pixels on the screen. As in the case of Per-Pixel Fragment Lists, I have allocated memory for the worst case scenario (all pixels fully-populated by the maximum number of  layers for a given model), the method flats out on the memory limit quickly. The overall tendency is for the acceleration to rise for higher resolutions.
The future looks bright as the method proves to be scalable in terms of resolution and reaches 100 frames-per-second mark for the highest resolutions on current hardware. Maybe it is not yet "production-ready", but it can surely be a fruit for thought for interactive application developers. Interesting what can be done with a "redundant" rendering pass.

Details:

The main idea behind d-buffer is almost embarrassingly simple. Store the data required by your algorithm in a hardware friendly manner. That is create a structure, which data order reassmbles the way processing element will access it. This is especially important in highly parallel architectures such as GPUs. Their processing elements often need to access data in a non-conflicting way to do their job as fast as possible. You also do not want to force your processing elements to synchronize their memory operations, to keep them busy at all time.
Sparse multi-layered framebuffer.GPUs and graphics APIs used to solve the problem in a somewhat brute force but extremely efficient manner: assign every processing element a distinct memory location to output its result into. It is one of the most efficient ways, if you consider that every processing element outputs relevant data and that there is only one output per processing element (lets assume that we use an architecture that processes N element using N possibly virtual processing elements - that is the base of GPU stream processing efficiency: independence of results data between processing elements or kernels if you will). This is how Z-buffer based architecture works (one output color for the closest fragment belonging to some pixel). But what, if you want to output multiple results per-element? It is fine, as long as you want to output the same amount of data per-element. This is how multiple rendering targets (MRT) works for GPUs.
But what, if some processing elements do not output any data? Then you get some unused, but still allocated memory fragments. See first picture for an illustration of this problem. Here you see an example of multi-layered image structure (frame buffer).
Unordered framebuffer.
You can say that the frame buffer is sparse, if it has some unused (that is filled with clear color or not drawn to) pixels.
You can say that the frame buffer's pixel is sparse, if there are some unused fragments between used ones in consecutive layers.
In short: there is some wasted (irrelevant) space in terms of data processing algorithm in use.
To solve this we could pack the 2D (possibly multi-layered) structure into a dense 1D one.
But in such unpredictable architecture as unsynchronized GPU, we can not make any assumptions of data arrival order. This is what makes GPUs work so fast. We need to be prepared for any case. The simplest way to deal with it, is to use a 1D Per-Pixel Fragment List and some 1D to 2D mapping structure (for example a 2D pointer map). The map will be sparse, in therms of its pixels, but the fragment list will be dense. This is how the guys at AMD solved this problem. I like to classify this method as unordered storage scheme.
The problem with this method is that, we have to store next element pointer for each generated fragment. This results in additional memory cost in both fragment access time and storage.
D-buffer.Doing random access into linear memory space can lead to major inefficiency. It is bad for both caching and data bus occupancy. If we can assume that we will need to access fragments belonging to the same pixel in the same processing element, we could pack the pixel's fragments close together in memory.
That is where the d-buffer comes into play. It is both:
The 1D to 2D map buffer can be realized in to ways:
We only need one fragment pointer for each pixel. The fragments can be accessed sequentially beginning with the pointer up to count buffer entry elements.
And this is it! This is how d-buffering is done.
It works remarkably well for most of the multi-fragment computer graphics problems, which include a fragment merging step.
It only assumes that the device uses some form of linear memory addressing scheme.
Realization of this idea is little more complicated and gets even more so for older (more constrained) architectures.

Copyright 2010-2011 Jarosław Konrad Lipowski.