Author Topic: realtime raytracer project (Read 17807 times)

nystep · « **on:** October 18, 2007 »

Hey,

I just wanted to post about one of my projects.
I'm attempting to code a state of the art raytracer.

Why raytracing?

Well, it's extremelly easy to code a raytracer for starters.
Even a state of the art raytracer with all optimisations fits in a few hundreds of lines of code. If you want to do a state of the art 3D engine with opengl, it will more likelly be thousands of lines of code.
The fastest raytracers use the kd-tree structure. It's an extremelly convinient structure, which has the very interresting propriety to scale very well with large scenes. I.e you can add as much geometry as you wish, it doesn't hurt the framerate. It has been proven that raytracers have a logarithmic complexity regarding the number of objects in the scene, whereas 3d acceleration and software engines have a linear complexity regarding the number of primitives.
Some may argue that the complexity of a hardware accelerated or software 3D engine can also be logarithmic regarding the number of objects, thanks to complex technics such as LOD, occlusion queries, impostors... However, these technics can also be implemented in a raytracer.
A raytracer can render pretty much every object that can be tested for intersections. Not only spheres, cylinders, conics and all the like, but also pixel-accurate volumes such as isosurfaces, volumetric effects, *accurate* reflections, *accurate* refractions, point light sources and area light sources, and global illumination. Furthemore, it is instinctive to add all these effects once you have a recursive procedure to trace a ray. However, these render-technics lead to ray multiplication, which hurts performance a lot.
You don't need a zbuffer to get correct object occlusions on the viewport. In fact, the inherent nature of a raytracer takes care of visibility determination.

So basicly, the raytracer's performance is measured in terms of rays per second, as the scene geometry has no real impact on render speed. I wish i could reach the state of the art raytracer's performance of 10M rays per second per core. It means that a state of the art raytracer can potentially render 800*600 scenes with whatever number of primitives in realtime, on nowaday's hardware. However, the number of cores per CPU is rapidly increasing, with dual cores being well spread already, and quad cores such as intel's Core2Quad being really affordable already. Furthemore, amd's phenom is coming along, as well as intel's nehalem with 8 hardware cores+hyperthreading in late 2008. As an endnote, intel demonstrated the larabee this year, a 80 core CPU which would be a true killer at raytracing! We expect it to hit the market arround 2010. There is also some experimental development done on hardware raytracing units. So yes, I think that raytracing is the future of realtime 3D visualisation.

My starting point is Ingo wald's excellent thesis on realtime raytracing and realtime global illumination. However, I forked from the state of the art technic to test a variant.

Here's the thesis: http://www.mpi-inf.mpg.de/~wald/PhD/

There are only a few points that have to be optimized for a state of the art raytracer:

- ray/primitive intersections
- kd-tree construction
- kd-tree traversal
- shading

Of course, these tasks have to be done through the use of SIMD instructions, such as SSE. I'm starting the code in C++ however, to get a chance to make the thing run before optimising it.

Here is the progress of the project currently, for each of the points above:

- ray/sphere intersections are implemented, and done in batch of 4. It enables to test 4 spheres at once later with SSE.
- kd-tree building is implemented with the naive algorithm and least efficient algorithm for raytracing: middle cut.
- kd-tree traversal has been implemented both in an iterative+stack way, and in a recursive fashion. The recursive is the fastest, because it has no stack datastructure overhead. Furthemore, i traverse for a single ray, whereas most state of the art raytracers test intersection for each primitive but for a packet of almost colinear rays.
- shading is done using basic gouraud diffuse model, but i rather display only normals for now.

The endpoint is performance: I get 250Krays per second per core. It means that a 16k spheres scene takes 1 second to render in 640*480 roughly. This poor performance is most likelly due to the number of tests i do per ray: 57. However, i test 4 spheres per leaf, instead of 1. It is also due to the poor quality of the kd-tree building algorithm i'm using. So there is a lot of room for performance improvement.

Next, i'll work on a SAH based heuristic to build kd-trees.
I noticed interesting research from september 2006 ( fresh, isn't it?

) on this topic: http://www.sci.utah.edu/~wald/RT06/papers/2006-09-18-RT06-Hunt.pdf

To finish, i attach a screenshot of the current raytraced scenes, and i'll post further improvements i do to the raytracer in this thread.

Shockwave · « **Reply #1 on:** October 18, 2007 »

That is a really interesting topic, I have seen raytracers made by quite a lot of people here so I think your topic will be watched with great interest.

Also I'd agree that when processors with many cores become available and are cheap enough that raytraced demos will be mind blowing.

Not sure if I agree that they will be the way that everything is made as the games industry seems to be going more and more for shaders, but then again ray tracing can be done with shaders too.

This topic may or may not be of interesting to you: http://dbfinteractive.com/index.php?topic=1612.0 There are some very interesting optimisations.

Is suspect that it probably isn't the way you want to tackle the raytracer though as you seem to want perfect accuracy

nystep · « **Reply #2 on:** October 18, 2007 »

> Is suspect that it probably isn't the way you want to tackle the raytracer though as you seem to want perfect accuracy

Thanks for the hint, i see the guys have made some great adaptative RT tricks! impressive actually.
However, yes, I'll try to prevent any approximations to the tracer. So no fixed point, no adaptative subsampling.. Furthemore, as I'll probably add textures at some point, I don't want any weird looking artifacts.

One of my ideas was basicly to even do some HDR rendering and implement a tone compression algorithm in realtime. I have no idea if it's really possible though yet. Wait and see...

Jim · « **Reply #3 on:** October 18, 2007 »

Snapshots look great! I know for a fact that OpenGL people are not interested in scene graph stuff - which might let you accelerate ray-tracing, so I wish you luck!

Jim

taj · « **Reply #4 on:** October 18, 2007 »

Quote from: nystep on October 18, 2007

Well, it's extremelly easy to code a raytracer for starters.
Even a state of the art raytracer with all optimisations fits in a few hundreds of lines of code.

I used to be a researcher of raytracing about 18 years ago and my raytracers were bigger than this even then. Honestly, I think this is a little optimistic for "State of the Art". If I remember right, POVRay is about 150,000 lines of code. Mind you its 50% fluff but even so. If you mean "fast but simple" then I agree. I'll watch your project with a great deal of interest. Good luck, seems like you are off to a very good start, do you have a blog?

Frankly I'm thinking of doing a similar long term project in shaders.

Chris

rain_storm · « **Reply #5 on:** October 18, 2007 »

I will be following this topic to see how things progress. I love raytracers and I cant wait to see this gaining momentum. Nystep you got my support. Are you planning on implementing fluids and fire as well. If you want to futer proof this project I would suggest you consider it

nystep · « **Reply #6 on:** October 19, 2007 »

I have done quite a lot of progress today.

First i fixed the sphere splitting bug, which was caused by tree-building.
Then, i performed additional benchmarks using both of the 2 tree-building heuristics i have now..
The speed can be catastrophic as well as pretty good. I had performance variation from a factor of 1 to 10 depending on where i put the spheres arround, which is kind of.. Weird.

The statistics show that the traversal algorithms visits 2 to 20 leafs depending on the scene. This is most likelly the cause of the slowdown.

I improved the recursive traversal algorithm with early ray termination.

The shooter performance is now arround 600K 700K rays per second per core, on "good" scenes of 16k spheres... I have used this speed gain to render 1024*768 scenes instead of 640*480. This performance is still rather poor, so more work waits for me soon...

It has been pretty much a pain in the ass to build a "correct" tree that doesn't split spheres, but i guess more pain is on the way, I'll work on the SAH heuristic for example, as said, and for starters i'll try to make a tree that has empty leafs for the most empty space as possible. It enables rays to travel in these at minimal cost.

I'm also bound currently by kd-tree building time. At the moment, it takes 7-8 seconds to build the tree for 16k spheres. I don't include it in the rendertime of course, as it's pretty much some preprocessing step, as I mean to do some animation in the end. It's going to be a problem, because I fear the SAH will be even slower for building kd-trees.

For the quantity of sources, i'm sorry it has hurt you in some way chris, i didn't mean to offend anyone. Furthemore, I just noticed i'm arround 1k lines of code now for this project, and it's pretty much like it's going to increase, as I will add other tree-building algorithms, and shading... I still hardly imagine how i would end up reaching 10k lines of code with a raytracer though. I'd like to keep the code as simple and straightforward as possible.

I promise i share the sources as soon as I have a working SAH for building kd-trees.

Paul · « **Reply #7 on:** October 19, 2007 »

Sounds like it's going petty well, keep up the good work

Shockwave · « **Reply #8 on:** October 19, 2007 »

Quote

The shooter performance is now arround 600K 700K rays per second per core, on "good" scenes of 16k spheres... I have used this speed gain to render 1024*768 scenes instead of 640*480. This performance is still rather poor, so more work waits for me soon...

Imagine 1024 * 768 running at I dunno... Let's pick a stupid figure out of the air...

100fps?

Maybe you'll code the killer production now that will come into it's own in 3 years time when processors are faster and have many more cores.

Paul · « **Reply #9 on:** October 20, 2007 »

What does x rays per second per core acually mean?
does it mean my core2duo gets 2fps om a 1024*768 resolution?
Surley it must depend on the speed of the cores, so what speeds are theese cores yuo're refering to going at?

nystep · « **Reply #10 on:** October 20, 2007 »

Quote from: Paul on October 20, 2007

What does x rays per second per core acually mean?
does it mean my core2duo gets 2fps om a 1024*768 resolution?
Surley it must depend on the speed of the cores, so what speeds are theese cores yuo're refering to going at?

actually my CPU is a centrino duo T2300 clocked at 1.66GHz... So you'd probably get more fps on a core2duo CPU.
I think that talking in number of rays per seconds per core makes it easier to measure the performance of a raytracer. For example, I came accross www.rayforce.net, the number of rays per second is really impressive, as they mention 40-60 millions of rays per second, but on the other side this result is from a 8 core machine, so if you really want to compare with your tracer you'd better take the number of rays per core.. Still, there are many types of CPUs arround.. What should we talk about, CPU cycles per ray then?

I've implemented the SAH (Surface Area Heuristic), but there must be something seriously fucked up in my code, as I didn't notice any really impressive performance gain... After a while I realized i'm never generating empty leaves at a high level in the tree, so i fixed it up. Even with that, i noticed disappointing performance. I came accross scenes where the number of visited leaves per ray was next to 1, I had a performance of 2Mrays per second per core on these scenes. But in more randomish scenes, it seems more mixed. The performance is still arround the 1Mrays per second (more likelly under, 600K-900K is something i usually get in average..). The problem is still the number of leaves traversed before finding an intersection. It tends to blow off if the "eye" of the camera is inside the kd-tree, and the renders are insanelly slow then. I've been profiling as well the program and i saw the ratio spent in the intersection test was only 33% of the CPU time, against 60% in the kd-tree traversal. It means if i optimize primitive intersection tests in assembler with SSE, i'll only get a limited performance boost. This is, once again, a little disappointing..

So next i'll take more care about my kd-tree, as the render speed is really dependant upon the quality of the tree that is built. I'll try to implement another heuristic, that aims at maximizing the empty space outside the remaining geometry when splitting along a plane. It would enable to make bigger empty leaves, at higher levels in the tree. However, it's not necessarly good to split once you find a little empty room somewhere in the scene, since deeper trees mean more traversal time.

I decided to share the source, hope it can be usefull to someone. Do you mind if it doesn't comes along with my whole engine? I'm too lazy to do a clean package now.. However, the code has little dependancy with the engine, it only uses the logfile, and a self-resizing vector class...

Still, it can be interresting as it shows how to do a clean and "not too slow" recursive kd-tree traversal (i've been fighting with it, really, i'm sure it can help). It also shows how to build a kd-tree from a set of primitives.

Next, i think i'll read up this MLRTA (Multi level raytracing algorithm) paper i just downloaded. It shows off really mindblowing performance results, but the people who have implemented it arround on the net have been disappointed from what i've been reading on some other forum

.. I'll also implement this other heuristic to build the tree and see which results it gets.

Jim · « **Reply #11 on:** October 20, 2007 »

I notice you're using Visual Studio 6.0. Have you considered upgrading to Visual Studio 2005 Express? That way you'd get free SSE2 optimisation for all your floating point code just by setting a compiler option. As you say, it won't fix the tree stuff, but it'll help with the geometric calculations.
Our 'Getting Started' thread is here http://dbfinteractive.com/index.php?topic=1243.0.

Jim

nystep · « **Reply #12 on:** October 20, 2007 »

Actually, I have Visual Studio 2005 Professional installed here, Yet I don't use it. It's too bloated, it installs a sql server in your back, and my versions of visual assist X don't install on it. So, I keep rocking the good old vc6, which furthemore is the last Visual studio which has a profiler integrated in the IDE.

And, quite sincerelly, i don't expect the compiler to do the SSE job in my place, even when you have proper SOAs, it doesn't even detect how to use them.. So intrisics, nasm still work fine. At least for hobbies... At work it's another story

taj · « **Reply #13 on:** October 23, 2007 »

nystep,

can you explain the principle of the SAH in a few words?
Thanks,
Chris

nystep · « **Reply #14 on:** October 25, 2007 »

Sure chris,

Building a kd-tree is kind of simple by concept. It's a recursive process, where you take a list of primitives as argument. If you meet an end critera, you store the primitives for this leaf and return. Else, you have to decide where to split your current Axis Aligned Bounding Box (aabb) for each children. Then you intersect your primitive list with each of the children's aabb, and recurse on each child to keep building the tree.

The biggest problem is where to split the current aabb. There are various algorithms for it, and the performance greatly varies amoung them. We can think of straightforward ways to split the current aabb: Put the split at the median point of the primitives, or divide the size of the aabb by 2. We can think the first is a good heuristic, because you end up having a balanced tree with it. However, if your rays travel large regions of empty space, it will still have to test for triangles it will most likelly never intersect. As a result, the median cut was the heuristic that produced the slowest resutls on my test scenes. On another side, if you divide the size of the aabb by 2, you end up having some empty space, but you can't really handle efficiently the case of a teapot in a stadium: the algorithm will store the geometry for the teapot very deep in the tree, so it will be very slow to reach it.

The idea behind the surface area heuristic is to associate a raytracing cost to an axis aligned bounding box. The SAH assumes that rays are equally distributed in space, and are not blocked by objects. Having an AABB named A, it is possible to calculate the probability of each of the sub-AABBs, let them be AL and AR.

P(VR/V) = SA(AR) / SA(A)
P(VL/V) = SA(AL) / SA(A)

Where P is a probability and SA is the Surface Area of each sub Axis Aligned Bounding Box. Once this is known, it is possible to associate a cost to each of the left and right sub-AABBs:

Let Ctrav be the cost for traveling a node in the kd-tree, and Cinter the cost of intersecting a primitive:

Csplit = Ctrav + ( P(VR/V)*NR + P(VL/V)*NL )

where NR is the number of primitives that intersect the left AABB, and NR the number of primitives that intersect the right AABB. (From ingo wald's thesis).

The idea of the heuristic is to extract the minimum Csplit. Hopefully, this cost function varies (only) at each boundary of an object, so you'd be forced to test along the xmin of a triangle and along the xmax. For a single axis scan, it still makes N*2 cost evaluations to do, and you have to determine NR and NL at each evaluation. So this is a pretty slow process, however, accelerating it is one of today's research topics in raytracing.

If the minimum cost you obtain is lower than your current AABB cost, you perform a split, otherwise, you store primitives and tag your current node as a leaf. There is some pretty readable further information about it on Jacco bikker's excellent tutorial page: http://www.devmaster.net/articles/raytracing_series/part7.php.

nystep · « **Reply #15 on:** October 25, 2007 »

So, i've done quite many changes that bring the performance next to 1,3 MRPS on a particular scene. The average performance is more often arround 1MRPS in the general case. I've performed quite a lot of optimisation to reach this performance. Here is what i have done:

- packed the kd-tree node structure to 8 bytes, and found a way to do it in 4 bytes..
- precomputed 1/direction_vector, which removes the division in the kd-tree traversal, overall speed bumped by 30% from this.
- tried to tweak my termination critera for kd-tree building, i have various speed improvements from it.. depends on the scene.
- made the iterative kd-tree traversal 25% faster by removing the stack template and implementing a hard-coded stack...

All in all, this tracer is now as fast as what Ingo Wald describes in his thesis, as for it is a mono-raytracer, single core, and consists of C++ code. I'm quite satisfied with it currently, next would be to implement a packet tracer, and MLRTA. MLRTA is quite tricky for what i've read so far... And the benefits are not quite those listed in the tables on the first page i think, for it only finds an entry point for the kd-tree traversal deeper in the code.. Even in the end of the paper they admit they only divide by 3 the number of steps in the kd-tree traversal in some cases, so the performance boost canot be that big...

I share my current sources of course.. Hope it can help someone..

taj · « **Reply #16 on:** October 25, 2007 »

Thankyou very much nystep. Definite Karma++ ++.

I love it when someone knowledgable makes the subject reasonably simple for others, rather than having to trawl through dozens of web pages, only to find out most pages are written by people who havent a clue.

Thanks again. I guess the research for acceleration is basically because people want to update kd-trees dynamically each frame?

Chris

taj · « **Reply #17 on:** October 25, 2007 »

nystep, Whats glSilent, it seems to be missing from the source code?

frea · « **Reply #18 on:** November 19, 2007 »

You might be interested in nvidia's CUDA. When i'll get a geforce 8xxx a raycaster will be the first thing i try

. 32, up to 128 parallel procesors, heaven for raytracing

.

Shockwave · « **Reply #19 on:** November 20, 2007 »

The next few years will be really interesting for raytracing when more and more cores are squeezed into processors, but I think that the real advances will come from using shaders for raytracing...