Author Topic: bumpmapping (Read 26115 times)

ninogenio · « **Reply #21 on:** May 09, 2013 »

cheers mate!

right last version i promise lol.. ive made look up tables for almost everything and doing lots of precalcs also split the bump map function into two parts and running them on seperate threads. only down side is, it will run really really bad on single core cpus now.

the fps on this system shot up from around 15-20 to 62-66.. still to change all the array accesses too ptr addition and figure out how too pre calc pow. i should be able too hit the golden 70fps mark. Doing this much calcs, 307200 times a frame. im amazed at the speed so far. it just might be possible to make a little 3d demo with a few bump textures at 128*128 or something at reasonable speed.

could any one who tries this tell me there cpu and fps please.

ps to see fps run in windowed mode and look at the black box.
cheers.

hellfire · « **Reply #22 on:** May 09, 2013 »

33 fps on Intel Core2 Quad 2.8GHz.
Feels much faster than the previous versions.

And my magic crystal ball foretells twice the speed if you kick the floating point stuff out of the main loop and use only integers

ninogenio · « **Reply #23 on:** May 09, 2013 »

cheers helfire.

yep your right mate its bottle necked at the fpu atm.

ill confese fixed point scares me a bit. its been years since i did any. if im right you shift all ints and floats left by 8 bits, mulling the floats. add subtract multiply them together then shift right at the end again to clip off the fraction and bring everything back in range.

hellfire · « **Reply #24 on:** May 10, 2013 »

Quote from: ninogenio on May 09, 2013

you shift all ints and floats left by 8 bits,
add subtract multiply them together
then shift right at the end again

Exactly.
Those values which contain integers anyway won't need any additional bits, though.
For the rest you usually have to check, how much precision is actually required.
The trick is to keep track of the number of fractional bits.
For example if you multiply to values with 8bits of fractional part, the result has 16bits fraction - so you need to shr8 to get back to 8bits of precision.

For the diffuse part I would quantize the vectors to 8bits (so a normalized vector ranges from -255..+255):

Code: [Select]

DiffDot = ( Nx * Lx + Ny * Ly + Nz * Lz ) shr 8;So you get an 8bit (0..255) value to shade your texture color, which fits nicely into mmx.

For the specular part you might need more bits because the exponent keeps only a small piece of the range (eg. 0.8 - 1.0), all the rest is black (below 1/255) anyway.
And the integer value makes it much easier to pick the pow-function from a table.

Another thing that makes your code slower at the moment is that you precalculated everything into double-arrays, which increases your memory bandwidth by a factor of 8.

I had a look at your code and noticed that this part:

Code: [Select]

            VdirX = (PixyX) * ReciOX
            VdirX = CamX-VdirX
            VdirY = (PixyY) * ReciOY
            VdirY = CamY-VdirY
            VdirZ = TextureZ - CamZ
            
            SpecDot = ( Nx*VDirX + Ny*VDirY + Nz*VDirZ )
            
            Rx = VDirX-Nx*2.0*SpecDot
            Ry = VDirY-Ny*2.0*SpecDot
            Rz = VDirZ-Nz*2.0*SpecDot

...is constant for each pixel and can be precalculated just as the normal map.

ninogenio · « **Reply #25 on:** May 10, 2013 »

cheers hellfire,

and thanks for looking at the code. excellent spot with the static veiwport and texture optimization, im wanting too try this dynamically though as it will be mapped onto a cube at some point so will be moving around.

well a fun night tonight i brushed off the fixed point cobwebs and after about 2 hours the whole thing is integerized with ptr addition included also.. the only part that i cant get my head around fixed pointing is the pow function. also there is loads of shifts in there now so that will slow things down a bit.

i had too use ten point shifts for precision as i was loosing too much specular term. even the diffuse part was suffering at eight.
Im Chuffed so far. at this point im up by about 5 fps at my end but im sure ive opened up loads of optimisation options now.

i just cant see them at this point too many shifts and a few rounding errors does that though

oh and thanks for the quantize on the diffuse suggestion ill give that a bash next as i think it could get rid of a few calculations k+

hellfire · « **Reply #26 on:** May 10, 2013 »

This version runs at ~50fps on my machine, that's almost twice as fast as the previous one.

Quote

the only part that i cant get my head around fixed pointing is the pow function.

That's actually super easy:
If both vectors (Rx,Ry,Rz) and (Lx,Ly,Lz) are normalized to 10 bits, SpecDot is an integer in the range -1023..+1023.
Since you're only interested in positive values and (with an exponent of 40) all values <800 are zero anyway, you can look up the pow-function from a really small table.

But I noticed that your view direction vector (VDirX,VDirY,VDirZ) is not normalized (and I'm a bit surprised that it still works so well), so you have to be a bit careful with the actual numeric range of the dot-products.

And code like this:

Code: [Select]

Rx = VDirX-Nx*2*SpecDot
Ry = VDirY-Ny*2*SpecDot
Rz = VDirZ-Nz*2*SpecDot

...is predestinated for mmx, you just have to make sure that input and output fits into signed 16bit values.
If you extend your vectors to have a 4th coordinate (which just stays 0), it's much easier to load data into mmx registers.

ninogenio · « **Reply #27 on:** May 11, 2013 »

excellent thanks hellfire,

im trying too pull specdot into the range of -1 1 for a precalced pow, normalizing Vdir and light Vectors but there must be something off some where else because i always get around 1.4 -1.4. unless i normalize the reflection vector, light vector and Vdir then i get -1 1. just for clarity Can the light vector Be unormalized until after the diffuse angle is worked out?.

im surprised the extra sqrts dont hamper performance too much.

how would i go about making the reflection vector fit into a 16 bit number? that would mean only 4 bits of precision do you think this would be enough.

i was having a little look here..
http://www.dbfinteractive.com/forum/index.php?topic=1726.msg26106#msg26106

i see what you mean about padding 4Dvectors with w being 0, it seems like the lesser of 2 evils.

hellfire · « **Reply #28 on:** May 11, 2013 »

Quote from: ninogenio on May 11, 2013

im trying too pull specdot into the range of -1 1 for a precalced pow, normalizing Vdir and light Vectors but there must be something off some where else because i always get around 1.4 -1.4. unless i normalize the reflection vector, light vector and Vdir then i get -1 1.

the dot-product of two vectors v1 and v2 (v1.x*v2.x + v1.y*v2.y + v1.z*v2.z) is in the range -1..+1 only if the length of both vectors is 1.0.
If that's not the case, the diffuse term is calculated by

Code: [Select]

dot= (v1.x*v2.x + v1.y*v2.y + v1.z*v2.z)
diffuse= dot / ( length(v1) * length(v2) )

To avoid the sqaure root associated with calculating the length, one tries to have both vectors *almost* normalized beforehand, so that length(v1) * length(v2) becomes ~1.0 and can be skipped.
Nobody will notice if the length of your vectors is off by a few percent, so a very rough approximation for 1/sqrt is totally sufficent (for floating point values the fast inverse square root function is popular).
On the other hand, why bother with normalization if your shading function gives good results with unnormalized vectors?
All that happens is that your dot products deliver somewhat larger (or smaller) values. So if you want to use it for a lookup-table, your array must be somewhat larger...

ninogenio · « **Reply #29 on:** May 11, 2013 »

Quote

To avoid the sqaure root associated with calculating the length, one tries to have both vectors *almost* normalized beforehand, so that length(v1) * length(v2) becomes ~1.0 and can be skipped.
Nobody will notice if the length of your vectors is off by a few percent

im glad you said that as thats exactly what i was hoping for originally by reciprocally dividing each of the vectors elements by there know length. im still struggling with a lookup table for pow as it is my specdot is producing numbers in the range of 4 -4 so i thought great just multiply everything you wrote by a factor of 4 giving a lookup table of just over 800 elements. but i was wrong

.

what i have noticed though is using the look up table even though the specular term isnt working correctly the fps has shot up too 110ish so this is going too be a big optimization i think.

ninogenio · « **Reply #30 on:** May 12, 2013 »

well after lots and lots of tinkering i think i have the whole thing completely working. on ints with pow arrayed im getting around 91 fps now. and its still decent quality.

hellfire · « **Reply #31 on:** May 12, 2013 »

Nice work, Nino!
Runs at about 85fps here.

This got me all a bit curious and I wanted to see how fast I could get it myself.
So I started from scratch with a glsl shader as it's much easier to figure out the math, check the required precisions and try different variations.
That code looks like this and runs at >5000 fps in 640x480 on my gtx560-ti:

Code: [Select]

uniform sampler2D colorTex;   // color texture
uniform sampler2D normalTex;  // normal texture
uniform vec3 lightPos;        // light position
uniform vec4 lightColor;      // light color (1.0, 0.4, 0.4, 1.0);
uniform vec4 specColor;       // specular color (1.0, 0.9, 0.9, 1.0)

varying vec2 fragPos;         // 2d pixel position (0..1, 0..1)
varying vec2 uv;              // 2d texture coordinate

void main()
{
   vec4 result= vec4(0.0, 0.0, 0.0, 0.0);

   // get color and normal from textures
   vec4 color= texture2D(colorTex, uv);
   vec3 normal= texture2D(normalTex, uv).xyz * 2.0 - 1.0;

   // make sure normal is actually normalized
   normal= normalize( normal );

   // get height from alpha channel of color map
   float height= 2.0 - color.a;

   // 3d fragment position
   vec3 pos= vec3(fragPos, height);

   // camera (0,0,0) to pixel
   vec3 viewDir= normalize(pos);

   // light to pixel
   vec3 lightDir= normalize(pos - lightPos);
   vec3 refl= normal*2.0*dot(normal, viewDir) - viewDir;

   // distance attenuation (1/dist^2) * 2.0 (disabled)
//   float dist= 2.0 / length(lightPos - pos);

   // diffuse term
   float diffuse= dot(normal, lightDir);
   if (diffuse > 0.0)
       result= color * lightColor * diffuse; // * dist;

   // specular term
   float specular= dot(refl, lightDir);
   if (specular > 0.8)
       result+= specColor * pow( specular, 20.0 );

   // store color
   gl_FragColor= result;
}

So I started to write this down in C, using 4d 16bit integer vectors to give the compiler a clue to use mmx and replaced the normalization- and pow-part with a lookup table.
That version ran at about 70fps using no multithreading. Looking at the disassembly it actually used mmx vectorization but wasn't very clever at it.
So I successively replaced all parts of the inner-loop with hand-crafted mmx blocks which got me at around 85fps and there's probably a good chance to make that a few percent faster.
Finally I added multi-core support to make it use all my cpu-cores and got at about >200fps (exe attached).
I must admit that it uses all my 4 cores to the max while your version only uses ~50%.

The c code I started from looks like this:

Code: [Select]

typedef struct
{
   short x,y,z,w;
} ShortVector4;

void drawBump2d_c(
      unsigned int *dst,          // destination buffer
      unsigned int *src,          // interleaved color/normal data per pixel
      int width,                  // width of buffer
      int height,                 // height of buffer
      ShortVector4 lightPos,      // light position
      ShortVector4 camera,        // camera position (w/2, h/2, 0, 0)
      ShortVector4 lightColor,    // light color
      ShortVector4 specColor      // specular color
)
{
   // run scanlines in parallel:
   #pragma omp parallel for
   for (int y=0; y<height; y++)
   {
      scanlineBump2d_c(
         dst + y * width,
         src + y * width * 2,
         width,
         y,
         lightPos,
         lightColor,
         specColor,
         camera
      );
   }
}


void scanlineBump2d_c(
   unsigned int* dst,
   unsigned int* src,
   int width,
   int y,
   ShortVector4 lightPos,
   ShortVector4 lightColor,
   ShortVector4 specColor,
   ShortVector4 camera)
{
   ShortVector4 pos;
   ShortVector4 norm;
   ShortVector4 viewdir;
   ShortVector4 refl;
   ShortVector4 col;
   ShortVector4 lightdir;

   pos.x= 0;
   pos.y= y;
   pos.z= 0;
   pos.w= 0;
   for (int x=0; x<width; x++)
   {
      // start with black color
      col.x= 0;
      col.y= 0;
      col.z= 0;
      col.w= 0;

      unsigned int pixelColor= src[0];
      pos.z= (pixelColor >> 24 & 255); // height stored in alpha
      // pos is the pixel's 3d coordinate

      // normal vector in 7bit fractional
      unsigned int nrm= src[1];
      norm.x= (nrm & 255) - 128;
      norm.y= (nrm >> 8 & 255) - 128;
      norm.z= (nrm >> 16 & 255) - 128;
      norm.w= (nrm >> 24 & 255) - 128;

      // light direction
      lightdir.x= pos.x - lightPos.x;
      lightdir.y= pos.y - lightPos.y;
      lightdir.z= pos.z - lightPos.z;
      lightdir.w= pos.w - lightPos.w;

      // normalize
      int t;
      short inv;
      t= lightdir.x*lightdir.x + lightdir.y*lightdir.y + lightdir.z*lightdir.z + lightdir.w*lightdir.w;
      inv= invSqrt[t>>10];

      // rescale this to end up with 16bit of fraction to match mmx' pmulhw
      lightdir.x= (lightdir.x<<3)*inv>>16;
      lightdir.y= (lightdir.y<<3)*inv>>16;
      lightdir.z= (lightdir.z<<3)*inv>>16;
      lightdir.w= (lightdir.w<<3)*inv>>16;

      // calculate diffuse term - result: -16383..+16383
      int diffuse= norm.x*lightdir.x + norm.y*lightdir.y + norm.z*lightdir.z + norm.w*lightdir.w;
      if (diffuse > 0)
      {
         diffuse= diffuse>>5;

         col.x+= (pixelColor >>  0 & 255) * lightColor.x * diffuse >> 16;
         col.y+= (pixelColor >>  8 & 255) * lightColor.y * diffuse >> 16;
         col.z+= (pixelColor >> 16 & 255) * lightColor.z * diffuse >> 16;
         col.w+= (pixelColor >> 24 & 255) * lightColor.w * diffuse >> 16;
      }


      // view direction vector: camera -> pixel
      viewdir.x= pos.x - camera.x;
      viewdir.y= pos.y - camera.y;
      viewdir.z= pos.z - camera.z;
      viewdir.w= pos.w - camera.w;

      // normalize
      t= viewdir.x*viewdir.x + viewdir.y*viewdir.y + viewdir.z*viewdir.z + viewdir.w*viewdir.w;
      inv= invSqrt[t>>10];

      viewdir.x= (viewdir.x<<3)*inv>>16;
      viewdir.y= (viewdir.y<<3)*inv>>16;
      viewdir.z= (viewdir.z<<3)*inv>>16;
      viewdir.w= (viewdir.w<<3)*inv>>16;

      // reflection vector
      t= (norm.x*viewdir.x + norm.y*viewdir.y + norm.z*viewdir.z + norm.w*viewdir.w);
      refl.x= ((norm.x<<3) * t >> 16) - viewdir.x;
      refl.y= ((norm.y<<3) * t >> 16) - viewdir.y;
      refl.z= ((norm.z<<3) * t >> 16) - viewdir.z;
      refl.w= ((norm.w<<3) * t >> 16) - viewdir.w;

      // specular term. result: -16383..+16383
      int specular= refl.x*lightdir.x + refl.y*lightdir.y + refl.z*lightdir.z + refl.w*lightdir.w;

      if (specular > 12288) // 16383 * 0.75 -> pow(0.75, 2.0) < 1/255
      {
         specular= specular >> 7;

         unsigned int s= powTable[specular] & 255;

         col.x+= (s * specColor.x >> 8);
         col.y+= (s * specColor.y >> 8);
         col.z+= (s * specColor.z >> 8);
         col.w+= (s * specColor.w >> 8);
      }

      // saturate
      if (col.x>255) col.x=255;
      if (col.y>255) col.y=255;
      if (col.z>255) col.z=255;
      if (col.w>255) col.w=255;

      *dst= (col.z<<16)|(col.y<<8)|col.x;

      dst++;
      src+=2;
      pos.x++;
   }
}

The two tables look like this:

Code: [Select]

   int invSqrt[65536];             // way too much
   unsigned int powTable[2048];

   for (int i=0; i<2048; i++)
   {
      double p= pow(i/128.0, 20.0)*2.0;
      if (p<0.0) p=0.0;
      if (p>1.0) p=1.0;
      int v= p*255.0;
      powTable[i]= (v<<24)|(v<<16)|(v<<8)|v;
   }

   for (int i=0; i<65536; i++)
   {
      double t= 32767.0 * 32.0 / sqrt(i*1024.0);
      if (t > 0x7fff) t= 0x7fff;
      invSqrt[i]= (int)t;
   }

As I don't want to kill the suspense I'm not going to add the mmx code for now

ninogenio · « **Reply #32 on:** May 12, 2013 »

amazing

best sunday morning ever

!!!

so you managed to get up too 85fps in your c code using only 1 core, then when you spread the work load across all cores you got >200, im getting 336fps at my end core i7 3.2

.

i honestly never imagined this could be made that quick. and your specular and diffuse terms look lovely!!
what value do you hold in light w? is the norm.w Packed with your grey level height map, and integrated into the diffuse term dot product, as your diffuse term is much more prominent than mine.

just noticed you were able too make a look up table for normalization that is amazing. it was annoying me that mine wasn't correct but i couldn't afford the extra sqrt's. i would never have thought of your solution.

ill take a little while to digest all your code, already i can see lots of areas i can improve mine k++

hellfire · « **Reply #33 on:** May 12, 2013 »

Quote

what value do you hold in light w?

i'm on my mobile, so just a short reply:
all w components are zero and are just there to make the compiler use a single vector instruction on the whole 4 values.
otherwise it tries to mask the 4th component away and ends up slower...
i just use alpha of the colormap to store 255-grey.
i also renormalize the normalmap after loading because it didn't really fit.
have to look up the z components of the light and camera when i'm back home.

ninogenio · « **Reply #34 on:** May 12, 2013 »

thanks mate, no problem.

i just noticed my threading isnt working properly your example makes my cpu run at 100% full wack. i just tried to splice mine into 4 sections and 4 core it. my fps went too 136 but with only 36% cpu usage im guessing its because im not doing my multi threading on a scanline by scanline basis as you do, and as a result for whatever reason a lot of the time my code makes the cpu sit idle.

hellfire · « **Reply #35 on:** May 12, 2013 »

The rest of the parameters are:

Code: [Select]

w= width of bitmap (640)
h= height of bitmap (480)
time= time in seconds

lightPos.xyzw= ( (sin(time*1.6)+1)*w/2, (cos(time*1.8)+1)*h/2, 50, 0)
cameraPos.xyzw= (w/2, h/2, 0, 0)
lightColor.xyzw= (127, 127, 255, 0)  // blue, green, red, alpha
specColor.xyzw= (255, 192, 192, 0)   // blue, green, red, alpha

And the two textures are modified the following way:

Code: [Select]

// "height" into alpha channel:
color[i].w= 255 - ((color[i].x*30 + color[i].y*150 + color[i].z*76) >> 8);

// renormalize
int t= 32767 / sqrt( normal[i].x*normal[i].x + normal[i].y*normal[i].y + normal[i].z*normal[i].z);
normal[i].x= -(normal[i].x * t >> 8);  // negated!
normal[i].y=  (normal[i].y * t >> 8); 
normal[i].z=  (normal[i].z * t >> 8);

the color/normal-buffers got interleaved into a separate buffer, so i can read both with a single movq and save an adress register.

As all vectors are normalized to signed 8bits (-128..+127) and the normalization-precision is very rough, you can see some quantization noise in the shading (which is actually good, if it wasn't there you'd see color banding) which could be removed by using more bits of the available range - but it's probably getting a bit trickier with mmx then...

Quote

im not doing my multi threading on a scanline by scanline basis as you do

open-mp doesn't schedule one job per scanline. instead it distributes the whole number of loop iterations (in this case 0..479) over the number of available cores.
So scanlines 0..119 are processed by core0, 120..239 by core1, 240..359 by core2, 360..479 by core3.
This gives minimal scheduling overhead but if one core gets interrupted by another task and thus finishes later, all other cores must wait until the last one finished.

ninogenio · « **Reply #36 on:** May 13, 2013 »

Excellent thanks mate ive taken a step back too the float version and redone all the base calcs watching there floor and roof values too make sure they stay in the correct ranges and it works nicely. the specular and diffuse terms look and behave the same as yours.

next step is too change the fixed point version too behave the same then ill change the size of my shifts too suit mmx.

its a pitty freebasic doesn't have a version of open mp. the standard freebasic threading commands don't seem too use more than two of my cores. so the cpu usage only ever gets as high as 36%. ill have too go a bit deeper into that.

Quote

As all vectors are normalized to signed 8bits (-128..+127) and the normalization-precision is very rough, you can see some quantization noise in the shading (which is actually good, if it wasn't there you'd see color banding) which could be removed by using more bits of the available range - but it's probably getting a bit trickier with mmx then...

glad you wrote this as i just couldnt produce the nice speckled effect ( in the float version ) around the shading and it was driving me nuts.

hellfire · « **Reply #37 on:** May 13, 2013 »

Quote from: ninogenio on May 13, 2013

ive taken a step back too the float version
next step is too change the fixed point version too behave the same

Another option is to keep everything in floating point and take the sse route...

ninogenio · « **Reply #38 on:** May 14, 2013 »

i think ill probably go the fixed point mmx way mostly for memory bandwidth.

well i have too admit i didnt fully understand what was going on with your all your shifts etc so have spent the whole night tearing all my code down too a number by number basis and watching all the values in real time. i've learned stuff like fixed point reciprocal divides etc.. its really been a while

.

ive fixed my range normalizing issues i had, tided it all up a bit and now its 4 core, i dont think it will use all the cpu though still haven't got round too getting into that. im still using 10 point shifts atm. my next job is too bring everthing into mmx range and try a bit of asm out.

150fps on my end at the moment.

hellfire · « **Reply #39 on:** May 14, 2013 »

Quote from: ninogenio on May 14, 2013

well i have too admit i didnt fully understand what was going on with your all your shifts

Well, what's probably not so straight is the normalization part - which would usually look like this:

Code: [Select]

int t= lightdir.x*lightdir.x + lightdir.y*lightdir.y + lightdir.z*lightdir.z + lightdir.w*lightdir.w;

// factor to normalize to -128..+128 with 8bit of fractional part
int invSqrt= (128*256) / sqrt(t);

lightdir.x= lightdir.x * invSqrt >> 8;
lightdir.y= lightdir.y * invSqrt >> 8;
lightdir.z= lightdir.z * invSqrt >> 8;
lightdir.w= lightdir.w * invSqrt >> 8;

But when multiplying two 16bit values with mmx you can only keep the upper or the lower word, like this:

Code: [Select]

pmullw: lightdir= lightdir * invSqrt;
pmulhw: lightdir= lightdir * invSqrt >> 16;

With pmullw you get an overflow when the vector exceed 0..255 (which it does),
with pmulhw the result gets 256x smaller than it's supposed to.
So I distributed the factor of 256 to both values, *8 to the vector and *32 to the invSqrt, ending up at:

Code: [Select]

invSqrt= (128*256*32) / sqrt(t);
viewdir= (viewdir<<3) * invSqrt >> 16;

Now the invSqrt doesn't fit into 16bit anymore for small values of t.
But that's not a big problem because it's impossible to normalize very small vector anyway (because 1/sqrt(0) = infinity).
So I just clamp the table-values at 32767 and accept that vectors shorter than 0.25 (32 at 7bits fractional) will be too short.

I choose a factor of 32 because at that point the lookup-table for invSqrt was much larger and I was looking up invSqrt[t>>5], so only the value of invSqrt[0] got clamped (which must be clamped anyway).
Now that it's looking up invSqrt[t>>10], it makes more sense to put the whole *256 within the invSqrt-table and remove the shift...