Show Posts

This section allows you to view all posts made by this member. Note that you can only see posts made in areas you currently have access to.


Messages - hellfire

Pages: [1] 2 3 4 5 6 7 8 ... 55
1
Freebasic / Re: bumpmapping
« on: May 16, 2013 »
Since I don't have "virtual cores" it doesn't seem to make much of a difference:
BumpMap4: around 125fps and 75-85% cpu usage.
BumpMap8: around 130fps and 70-90% cpu usage.

I think there's simply no thread running while ptc transfers the framebuffer over to the window (and it's probably not very fast).
How about running them in parralel with a double buffer?
Code: [Select]
Start threads rendering to buffer0
display buffer1
wait for threads
swap pointers of buffer0 / buffer1

2
Freebasic / Re: bumpmapping
« on: May 15, 2013 »
does your code wait on the threads finishing before updating the frame or does it just unlock them and let them run
I don't really know how to determine that a frame has finished without waiting on the threads to finish.
At the moment I don't really do anything at all - all the multi-threading is handled by the open-mp macro automatically.
It behaves just like running without multiple threads, so the loop finishes as soon as all threads are done.
As open-mp provides a thread-id for every iteration of the parallel loop, I figured it processes continuous blocks with each thread.
That's probably not the best possible solution for all scenarios but good enough to not think about a better way for now.
If each block requires very different processing time, it probably makes sense to work on a smaller granularity to minimize thread sync time.
I haven't checked the processing time for each thread yet, but I guess it should be quite constant as the amount of source data is equal and accessed strictly linear.

3
Freebasic / Re: bumpmapping
« on: May 15, 2013 »
do you think its possible i might have mutual exchange issues with my threading and that might make the cpu sit around a lot of the time each frame and do nothing.
would passing the the same segment of memory too different threads even though they were working on different cells cause any binding issues.
I guess you're not using any kind of mutexing, so there's no reason why any of the threads should wait.
And as every thread works on his own block of data, the threads cannot interfere.
However, every thread needs his own set of temporary variables - when working on the same global variable from different threads, the result is of course totally unpredictable.

But there can still be really awkward situations like this one:
Code: [Select]
// once loaded from memory, both variables will be kept in the same cache line
int dataA;
int dataB;

core0:
dataA= 1234;    // write value back into cache

core1:
int value= dataB;
// cache line of dataB got invalided because core0 modified
// memory which refers to the same cache line!
// must request core0 to write its' cache line back into memory
// wait until memory is available
// read back whole cache line
// wait until data is available in the cache

4
Freebasic / Re: bumpmapping
« on: May 14, 2013 »
Well, the main trick is to make your vectors fit into 4x short, so you can use mmx for vector and color-processing.
I'd suggest to convert your code to asm in very very small steps.
Write all intermediate values back into variables so you can check them.

Start with the simple stuff, for example:
Code: [Select]
ShortVector4 col;
/*
col.x= 0;
col.y= 0;
col.z= 0;
col.w= 0;
*/
_asm {
   pxor mm7,mm7
   movq [col], mm7
};

And
Code: [Select]
// calc light direction
/*
lightDir.x= pos.x - lightPos.x;
lightDir.y= pos.y - lightPos.y;
lightDir.z= pos.z - lightPos.z;
lightDir.w= pos.w - lightPos.w;
*/
_asm {
  movq      mm3, [pos]
  movq      mm4, [lightPos]
  psubw     mm3, mm4
  movq      [lightDir],mm3
};

Once you've converted the whole innerloop, you can remove most of the loading/storing from and to variables.
That's the point where your code suddenly gets much faster.

5
C / C++ /C# / Re: [C] 64-bit XM player
« on: May 14, 2013 »
Hi Rene,
why do you want to create a 64bit binary?
If you're working with Cuda anyway, you won't benefit much from 64bit code generation.
You probably won't need more than 2gb of memory, either.
And a 64bit binary rules out everyone who's still running a 32bit os...

6
Freebasic / Re: bumpmapping
« on: May 14, 2013 »
well i have too admit i didnt fully understand what was going on with your all your shifts

Well, what's probably not so straight is the normalization part - which would usually look like this:
Code: [Select]
int t= lightdir.x*lightdir.x + lightdir.y*lightdir.y + lightdir.z*lightdir.z + lightdir.w*lightdir.w;

// factor to normalize to -128..+128 with 8bit of fractional part
int invSqrt= (128*256) / sqrt(t);

lightdir.x= lightdir.x * invSqrt >> 8;
lightdir.y= lightdir.y * invSqrt >> 8;
lightdir.z= lightdir.z * invSqrt >> 8;
lightdir.w= lightdir.w * invSqrt >> 8;

But when multiplying two 16bit values with mmx you can only keep the upper or the lower word, like this:
Code: [Select]
pmullw: lightdir= lightdir * invSqrt;
pmulhw: lightdir= lightdir * invSqrt >> 16;

With pmullw you get an overflow when the vector exceed 0..255 (which it does),
with pmulhw the result gets 256x smaller than it's supposed to.
So I distributed the factor of 256 to both values, *8 to the vector and *32 to the invSqrt, ending up at:
Code: [Select]
invSqrt= (128*256*32) / sqrt(t);
viewdir= (viewdir<<3) * invSqrt >> 16;
Now the invSqrt doesn't fit into 16bit anymore for small values of t.
But that's not a big problem because it's impossible to normalize very small vector anyway (because 1/sqrt(0) = infinity).
So I just clamp the table-values at 32767 and accept that vectors shorter than 0.25 (32 at 7bits fractional) will be too short.

I choose a factor of 32 because at that point the lookup-table for invSqrt was much larger and I was looking up invSqrt[t>>5], so only the value of invSqrt[0] got clamped (which must be clamped anyway).
Now that it's looking up invSqrt[t>>10], it makes more sense to put the whole *256 within the invSqrt-table and remove the shift...

7
C / C++ /C# / Re: [C++][OpenGL] Sine Scroller
« on: May 13, 2013 »
Uploaded the executable file, i don't know if it works on another PC, please try  ;)
Even works with wine (an emulator to run windows executables under linux), so it quite possibly runs everywhere :)

8
Freebasic / Re: bumpmapping
« on: May 13, 2013 »
ive taken a step back too the float version
next step is too change the fixed point version too behave the same
Another option is to keep everything in floating point and take the sse route...

9
Freebasic / Re: bumpmapping
« on: May 12, 2013 »
The rest of the parameters are:
Code: [Select]
w= width of bitmap (640)
h= height of bitmap (480)
time= time in seconds

lightPos.xyzw= ( (sin(time*1.6)+1)*w/2, (cos(time*1.8)+1)*h/2, 50, 0)
cameraPos.xyzw= (w/2, h/2, 0, 0)
lightColor.xyzw= (127, 127, 255, 0)  // blue, green, red, alpha
specColor.xyzw= (255, 192, 192, 0)   // blue, green, red, alpha

And the two textures are modified the following way:
Code: [Select]
// "height" into alpha channel:
color[i].w= 255 - ((color[i].x*30 + color[i].y*150 + color[i].z*76) >> 8);

// renormalize
int t= 32767 / sqrt( normal[i].x*normal[i].x + normal[i].y*normal[i].y + normal[i].z*normal[i].z);
normal[i].x= -(normal[i].x * t >> 8);  // negated!
normal[i].y=  (normal[i].y * t >> 8);
normal[i].z=  (normal[i].z * t >> 8);
the color/normal-buffers got interleaved into a separate buffer, so i can read both with a single movq and save an adress register.

As all vectors are normalized to signed 8bits (-128..+127) and the normalization-precision is very rough, you can see some quantization noise in the shading (which is actually good, if it wasn't there you'd see color banding) which could be removed by using more bits of the available range - but it's probably getting a bit trickier with mmx then...

Quote
im not doing my multi threading on a scanline by scanline basis as you do
open-mp doesn't schedule one job per scanline. instead it distributes the whole number of loop iterations (in this case 0..479) over the number of available cores.
So scanlines 0..119 are processed by core0, 120..239 by core1, 240..359 by core2, 360..479 by core3.
This gives minimal scheduling overhead but if one core gets interrupted by another task and thus finishes later, all other cores must wait until the last one finished.

10
Freebasic / Re: bumpmapping
« on: May 12, 2013 »
Quote
what value do you hold in light w?
i'm on my mobile, so just a short reply:
all w components are zero and are just there to make the compiler use a single vector instruction on the whole 4 values.
otherwise it tries to mask the 4th component away and ends up slower...
i just use alpha of the colormap to store 255-grey.
i also renormalize the normalmap after loading because it didn't really fit.
have to look up the z components of the light and camera when i'm back home.

11
Freebasic / Re: bumpmapping
« on: May 12, 2013 »
Nice work, Nino!
Runs at about 85fps here.

This got me all a bit curious and I wanted to see how fast I could get it myself.
So I started from scratch with a glsl shader as it's much easier to figure out the math, check the required precisions and try different variations.
That code looks like this and runs at >5000 fps in 640x480 on my gtx560-ti:
Code: [Select]
uniform sampler2D colorTex;   // color texture
uniform sampler2D normalTex;  // normal texture
uniform vec3 lightPos;        // light position
uniform vec4 lightColor;      // light color (1.0, 0.4, 0.4, 1.0);
uniform vec4 specColor;       // specular color (1.0, 0.9, 0.9, 1.0)

varying vec2 fragPos;         // 2d pixel position (0..1, 0..1)
varying vec2 uv;              // 2d texture coordinate

void main()
{
   vec4 result= vec4(0.0, 0.0, 0.0, 0.0);

   // get color and normal from textures
   vec4 color= texture2D(colorTex, uv);
   vec3 normal= texture2D(normalTex, uv).xyz * 2.0 - 1.0;

   // make sure normal is actually normalized
   normal= normalize( normal );

   // get height from alpha channel of color map
   float height= 2.0 - color.a;

   // 3d fragment position
   vec3 pos= vec3(fragPos, height);

   // camera (0,0,0) to pixel
   vec3 viewDir= normalize(pos);

   // light to pixel
   vec3 lightDir= normalize(pos - lightPos);
   vec3 refl= normal*2.0*dot(normal, viewDir) - viewDir;

   // distance attenuation (1/dist^2) * 2.0 (disabled)
//   float dist= 2.0 / length(lightPos - pos);

   // diffuse term
   float diffuse= dot(normal, lightDir);
   if (diffuse > 0.0)
       result= color * lightColor * diffuse; // * dist;

   // specular term
   float specular= dot(refl, lightDir);
   if (specular > 0.8)
       result+= specColor * pow( specular, 20.0 );

   // store color
   gl_FragColor= result;
}

So I started to write this down in C, using 4d 16bit integer vectors to give the compiler a clue to use mmx and replaced the normalization- and pow-part with a lookup table.
That version ran at about 70fps using no multithreading. Looking at the disassembly it actually used mmx vectorization but wasn't very clever at it.
So I successively replaced all parts of the inner-loop with hand-crafted mmx blocks which got me at around 85fps and there's probably a good chance to make that a few percent faster.
Finally I added multi-core support to make it use all my cpu-cores and got at about >200fps (exe attached).
I must admit that it uses all my 4 cores to the max while your version only uses ~50%.

The c code I started from looks like this:
Code: [Select]
typedef struct
{
   short x,y,z,w;
} ShortVector4;

void drawBump2d_c(
      unsigned int *dst,          // destination buffer
      unsigned int *src,          // interleaved color/normal data per pixel
      int width,                  // width of buffer
      int height,                 // height of buffer
      ShortVector4 lightPos,      // light position
      ShortVector4 camera,        // camera position (w/2, h/2, 0, 0)
      ShortVector4 lightColor,    // light color
      ShortVector4 specColor      // specular color
)
{
   // run scanlines in parallel:
   #pragma omp parallel for
   for (int y=0; y<height; y++)
   {
      scanlineBump2d_c(
         dst + y * width,
         src + y * width * 2,
         width,
         y,
         lightPos,
         lightColor,
         specColor,
         camera
      );
   }
}


void scanlineBump2d_c(
   unsigned int* dst,
   unsigned int* src,
   int width,
   int y,
   ShortVector4 lightPos,
   ShortVector4 lightColor,
   ShortVector4 specColor,
   ShortVector4 camera)
{
   ShortVector4 pos;
   ShortVector4 norm;
   ShortVector4 viewdir;
   ShortVector4 refl;
   ShortVector4 col;
   ShortVector4 lightdir;

   pos.x= 0;
   pos.y= y;
   pos.z= 0;
   pos.w= 0;
   for (int x=0; x<width; x++)
   {
      // start with black color
      col.x= 0;
      col.y= 0;
      col.z= 0;
      col.w= 0;

      unsigned int pixelColor= src[0];
      pos.z= (pixelColor >> 24 & 255); // height stored in alpha
      // pos is the pixel's 3d coordinate

      // normal vector in 7bit fractional
      unsigned int nrm= src[1];
      norm.x= (nrm & 255) - 128;
      norm.y= (nrm >> 8 & 255) - 128;
      norm.z= (nrm >> 16 & 255) - 128;
      norm.w= (nrm >> 24 & 255) - 128;

      // light direction
      lightdir.x= pos.x - lightPos.x;
      lightdir.y= pos.y - lightPos.y;
      lightdir.z= pos.z - lightPos.z;
      lightdir.w= pos.w - lightPos.w;

      // normalize
      int t;
      short inv;
      t= lightdir.x*lightdir.x + lightdir.y*lightdir.y + lightdir.z*lightdir.z + lightdir.w*lightdir.w;
      inv= invSqrt[t>>10];

      // rescale this to end up with 16bit of fraction to match mmx' pmulhw
      lightdir.x= (lightdir.x<<3)*inv>>16;
      lightdir.y= (lightdir.y<<3)*inv>>16;
      lightdir.z= (lightdir.z<<3)*inv>>16;
      lightdir.w= (lightdir.w<<3)*inv>>16;

      // calculate diffuse term - result: -16383..+16383
      int diffuse= norm.x*lightdir.x + norm.y*lightdir.y + norm.z*lightdir.z + norm.w*lightdir.w;
      if (diffuse > 0)
      {
         diffuse= diffuse>>5;

         col.x+= (pixelColor >>  0 & 255) * lightColor.x * diffuse >> 16;
         col.y+= (pixelColor >>  8 & 255) * lightColor.y * diffuse >> 16;
         col.z+= (pixelColor >> 16 & 255) * lightColor.z * diffuse >> 16;
         col.w+= (pixelColor >> 24 & 255) * lightColor.w * diffuse >> 16;
      }


      // view direction vector: camera -> pixel
      viewdir.x= pos.x - camera.x;
      viewdir.y= pos.y - camera.y;
      viewdir.z= pos.z - camera.z;
      viewdir.w= pos.w - camera.w;

      // normalize
      t= viewdir.x*viewdir.x + viewdir.y*viewdir.y + viewdir.z*viewdir.z + viewdir.w*viewdir.w;
      inv= invSqrt[t>>10];

      viewdir.x= (viewdir.x<<3)*inv>>16;
      viewdir.y= (viewdir.y<<3)*inv>>16;
      viewdir.z= (viewdir.z<<3)*inv>>16;
      viewdir.w= (viewdir.w<<3)*inv>>16;

      // reflection vector
      t= (norm.x*viewdir.x + norm.y*viewdir.y + norm.z*viewdir.z + norm.w*viewdir.w);
      refl.x= ((norm.x<<3) * t >> 16) - viewdir.x;
      refl.y= ((norm.y<<3) * t >> 16) - viewdir.y;
      refl.z= ((norm.z<<3) * t >> 16) - viewdir.z;
      refl.w= ((norm.w<<3) * t >> 16) - viewdir.w;

      // specular term. result: -16383..+16383
      int specular= refl.x*lightdir.x + refl.y*lightdir.y + refl.z*lightdir.z + refl.w*lightdir.w;

      if (specular > 12288) // 16383 * 0.75 -> pow(0.75, 2.0) < 1/255
      {
         specular= specular >> 7;

         unsigned int s= powTable[specular] & 255;

         col.x+= (s * specColor.x >> 8);
         col.y+= (s * specColor.y >> 8);
         col.z+= (s * specColor.z >> 8);
         col.w+= (s * specColor.w >> 8);
      }

      // saturate
      if (col.x>255) col.x=255;
      if (col.y>255) col.y=255;
      if (col.z>255) col.z=255;
      if (col.w>255) col.w=255;

      *dst= (col.z<<16)|(col.y<<8)|col.x;

      dst++;
      src+=2;
      pos.x++;
   }
}

The two tables look like this:
Code: [Select]
   int invSqrt[65536];             // way too much
   unsigned int powTable[2048];

   for (int i=0; i<2048; i++)
   {
      double p= pow(i/128.0, 20.0)*2.0;
      if (p<0.0) p=0.0;
      if (p>1.0) p=1.0;
      int v= p*255.0;
      powTable[i]= (v<<24)|(v<<16)|(v<<8)|v;
   }

   for (int i=0; i<65536; i++)
   {
      double t= 32767.0 * 32.0 / sqrt(i*1024.0);
      if (t > 0x7fff) t= 0x7fff;
      invSqrt[i]= (int)t;
   }

As I don't want to kill the suspense I'm not going to add the mmx code for now ;)

12
Freebasic / Re: bumpmapping
« on: May 11, 2013 »
im trying too pull specdot into the range of -1 1 for a precalced pow, normalizing Vdir and light  Vectors but there must be something off some where else because i always get around 1.4 -1.4. unless i normalize the reflection vector, light vector and Vdir then i get -1 1.
the dot-product of two vectors v1 and v2 (v1.x*v2.x + v1.y*v2.y + v1.z*v2.z) is in the range -1..+1 only if the length of both vectors is 1.0.
If that's not the case, the diffuse term is calculated by
Code: [Select]
dot= (v1.x*v2.x + v1.y*v2.y + v1.z*v2.z)
diffuse= dot / ( length(v1) * length(v2) )
To avoid the sqaure root associated with calculating the length, one tries to have both vectors *almost* normalized beforehand, so that length(v1) * length(v2) becomes ~1.0 and can be skipped.
Nobody will notice if the length of your vectors is off by a few percent, so a very rough approximation for 1/sqrt is totally sufficent (for floating point values the fast inverse square root function is popular).
On the other hand, why bother with normalization if your shading function gives good results with unnormalized vectors?
All that happens is that your dot products deliver somewhat larger (or smaller) values. So if you want to use it for a lookup-table, your array must be somewhat larger...

13
Freebasic / Re: bumpmapping
« on: May 10, 2013 »
This version runs at ~50fps on my machine, that's almost twice as fast as the previous one.

Quote
the only part that i cant get my head around fixed pointing is the pow function.
That's actually super easy:
If both vectors (Rx,Ry,Rz) and (Lx,Ly,Lz) are normalized to 10 bits, SpecDot is an integer in the range -1023..+1023.
Since you're only interested in positive values and (with an exponent of 40) all values <800 are zero anyway, you can look up the pow-function from a really small table.

But I noticed that your view direction vector (VDirX,VDirY,VDirZ) is not normalized (and I'm a bit surprised that it still works so well), so you have to be a bit careful with the actual numeric range of the dot-products.

And code like this:
Code: [Select]
Rx = VDirX-Nx*2*SpecDot
Ry = VDirY-Ny*2*SpecDot
Rz = VDirZ-Nz*2*SpecDot
...is predestinated for mmx, you just have to make sure that input and output fits into signed 16bit values.
If you extend your vectors to have a 4th coordinate (which just stays 0), it's much easier to load data into mmx registers.

14
Freebasic / Re: bumpmapping
« on: May 10, 2013 »
you shift all ints and floats left by 8 bits,
add subtract multiply them together
then shift right at the end again
Exactly.
Those values which contain integers anyway won't need any additional bits, though.
For the rest you usually have to check, how much precision is actually required.
The trick is to keep track of the number of fractional bits.
For example if you multiply to values with 8bits of fractional part, the result has 16bits fraction - so you need to shr8 to get back to 8bits of precision.

For the diffuse part I would quantize the vectors to 8bits (so a normalized vector ranges from -255..+255):
Code: [Select]
DiffDot = ( Nx * Lx + Ny * Ly + Nz * Lz ) shr 8;So you get an 8bit (0..255) value to shade your texture color, which fits nicely into mmx.

For the specular part you might need more bits because the exponent keeps only a small piece of the range (eg. 0.8 - 1.0), all the rest is black (below 1/255) anyway.
And the integer value makes it much easier to pick the pow-function from a table.

Another thing that makes your code slower at the moment is that you precalculated everything into double-arrays, which increases your memory bandwidth by a factor of 8.

I had a look at your code and noticed that this part:
Code: [Select]
            VdirX = (PixyX) * ReciOX
            VdirX = CamX-VdirX
            VdirY = (PixyY) * ReciOY
            VdirY = CamY-VdirY
            VdirZ = TextureZ - CamZ
           
            SpecDot = ( Nx*VDirX + Ny*VDirY + Nz*VDirZ )
           
            Rx = VDirX-Nx*2.0*SpecDot
            Ry = VDirY-Ny*2.0*SpecDot
            Rz = VDirZ-Nz*2.0*SpecDot
...is constant for each pixel and can be precalculated just as the normal map.

15
Freebasic / Re: bumpmapping
« on: May 09, 2013 »
33 fps on Intel Core2 Quad 2.8GHz.
Feels much faster than the previous versions.

And my magic crystal ball foretells twice the speed if you kick the floating point stuff out of the main loop and use only integers :)

16
Freebasic / Re: bumpmapping
« on: May 08, 2013 »
If you're looking from (almost) the same angle as the light source hits the surface, diffuse and specular are on the same spot and hard to distinguish.
A typically specular-only setting is the sun going down over the ocean (example) as the diffuse term gets almost zero but the light reflects towards the viewer.

17
Freebasic / Re: bumpmapping
« on: May 08, 2013 »
i've replaced it with this Exp(10*log(SpecDot))
Since floating point numbers are represented as x*2^exp, pow2 and log2 can be approximated by fiddling with the exponent-bits within the float.
And because
Code: [Select]
pow(x, exp) = pow2(exp * log2(x))you can do a *very* rough approximation with:
Code: [Select]
   const float inv23= 1.0f / (1 << 23);
   const float bias= 126.94269504f;

   // approximate log2(x)
   int *ip= (int*)&x; // cast x to integer
   float y = ip[0];
   y= y * inv23 - bias;
   e*=y;

   // approximate 2.0^e
   int i= (1 << 23) * (e + bias);
   float *fp= (float*)&i; // cast i to float
   float result= fp[0];
Be aware that you can be off by a factor of 2 but for a shading function that usually doesn't really matter.
And if you're limiting this to positive integer exponents, it gets quite another bit tighter.

18
Welcome to the forum, Ash!

when i was looking about integrating audio into demoscenes.
Scene-grammar lesson for today: there are many demos  but only one demo scene!


19
Freebasic / Re: bumpmapping
« on: May 07, 2013 »
im intrested in what you wrote about packing a specular level into the alpha channel too stop incorrect lighting going on would it be possible for you too elaborate a little further?
In Reply #4 I suggested a exponential function for the specular term:
Code: [Select]
if (specular > 0.0) specular= pow(specular, 40.0)The exponent is supposed to simulate the reflection of a light source.
A higher exponent creates a smaller highlight and makes the material appear more shiny (for example see here).
Now your surface might not be equally shiny everywhere, eg. imagine some rusty spots on a metal plane (like this).

One way to do this is to have a separate (grey-scale) map to define the brightness of the specular term, like this:
Code: [Select]
if (specular > 0.0) specular= pow(specular, 40.0) * specularLevel[pixelPosition]
Another way is to store the specular exponent in the map:
Code: [Select]
if (specular > 0.0) specular= pow(specular, specularLevel[pixelPosition])
Since the exponential function is somewhat expensive, you'd probably use the first version in a software renderer and pick the pow-function with a constant exponent from a lookup table.

For the color- and normal-maps you're typically working with 32bit rgba-colors and the alpha-channel is often unused.
And as the specular-map just contains a single scalar value for each pixel, it can be stored in the alpha-component of one of the other maps to avoid another texture fetch.

20
Freebasic / Re: Confusion demo fx.
« on: May 07, 2013 »
That's some nice stuff, Rel!

Pages: [1] 2 3 4 5 6 7 8 ... 55