Author Topic: bumpmapping  (Read 7569 times)

0 Members and 1 Guest are viewing this topic.

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: bumpmapping
« Reply #40 on: May 14, 2013 »
yeah i can see how they all work now, cheers Hellfire!

i just wanted too make sure i knew 100% what was going on first. now that everything is pretty much ready for mmx. Can mmx be used for almost everthing in the loop reflection vector invsqrt r,g,b saturation Etc. or is it better too use mmx for the color modulation part and saturation only?
Challenge Trophies Won:

Offline hellfire

  • Sponsor
  • Pentium
  • *******
  • Posts: 1289
  • Karma: 466
    • View Profile
    • my stuff
Re: bumpmapping
« Reply #41 on: May 14, 2013 »
Well, the main trick is to make your vectors fit into 4x short, so you can use mmx for vector and color-processing.
I'd suggest to convert your code to asm in very very small steps.
Write all intermediate values back into variables so you can check them.

Start with the simple stuff, for example:
Code: [Select]
ShortVector4 col;
/*
col.x= 0;
col.y= 0;
col.z= 0;
col.w= 0;
*/
_asm {
   pxor mm7,mm7
   movq [col], mm7
};

And
Code: [Select]
// calc light direction
/*
lightDir.x= pos.x - lightPos.x;
lightDir.y= pos.y - lightPos.y;
lightDir.z= pos.z - lightPos.z;
lightDir.w= pos.w - lightPos.w;
*/
_asm {
  movq      mm3, [pos]
  movq      mm4, [lightPos]
  psubw     mm3, mm4
  movq      [lightDir],mm3
};

Once you've converted the whole innerloop, you can remove most of the loading/storing from and to variables.
That's the point where your code suddenly gets much faster.
« Last Edit: May 14, 2013 by hellfire »
Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: bumpmapping
« Reply #42 on: May 14, 2013 »
cheers hellfire will do mate ive already started   ;).
ive 4dvectorized all my variables and shifted in the correct ranges also started converting all the simple parts too asm might take me a while thought..

when you say all the load and store instruction is where the code slows down you probably mean some thing like this i guess  :)

Code: [Select]
_BUMP1@4:
push ebp
mov ebp, esp
sub esp, 140
push ebx
.Lt_00DF:
mov dword ptr [ebp-4], 0
mov dword ptr [ebp-8], 0
mov dword ptr [ebp-12], 0
mov dword ptr [ebp-16], 0
mov dword ptr [ebp-20], 0
mov dword ptr [ebp-24], 0
mov dword ptr [ebp-28], 0
mov dword ptr [ebp-32], 0
mov dword ptr [ebp-36], 0
mov dword ptr [ebp-40], 0
mov dword ptr [ebp-44], 0
mov dword ptr [ebp-48], 0
mov dword ptr [ebp-52], 0
mov dword ptr [ebp-56], 0
mov dword ptr [ebp-60], 0
mov dword ptr [ebp-64], 0
mov dword ptr [ebp-68], 0
mov dword ptr [ebp-72], 0
mov dword ptr [ebp-76], 0
mov dword ptr [ebp-80], 0
mov dword ptr [ebp-84], 0
mov dword ptr [ebp-88], 0
mov dword ptr [ebp-92], 0
mov dword ptr [ebp-96], 0
mov dword ptr [ebp-100], 0
mov dword ptr [ebp-104], 0
mov dword ptr [ebp-108], 0
mov dword ptr [ebp-112], 0
mov dword ptr [ebp-116], 0
mov dword ptr [ebp-120], 0
mov dword ptr [ebp-124], 0
mov dword ptr [ebp-128], 0
mov dword ptr [ebp-132], 0
mov dword ptr [ebp-136], 0
fld qword ptr [_Lt_00D9]
fmul qword ptr [_DIFFUSER]
fistp dword ptr [ebp-96]
fld qword ptr [_Lt_00D9]
fmul qword ptr [_DIFFUSEG]
fistp dword ptr [ebp-100]
fld qword ptr [_Lt_00D9]
fmul qword ptr [_DIFFUSEB]
fistp dword ptr [ebp-104]
mov dword ptr [ebp-92], 0
mov eax, dword ptr [ebp-92]
lea ebx, [_BUFFER+eax*4]
mov dword ptr [ebp-4], ebx
mov ebx, dword ptr [ebp-92]
lea eax, [_NORMX+ebx*4]
mov dword ptr [ebp-32], eax
mov eax, dword ptr [ebp-92]
lea ebx, [_NORMY+eax*4]
mov dword ptr [ebp-36], ebx
mov ebx, dword ptr [ebp-92]
lea eax, [_NORMZ+ebx*4]
mov dword ptr [ebp-40], eax
mov eax, dword ptr [ebp-92]
lea ebx, [_BUMPR+eax*4]
mov dword ptr [ebp-8], ebx
mov ebx, dword ptr [ebp-92]
lea eax, [_BUMPG+ebx*4]
mov dword ptr [ebp-12], eax
mov eax, dword ptr [ebp-92]
lea ebx, [_BUMPB+eax*4]
mov dword ptr [ebp-16], ebx
mov ebx, dword ptr [ebp-92]
lea eax, [_RNORM+ebx*4]
mov dword ptr [ebp-20], eax
mov eax, dword ptr [ebp-92]
lea ebx, [_GNORM+eax*4]
mov dword ptr [ebp-24], ebx
mov ebx, dword ptr [ebp-92]
lea eax, [_BNORM+ebx*4]
mov dword ptr [ebp-28], eax
mov eax, dword ptr [ebp-92]
lea ebx, [_HEIGHTMAP+eax*4]
mov dword ptr [ebp-44], ebx
mov dword ptr [ebp-140], 0
mov dword ptr [ebp-52], 0
.Lt_00E4:
mov dword ptr [ebp-48], 0
.Lt_00E8:
mov ebx, dword ptr [ebp-32]
mov eax, dword ptr [ebx]
mov dword ptr [ebp-68], eax
mov eax, dword ptr [ebp-36]
mov ebx, dword ptr [eax]
mov dword ptr [ebp-72], ebx
mov ebx, dword ptr [ebp-40]
mov eax, dword ptr [ebx]
mov dword ptr [ebp-76], eax
mov eax, dword ptr [ebp-44]
mov ebx, dword ptr [eax]
add ebx, -20480
sar ebx, 10
mov dword ptr [ebp-120], ebx
fild dword ptr [ebp-48]
fsub qword ptr [_LIGHTX]
fistp dword ptr [ebp-112]
fild dword ptr [ebp-52]
fsub qword ptr [_LIGHTY]
fistp dword ptr [ebp-116]
mov ebx, dword ptr [ebp-112]
imul ebx, dword ptr [ebp-112]
mov eax, dword ptr [ebp-116]
imul eax, dword ptr [ebp-116]
add ebx, eax
mov eax, dword ptr [ebp-120]
imul eax, dword ptr [ebp-120]
add ebx, eax
mov dword ptr [ebp-140], ebx
mov ebx, dword ptr [ebp-140]
sar ebx, 10
mov eax, dword ptr [_INVSQRT+ebx*4]
mov dword ptr [ebp-140], eax

and this was just a tiny snippet of the freebasic generated code from the bump function..

also do you think its possible i might have mutual exchange issues with my threading and that might make the cpu sit around a lot of the time each frame and do nothing. would passing the the same segment of memory too different threads even though they were working on different cells cause any binding issues. i have all my external arrays globally created atm and just let my bump1,2,3,4 pull them in and process different locations in them. there is actually a chance that my threads try and read variables such as lightx,y,z and address the same invsqrt and powtable indexes at the same time.
« Last Edit: May 14, 2013 by ninogenio »
Challenge Trophies Won:

Offline hellfire

  • Sponsor
  • Pentium
  • *******
  • Posts: 1289
  • Karma: 466
    • View Profile
    • my stuff
Re: bumpmapping
« Reply #43 on: May 15, 2013 »
do you think its possible i might have mutual exchange issues with my threading and that might make the cpu sit around a lot of the time each frame and do nothing.
would passing the the same segment of memory too different threads even though they were working on different cells cause any binding issues.
I guess you're not using any kind of mutexing, so there's no reason why any of the threads should wait.
And as every thread works on his own block of data, the threads cannot interfere.
However, every thread needs his own set of temporary variables - when working on the same global variable from different threads, the result is of course totally unpredictable.

But there can still be really awkward situations like this one:
Code: [Select]
// once loaded from memory, both variables will be kept in the same cache line
int dataA;
int dataB;

core0:
dataA= 1234;    // write value back into cache

core1:
int value= dataB;
// cache line of dataB got invalided because core0 modified
// memory which refers to the same cache line!
// must request core0 to write its' cache line back into memory
// wait until memory is available
// read back whole cache line
// wait until data is available in the cache
Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: bumpmapping
« Reply #44 on: May 15, 2013 »
ahh i see,

and does your code wait on the threads finishing before updating the frame or does it just unlock them and let them run, the way mines is if i remove the threadwaits i get similar behavior too yours, 100% usage with about 280 fps, yours gets 340fps but that would be down too mmx.

sorry for all the questions im new too all this kind of stuff and its great too be able too ask people with as much knowledge as your self questions about it.
Challenge Trophies Won:

Offline hellfire

  • Sponsor
  • Pentium
  • *******
  • Posts: 1289
  • Karma: 466
    • View Profile
    • my stuff
Re: bumpmapping
« Reply #45 on: May 15, 2013 »
does your code wait on the threads finishing before updating the frame or does it just unlock them and let them run
I don't really know how to determine that a frame has finished without waiting on the threads to finish.
At the moment I don't really do anything at all - all the multi-threading is handled by the open-mp macro automatically.
It behaves just like running without multiple threads, so the loop finishes as soon as all threads are done.
As open-mp provides a thread-id for every iteration of the parallel loop, I figured it processes continuous blocks with each thread.
That's probably not the best possible solution for all scenarios but good enough to not think about a better way for now.
If each block requires very different processing time, it probably makes sense to work on a smaller granularity to minimize thread sync time.
I haven't checked the processing time for each thread yet, but I guess it should be quite constant as the amount of source data is equal and accessed strictly linear.
Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: bumpmapping
« Reply #46 on: May 15, 2013 »
i think i might be onto something.. ive noticed that the call too ptcupdate( @Buffer(0)) drops my cpu usage by about 50% across all cores. if i comment it out i obviously get no final render but the cpu usage jumps too about 93% with all cores almost maxed out evenly. im going too give setting up a bit of gdi blitting a go rather than ptc, too see if that helps if not ill just ditch freebasic and jump back into my visual studio and use open mp as i cant afford for my cpu to be so heavily underutilized.
« Last Edit: May 15, 2013 by ninogenio »
Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: bumpmapping
« Reply #47 on: May 16, 2013 »
i've fixed the problem!!  :)

my initial hunch was correct there is something in ptc ext++ that doesn't behave properly in multi core systems, it acts a bit like a sleep command so no matter how much or little threads you create cpu usage doesnt go very high.

ive coded two versions in this zip using gdi. the thread stuff works great now. one of the versions is an eight thread as my i7 has 8 effective cores ( 4 hardware 4 virtual ) this version gets 270 280fps with 90%-93% usage. and a 4 core one which gets 200 210fps with 50%  -  60%.

if anyone tries these could they please tell me usage and fps please it would come in most handy too know that it works as hoped.

-Removed too keep Forum tidy see below-
« Last Edit: May 16, 2013 by ninogenio »
Challenge Trophies Won:

Offline hellfire

  • Sponsor
  • Pentium
  • *******
  • Posts: 1289
  • Karma: 466
    • View Profile
    • my stuff
Re: bumpmapping
« Reply #48 on: May 16, 2013 »
Since I don't have "virtual cores" it doesn't seem to make much of a difference:
BumpMap4: around 125fps and 75-85% cpu usage.
BumpMap8: around 130fps and 70-90% cpu usage.

I think there's simply no thread running while ptc transfers the framebuffer over to the window (and it's probably not very fast).
How about running them in parralel with a double buffer?
Code: [Select]
Start threads rendering to buffer0
display buffer1
wait for threads
swap pointers of buffer0 / buffer1
Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: bumpmapping
« Reply #49 on: May 16, 2013 »
thanks very much hellfire.

great little offscreen rendering idea. i tried it with ptc but it doesnt make a difference. i think there is a little more too the ptc issue though as when i start task manager too check usage with the ptc version the fps jumps a lot like 70fps and stays there. so i think maybe one of the windows calls inside ptc are causing me issues. it might even be a problem that only affects my setup.

im fine using gdi. i was just being lazy using ptc. your little double buffer idea gave a nice little step up with it. about 5% usage so im now sitting at roughly 95% which is fine for me i wouldnt like too run 100% all the time anyway.

i tried too multi thread portions of the screen with setdibitstodevice but windows didnt like that at all :).. the principal works fine a can render the full screen with 4 segment calls but the minute threading gets involved the app instantly gives up..
Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: bumpmapping
« Reply #50 on: May 16, 2013 »
for any one that might find it usefull ive coded a little sort of engine for this tonight it dynamically allocates and deallocates cores you can see the number of threads in use in the text box too reduce cores press Z and add cores X key. ive also made the bump maps hold all there own data so they can now easily be loaded and handed all the 3d info for use too texture map objects.. thats my next thing.

this now does 2 bump map images. that can be fliped back and forth with the 1 and 2 keys.

so Z and X allocates deallocates threads And..
1 and 2 flips back and forth through textures..

thanks for all your help hellfire its been a really great little project ive enjoyed it loads!! and im going too keep chipping away at the mmx stuff in my free time.
Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: bumpmapping
« Reply #51 on: December 15, 2013 »
Quote
Nice one, Nino.
The pure 2d bump usually just offsets the light-texture coordinate according to the normal (that's how it's done here).
The light-map was made big enough that you didn't have to care about clipping, so you don't have to recalculate it every frame.
Code: [Select]

for y= ...
  for x= ...
    col= colorMap[y]


    normal= normalMap[y]

    nx= normal shr 16 and 255
    ny= normal shr 8 and 255
           
    light= lightMap[y+ny+lightPosY][x+nx+lightPosX]

    dst[y]
  • = blend(col, light)


If you really want to work in 3d, you can handle diffuse and reflection separately, though.

^ on first page...

i was just doing my end of year back ups and organizing all my things, while doing so i was just randomly running things then while backing up this i remembered your comment hellfire about how simply this can be done.

so after a quick half hour i codded this up and it works and looks pretty neat much better than expected actually. it is so cheap i got away with 4 lights white blue red and green and still blazes along on my machine... well impressed cheers mate  ;)
« Last Edit: December 15, 2013 by ninogenio »
Challenge Trophies Won:

Offline Hotshot

  • DBF Aficionado
  • ******
  • Posts: 2114
  • Karma: 91
    • View Profile
Re: bumpmapping
« Reply #52 on: December 17, 2013 »
That is coolest Bump mapping I have seen :)


Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: bumpmapping
« Reply #53 on: December 18, 2013 »
thanks hotshot, this hole topic was an excellent learning project that i hope others can enjoy as much as i have.  :cheers:
Challenge Trophies Won: