Dark Bit Factory & Gravity

PROGRAMMING => Freebasic => Topic started by: Emil_halim on April 12, 2007

Title: Fast Normalize vector
Post by: Emil_halim on April 12, 2007
Hi all

really i did not use asm 2 years ago or more, i was a big Fan of assembly code.

anyway , i have found this code by navigating the net , so i port it to freebasic, it normalize a vector by using SSE asm.

the problem that i still did not solve is , it works with a 4 members vector not 3 members vector, so i want to apply the code to array of 3 members vector.

BTW you can use it with software render , it is very fast becuse it manipulate with 4 single variables at the same time.

any help please?

Code: [Select]

' 4 members vector
type Vec4
    as single x,y,z,w
end type

sub SSE_Normalize ( vectors as single ptr)
   asm
movups xmm0, [vectors]
movaps xmm2, xmm0
mulps xmm0, xmm0
movaps xmm1, xmm0
shufps xmm0, xmm1, 0x4e
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm1, xmm1, 0x11
addps xmm0,   xmm1
rsqrtps xmm0, xmm0
mulps xmm2, xmm0
movups [vectors], xmm2
end asm

end sub

   
Title: Re: Fast Normalize vector
Post by: Jim on April 12, 2007
The way SSE works, you almost *have* to work on 4 values at once.

The only problems here are the
Code: [Select]
movups xmm0,[vectors]
...
movups [vectors],xmm2
because they read in 128bits (4 floats) at a time.

Either you need to wrap the routine like this

sub normalise(x as sfloat, y as sfloat z as sfloat)
dim quad as sfloat (4)
quad(0)=x
quad(1)=y
quad(2)=z
quad(3)=0
SSE_normalise(@quad(0))
end sub

Alternatively you could pad your 3d vector type with an extra zero value.  If you keep that value at 0, then the calculation will work just fine as it is.  Don't worry about it wasting calculations on the zero, the gains of doing 4 values at once far outweigh the losses.

Jim
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 12, 2007
Yes , Jim.

If I use the warp routine I will lose some speed then using SSE has no meaning.

So using vector4 with zero padding is good and I was thinking in it , but I can not use it because , if I will render this Array of  vectors with DirectX , it must be 3 members Vector , that is my actually problem.   
Title: Re: Fast Normalize vector
Post by: Jim on April 12, 2007
Nothing you can do about it.  SSE loads 4 words at a time, DirectX requires them to be in groups of 3.  You're going to end up copying them somewhere in the pipeline :-\

Jim
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 12, 2007

ok , thanks Jim for your help.
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 14, 2007

ok , i have solved that problem by using 2 arrays of vectors , one for in_vector that has 4 members , and the other for Out_vectors that has 3 members. so it may be usful for someone else.

here is a test code
Code: [Select]
/'==================================================='/
'   using fast SSE to caculate a normale of vector
'         
'                  bu Emil halim
'
/'==================================================='/


type Verctor4
as single x,y,z,w
end type

type Verctor3
as single x,y,z
end type

dim as integer i
dim as Verctor4 in_vec(0 to 200)
  for i = 0 to 199
       in_vec(i).x = rnd * 10.0
       in_vec(i).y = rnd * 10.0
       in_vec(i).z = rnd * 10.0
       in_vec(i).w = 0.0
  next   

dim as Verctor3 out_vec(0 to 200)

   '
   ' normalizing vector
   '====================
   '
   '  Rec_len = 1.0 / sqr((vec1.x*vec1.x) + (vec1.y*vec1.y) + (vec1.z*vec1.z))
   '  vec1.x *= Rec_len
   '  vec1.y *= Rec_len
   '  vec1.z *= Rec_len
   '
   dim as integer in_addr  = @in_vec(0).x
   dim as integer out_addr = @out_vec(0).x
   asm
      mov    eax , [in_addr]
      mov    edx , [out_addr]
      mov    ecx , 199
     .lab:
      movups xmm0, [eax]           
      movaps xmm2, xmm0           
      mulps  xmm0, xmm0           
      movaps xmm1, xmm0  'ddccbbaa
      shufps xmm0, xmm1,0b01001110
      addps  xmm0, xmm1           
      movaps xmm1, xmm0           
      shufps xmm1, xmm1,0b00010001
      addps  xmm0, xmm1           
      rsqrtps xmm0, xmm0           
      mulps  xmm2, xmm0           
      movups [edx], xmm2
      add eax,4*4
      add edx,3*4
      dec ecx
      jnz .lab   
   end asm

  'test the results
   i = 10
   print in_vec(i).x , in_vec(i).y , in_vec(i).z , in_vec(i).w
   print out_vec(i).x , out_vec(i).y , out_vec(i).z
   
   dim as single v_len =  sqr((out_vec(i).x*out_vec(i).x) + (out_vec(i).y*out_vec(i).y) + (out_vec(i).z*out_vec(i).z))
   print v_len  ' must be 1.0
   
   Do
   Loop

Title: Re: Fast Normalize vector
Post by: Jim on April 15, 2007
Good solution!  Have some Karma!

Jim
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 15, 2007

thanks Jim.  :)
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 15, 2007
I have mod my last code , so that you can see the advantage of using SSE asm code in floating calculation.

so there are 2 codes , one for orignal FreeBasic and the other for SSE asm.

please test it and feed back your results.

In my system
==========

FreeBasic time = 18884 Cycles

SSE  time = 7280 Cycles

here is the code
Code: [Select]
/'==================================================='/
'   using fast SSE to caculate a normale of vector
'         
'                  by Emil halim
'
/'==================================================='/


type Vector4
as single x,y,z,w
end type

type Vector3
as single x,y,z
end type

   dim as integer i
   dim as Vector4 in_vec(0 to 200)
   for i = 0 to 199
       in_vec(i).x = rnd * 10.0
       in_vec(i).y = rnd * 10.0
       in_vec(i).z = rnd * 10.0
       in_vec(i).w = 0.0
   next   

   dim as Vector3 out_vec(0 to 200)

   dim as integer Cycles1 , Cycles , save
   
   '
   ' normalizing vector
   '====================
   '
   
   ' FreeBasic code
   asm rdtsc          '  measure of time
   asm mov [save],Eax
   for i = 0 to 199
     dim as single Rec_len = 1.0 / sqr((in_vec(i).x*in_vec(i).x) + (in_vec(i).y*in_vec(i).y) + (in_vec(i).z*in_vec(i).z))
     in_vec(i).x *= Rec_len
     in_vec(i).y *= Rec_len
     in_vec(i).z *= Rec_len
   next
   asm rdtsc          '  measure of time
   asm SUB Eax, [save]
   asm mov [Cycles1],eax


   ' SSE asm code
   for i = 0 to 199
       in_vec(i).x = rnd * 10.0
    in_vec(i).y = rnd * 10.0
    in_vec(i).z = rnd * 10.0
    in_vec(i).w = 0.0
next   
   dim as integer in_addr  = @in_vec(0).x
   dim as integer out_addr = @out_vec(0).x
   asm
      rdtsc             '  measure of time
      mov [save],Eax         
      mov    esi , [in_addr]
      mov    edi , [out_addr]
      mov    ecx , 199
     .lab:
      movups xmm0, [esi]           
      movaps xmm2, xmm0           
      mulps  xmm0, xmm0           
      movaps xmm1, xmm0  'ddccbbaa
      shufps xmm0, xmm1,0b01001110
      addps  xmm0, xmm1           
      movaps xmm1, xmm0           
      shufps xmm1, xmm1,0b00010001
      addps  xmm0, xmm1           
      rsqrtps xmm0, xmm0           
      mulps  xmm2, xmm0           
      movups [edi], xmm2
      add esi,4*4
      add edi,3*4
      dec ecx
      jnz .lab   
      rdtsc             '  measure of time
      SUB Eax, [save]
      mov [Cycles],eax
    end asm

  'test the results
   i = 10
   print in_vec(i).x , in_vec(i).y , in_vec(i).z , in_vec(i).w
   print out_vec(i).x , out_vec(i).y , out_vec(i).z
   
   dim as single v_len =  sqr((out_vec(i).x*out_vec(i).x) + (out_vec(i).y*out_vec(i).y) + (out_vec(i).z*out_vec(i).z))
   print v_len  ' must be 1.0
   
   print "SSE       time = " ; Cycles
   print "FreeBasic time = " ; Cycles1
   
   Do
   Loop
Title: Re: Fast Normalize vector
Post by: Jim on April 15, 2007
I'm surprised it's not massively quicker than that - that's what?  2.5 x speed?  Does it make any difference if you get your source vector4s aligned to a 128bit boundary and use movaps xmm0,[esi]?

Try moving the add esi/dec ecx to just after the rsqrtps.  It's obvious that's a slow instruction and it might help to overlap it with something!  If you start edi off with out_addr-12, you can put the add edi in there too if it makes any difference.

Also, you have to watch when you store the 200th vector3 result.  The movups [edi],xmm2 will store four 32 bit values into the vector3 you have allocated.  You need to make sure there's an extra vector3 on the end to take care of that.

Question - in freebasic is 1.0 a single or a double?  Does Sqr take/return single or double?  I'm guessing all single?

Jim
 
Title: Re: Fast Normalize vector
Post by: Stonemonkey on April 15, 2007
afaik fb stores const floats as double and Sqr takes/returns singles or doubles:

fld dword ptr [ebp-8]
fsqrt
fstp dword ptr [ebp-8]

or

fld qword ptr [ebp-12]
fsqrt
fstp qword ptr [ebp-12]

or whatever way round you want.

Fryer.
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 15, 2007
Of course aligned memory with 16 byte , I.E using  __declspec(align(16)) with C++ will make it faster , but I did not know how to do it with FreeBasic , besides I really do not want to make it because it west memory resource with big mesh and I need the out_vectores array to be vec3 to send it to DirectX.

Anyway , I will try to use  prefetchnta command to get the data in cache memory , and get TWO VECTORS at the same time.
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 15, 2007

ok , here i uesd 'get 2 vectors' at the same time and get some speed.

now SSE time is =  6492

but using prefetchnta decrease the speed so i comment it

here is the code
Code: [Select]
   asm
      rdtsc             '  measure of time
      mov [save],Eax         
      mov    esi , [in_addr]
      mov    edi , [out_addr]
      mov    ecx , 199
      shr    ecx , 1
     .lab:
      movups xmm0, [esi] 
      movups xmm3, [esi+16] 
      'prefetchnta  [esi+3*16]
      movaps xmm2, xmm0     
      movaps xmm5, xmm3       
      mulps  xmm0, xmm0
      mulps  xmm3, xmm3           
      movaps xmm1, xmm0  'ddccbbaa
      movaps xmm4, xmm3
      shufps xmm0, xmm1,0b01001110
      shufps xmm3, xmm4,0b01001110
      addps  xmm0, xmm1   
      addps  xmm3, xmm4
      movaps xmm1, xmm0
      movaps xmm4, xmm3
      shufps xmm1, xmm1,0b00010001
      shufps xmm4, xmm4,0b00010001
      addps  xmm0, xmm1   
      addps  xmm3, xmm4
      rsqrtps xmm0, xmm0   
      rsqrtps xmm3, xmm3
      mulps  xmm2, xmm0   
      mulps  xmm5, xmm3
      movups [edi], xmm2
      movups [edi+12], xmm5
      'prefetchnta   [edi+3*12]
      add esi,4*4*2
      add edi,3*4*2
      dec ecx
      jnz .lab   
      rdtsc             '  measure of time
      sub Eax, [save]
      mov [Cycles],eax
   end asm
Title: Re: Fast Normalize vector
Post by: Jim on April 15, 2007
There's still a 4 cycle 'hole' after each rsqrt and a 6 cycle hole after each mulps you might want to re-order the instructions to squeeze a little more out of it.
Might be worth asking on comp.lang.asm.x86 if anyone has any ideas for making it quicker?

Jim
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 16, 2007

thanks Jim , i will ask there.
Title: Re: Fast Normalize vector
Post by: Paul on April 16, 2007
The first code generated this for me:
 2.655853      0.7265174     1.123683      0
 0.893011      0.2442862     0.3778302
 0.9999501
SSE       time =  4664
FreeBasic time =  16103

Does this mean it was quicker on my pc???
I'm not in to this ASM stuff at all so I have no idea.
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 16, 2007

sorry paul , i did not know what this word wuicker mean?

did you test the last one?

thanks for testing.
Title: Re: Fast Normalize vector
Post by: Jim on April 16, 2007
'quicker'.  It looks like it was a bit quicker on Paul's PC.  Perhaps the difference between HT and P4D or Core2 or AMD?

Jim
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 17, 2007

ok i see now, and yes perhapse the speed of CPU of Paul's PC.

anyway , the results of Paul indecates that , it is 4 times faster when using SSE , and this is good news.  :)

if any one other can test that and feed the results!!!
Title: Re: Fast Normalize vector
Post by: Paul on April 17, 2007
yes, i meant the speed difference between sse and freebasic was greater on my machine.
sorry for being so unclear

Edit: only tried the first prog, how do i run the second one?
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 17, 2007

here it is Paul

Code: [Select]
/'==================================================='/
'   using fast SSE to caculate a normale of vector
'         
'                  by Emil halim
'
/'==================================================='/


type Vector4
as single x,y,z,w
end type

type Vector3
as single x,y,z
end type

   dim as integer i
   dim as Vector4 in_vec(0 to 200)
   for i = 0 to 199
       in_vec(i).x = rnd * 10.0
       in_vec(i).y = rnd * 10.0
       in_vec(i).z = rnd * 10.0
       in_vec(i).w = 0.0
   next   

   dim as Vector3 out_vec(0 to 200)

   dim as integer Cycles1 , Cycles , save
   
   '
   ' normalizing vector
   '====================
   '
   
   ' FreeBasic code
   asm rdtsc          '  measure of time
   asm mov [save],Eax
   for i = 0 to 199
     dim as single Rec_len = 1.0 / sqr((in_vec(i).x*in_vec(i).x) + (in_vec(i).y*in_vec(i).y) + (in_vec(i).z*in_vec(i).z))
     in_vec(i).x *= Rec_len
     in_vec(i).y *= Rec_len
     in_vec(i).z *= Rec_len
   next
   asm rdtsc          '  measure of time
   asm SUB Eax, [save]
   asm mov [Cycles1],eax


   ' SSE asm code
   for i = 0 to 199
       in_vec(i).x = rnd * 10.0
    in_vec(i).y = rnd * 10.0
    in_vec(i).z = rnd * 10.0
    in_vec(i).w = 0.0
next   
   dim as integer in_addr  = @in_vec(0).x
   dim as integer out_addr = @out_vec(0).x
   asm
      rdtsc             '  measure of time
      mov [save],Eax         
      mov    esi , [in_addr]
      mov    edi , [out_addr]
      mov    ecx , 199
      shr    ecx , 1
     .lab:
      movups xmm0, [esi] 
      movups xmm3, [esi+16] 
      'prefetchnta  [esi+3*16]
      movaps xmm2, xmm0     
      movaps xmm5, xmm3       
      mulps  xmm0, xmm0
      mulps  xmm3, xmm3           
      movaps xmm1, xmm0  'ddccbbaa
      movaps xmm4, xmm3
      shufps xmm0, xmm1,0b01001110
      shufps xmm3, xmm4,0b01001110
      addps  xmm0, xmm1   
      addps  xmm3, xmm4
      movaps xmm1, xmm0
      movaps xmm4, xmm3
      shufps xmm1, xmm1,0b00010001
      shufps xmm4, xmm4,0b00010001
      addps  xmm0, xmm1   
      addps  xmm3, xmm4
      rsqrtps xmm0, xmm0   
      rsqrtps xmm3, xmm3
      mulps  xmm2, xmm0   
      mulps  xmm5, xmm3
      movups [edi], xmm2
      movups [edi+12], xmm5
      'prefetchnta   [edi+3*12]
      add esi,4*4*2
      add edi,3*4*2
      dec ecx
      jnz .lab   
      rdtsc             '  measure of time
      sub Eax, [save]
      mov [Cycles],eax
   end asm

  'test the results
   i = 10
   print in_vec(i).x , in_vec(i).y , in_vec(i).z , in_vec(i).w
   print out_vec(i).x , out_vec(i).y , out_vec(i).z
   
   dim as single v_len =  sqr((out_vec(i).x*out_vec(i).x) + (out_vec(i).y*out_vec(i).y) + (out_vec(i).z*out_vec(i).z))
   print v_len  ' must be 1.0
   
   print "SSE       time = " ; Cycles
   print "FreeBasic time = " ; Cycles1
   
   Do
   Loop

Title: Re: Fast Normalize vector
Post by: Paul on April 17, 2007
Even better result :D

 2.655853      0.7265174     1.123683      0
 0.893011      0.2442862     0.3778302
 0.9999501
SSE       time =  3851
FreeBasic time =  16128

thats 4.1880031160737470786808621137367* faster :)
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 17, 2007

great news.  :)
Title: Re: Fast Normalize vector
Post by: Jim on April 21, 2007
Not much luck with c.l.a.x86 :(
I tried the code on mine and got about 3x speed.  It's interesting to run it a few times and see the different results.  When I made it do 20000 instead of 200 I get something more reliable.

Jim
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 21, 2007

No luck with Comp.lang.asm ,  :(

I have posted there but get no answer.

I think that , the effect of SEE is noticable when you use a huge calculations , just as your test.

so i consider it is good news too. :)   
Title: Re: Fast Normalize vector
Post by: Dr_D on April 27, 2007
Man, this makes me want to learn ASM really, really bad. Good job!  :clap:

SSE       time =  3933
FreeBasic time =  47493

Title: Re: Fast Normalize vector
Post by: Emil_halim on April 28, 2007
that is great news too.  :)

Edited:

it is faster by 12 times , !!!!!!!!!!

did you test it more than one and took the avr or what ?

what is the config of your system ?
Title: Re: Fast Normalize vector
Post by: Dr_D on April 28, 2007
Well, I just ran it a few times and just posted the results from the last one. They were all very good though. ;)

I'm running xp sp2 with a 2gig sempron and 512mb ddr sdram.
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 28, 2007
ok, that really nice , i think AMD will beat Intel in the few next monthes.
Title: Re: Fast Normalize vector
Post by: Dr_D on April 28, 2007
i made a little demo of a skeletal animation a while back. Do you mind if I stick this code in there to test it with a real application?  :cheers:
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 28, 2007

Sure man , use it as you want.  :)

If I have some times I will do a matrix mutably too with SEE and of course you can use it too.
Title: Re: Fast Normalize vector
Post by: Dr_D on April 28, 2007
Man, that would be awesome! :D I really need to take a class on asm programming. :-\
Title: Re: Fast Normalize vector
Post by: Shockwave on April 28, 2007
I really need to take a class on asm programming. :-\

We have an asm forum here so we can deal with your questions :)