Author Topic: Fast Normalize vector  (Read 14783 times)

0 Members and 1 Guest are viewing this topic.

Offline Emil_halim

  • Atari ST
  • ***
  • Posts: 248
  • Karma: 21
    • View Profile
    • OgreMagic Library
Fast Normalize vector
« on: April 12, 2007 »
Hi all

really i did not use asm 2 years ago or more, i was a big Fan of assembly code.

anyway , i have found this code by navigating the net , so i port it to freebasic, it normalize a vector by using SSE asm.

the problem that i still did not solve is , it works with a 4 members vector not 3 members vector, so i want to apply the code to array of 3 members vector.

BTW you can use it with software render , it is very fast becuse it manipulate with 4 single variables at the same time.

any help please?

Code: [Select]

' 4 members vector
type Vec4
    as single x,y,z,w
end type

sub SSE_Normalize ( vectors as single ptr)
   asm
movups xmm0, [vectors]
movaps xmm2, xmm0
mulps xmm0, xmm0
movaps xmm1, xmm0
shufps xmm0, xmm1, 0x4e
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm1, xmm1, 0x11
addps xmm0,   xmm1
rsqrtps xmm0, xmm0
mulps xmm2, xmm0
movups [vectors], xmm2
end asm

end sub

   

Offline Jim

  • Founder Member
  • DBF Aficionado
  • ********
  • Posts: 5301
  • Karma: 402
    • View Profile
Re: Fast Normalize vector
« Reply #1 on: April 12, 2007 »
The way SSE works, you almost *have* to work on 4 values at once.

The only problems here are the
Code: [Select]
movups xmm0,[vectors]
...
movups [vectors],xmm2
because they read in 128bits (4 floats) at a time.

Either you need to wrap the routine like this

sub normalise(x as sfloat, y as sfloat z as sfloat)
dim quad as sfloat (4)
quad(0)=x
quad(1)=y
quad(2)=z
quad(3)=0
SSE_normalise(@quad(0))
end sub

Alternatively you could pad your 3d vector type with an extra zero value.  If you keep that value at 0, then the calculation will work just fine as it is.  Don't worry about it wasting calculations on the zero, the gains of doing 4 values at once far outweigh the losses.

Jim
Challenge Trophies Won:

Offline Emil_halim

  • Atari ST
  • ***
  • Posts: 248
  • Karma: 21
    • View Profile
    • OgreMagic Library
Re: Fast Normalize vector
« Reply #2 on: April 12, 2007 »
Yes , Jim.

If I use the warp routine I will lose some speed then using SSE has no meaning.

So using vector4 with zero padding is good and I was thinking in it , but I can not use it because , if I will render this Array of  vectors with DirectX , it must be 3 members Vector , that is my actually problem.   

Offline Jim

  • Founder Member
  • DBF Aficionado
  • ********
  • Posts: 5301
  • Karma: 402
    • View Profile
Re: Fast Normalize vector
« Reply #3 on: April 12, 2007 »
Nothing you can do about it.  SSE loads 4 words at a time, DirectX requires them to be in groups of 3.  You're going to end up copying them somewhere in the pipeline :-\

Jim
Challenge Trophies Won:

Offline Emil_halim

  • Atari ST
  • ***
  • Posts: 248
  • Karma: 21
    • View Profile
    • OgreMagic Library
Re: Fast Normalize vector
« Reply #4 on: April 12, 2007 »

ok , thanks Jim for your help.

Offline Emil_halim

  • Atari ST
  • ***
  • Posts: 248
  • Karma: 21
    • View Profile
    • OgreMagic Library
Re: Fast Normalize vector
« Reply #5 on: April 14, 2007 »

ok , i have solved that problem by using 2 arrays of vectors , one for in_vector that has 4 members , and the other for Out_vectors that has 3 members. so it may be usful for someone else.

here is a test code
Code: [Select]
/'==================================================='/
'   using fast SSE to caculate a normale of vector
'         
'                  bu Emil halim
'
/'==================================================='/


type Verctor4
as single x,y,z,w
end type

type Verctor3
as single x,y,z
end type

dim as integer i
dim as Verctor4 in_vec(0 to 200)
  for i = 0 to 199
       in_vec(i).x = rnd * 10.0
       in_vec(i).y = rnd * 10.0
       in_vec(i).z = rnd * 10.0
       in_vec(i).w = 0.0
  next   

dim as Verctor3 out_vec(0 to 200)

   '
   ' normalizing vector
   '====================
   '
   '  Rec_len = 1.0 / sqr((vec1.x*vec1.x) + (vec1.y*vec1.y) + (vec1.z*vec1.z))
   '  vec1.x *= Rec_len
   '  vec1.y *= Rec_len
   '  vec1.z *= Rec_len
   '
   dim as integer in_addr  = @in_vec(0).x
   dim as integer out_addr = @out_vec(0).x
   asm
      mov    eax , [in_addr]
      mov    edx , [out_addr]
      mov    ecx , 199
     .lab:
      movups xmm0, [eax]           
      movaps xmm2, xmm0           
      mulps  xmm0, xmm0           
      movaps xmm1, xmm0  'ddccbbaa
      shufps xmm0, xmm1,0b01001110
      addps  xmm0, xmm1           
      movaps xmm1, xmm0           
      shufps xmm1, xmm1,0b00010001
      addps  xmm0, xmm1           
      rsqrtps xmm0, xmm0           
      mulps  xmm2, xmm0           
      movups [edx], xmm2
      add eax,4*4
      add edx,3*4
      dec ecx
      jnz .lab   
   end asm

  'test the results
   i = 10
   print in_vec(i).x , in_vec(i).y , in_vec(i).z , in_vec(i).w
   print out_vec(i).x , out_vec(i).y , out_vec(i).z
   
   dim as single v_len =  sqr((out_vec(i).x*out_vec(i).x) + (out_vec(i).y*out_vec(i).y) + (out_vec(i).z*out_vec(i).z))
   print v_len  ' must be 1.0
   
   Do
   Loop


Offline Jim

  • Founder Member
  • DBF Aficionado
  • ********
  • Posts: 5301
  • Karma: 402
    • View Profile
Re: Fast Normalize vector
« Reply #6 on: April 15, 2007 »
Good solution!  Have some Karma!

Jim
Challenge Trophies Won:

Offline Emil_halim

  • Atari ST
  • ***
  • Posts: 248
  • Karma: 21
    • View Profile
    • OgreMagic Library
Re: Fast Normalize vector
« Reply #7 on: April 15, 2007 »

thanks Jim.  :)

Offline Emil_halim

  • Atari ST
  • ***
  • Posts: 248
  • Karma: 21
    • View Profile
    • OgreMagic Library
Re: Fast Normalize vector
« Reply #8 on: April 15, 2007 »
I have mod my last code , so that you can see the advantage of using SSE asm code in floating calculation.

so there are 2 codes , one for orignal FreeBasic and the other for SSE asm.

please test it and feed back your results.

In my system
==========

FreeBasic time = 18884 Cycles

SSE  time = 7280 Cycles

here is the code
Code: [Select]
/'==================================================='/
'   using fast SSE to caculate a normale of vector
'         
'                  by Emil halim
'
/'==================================================='/


type Vector4
as single x,y,z,w
end type

type Vector3
as single x,y,z
end type

   dim as integer i
   dim as Vector4 in_vec(0 to 200)
   for i = 0 to 199
       in_vec(i).x = rnd * 10.0
       in_vec(i).y = rnd * 10.0
       in_vec(i).z = rnd * 10.0
       in_vec(i).w = 0.0
   next   

   dim as Vector3 out_vec(0 to 200)

   dim as integer Cycles1 , Cycles , save
   
   '
   ' normalizing vector
   '====================
   '
   
   ' FreeBasic code
   asm rdtsc          '  measure of time
   asm mov [save],Eax
   for i = 0 to 199
     dim as single Rec_len = 1.0 / sqr((in_vec(i).x*in_vec(i).x) + (in_vec(i).y*in_vec(i).y) + (in_vec(i).z*in_vec(i).z))
     in_vec(i).x *= Rec_len
     in_vec(i).y *= Rec_len
     in_vec(i).z *= Rec_len
   next
   asm rdtsc          '  measure of time
   asm SUB Eax, [save]
   asm mov [Cycles1],eax


   ' SSE asm code
   for i = 0 to 199
       in_vec(i).x = rnd * 10.0
    in_vec(i).y = rnd * 10.0
    in_vec(i).z = rnd * 10.0
    in_vec(i).w = 0.0
next   
   dim as integer in_addr  = @in_vec(0).x
   dim as integer out_addr = @out_vec(0).x
   asm
      rdtsc             '  measure of time
      mov [save],Eax         
      mov    esi , [in_addr]
      mov    edi , [out_addr]
      mov    ecx , 199
     .lab:
      movups xmm0, [esi]           
      movaps xmm2, xmm0           
      mulps  xmm0, xmm0           
      movaps xmm1, xmm0  'ddccbbaa
      shufps xmm0, xmm1,0b01001110
      addps  xmm0, xmm1           
      movaps xmm1, xmm0           
      shufps xmm1, xmm1,0b00010001
      addps  xmm0, xmm1           
      rsqrtps xmm0, xmm0           
      mulps  xmm2, xmm0           
      movups [edi], xmm2
      add esi,4*4
      add edi,3*4
      dec ecx
      jnz .lab   
      rdtsc             '  measure of time
      SUB Eax, [save]
      mov [Cycles],eax
    end asm

  'test the results
   i = 10
   print in_vec(i).x , in_vec(i).y , in_vec(i).z , in_vec(i).w
   print out_vec(i).x , out_vec(i).y , out_vec(i).z
   
   dim as single v_len =  sqr((out_vec(i).x*out_vec(i).x) + (out_vec(i).y*out_vec(i).y) + (out_vec(i).z*out_vec(i).z))
   print v_len  ' must be 1.0
   
   print "SSE       time = " ; Cycles
   print "FreeBasic time = " ; Cycles1
   
   Do
   Loop
« Last Edit: April 15, 2007 by Emil_halim »

Offline Jim

  • Founder Member
  • DBF Aficionado
  • ********
  • Posts: 5301
  • Karma: 402
    • View Profile
Re: Fast Normalize vector
« Reply #9 on: April 15, 2007 »
I'm surprised it's not massively quicker than that - that's what?  2.5 x speed?  Does it make any difference if you get your source vector4s aligned to a 128bit boundary and use movaps xmm0,[esi]?

Try moving the add esi/dec ecx to just after the rsqrtps.  It's obvious that's a slow instruction and it might help to overlap it with something!  If you start edi off with out_addr-12, you can put the add edi in there too if it makes any difference.

Also, you have to watch when you store the 200th vector3 result.  The movups [edi],xmm2 will store four 32 bit values into the vector3 you have allocated.  You need to make sure there's an extra vector3 on the end to take care of that.

Question - in freebasic is 1.0 a single or a double?  Does Sqr take/return single or double?  I'm guessing all single?

Jim
 
Challenge Trophies Won:

Offline Stonemonkey

  • Pentium
  • *****
  • Posts: 1315
  • Karma: 96
    • View Profile
Re: Fast Normalize vector
« Reply #10 on: April 15, 2007 »
afaik fb stores const floats as double and Sqr takes/returns singles or doubles:

fld dword ptr [ebp-8]
fsqrt
fstp dword ptr [ebp-8]

or

fld qword ptr [ebp-12]
fsqrt
fstp qword ptr [ebp-12]

or whatever way round you want.

Fryer.

Offline Emil_halim

  • Atari ST
  • ***
  • Posts: 248
  • Karma: 21
    • View Profile
    • OgreMagic Library
Re: Fast Normalize vector
« Reply #11 on: April 15, 2007 »
Of course aligned memory with 16 byte , I.E using  __declspec(align(16)) with C++ will make it faster , but I did not know how to do it with FreeBasic , besides I really do not want to make it because it west memory resource with big mesh and I need the out_vectores array to be vec3 to send it to DirectX.

Anyway , I will try to use  prefetchnta command to get the data in cache memory , and get TWO VECTORS at the same time.

Offline Emil_halim

  • Atari ST
  • ***
  • Posts: 248
  • Karma: 21
    • View Profile
    • OgreMagic Library
Re: Fast Normalize vector
« Reply #12 on: April 15, 2007 »

ok , here i uesd 'get 2 vectors' at the same time and get some speed.

now SSE time is =  6492

but using prefetchnta decrease the speed so i comment it

here is the code
Code: [Select]
   asm
      rdtsc             '  measure of time
      mov [save],Eax         
      mov    esi , [in_addr]
      mov    edi , [out_addr]
      mov    ecx , 199
      shr    ecx , 1
     .lab:
      movups xmm0, [esi] 
      movups xmm3, [esi+16] 
      'prefetchnta  [esi+3*16]
      movaps xmm2, xmm0     
      movaps xmm5, xmm3       
      mulps  xmm0, xmm0
      mulps  xmm3, xmm3           
      movaps xmm1, xmm0  'ddccbbaa
      movaps xmm4, xmm3
      shufps xmm0, xmm1,0b01001110
      shufps xmm3, xmm4,0b01001110
      addps  xmm0, xmm1   
      addps  xmm3, xmm4
      movaps xmm1, xmm0
      movaps xmm4, xmm3
      shufps xmm1, xmm1,0b00010001
      shufps xmm4, xmm4,0b00010001
      addps  xmm0, xmm1   
      addps  xmm3, xmm4
      rsqrtps xmm0, xmm0   
      rsqrtps xmm3, xmm3
      mulps  xmm2, xmm0   
      mulps  xmm5, xmm3
      movups [edi], xmm2
      movups [edi+12], xmm5
      'prefetchnta   [edi+3*12]
      add esi,4*4*2
      add edi,3*4*2
      dec ecx
      jnz .lab   
      rdtsc             '  measure of time
      sub Eax, [save]
      mov [Cycles],eax
   end asm

Offline Jim

  • Founder Member
  • DBF Aficionado
  • ********
  • Posts: 5301
  • Karma: 402
    • View Profile
Re: Fast Normalize vector
« Reply #13 on: April 15, 2007 »
There's still a 4 cycle 'hole' after each rsqrt and a 6 cycle hole after each mulps you might want to re-order the instructions to squeeze a little more out of it.
Might be worth asking on comp.lang.asm.x86 if anyone has any ideas for making it quicker?

Jim
Challenge Trophies Won:

Offline Emil_halim

  • Atari ST
  • ***
  • Posts: 248
  • Karma: 21
    • View Profile
    • OgreMagic Library
Re: Fast Normalize vector
« Reply #14 on: April 16, 2007 »

thanks Jim , i will ask there.

Offline Paul

  • Pentium
  • *****
  • Posts: 1490
  • Karma: 47
    • View Profile
Re: Fast Normalize vector
« Reply #15 on: April 16, 2007 »
The first code generated this for me:
 2.655853      0.7265174     1.123683      0
 0.893011      0.2442862     0.3778302
 0.9999501
SSE       time =  4664
FreeBasic time =  16103

Does this mean it was quicker on my pc???
I'm not in to this ASM stuff at all so I have no idea.
« Last Edit: April 17, 2007 by Paul »
I will bite you - http://s5.bitefight.se/c.php?uid=31059
Challenge Trophies Won:

Offline Emil_halim

  • Atari ST
  • ***
  • Posts: 248
  • Karma: 21
    • View Profile
    • OgreMagic Library
Re: Fast Normalize vector
« Reply #16 on: April 16, 2007 »

sorry paul , i did not know what this word wuicker mean?

did you test the last one?

thanks for testing.

Offline Jim

  • Founder Member
  • DBF Aficionado
  • ********
  • Posts: 5301
  • Karma: 402
    • View Profile
Re: Fast Normalize vector
« Reply #17 on: April 16, 2007 »
'quicker'.  It looks like it was a bit quicker on Paul's PC.  Perhaps the difference between HT and P4D or Core2 or AMD?

Jim
Challenge Trophies Won:

Offline Emil_halim

  • Atari ST
  • ***
  • Posts: 248
  • Karma: 21
    • View Profile
    • OgreMagic Library
Re: Fast Normalize vector
« Reply #18 on: April 17, 2007 »

ok i see now, and yes perhapse the speed of CPU of Paul's PC.

anyway , the results of Paul indecates that , it is 4 times faster when using SSE , and this is good news.  :)

if any one other can test that and feed the results!!!

Offline Paul

  • Pentium
  • *****
  • Posts: 1490
  • Karma: 47
    • View Profile
Re: Fast Normalize vector
« Reply #19 on: April 17, 2007 »
yes, i meant the speed difference between sse and freebasic was greater on my machine.
sorry for being so unclear

Edit: only tried the first prog, how do i run the second one?
« Last Edit: April 17, 2007 by Paul »
I will bite you - http://s5.bitefight.se/c.php?uid=31059
Challenge Trophies Won: