Author Topic: Fast Normalize vector (Read 16100 times)

Emil_halim · « **on:** April 12, 2007 »

Hi all

really i did not use asm 2 years ago or more, i was a big Fan of assembly code.

anyway , i have found this code by navigating the net , so i port it to freebasic, it normalize a vector by using SSE asm.

the problem that i still did not solve is , it works with a 4 members vector not 3 members vector, so i want to apply the code to array of 3 members vector.

BTW you can use it with software render , it is very fast becuse it manipulate with 4 single variables at the same time.

any help please?

Code: [Select]


' 4 members vector
type Vec4
    as single x,y,z,w
end type 

sub SSE_Normalize ( vectors as single ptr) 
   asm 
		movups xmm0, [vectors]
		movaps xmm2, xmm0
		mulps xmm0, xmm0
		movaps xmm1, xmm0
		shufps xmm0, xmm1, 0x4e
		addps xmm0, xmm1
		movaps xmm1, xmm0
		shufps xmm1, xmm1, 0x11
		addps xmm0,   xmm1
		rsqrtps xmm0, xmm0 
		mulps xmm2, xmm0
		movups [vectors], xmm2
	end asm
	
end sub

Jim · « **Reply #1 on:** April 12, 2007 »

The way SSE works, you almost *have* to work on 4 values at once.

The only problems here are the

Code: [Select]

movups xmm0,[vectors]
...
movups [vectors],xmm2

because they read in 128bits (4 floats) at a time.

Either you need to wrap the routine like this

sub normalise(x as sfloat, y as sfloat z as sfloat)
dim quad as sfloat (4)
quad(0)=x
quad(1)=y
quad(2)=z
quad(3)=0
SSE_normalise(@quad(0))
end sub

Alternatively you could pad your 3d vector type with an extra zero value. If you keep that value at 0, then the calculation will work just fine as it is. Don't worry about it wasting calculations on the zero, the gains of doing 4 values at once far outweigh the losses.

Jim

Emil_halim · « **Reply #2 on:** April 12, 2007 »

Yes , Jim.

If I use the warp routine I will lose some speed then using SSE has no meaning.

So using vector4 with zero padding is good and I was thinking in it , but I can not use it because , if I will render this Array of vectors with DirectX , it must be 3 members Vector , that is my actually problem.

Jim · « **Reply #3 on:** April 12, 2007 »

Nothing you can do about it. SSE loads 4 words at a time, DirectX requires them to be in groups of 3. You're going to end up copying them somewhere in the pipeline $:-\$

Jim

Emil_halim · « **Reply #4 on:** April 12, 2007 »

ok , thanks Jim for your help.

Emil_halim · « **Reply #5 on:** April 14, 2007 »

ok , i have solved that problem by using 2 arrays of vectors , one for in_vector that has 4 members , and the other for Out_vectors that has 3 members. so it may be usful for someone else.

here is a test code

Code: [Select]

/'==================================================='/
'   using fast SSE to caculate a normale of vector
'          
'                  bu Emil halim 
' 
/'==================================================='/


type Verctor4
	 as single x,y,z,w
end type

type Verctor3
	 as single x,y,z
end type

dim as integer i
dim as Verctor4 in_vec(0 to 200)
  for i = 0 to 199
       in_vec(i).x = rnd * 10.0
       in_vec(i).y = rnd * 10.0
       in_vec(i).z = rnd * 10.0
       in_vec(i).w = 0.0
  next    
	
dim as Verctor3 out_vec(0 to 200)

   '
   ' normalizing vector
   '====================
   '
   '  Rec_len = 1.0 / sqr((vec1.x*vec1.x) + (vec1.y*vec1.y) + (vec1.z*vec1.z))
   '  vec1.x *= Rec_len
   '  vec1.y *= Rec_len
   '  vec1.z *= Rec_len
   '
   dim as integer in_addr  = @in_vec(0).x
   dim as integer out_addr = @out_vec(0).x
   asm
      mov    eax , [in_addr] 
      mov    edx , [out_addr]
      mov    ecx , 199
     .lab: 
      movups xmm0, [eax]           
      movaps xmm2, xmm0            
      mulps  xmm0, xmm0            
      movaps xmm1, xmm0  'ddccbbaa 
      shufps xmm0, xmm1,0b01001110 
      addps  xmm0, xmm1            
      movaps xmm1, xmm0            
      shufps xmm1, xmm1,0b00010001 
      addps  xmm0, xmm1            
      rsqrtps xmm0, xmm0           
      mulps  xmm2, xmm0            
      movups [edx], xmm2
      add eax,4*4
      add edx,3*4
      dec ecx
      jnz .lab    
   end asm
	
  'test the results
   i = 10
   print in_vec(i).x , in_vec(i).y , in_vec(i).z , in_vec(i).w
   print out_vec(i).x , out_vec(i).y , out_vec(i).z 
   
   dim as single v_len =  sqr((out_vec(i).x*out_vec(i).x) + (out_vec(i).y*out_vec(i).y) + (out_vec(i).z*out_vec(i).z))
   print v_len  ' must be 1.0
   
   Do
   Loop

Jim · « **Reply #6 on:** April 15, 2007 »

Good solution! Have some Karma!

Jim

Emil_halim · « **Reply #7 on:** April 15, 2007 »

thanks Jim.

Emil_halim · « **Reply #8 on:** April 15, 2007 »

I have mod my last code , so that you can see the advantage of using SSE asm code in floating calculation.

so there are 2 codes , one for orignal FreeBasic and the other for SSE asm.

please test it and feed back your results.

In my system
==========

FreeBasic time = 18884 Cycles

SSE time = 7280 Cycles

here is the code

Code: [Select]

/'==================================================='/
'   using fast SSE to caculate a normale of vector
'          
'                  by Emil halim 
' 
/'==================================================='/


type Vector4
	 as single x,y,z,w
end type

type Vector3
	 as single x,y,z
end type

   dim as integer i
   dim as Vector4 in_vec(0 to 200)
   for i = 0 to 199
       in_vec(i).x = rnd * 10.0
       in_vec(i).y = rnd * 10.0
       in_vec(i).z = rnd * 10.0
       in_vec(i).w = 0.0
   next    
	
   dim as Vector3 out_vec(0 to 200)

   dim as integer Cycles1 , Cycles , save
   
   '
   ' normalizing vector
   '====================
   '
   
   ' FreeBasic code
   asm rdtsc          '  measure of time
   asm mov [save],Eax
   for i = 0 to 199
     dim as single Rec_len = 1.0 / sqr((in_vec(i).x*in_vec(i).x) + (in_vec(i).y*in_vec(i).y) + (in_vec(i).z*in_vec(i).z))
     in_vec(i).x *= Rec_len
     in_vec(i).y *= Rec_len
     in_vec(i).z *= Rec_len
   next
   asm rdtsc          '  measure of time
   asm SUB Eax, [save]
   asm mov [Cycles1],eax	
	
	
   ' SSE asm code
   for i = 0 to 199
       in_vec(i).x = rnd * 10.0
	    in_vec(i).y = rnd * 10.0
	    in_vec(i).z = rnd * 10.0
	    in_vec(i).w = 0.0
	next    
   dim as integer in_addr  = @in_vec(0).x
   dim as integer out_addr = @out_vec(0).x
   asm
      rdtsc             '  measure of time
      mov [save],Eax          
      mov    esi , [in_addr] 
      mov    edi , [out_addr]
      mov    ecx , 199
     .lab: 
      movups xmm0, [esi]           
      movaps xmm2, xmm0            
      mulps  xmm0, xmm0            
      movaps xmm1, xmm0  'ddccbbaa 
      shufps xmm0, xmm1,0b01001110 
      addps  xmm0, xmm1            
      movaps xmm1, xmm0            
      shufps xmm1, xmm1,0b00010001 
      addps  xmm0, xmm1            
      rsqrtps xmm0, xmm0           
      mulps  xmm2, xmm0            
      movups [edi], xmm2
      add esi,4*4
      add edi,3*4
      dec ecx
      jnz .lab   
      rdtsc             '  measure of time
      SUB Eax, [save]
      mov [Cycles],eax	
    end asm
	
  'test the results
   i = 10
   print in_vec(i).x , in_vec(i).y , in_vec(i).z , in_vec(i).w
   print out_vec(i).x , out_vec(i).y , out_vec(i).z 
   
   dim as single v_len =  sqr((out_vec(i).x*out_vec(i).x) + (out_vec(i).y*out_vec(i).y) + (out_vec(i).z*out_vec(i).z))
   print v_len  ' must be 1.0
   
   print "SSE       time = " ; Cycles 
   print "FreeBasic time = " ; Cycles1 
   
   Do
   Loop

Jim · « **Reply #9 on:** April 15, 2007 »

I'm surprised it's not massively quicker than that - that's what? 2.5 x speed? Does it make any difference if you get your source vector4s aligned to a 128bit boundary and use movaps xmm0,[esi]?

Try moving the add esi/dec ecx to just after the rsqrtps. It's obvious that's a slow instruction and it might help to overlap it with something! If you start edi off with out_addr-12, you can put the add edi in there too if it makes any difference.

Also, you have to watch when you store the 200th vector3 result. The movups [edi],xmm2 will store four 32 bit values into the vector3 you have allocated. You need to make sure there's an extra vector3 on the end to take care of that.

Question - in freebasic is 1.0 a single or a double? Does Sqr take/return single or double? I'm guessing all single?

Jim

Stonemonkey · « **Reply #10 on:** April 15, 2007 »

afaik fb stores const floats as double and Sqr takes/returns singles or doubles:

fld dword ptr [ebp-8]
fsqrt
fstp dword ptr [ebp-8]

or

fld qword ptr [ebp-12]
fsqrt
fstp qword ptr [ebp-12]

or whatever way round you want.

Fryer.

Emil_halim · « **Reply #11 on:** April 15, 2007 »

Of course aligned memory with 16 byte , I.E using __declspec(align(16)) with C++ will make it faster , but I did not know how to do it with FreeBasic , besides I really do not want to make it because it west memory resource with big mesh and I need the out_vectores array to be vec3 to send it to DirectX.

Anyway , I will try to use prefetchnta command to get the data in cache memory , and get TWO VECTORS at the same time.

Emil_halim · « **Reply #12 on:** April 15, 2007 »

ok , here i uesd 'get 2 vectors' at the same time and get some speed.

now SSE time is = 6492

but using prefetchnta decrease the speed so i comment it

here is the code

Code: [Select]

   asm
      rdtsc             '  measure of time
      mov [save],Eax          
      mov    esi , [in_addr] 
      mov    edi , [out_addr]
      mov    ecx , 199
      shr    ecx , 1
     .lab: 
      movups xmm0, [esi]  
      movups xmm3, [esi+16]  
      'prefetchnta  [esi+3*16] 
      movaps xmm2, xmm0     
      movaps xmm5, xmm3        
      mulps  xmm0, xmm0
      mulps  xmm3, xmm3            
      movaps xmm1, xmm0  'ddccbbaa 
      movaps xmm4, xmm3
      shufps xmm0, xmm1,0b01001110 
      shufps xmm3, xmm4,0b01001110 
      addps  xmm0, xmm1    
      addps  xmm3, xmm4
      movaps xmm1, xmm0
      movaps xmm4, xmm3
      shufps xmm1, xmm1,0b00010001 
      shufps xmm4, xmm4,0b00010001 
      addps  xmm0, xmm1   
      addps  xmm3, xmm4
      rsqrtps xmm0, xmm0   
      rsqrtps xmm3, xmm3
      mulps  xmm2, xmm0   
      mulps  xmm5, xmm3 
      movups [edi], xmm2
      movups [edi+12], xmm5
      'prefetchnta   [edi+3*12]
      add esi,4*4*2
      add edi,3*4*2
      dec ecx
      jnz .lab   
      rdtsc             '  measure of time
      sub Eax, [save]
      mov [Cycles],eax	
   end asm

Jim · « **Reply #13 on:** April 15, 2007 »

There's still a 4 cycle 'hole' after each rsqrt and a 6 cycle hole after each mulps you might want to re-order the instructions to squeeze a little more out of it.
Might be worth asking on comp.lang.asm.x86 if anyone has any ideas for making it quicker?

Jim

Emil_halim · « **Reply #14 on:** April 16, 2007 »

thanks Jim , i will ask there.

Paul · « **Reply #15 on:** April 16, 2007 »

The first code generated this for me:
2.655853 0.7265174 1.123683 0
0.893011 0.2442862 0.3778302
0.9999501
SSE time = 4664
FreeBasic time = 16103

Does this mean it was quicker on my pc???
I'm not in to this ASM stuff at all so I have no idea.

Emil_halim · « **Reply #16 on:** April 16, 2007 »

sorry paul , i did not know what this word wuicker mean?

did you test the last one?

thanks for testing.

Jim · « **Reply #17 on:** April 16, 2007 »

'quicker'. It looks like it was a bit quicker on Paul's PC. Perhaps the difference between HT and P4D or Core2 or AMD?

Jim

Emil_halim · « **Reply #18 on:** April 17, 2007 »

ok i see now, and yes perhapse the speed of CPU of Paul's PC.

anyway , the results of Paul indecates that , it is 4 times faster when using SSE , and this is good news.

if any one other can test that and feed the results!!!

Paul · « **Reply #19 on:** April 17, 2007 »

yes, i meant the speed difference between sse and freebasic was greater on my machine.
sorry for being so unclear

Edit: only tried the first prog, how do i run the second one?