Dark Bit Factory & Gravity

PROGRAMMING => Freebasic => Topic started by: Emil_halim on April 12, 2007

Title: Fast Normalize vector
Post by: Emil_halim on April 12, 2007: Hi all

really i did not use asm 2 years ago or more, i was a big Fan of assembly code.

anyway , i have found this code by navigating the net , so i port it to freebasic, it normalize a vector by using SSE asm.

the problem that i still did not solve is , it works with a 4 members vector not 3 members vector, so i want to apply the code to array of 3 members vector.

BTW you can use it with software render , it is very fast becuse it manipulate with 4 single variables at the same time.

any help please?

Code: [Select]
' 4 members vector type Vec4 as single x,y,z,w end type sub SSE_Normalize ( vectors as single ptr) asm movups xmm0, [vectors] movaps xmm2, xmm0 mulps xmm0, xmm0 movaps xmm1, xmm0 shufps xmm0, xmm1, 0x4e addps xmm0, xmm1 movaps xmm1, xmm0 shufps xmm1, xmm1, 0x11 addps xmm0, xmm1 rsqrtps xmm0, xmm0 mulps xmm2, xmm0 movups [vectors], xmm2 end asm end sub
Title: Re: Fast Normalize vector
Post by: Jim on April 12, 2007: The way SSE works, you almost *have* to work on 4 values at once.

The only problems here are the
Code: [Select]
movups xmm0,[vectors] ... movups [vectors],xmm2because they read in 128bits (4 floats) at a time.

Either you need to wrap the routine like this

sub normalise(x as sfloat, y as sfloat z as sfloat)
dim quad as sfloat (4)
quad(0)=x
quad(1)=y
quad(2)=z
quad(3)=0
SSE_normalise(@quad(0))
end sub

Alternatively you could pad your 3d vector type with an extra zero value. If you keep that value at 0, then the calculation will work just fine as it is. Don't worry about it wasting calculations on the zero, the gains of doing 4 values at once far outweigh the losses.

Jim
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 12, 2007: Yes , Jim.

If I use the warp routine I will lose some speed then using SSE has no meaning.

So using vector4 with zero padding is good and I was thinking in it , but I can not use it because , if I will render this Array of vectors with DirectX , it must be 3 members Vector , that is my actually problem.
Title: Re: Fast Normalize vector
Post by: Jim on April 12, 2007: Nothing you can do about it. SSE loads 4 words at a time, DirectX requires them to be in groups of 3. You're going to end up copying them somewhere in the pipeline :-\

Jim
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 12, 2007: ok , thanks Jim for your help.
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 14, 2007: ok , i have solved that problem by using 2 arrays of vectors , one for in_vector that has 4 members , and the other for Out_vectors that has 3 members. so it may be usful for someone else.

here is a test code
Code: [Select]
/'==================================================='/ ' using fast SSE to caculate a normale of vector ' ' bu Emil halim ' /'==================================================='/ type Verctor4 as single x,y,z,w end type type Verctor3 as single x,y,z end type dim as integer i dim as Verctor4 in_vec(0 to 200) for i = 0 to 199 in_vec(i).x = rnd * 10.0 in_vec(i).y = rnd * 10.0 in_vec(i).z = rnd * 10.0 in_vec(i).w = 0.0 next dim as Verctor3 out_vec(0 to 200) ' ' normalizing vector '==================== ' ' Rec_len = 1.0 / sqr((vec1.x*vec1.x) + (vec1.y*vec1.y) + (vec1.z*vec1.z)) ' vec1.x *= Rec_len ' vec1.y *= Rec_len ' vec1.z *= Rec_len ' dim as integer in_addr = @in_vec(0).x dim as integer out_addr = @out_vec(0).x asm mov eax , [in_addr] mov edx , [out_addr] mov ecx , 199 .lab: movups xmm0, [eax] movaps xmm2, xmm0 mulps xmm0, xmm0 movaps xmm1, xmm0 'ddccbbaa shufps xmm0, xmm1,0b01001110 addps xmm0, xmm1 movaps xmm1, xmm0 shufps xmm1, xmm1,0b00010001 addps xmm0, xmm1 rsqrtps xmm0, xmm0 mulps xmm2, xmm0 movups [edx], xmm2 add eax,4*4 add edx,3*4 dec ecx jnz .lab end asm 'test the results i = 10 print in_vec(i).x , in_vec(i).y , in_vec(i).z , in_vec(i).w print out_vec(i).x , out_vec(i).y , out_vec(i).z dim as single v_len = sqr((out_vec(i).x*out_vec(i).x) + (out_vec(i).y*out_vec(i).y) + (out_vec(i).z*out_vec(i).z)) print v_len ' must be 1.0 Do Loop
Title: Re: Fast Normalize vector
Post by: Jim on April 15, 2007: Good solution! Have some Karma!

Jim
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 15, 2007: thanks Jim. :)
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 15, 2007: I have mod my last code , so that you can see the advantage of using SSE asm code in floating calculation.

so there are 2 codes , one for orignal FreeBasic and the other for SSE asm.

please test it and feed back your results.

In my system
==========

FreeBasic time = 18884 Cycles

SSE time = 7280 Cycles

here is the code
Code: [Select]
/'==================================================='/ ' using fast SSE to caculate a normale of vector ' ' by Emil halim ' /'==================================================='/ type Vector4 as single x,y,z,w end type type Vector3 as single x,y,z end type dim as integer i dim as Vector4 in_vec(0 to 200) for i = 0 to 199 in_vec(i).x = rnd * 10.0 in_vec(i).y = rnd * 10.0 in_vec(i).z = rnd * 10.0 in_vec(i).w = 0.0 next dim as Vector3 out_vec(0 to 200) dim as integer Cycles1 , Cycles , save ' ' normalizing vector '==================== ' ' FreeBasic code asm rdtsc ' measure of time asm mov [save],Eax for i = 0 to 199 dim as single Rec_len = 1.0 / sqr((in_vec(i).x*in_vec(i).x) + (in_vec(i).y*in_vec(i).y) + (in_vec(i).z*in_vec(i).z)) in_vec(i).x *= Rec_len in_vec(i).y *= Rec_len in_vec(i).z *= Rec_len next asm rdtsc ' measure of time asm SUB Eax, [save] asm mov [Cycles1],eax ' SSE asm code for i = 0 to 199 in_vec(i).x = rnd * 10.0 in_vec(i).y = rnd * 10.0 in_vec(i).z = rnd * 10.0 in_vec(i).w = 0.0 next dim as integer in_addr = @in_vec(0).x dim as integer out_addr = @out_vec(0).x asm rdtsc ' measure of time mov [save],Eax mov esi , [in_addr] mov edi , [out_addr] mov ecx , 199 .lab: movups xmm0, [esi] movaps xmm2, xmm0 mulps xmm0, xmm0 movaps xmm1, xmm0 'ddccbbaa shufps xmm0, xmm1,0b01001110 addps xmm0, xmm1 movaps xmm1, xmm0 shufps xmm1, xmm1,0b00010001 addps xmm0, xmm1 rsqrtps xmm0, xmm0 mulps xmm2, xmm0 movups [edi], xmm2 add esi,4*4 add edi,3*4 dec ecx jnz .lab rdtsc ' measure of time SUB Eax, [save] mov [Cycles],eax end asm 'test the results i = 10 print in_vec(i).x , in_vec(i).y , in_vec(i).z , in_vec(i).w print out_vec(i).x , out_vec(i).y , out_vec(i).z dim as single v_len = sqr((out_vec(i).x*out_vec(i).x) + (out_vec(i).y*out_vec(i).y) + (out_vec(i).z*out_vec(i).z)) print v_len ' must be 1.0 print "SSE time = " ; Cycles print "FreeBasic time = " ; Cycles1 Do Loop
Title: Re: Fast Normalize vector
Post by: Jim on April 15, 2007: I'm surprised it's not massively quicker than that - that's what? 2.5 x speed? Does it make any difference if you get your source vector4s aligned to a 128bit boundary and use movaps xmm0,[esi]?

Try moving the add esi/dec ecx to just after the rsqrtps. It's obvious that's a slow instruction and it might help to overlap it with something! If you start edi off with out_addr-12, you can put the add edi in there too if it makes any difference.

Also, you have to watch when you store the 200th vector3 result. The movups [edi],xmm2 will store four 32 bit values into the vector3 you have allocated. You need to make sure there's an extra vector3 on the end to take care of that.

Question - in freebasic is 1.0 a single or a double? Does Sqr take/return single or double? I'm guessing all single?

Jim
Title: Re: Fast Normalize vector
Post by: Stonemonkey on April 15, 2007: afaik fb stores const floats as double and Sqr takes/returns singles or doubles:

fld dword ptr [ebp-8]
fsqrt
fstp dword ptr [ebp-8]

or

fld qword ptr [ebp-12]
fsqrt
fstp qword ptr [ebp-12]

or whatever way round you want.

Fryer.
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 15, 2007: Of course aligned memory with 16 byte , I.E using __declspec(align(16)) with C++ will make it faster , but I did not know how to do it with FreeBasic , besides I really do not want to make it because it west memory resource with big mesh and I need the out_vectores array to be vec3 to send it to DirectX.

Anyway , I will try to use prefetchnta command to get the data in cache memory , and get TWO VECTORS at the same time.
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 15, 2007: ok , here i uesd 'get 2 vectors' at the same time and get some speed.

now SSE time is = 6492

but using prefetchnta decrease the speed so i comment it

here is the code
Code: [Select]
asm rdtsc ' measure of time mov [save],Eax mov esi , [in_addr] mov edi , [out_addr] mov ecx , 199 shr ecx , 1 .lab: movups xmm0, [esi] movups xmm3, [esi+16] 'prefetchnta [esi+3*16] movaps xmm2, xmm0 movaps xmm5, xmm3 mulps xmm0, xmm0 mulps xmm3, xmm3 movaps xmm1, xmm0 'ddccbbaa movaps xmm4, xmm3 shufps xmm0, xmm1,0b01001110 shufps xmm3, xmm4,0b01001110 addps xmm0, xmm1 addps xmm3, xmm4 movaps xmm1, xmm0 movaps xmm4, xmm3 shufps xmm1, xmm1,0b00010001 shufps xmm4, xmm4,0b00010001 addps xmm0, xmm1 addps xmm3, xmm4 rsqrtps xmm0, xmm0 rsqrtps xmm3, xmm3 mulps xmm2, xmm0 mulps xmm5, xmm3 movups [edi], xmm2 movups [edi+12], xmm5 'prefetchnta [edi+3*12] add esi,4*4*2 add edi,3*4*2 dec ecx jnz .lab rdtsc ' measure of time sub Eax, [save] mov [Cycles],eax end asm
Title: Re: Fast Normalize vector
Post by: Jim on April 15, 2007: There's still a 4 cycle 'hole' after each rsqrt and a 6 cycle hole after each mulps you might want to re-order the instructions to squeeze a little more out of it.
Might be worth asking on comp.lang.asm.x86 if anyone has any ideas for making it quicker?

Jim
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 16, 2007: thanks Jim , i will ask there.
Title: Re: Fast Normalize vector
Post by: Paul on April 16, 2007: The first code generated this for me:
2.655853 0.7265174 1.123683 0
0.893011 0.2442862 0.3778302
0.9999501
SSE time = 4664
FreeBasic time = 16103

Does this mean it was quicker on my pc???
I'm not in to this ASM stuff at all so I have no idea.
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 16, 2007: sorry paul , i did not know what this word wuicker mean?

did you test the last one?

thanks for testing.
Title: Re: Fast Normalize vector
Post by: Jim on April 16, 2007: 'quicker'. It looks like it was a bit quicker on Paul's PC. Perhaps the difference between HT and P4D or Core2 or AMD?

Jim
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 17, 2007: ok i see now, and yes perhapse the speed of CPU of Paul's PC.

anyway , the results of Paul indecates that , it is 4 times faster when using SSE , and this is good news. :)

if any one other can test that and feed the results!!!
Title: Re: Fast Normalize vector
Post by: Paul on April 17, 2007: yes, i meant the speed difference between sse and freebasic was greater on my machine.
sorry for being so unclear

Edit: only tried the first prog, how do i run the second one?
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 17, 2007: here it is Paul

Code: [Select]
/'==================================================='/ ' using fast SSE to caculate a normale of vector ' ' by Emil halim ' /'==================================================='/ type Vector4 as single x,y,z,w end type type Vector3 as single x,y,z end type dim as integer i dim as Vector4 in_vec(0 to 200) for i = 0 to 199 in_vec(i).x = rnd * 10.0 in_vec(i).y = rnd * 10.0 in_vec(i).z = rnd * 10.0 in_vec(i).w = 0.0 next dim as Vector3 out_vec(0 to 200) dim as integer Cycles1 , Cycles , save ' ' normalizing vector '==================== ' ' FreeBasic code asm rdtsc ' measure of time asm mov [save],Eax for i = 0 to 199 dim as single Rec_len = 1.0 / sqr((in_vec(i).x*in_vec(i).x) + (in_vec(i).y*in_vec(i).y) + (in_vec(i).z*in_vec(i).z)) in_vec(i).x *= Rec_len in_vec(i).y *= Rec_len in_vec(i).z *= Rec_len next asm rdtsc ' measure of time asm SUB Eax, [save] asm mov [Cycles1],eax ' SSE asm code for i = 0 to 199 in_vec(i).x = rnd * 10.0 in_vec(i).y = rnd * 10.0 in_vec(i).z = rnd * 10.0 in_vec(i).w = 0.0 next dim as integer in_addr = @in_vec(0).x dim as integer out_addr = @out_vec(0).x asm rdtsc ' measure of time mov [save],Eax mov esi , [in_addr] mov edi , [out_addr] mov ecx , 199 shr ecx , 1 .lab: movups xmm0, [esi] movups xmm3, [esi+16] 'prefetchnta [esi+3*16] movaps xmm2, xmm0 movaps xmm5, xmm3 mulps xmm0, xmm0 mulps xmm3, xmm3 movaps xmm1, xmm0 'ddccbbaa movaps xmm4, xmm3 shufps xmm0, xmm1,0b01001110 shufps xmm3, xmm4,0b01001110 addps xmm0, xmm1 addps xmm3, xmm4 movaps xmm1, xmm0 movaps xmm4, xmm3 shufps xmm1, xmm1,0b00010001 shufps xmm4, xmm4,0b00010001 addps xmm0, xmm1 addps xmm3, xmm4 rsqrtps xmm0, xmm0 rsqrtps xmm3, xmm3 mulps xmm2, xmm0 mulps xmm5, xmm3 movups [edi], xmm2 movups [edi+12], xmm5 'prefetchnta [edi+3*12] add esi,4*4*2 add edi,3*4*2 dec ecx jnz .lab rdtsc ' measure of time sub Eax, [save] mov [Cycles],eax end asm 'test the results i = 10 print in_vec(i).x , in_vec(i).y , in_vec(i).z , in_vec(i).w print out_vec(i).x , out_vec(i).y , out_vec(i).z dim as single v_len = sqr((out_vec(i).x*out_vec(i).x) + (out_vec(i).y*out_vec(i).y) + (out_vec(i).z*out_vec(i).z)) print v_len ' must be 1.0 print "SSE time = " ; Cycles print "FreeBasic time = " ; Cycles1 Do Loop
Title: Re: Fast Normalize vector
Post by: Paul on April 17, 2007: Even better result :D

2.655853 0.7265174 1.123683 0
0.893011 0.2442862 0.3778302
0.9999501
SSE time = 3851
FreeBasic time = 16128

thats 4.1880031160737470786808621137367* faster :)
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 17, 2007: great news. :)
Title: Re: Fast Normalize vector
Post by: Jim on April 21, 2007: Not much luck with c.l.a.x86 :(
I tried the code on mine and got about 3x speed. It's interesting to run it a few times and see the different results. When I made it do 20000 instead of 200 I get something more reliable.

Jim
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 21, 2007: No luck with Comp.lang.asm , :(

I have posted there but get no answer.

I think that , the effect of SEE is noticable when you use a huge calculations , just as your test.

so i consider it is good news too. :)
Title: Re: Fast Normalize vector
Post by: Dr_D on April 27, 2007: Man, this makes me want to learn ASM really, really bad. Good job! :clap:

SSE time = 3933
FreeBasic time = 47493
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 28, 2007: that is great news too. :)

Edited:

it is faster by 12 times , !!!!!!!!!!

did you test it more than one and took the avr or what ?

what is the config of your system ?
Title: Re: Fast Normalize vector
Post by: Dr_D on April 28, 2007: Well, I just ran it a few times and just posted the results from the last one. They were all very good though. ;)

I'm running xp sp2 with a 2gig sempron and 512mb ddr sdram.
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 28, 2007: ok, that really nice , i think AMD will beat Intel in the few next monthes.
Title: Re: Fast Normalize vector
Post by: Dr_D on April 28, 2007: i made a little demo of a skeletal animation a while back. Do you mind if I stick this code in there to test it with a real application? :cheers:
Title: Re: Fast Normalize vector
Post by: Emil_halim on April 28, 2007: Sure man , use it as you want. :)

If I have some times I will do a matrix mutably too with SEE and of course you can use it too.
Title: Re: Fast Normalize vector
Post by: Dr_D on April 28, 2007: Man, that would be awesome! :D I really need to take a class on asm programming. :-\
Title: Re: Fast Normalize vector
Post by: Shockwave on April 28, 2007: Quote from: Dr_D on April 28, 2007
I really need to take a class on asm programming. :-\

We have an asm forum here so we can deal with your questions :)