Dark Bit Factory & Gravity
PROGRAMMING => Freebasic => Topic started by: Emil_halim on April 12, 2007
-
Hi all
really i did not use asm 2 years ago or more, i was a big Fan of assembly code.
anyway , i have found this code by navigating the net , so i port it to freebasic, it normalize a vector by using SSE asm.
the problem that i still did not solve is , it works with a 4 members vector not 3 members vector, so i want to apply the code to array of 3 members vector.
BTW you can use it with software render , it is very fast becuse it manipulate with 4 single variables at the same time.
any help please?
' 4 members vector
type Vec4
as single x,y,z,w
end type
sub SSE_Normalize ( vectors as single ptr)
asm
movups xmm0, [vectors]
movaps xmm2, xmm0
mulps xmm0, xmm0
movaps xmm1, xmm0
shufps xmm0, xmm1, 0x4e
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm1, xmm1, 0x11
addps xmm0, xmm1
rsqrtps xmm0, xmm0
mulps xmm2, xmm0
movups [vectors], xmm2
end asm
end sub
-
The way SSE works, you almost *have* to work on 4 values at once.
The only problems here are the
movups xmm0,[vectors]
...
movups [vectors],xmm2
because they read in 128bits (4 floats) at a time.
Either you need to wrap the routine like this
sub normalise(x as sfloat, y as sfloat z as sfloat)
dim quad as sfloat (4)
quad(0)=x
quad(1)=y
quad(2)=z
quad(3)=0
SSE_normalise(@quad(0))
end sub
Alternatively you could pad your 3d vector type with an extra zero value. If you keep that value at 0, then the calculation will work just fine as it is. Don't worry about it wasting calculations on the zero, the gains of doing 4 values at once far outweigh the losses.
Jim
-
Yes , Jim.
If I use the warp routine I will lose some speed then using SSE has no meaning.
So using vector4 with zero padding is good and I was thinking in it , but I can not use it because , if I will render this Array of vectors with DirectX , it must be 3 members Vector , that is my actually problem.
-
Nothing you can do about it. SSE loads 4 words at a time, DirectX requires them to be in groups of 3. You're going to end up copying them somewhere in the pipeline :-\
Jim
-
ok , thanks Jim for your help.
-
ok , i have solved that problem by using 2 arrays of vectors , one for in_vector that has 4 members , and the other for Out_vectors that has 3 members. so it may be usful for someone else.
here is a test code
/'==================================================='/
' using fast SSE to caculate a normale of vector
'
' bu Emil halim
'
/'==================================================='/
type Verctor4
as single x,y,z,w
end type
type Verctor3
as single x,y,z
end type
dim as integer i
dim as Verctor4 in_vec(0 to 200)
for i = 0 to 199
in_vec(i).x = rnd * 10.0
in_vec(i).y = rnd * 10.0
in_vec(i).z = rnd * 10.0
in_vec(i).w = 0.0
next
dim as Verctor3 out_vec(0 to 200)
'
' normalizing vector
'====================
'
' Rec_len = 1.0 / sqr((vec1.x*vec1.x) + (vec1.y*vec1.y) + (vec1.z*vec1.z))
' vec1.x *= Rec_len
' vec1.y *= Rec_len
' vec1.z *= Rec_len
'
dim as integer in_addr = @in_vec(0).x
dim as integer out_addr = @out_vec(0).x
asm
mov eax , [in_addr]
mov edx , [out_addr]
mov ecx , 199
.lab:
movups xmm0, [eax]
movaps xmm2, xmm0
mulps xmm0, xmm0
movaps xmm1, xmm0 'ddccbbaa
shufps xmm0, xmm1,0b01001110
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm1, xmm1,0b00010001
addps xmm0, xmm1
rsqrtps xmm0, xmm0
mulps xmm2, xmm0
movups [edx], xmm2
add eax,4*4
add edx,3*4
dec ecx
jnz .lab
end asm
'test the results
i = 10
print in_vec(i).x , in_vec(i).y , in_vec(i).z , in_vec(i).w
print out_vec(i).x , out_vec(i).y , out_vec(i).z
dim as single v_len = sqr((out_vec(i).x*out_vec(i).x) + (out_vec(i).y*out_vec(i).y) + (out_vec(i).z*out_vec(i).z))
print v_len ' must be 1.0
Do
Loop
-
Good solution! Have some Karma!
Jim
-
thanks Jim. :)
-
I have mod my last code , so that you can see the advantage of using SSE asm code in floating calculation.
so there are 2 codes , one for orignal FreeBasic and the other for SSE asm.
please test it and feed back your results.
In my system
==========
FreeBasic time = 18884 Cycles
SSE time = 7280 Cycles
here is the code
/'==================================================='/
' using fast SSE to caculate a normale of vector
'
' by Emil halim
'
/'==================================================='/
type Vector4
as single x,y,z,w
end type
type Vector3
as single x,y,z
end type
dim as integer i
dim as Vector4 in_vec(0 to 200)
for i = 0 to 199
in_vec(i).x = rnd * 10.0
in_vec(i).y = rnd * 10.0
in_vec(i).z = rnd * 10.0
in_vec(i).w = 0.0
next
dim as Vector3 out_vec(0 to 200)
dim as integer Cycles1 , Cycles , save
'
' normalizing vector
'====================
'
' FreeBasic code
asm rdtsc ' measure of time
asm mov [save],Eax
for i = 0 to 199
dim as single Rec_len = 1.0 / sqr((in_vec(i).x*in_vec(i).x) + (in_vec(i).y*in_vec(i).y) + (in_vec(i).z*in_vec(i).z))
in_vec(i).x *= Rec_len
in_vec(i).y *= Rec_len
in_vec(i).z *= Rec_len
next
asm rdtsc ' measure of time
asm SUB Eax, [save]
asm mov [Cycles1],eax
' SSE asm code
for i = 0 to 199
in_vec(i).x = rnd * 10.0
in_vec(i).y = rnd * 10.0
in_vec(i).z = rnd * 10.0
in_vec(i).w = 0.0
next
dim as integer in_addr = @in_vec(0).x
dim as integer out_addr = @out_vec(0).x
asm
rdtsc ' measure of time
mov [save],Eax
mov esi , [in_addr]
mov edi , [out_addr]
mov ecx , 199
.lab:
movups xmm0, [esi]
movaps xmm2, xmm0
mulps xmm0, xmm0
movaps xmm1, xmm0 'ddccbbaa
shufps xmm0, xmm1,0b01001110
addps xmm0, xmm1
movaps xmm1, xmm0
shufps xmm1, xmm1,0b00010001
addps xmm0, xmm1
rsqrtps xmm0, xmm0
mulps xmm2, xmm0
movups [edi], xmm2
add esi,4*4
add edi,3*4
dec ecx
jnz .lab
rdtsc ' measure of time
SUB Eax, [save]
mov [Cycles],eax
end asm
'test the results
i = 10
print in_vec(i).x , in_vec(i).y , in_vec(i).z , in_vec(i).w
print out_vec(i).x , out_vec(i).y , out_vec(i).z
dim as single v_len = sqr((out_vec(i).x*out_vec(i).x) + (out_vec(i).y*out_vec(i).y) + (out_vec(i).z*out_vec(i).z))
print v_len ' must be 1.0
print "SSE time = " ; Cycles
print "FreeBasic time = " ; Cycles1
Do
Loop
-
I'm surprised it's not massively quicker than that - that's what? 2.5 x speed? Does it make any difference if you get your source vector4s aligned to a 128bit boundary and use movaps xmm0,[esi]?
Try moving the add esi/dec ecx to just after the rsqrtps. It's obvious that's a slow instruction and it might help to overlap it with something! If you start edi off with out_addr-12, you can put the add edi in there too if it makes any difference.
Also, you have to watch when you store the 200th vector3 result. The movups [edi],xmm2 will store four 32 bit values into the vector3 you have allocated. You need to make sure there's an extra vector3 on the end to take care of that.
Question - in freebasic is 1.0 a single or a double? Does Sqr take/return single or double? I'm guessing all single?
Jim
-
afaik fb stores const floats as double and Sqr takes/returns singles or doubles:
fld dword ptr [ebp-8]
fsqrt
fstp dword ptr [ebp-8]
or
fld qword ptr [ebp-12]
fsqrt
fstp qword ptr [ebp-12]
or whatever way round you want.
Fryer.
-
Of course aligned memory with 16 byte , I.E using __declspec(align(16)) with C++ will make it faster , but I did not know how to do it with FreeBasic , besides I really do not want to make it because it west memory resource with big mesh and I need the out_vectores array to be vec3 to send it to DirectX.
Anyway , I will try to use prefetchnta command to get the data in cache memory , and get TWO VECTORS at the same time.
-
ok , here i uesd 'get 2 vectors' at the same time and get some speed.
now SSE time is = 6492
but using prefetchnta decrease the speed so i comment it
here is the code
asm
rdtsc ' measure of time
mov [save],Eax
mov esi , [in_addr]
mov edi , [out_addr]
mov ecx , 199
shr ecx , 1
.lab:
movups xmm0, [esi]
movups xmm3, [esi+16]
'prefetchnta [esi+3*16]
movaps xmm2, xmm0
movaps xmm5, xmm3
mulps xmm0, xmm0
mulps xmm3, xmm3
movaps xmm1, xmm0 'ddccbbaa
movaps xmm4, xmm3
shufps xmm0, xmm1,0b01001110
shufps xmm3, xmm4,0b01001110
addps xmm0, xmm1
addps xmm3, xmm4
movaps xmm1, xmm0
movaps xmm4, xmm3
shufps xmm1, xmm1,0b00010001
shufps xmm4, xmm4,0b00010001
addps xmm0, xmm1
addps xmm3, xmm4
rsqrtps xmm0, xmm0
rsqrtps xmm3, xmm3
mulps xmm2, xmm0
mulps xmm5, xmm3
movups [edi], xmm2
movups [edi+12], xmm5
'prefetchnta [edi+3*12]
add esi,4*4*2
add edi,3*4*2
dec ecx
jnz .lab
rdtsc ' measure of time
sub Eax, [save]
mov [Cycles],eax
end asm
-
There's still a 4 cycle 'hole' after each rsqrt and a 6 cycle hole after each mulps you might want to re-order the instructions to squeeze a little more out of it.
Might be worth asking on comp.lang.asm.x86 if anyone has any ideas for making it quicker?
Jim
-
thanks Jim , i will ask there.
-
The first code generated this for me:
2.655853 0.7265174 1.123683 0
0.893011 0.2442862 0.3778302
0.9999501
SSE time = 4664
FreeBasic time = 16103
Does this mean it was quicker on my pc???
I'm not in to this ASM stuff at all so I have no idea.
-
sorry paul , i did not know what this word wuicker mean?
did you test the last one?
thanks for testing.
-
'quicker'. It looks like it was a bit quicker on Paul's PC. Perhaps the difference between HT and P4D or Core2 or AMD?
Jim
-
ok i see now, and yes perhapse the speed of CPU of Paul's PC.
anyway , the results of Paul indecates that , it is 4 times faster when using SSE , and this is good news. :)
if any one other can test that and feed the results!!!
-
yes, i meant the speed difference between sse and freebasic was greater on my machine.
sorry for being so unclear
Edit: only tried the first prog, how do i run the second one?
-
here it is Paul
/'==================================================='/
' using fast SSE to caculate a normale of vector
'
' by Emil halim
'
/'==================================================='/
type Vector4
as single x,y,z,w
end type
type Vector3
as single x,y,z
end type
dim as integer i
dim as Vector4 in_vec(0 to 200)
for i = 0 to 199
in_vec(i).x = rnd * 10.0
in_vec(i).y = rnd * 10.0
in_vec(i).z = rnd * 10.0
in_vec(i).w = 0.0
next
dim as Vector3 out_vec(0 to 200)
dim as integer Cycles1 , Cycles , save
'
' normalizing vector
'====================
'
' FreeBasic code
asm rdtsc ' measure of time
asm mov [save],Eax
for i = 0 to 199
dim as single Rec_len = 1.0 / sqr((in_vec(i).x*in_vec(i).x) + (in_vec(i).y*in_vec(i).y) + (in_vec(i).z*in_vec(i).z))
in_vec(i).x *= Rec_len
in_vec(i).y *= Rec_len
in_vec(i).z *= Rec_len
next
asm rdtsc ' measure of time
asm SUB Eax, [save]
asm mov [Cycles1],eax
' SSE asm code
for i = 0 to 199
in_vec(i).x = rnd * 10.0
in_vec(i).y = rnd * 10.0
in_vec(i).z = rnd * 10.0
in_vec(i).w = 0.0
next
dim as integer in_addr = @in_vec(0).x
dim as integer out_addr = @out_vec(0).x
asm
rdtsc ' measure of time
mov [save],Eax
mov esi , [in_addr]
mov edi , [out_addr]
mov ecx , 199
shr ecx , 1
.lab:
movups xmm0, [esi]
movups xmm3, [esi+16]
'prefetchnta [esi+3*16]
movaps xmm2, xmm0
movaps xmm5, xmm3
mulps xmm0, xmm0
mulps xmm3, xmm3
movaps xmm1, xmm0 'ddccbbaa
movaps xmm4, xmm3
shufps xmm0, xmm1,0b01001110
shufps xmm3, xmm4,0b01001110
addps xmm0, xmm1
addps xmm3, xmm4
movaps xmm1, xmm0
movaps xmm4, xmm3
shufps xmm1, xmm1,0b00010001
shufps xmm4, xmm4,0b00010001
addps xmm0, xmm1
addps xmm3, xmm4
rsqrtps xmm0, xmm0
rsqrtps xmm3, xmm3
mulps xmm2, xmm0
mulps xmm5, xmm3
movups [edi], xmm2
movups [edi+12], xmm5
'prefetchnta [edi+3*12]
add esi,4*4*2
add edi,3*4*2
dec ecx
jnz .lab
rdtsc ' measure of time
sub Eax, [save]
mov [Cycles],eax
end asm
'test the results
i = 10
print in_vec(i).x , in_vec(i).y , in_vec(i).z , in_vec(i).w
print out_vec(i).x , out_vec(i).y , out_vec(i).z
dim as single v_len = sqr((out_vec(i).x*out_vec(i).x) + (out_vec(i).y*out_vec(i).y) + (out_vec(i).z*out_vec(i).z))
print v_len ' must be 1.0
print "SSE time = " ; Cycles
print "FreeBasic time = " ; Cycles1
Do
Loop
-
Even better result :D
2.655853 0.7265174 1.123683 0
0.893011 0.2442862 0.3778302
0.9999501
SSE time = 3851
FreeBasic time = 16128
thats 4.1880031160737470786808621137367* faster :)
-
great news. :)
-
Not much luck with c.l.a.x86 :(
I tried the code on mine and got about 3x speed. It's interesting to run it a few times and see the different results. When I made it do 20000 instead of 200 I get something more reliable.
Jim
-
No luck with Comp.lang.asm , :(
I have posted there but get no answer.
I think that , the effect of SEE is noticable when you use a huge calculations , just as your test.
so i consider it is good news too. :)
-
Man, this makes me want to learn ASM really, really bad. Good job! :clap:
SSE time = 3933
FreeBasic time = 47493
-
that is great news too. :)
Edited:
it is faster by 12 times , !!!!!!!!!!
did you test it more than one and took the avr or what ?
what is the config of your system ?
-
Well, I just ran it a few times and just posted the results from the last one. They were all very good though. ;)
I'm running xp sp2 with a 2gig sempron and 512mb ddr sdram.
-
ok, that really nice , i think AMD will beat Intel in the few next monthes.
-
i made a little demo of a skeletal animation a while back. Do you mind if I stick this code in there to test it with a real application? :cheers:
-
Sure man , use it as you want. :)
If I have some times I will do a matrix mutably too with SEE and of course you can use it too.
-
Man, that would be awesome! :D I really need to take a class on asm programming. :-\
-
I really need to take a class on asm programming. :-\
We have an asm forum here so we can deal with your questions :)