Dark Bit Factory & Gravity
PROGRAMMING => Freebasic => Topic started by: Stonemonkey on January 24, 2012
-
Not quite sure where to put this as it's a freebasic sub using some asm. Thought I'd write this as there's been some discussion on asm and fading elsewhere.
Anyway, I've not done much testing so I hope it works ok and that's it's fairly self explanatory. If not let me know.
buffer_address=start address of 32bit depth unsigned int colour buffer
'wwidth/height=dimensions of the buffer
'fade=value 0-256 0=turn buffer black 256=no change
sub fade_buffer(byval buffer_address as uinteger ptr,_
byval wwidth as integer,_
byval height as integer,_
byval fade as integer)
dim as integer ptr last_address=buffer_address+(wwidth*height)-1
asm
mov ecx,dword ptr[buffer_address] 'load buffer start address into reg ecx
mov edx,dword ptr[fade] 'load fade value into reg edx
fade_loop:
mov eax,dword ptr[ecx] 'load 4 byte int into eax from address in ecx
mov ebx,eax 'copy 4 bytes into ebx
and eax,&hff00ff 'filter red and blue
and ebx,&h00ff00 'filter green
imul eax,edx 'mult by 8 bit fade value
imul ebx,edx ' " " " " " "
and eax,&hff00ff00 'filter high bytes of red and blue
and ebx,&h00ff0000 'filter high byte of green
or eax,ebx 're combine
shr eax,8 'shift back into position
mov dword ptr[ecx],eax 'write back to address in ecx
adc ecx,4 'point ecx to next pixel
cmp ecx,dword ptr[last_address] 'compare with the end of the buffer
jle fade_loop 'repeat loop if ecx is not past the final pixel
end asm
end sub
-
And a little bit of code to demo it. This acts directly on the screen buffer but it'll work on any 32 bit colour buffer.
'place fade routine here
sub main
screenres 640,480,32
for y as integer=0 to 479
for x as integer=0 to 639
pset(x,y),rnd*&hffffff
next
next
for i as integer=1 to 1000
screenlock
fade_buffer(screenptr,640,480,255)
screenunlock
next
end sub
main
sleep
end
-
Given it a go with MMX instructions, it does 2 pixels at a time and is quite a bit faster. I'm not too sure about MMX so if anyone can do this better I'd love to see.
sub fade_buffer_mmx(byval buffer_address as uinteger ptr,_
byval wwidth as integer,_
byval height as integer,_
byval fade as integer)
dim as integer ptr last_address=buffer_address+(wwidth*height)-2
fade or=(fade shl 8)or(fade shl 16)
asm
mov ecx,dword ptr[buffer_address] 'load buffer start address into reg ecx
pxor mm7,mm7 'clear mm7 register, used for unpacking
movd mm6,[fade] 'load fade to mm6
punpcklbw mm6,mm7 'unpack fade bytes
fade_loop_mmx:
movq mm0,dword ptr[ecx] 'load 2 pixel data to mm0
movq mm1,mm0 'copy mm0 to mm1
punpcklbw mm0,mm7 'unpack lo pixel bytes
punpckhbw mm1,mm7 'unpack hi pixel bytes
pmullw mm0,mm6 '8 bit mult of lo pixel and fade
pmullw mm1,mm6 '8 bit mult of hi pixel and fade
psrlw mm0,8 'shift lo result 8 bits
psrlw mm1,8 'shift hi result 8 bits
packuswb mm0,mm1 'pack bytes from both together
movq [ecx],mm0 'write back to buffer
add ecx,8 'next pixel
cmp ecx,dword ptr[last_address] 'check for end of buffer
jle fade_loop_mmx 'loop if not past end of buffer
emms 'reset fpu
end asm
end sub
-
Thanks Stonemonkey, I will check out the routine soon (for my upcoming "first ever" demo), much thanks.
-
Good one Stonemonkey :)
K+
-
I'm not too sure about MMX so if anyone can do this better I'd love to see.
Looks pretty good to me.
You should make sure your buffer starts at a multiple of 8 so that all your memory accesses are 64bit aligned.
The two pmullw won't work in parallel, atleast that was the case on p3 (but I doubt they ever changed anything on the mmx unit).
So you can put some instructions in between without any cost (but the cpu-front-end is probably clever enough to predraw add & cmp).
Personally I find it more convenient (and sometimes it's even faster) to have separate source- and destination-buffers.
You can also get rid of the cmp with a negative loop-counter (so you can check the zero-flag but it's still increasing)
mov edi, sourceBuffer + numberOfPixels*4 (4 byte per pixel)
mov ecx, -numberOfPixels/2 (two pixels per iteration)
fadeloop:
movq mm0,[edi+ecx*8] (first iteration accesses sourceBuffer[0])
...
inc ecx
jnz fadeloop
-
Looks pretty good to me.
You should make sure your buffer starts at a multiple of 8 so that all your memory accesses are 64bit aligned.
Yep, in this case I'm only using the fb screenbuffer but I think it's aligned, maybe to 16 bytes
The two pmullw won't work in parallel, atleast that was the case on p3 (but I doubt they ever changed anything on the mmx unit).
So you can put some instructions in between without any cost (but the cpu-front-end is probably clever enough to predraw add & cmp).
ok thanks
Personally I find it more convenient (and sometimes it's even faster) to have separate source- and destination-buffers.
You can also get rid of the cmp with a negative loop-counter (so you can check the zero-flag but it's still increasing)
mov edi, sourceBuffer + numberOfPixels*4 (4 byte per pixel)
mov ecx, -numberOfPixels/2 (two pixels per iteration)
fadeloop:
movq mm0,[edi+ecx*8] (first iteration accesses sourceBuffer[0])
...
inc ecx
jnz fadeloop
That seems to be a little bit quicker, down to 1.58ms from 1.61ms, and compared to about 3.3ms for my non mmx version.
-
That seems to be a little bit quicker, down to 1.58ms from 1.61ms
That doesn't seem much (less than 2%) but makes perfectly sense as the cpu is mostly waiting for the cache-loading and multiplications to finish.
You'll probably see a speed increase if you prefetch the pixels of the next scanline to some dummy register, so all the data has already been loaded into the cache before the actual computation starts.
This way alu and mmu are working in parallel without stalls.
On the downside it makes the code more complex and needs some special-case handling for the last scanline.
-
I've looked into prefetching before but never worked it out so some tips on that would be more than welcome, instead of the special case for the last scanline what about just assigning an extra rows worth of memory to the buffer, maybe not so easy to do with FB's screenbuffer but not a problem with something like tinyptc or my own buffers.
-
It turns out that the benefits of prefetching are tricky to show.
For a first try I simply copied a constant 640x480 image into the screen-buffer and profiled "fade_buffer_mmx" using rdtsc (http://faydoc.tripod.com/cpu/rdtsc.htm).
It runs at pretty constant 790k cycles (that's 2.5 cycles per pixel!).
Since all the data has just been copied it's still in the cache - there's not a single cache miss and the two pmullw must be executed in parallel, too.
The copying runs at about 700k to 1900k cycles, so that's the place where all the cache issues happen.
Considering that the fader is a post-processing filter (so you already managed to fill the buffer), it doesn't really make sense to put much thought into prefetching for this routine.
On a second try I changed the fader to work on separate source- and destination-buffers to get all the trouble into a single place but it behaves surprisingly well, too.
The hardware-prefetcher seems to be much smarter than the last time I looked at it (already a couple of years ago), so it doesn't seem to require hints by prefetch instructions (it's easy to predict since all data-access happens strictly linear, though).
Tests have been done on a core2 quad (with 3mb shared cache per core-pair).
Prefetching behaved significantly different on an athlon xp.
-
Thanks hellfire, so you think the cpu is pre fetching based on continuing the sequence of each pixel in a full screen buffer?
I'm wondering about this for ideas other than the fade routine and have a few other questions.
Does prefetching help in both reading and writing?
Say something like the fade routine was only affecting a rect area of the buffer, would prefetching help in that?
Is there much that could be done for reading from textures, would this work:
in the inner loop of a triangle rasteriser, after reading from the texture but before shading and writing to the buffer, calculate the address of the texel read for the next pixel and give that hint?
-
so you think the cpu is pre fetching based on continuing the sequence of each pixel in a full screen buffer?
Unfortunately I haven't found any in-depth documentation on the exact behaviour of the hardware prefetcher.
From my understand it keeps track of all memory accesses and predicts the next access based on repeating patterns.
In the best case you'd always access sequential memory locations but of course that's rarely possible (and the cpu-designers are aware of that)
A simple example (in C):
// object containing 3 integers
struct Object {
int a,b,c;
};
// array of 100 objects
Object array[100];
// iterate through all objects and access only a single member
int sum= 0;
for (int i=0; i<100; i++) sum += array[i].a;
Here the prefetcher will be clever enough to see that you're always skipping 8 bytes and tries to make the predicted memory location available before you actually access it.
It will always load a whole cache-line (on my core2 that's 64 bytes) and once a line is loaded the access is basically for free.
If you're skipping less than a cache-line, it would still need to transfer all the data into the cache, though.
And if you're not accessing memory in a somewhat sequential manner it might even load the wrong data into the cache.
So there will be some cases where it needs a little bit of help but most of the time you won't see any speed improvements from prefetch-instructions because the hardware prefetcher is already clever enough.
Does prefetching help in both reading and writing?
Depends on how you're writing.
If you're always filling whole cache-lines, there's no need to prefetch anything.
Especially when we're talking about rendering you'd preferably write sequential data (for example filling a scanline) but read data from several (and probably badly predictable) sources.
So prefetching the read-accesses is usually more important.
Still the start and end of each scanline will probably not fill a whole cache-line and need to be merged with the existing data.
Say something like the fade routine was only affecting a rect area of the buffer, would prefetching help in that?
Since you're accessing sequential pixels along each scanline, the prefetcher will probably mispredict the gap between two scanlines - but a single cache-miss per scanline wouldn't be that bad.
You can compensate the mis-prediction with a software-prefetch but since a cache-line needs some time to load you must schedule it early enough (which usually means a few hundred cycles in advance).
That might sound way too much but loading a cache-line always means to discard another, too.
Is there much that could be done for reading from textures, would this work:
in the inner loop of a triangle rasteriser, after reading from the texture but before shading and writing to the buffer, calculate the address of the texel read for the next pixel and give that hint?
That way you would still be waiting for the cache-line to load.
Since your uv-deltas are relatively linear (so you're accessing constant strides again) the hardware will predict most of that anyway.
But if your texels are far away from another, the hardware still loads a whole line although only a single pixel is required.
So in case of texture-mapping it's much more efficient to manage your textures in such a way that many required texels are in the same cache-line regardless of the orientation of the polygon.
For example use mip-maps and/or tiling and render batches of polygons which share the same texture.
Especially the latter will make most of the texture available in the cache after rendering a few polygons.
-
Trying out this routine...
In using the test code, it works fine (adjusted the loop to 255 steps). When I incorporate the subroutine into my own code, and call it once the ESC key has been pressed (outside of the loop, before ptc_close()/end/exitprocess(0)).. it does nothing.
It seems to me as the screenptr isnt correct, but not quite sure how to fix it. I'm using Jim's OpenGL screen setup at the moment with a screen set to...
screenres 800,600,32,0,2 or gfx_no_frame
It's currently setup like...
for i as integer=1 to 255
screenlock
fade_buffer(screenptr,xres,yres,255)
screenunlock
next
I also tried...
fade_buffer(@buffer(0),xres,yres,255)
but that didn't matter.
Not sure why i'ts not working.
Also, how can I adapt this to fade in from black to full screen (the opposite of what it's setup to do)?
-
Hi, sorry but it won't work with opengl as it needs the buffer it acts on to be in system memory.
As it is just now it darkens the contents of a buffer so to fade in would require either the buffer to be completely redrawn each frame with the shading applied before flipping or if the image being faded in is static then it could be set up to copy from one buffer to another while doing the fading in.
-
Ah k, thanks for the reply. Back the the drawing board for my fade in/out routine then.
I might just use alpha blending (like that routine Shockwave did for me). Not sure how cpu intensive that would be though for a screen of 800x600. eg, setup a box of 800x600 in &H000000 and then transition that to full colour? idk.
Think I shall do some more googling around on alpha fades. I've seen a heap of demos on RetroRemake with full screen alpha fades, yet they ran so damn smooth... no idea on the methods used though. My guess its coded in PureBasic, not Freebasic, like I'm using.
Thanks for the reply Stonemonkey.
-
To do a fade in of a static image you'll need to have the full colour image in a seperate buffer no matter what, if the screen is being redrawn each frame then you can apply the fade each frame to the framebuffer, either way nothing will beat a bit of asm (as long as it's reasonably well written) and although PB is probably a little faster than FB I don't think FB is too bad. If you're using opengl then you'll have to do the fade with opengl calls.
-
Ok, thanks, I'll have an attempt at it soon'ish I guess.
I also thought of actually doing a memcpy into a variable when ESC is pressed, so whatever is on screen at that given time is stored, and then doing a loop of that for the rgb values from 1 (full colour) to 0 (black), which would also give the desired 'fade to black' effect. As for the fade in, it's just the opposite I guess.
-
You should:
1) copy away the screen when you press escape
2) in a loop
for c=255 to 0 step -1
copy the copy of the screen to the screen
fadescreentobrightness(c)
ptc_update
ptc_flip
next
or for c=0 to 255 to fade up
fadescreentobrightness will do something like
rgba = pixel value
'split out the r,g,b values
r = (rgba shr 16) and $ff
g = (rgba shr 8) and $ff
b = rgba and $ff
'apply the brightness
r = (r * c) shr 8
g = (g * c) shr 8
b = (b * c) shr 8
'repack the colours
newrgba = (r shl 16) or (g shl 8) or b or $ff000000
pixel value = newrgba
You can optimise that a bit, but get it working first :)
Jim
-
@Jim, thanks. I sort of had that idea in my head to begin with (with thanks to Shockwave), but didn't know how to go about it exactly.
I tried the sub routine, but I can't seem to get it to work.
My code for it is this...
do
' all the loopy stuff here
loop until inkey$ = chr$(27)
for c as integer = 255 to 0 step -1
memcpy(@buffer(0),store,xres*yres*4) ' whatever is stored in 'store' gets put back into the buffer
fadescreen(c) 'fade screen routine
ptc_update @buffer(0) ' update the buffer
next
ptc_close()
end
exitprocess(0)
and...
sub fadescreen(byval c as integer)
dim as uinteger rgbav,rv,gv,bv
rgbav = c
rv = (rgbav shr 16) and &hff
gv = (rgbav shr 8) and &hff
bv = rgbav and &hff
'apply the brightness
rv = (rv * c) shr 8
gv = (gv * c) shr 8
bv = (bv * c) shr 8
'repack the colours
rgbav = (rv shl 16) or (gv shl 8) or bv or &hff000000
c = rgbav
end sub
What am I missing exactly?
-
fadescreen looks like it's only working on the first pixel. It needs to work on xres*yres pixels!
Jim
-
Oh, k. thx... no idea how to fix it though. I've had a play with the code. Stumped.
"c+(xres*yres)" doesn't work.
Remember I'm an utter newbie, and get lost very quickly.
-
You need to loop through all the pixels and fade each one using the value of c to multiply each of the red/green/blue parts.
rgbav is the pixel colour read from each pixel, shaded and then written back to the pixel.
Do the looping for that inside the sub as calling the sub for each pixel would be slow.
-
sub fadescreen(byval c as integer)
dim as integer ptr s = @screen(0) ; point s to the beginning of screen memory
dim as uinteger rgbav,rv,gv,bv,p
for p = 0 to xres*yres-1
rgbav = *s ; pull out the pixel at 's'
rv = (rgbav shr 16) and &hff
gv = (rgbav shr 8) and &hff
bv = rgbav and &hff
'apply the brightness
rv = (rv * c) shr 8
gv = (gv * c) shr 8
bv = (bv * c) shr 8
'repack the colours
rgbav = (rv shl 16) or (gv shl 8) or bv or &hff000000
*s = rgbav ; put the pixel back to 's'
s=s+1 ; go to the next pixel in screen memory
next
end sub
Jim
-
@Jim. wow. Just as I was about to post again (for more help, as I was extremely stuck), it notified me that someone replied... and with much thanks (as I was pulling my hair out) it works as intended. Not real sure as to 'exactly' how its works at the moment, but I will be sure to read over the code 1000 times until I understand what's happening.
I have the basic 'jist' of what is going on, but the whole screen pointers n such I'm still yet to wrap my head around.
Much thanks again Jim.
K++ and all that jazz (<- does that actually add good karma to you? I forget to add that to ones that have helped me)
-
K++ and all that jazz (<- does that actually add good karma to you? I forget to add that to ones that have helped me)
You need to click 'applaud' under Jim's avatar to give him some Karma. People say "K++" just to let people know they've given them some karma :)
-
Ahh, thought so, so now I need to hit 'Applaud' on a few profiles... like 10 times or so.. hehe Raizor, Jim, Stonemonkey, Shockwave... etc etc.. if you notice your karma jump by 10-20 points... it's from me :D
-
I have the basic 'jist' of what is going on, but the whole screen pointers n such I'm still yet to wrap my head around.
It might seem a little strange at first but it's not really all that tricky, and it can be very useful. I'll try to keep this simple but if there's anything you don't get just say.
Every variable in a program is stored somewhere in memory and has it's own unique address.
You can get the address of a variable using @
dim as integer a=10
print @a 'print the address in memory where the value of a is stored
'the address will be the same no matter what value is written into a
sleep
That address can be stored in a pointer and the contents of the memory it points to can be read or written to using *
dim as integer a=10
dim as integer pointer b=@a 'b contains (points to) the memory address of a
print *b 'prints the contents of the memory address b points to
'you can also write into the memory address
*b=20 'will have the same effect as a=20
sleep
Since an array is just a list of variables in an area of memory you can read/write from different parts of the array just by modifying the pointer, no symbol is used to modify the pointer.
dim as integer a(0 to 1)
a(0)=12
a(1)=34
dim as integer pointer b=@a(0) 'b points to the first address of the a array
print *b 'print the value in the address b points to
b=b+1 ' increment the pointer
print *b 'print the value in the new address b points to
sleep
-
@Stonemonkey... awesome response, and much thanks. Any little bit that helps me learn = very much appreciated.
With that third example as b being a pointer, does that mean each loop "b" is +1 each time, yet "a" still stays the same, so in fact that if I wanted to run a few different loops that point to the same thing, in effect I can setup the initial one... eg...
dim as integer a(0 to 2) , a(0)=10 , a(1)=100 , a(2)=1000
dim as integer pointer ap1=@a(0) , ap2=@a(0) , ap3=@a(0) ' the pointers ap1,ap2,ap3 that point back to a(0) which is "10"
for loopy as integer = 1 to 1000
ap1=ap1+1+loopy
ap2=ap2+3+loopy
ap3=ap3+5+loopy
cls
print "ap1 = "+str(ap1)
print "ap2 = "+str(ap2)
print "ap3 = "+str(ap3)
next
print "this number is "+str(a(1)) ' which should return "100"
if a(1) <> "10000" then
a(1)=10000 ' this changes the initial "a(1)" array from "100" to "10000"
endif
print "this number has been changed to a "+str(a(1)) ' which should return "10000"
I'm a lil' tired... but I hope I made some sense.
REALLY appreciate it Stonemonkey. I guess my next thing is setting up the different aspects of my demo's as buffers. So then I have my background buffer, scroller buffer, logo buffer etc etc etc, Thinking about doing it that way makes sense to me, as it would be easier to trace/modify/blit each one.
Thanks again.
*edited... see I did make a mistake. heh. Integers!!!!! not strings ;P
-
You have to be a bit more careful about how you modify your pointers, you only have an array of 3 integers there but you are adding 1,3,5 and loopy which goes from 1 to 1000.
Have a look at Jims loop and see how he's using the pointer to access the array and try some things out for yourself and see if you can get the results you expect.
EDIT: I think I see what you're saying, you can modify the pointer without changing the values? yes.
any_pointer=any_pointer+x
'changes the address that the pointer points to but doesn't affect the contents of what it points to.
*any_pointer=*any_pointer+x
'changes the contents of the memory address my_pointer points to but doesn't affect the pointer. Any change made in this way is also apparent when read by the original variable or by another pointer.
-
Hmm, I think I misunderstood and then wrote code that doesn't make much sense, my bad.
I'll have a read tomorrow when I'm less under the influence.
By your comments and me re-reading the code what I quickly wrote, I see how it's wrong.
As no, I don't want to keep changing the pointer (Although I see that can be done now also, so I can swap ap0 from pointing to a(0) which = 10, and repoint ap0 to a(2) which = 1000) as this wasn't my intention.
What you briefly explained here...
any_pointer=any_pointer+x
'changes the address that the pointer points to but doesn't affect the contents of what it points to.
*any_pointer=*any_pointer+x
'changes the contents of the memory address my_pointer points to but doesn't affect the pointer.
^ explains what I was trying to understand. I'll give it another read over tomorrow. Much thanks.
-
Perhaps I jumped a step going straight in with a pointer. Compare this code with the previous one
sub fadescreen(byval c as integer)
dim as uinteger rgbav,rv,gv,bv,p
for p = 0 to xres*yres-1
rgbav = screen(p) ; pull out the pixel at array entry 'p'
rv = (rgbav shr 16) and &hff
gv = (rgbav shr 8) and &hff
bv = rgbav and &hff
'apply the brightness
rv = (rv * c) shr 8
gv = (gv * c) shr 8
bv = (bv * c) shr 8
'repack the colours
rgbav = (rv shl 16) or (gv shl 8) or bv or &hff000000
screen(p) = rgbav ; put the pixel back to 'p'
next
end sub
These pieces of code do exactly the same thing. So the step I skipped really is an optimisation.
Look at what happens for each loop in the above code
;p=0
rgbav = screen(0) ; pull out the pixel at array entry 0
;p=1
rgbav = screen(1) ; pull out the pixel at array entry 1
;p=2
rgbav = screen(2) ; pull out the pixel at array entry 2
...
So when we index an array we're asking the compiler:
"work out where screen is, offset by 'p' and pull out the value".
For the first 4 pixels this would be
1. address=screen+0*4
2. address=screen+1*4
3. address=screen+2*4
4. address=screen+3*4
But we're asking the same question again and again, starting from scratch, when in fact if we know the answer to "offset by 0" we can tell the answer to "offset by 1" is the same as just adding 1 to the previous answer. And then, "offset by 2" is just adding 1 again.
For the first 4 pixels this would be
1. address=screen
2. address=address+4
3. address=address+4
4. address=address+4
You can see the answers are the same in both cases. n.b. We are adding 4 because 4bytes is the size of an integer is the size of the rgba pixels we are working with.
Actually, maybe it will help to write that out longwise:
1. address=screen
2. address=screen+4
3. address=screen+4+4
4. address=screen+4+4+4
and simplifying
1. address=screen+0*4
2. address=screen+1*4
3. address=screen+2*4
4. address=screen+3*4
So, we should be
1) thinking of the screen as an array of pixels. In this case it's one dimensional, and we have to imagine it as being rectangular.
2) seeing that an array is really just a pointer to where the data starts in memory
3) noticing that indexing an array is just taking that pointer and offsetting in to it by the size of the index
4) seeing that some kinds of arithmetic, especially adding and subtracting, can be applied to pointers just as they can be to integers.
5) noting that doing arithmetic on pointers has no effect on the memory or variables being pointed at.
Hope that helps to understand.
Jim
-
@Jim, thanks, I'll have a read through this thread properly this week. Really appreciate the help guys.