Author Topic: screen fade routine (Read 16103 times)

Stonemonkey · « **on:** January 24, 2012 »

Not quite sure where to put this as it's a freebasic sub using some asm. Thought I'd write this as there's been some discussion on asm and fading elsewhere.

Anyway, I've not done much testing so I hope it works ok and that's it's fairly self explanatory. If not let me know.

Code: [Select]

buffer_address=start address of 32bit depth unsigned int colour buffer
'wwidth/height=dimensions of the buffer
'fade=value 0-256   0=turn buffer black   256=no change

sub fade_buffer(byval buffer_address as uinteger ptr,_
                byval wwidth as integer,_
                byval height as integer,_
                byval fade as integer)
    dim as integer ptr last_address=buffer_address+(wwidth*height)-1
    
    asm
        mov ecx,dword ptr[buffer_address]   'load buffer start address into reg ecx
        mov edx,dword ptr[fade]             'load fade value into reg edx
fade_loop:
            mov eax,dword ptr[ecx]          'load 4 byte int into eax from address in ecx
            mov ebx,eax                     'copy 4 bytes into ebx
            and eax,&hff00ff                'filter red and blue
            and ebx,&h00ff00                'filter green
            imul eax,edx                    'mult by 8 bit fade value
            imul ebx,edx                    '  "   " "  "   "    "
            and eax,&hff00ff00              'filter high bytes of red and blue
            and ebx,&h00ff0000              'filter high byte of green
            or eax,ebx                      're combine
            shr eax,8                       'shift back into position
            mov dword ptr[ecx],eax          'write back to address in ecx
            adc ecx,4                       'point ecx to next pixel
            cmp ecx,dword ptr[last_address] 'compare with the end of the buffer
        jle fade_loop                       'repeat loop if ecx is not past the final pixel
    end asm
end sub

Stonemonkey · « **Reply #1 on:** January 24, 2012 »

And a little bit of code to demo it. This acts directly on the screen buffer but it'll work on any 32 bit colour buffer.

Code: [Select]


'place fade routine here

sub main
    
    screenres 640,480,32
    for y as integer=0 to 479
        for x as integer=0 to 639
            pset(x,y),rnd*&hffffff
        next
    next
    
    for i as integer=1 to 1000
        screenlock
        fade_buffer(screenptr,640,480,255)
        screenunlock
    next
    
end sub

main
sleep
end

Stonemonkey · « **Reply #2 on:** January 25, 2012 »

Given it a go with MMX instructions, it does 2 pixels at a time and is quite a bit faster. I'm not too sure about MMX so if anyone can do this better I'd love to see.

Code: [Select]

sub fade_buffer_mmx(byval buffer_address as uinteger ptr,_
                byval wwidth as integer,_
                byval height as integer,_
                byval fade as integer)
    dim as integer ptr last_address=buffer_address+(wwidth*height)-2
    fade or=(fade shl 8)or(fade shl 16)
    asm
        mov ecx,dword ptr[buffer_address]   'load buffer start address into reg ecx
        pxor mm7,mm7                        'clear mm7 register, used for unpacking
        movd mm6,[fade]                     'load fade to mm6
        punpcklbw mm6,mm7                   'unpack fade bytes
        
fade_loop_mmx:

            movq mm0,dword ptr[ecx]         'load 2 pixel data to mm0
            movq mm1,mm0                    'copy mm0 to mm1
            punpcklbw mm0,mm7               'unpack lo pixel bytes
            punpckhbw mm1,mm7               'unpack hi pixel bytes
            pmullw mm0,mm6                  '8 bit mult of lo pixel and fade
            pmullw mm1,mm6                  '8 bit mult of hi pixel and fade
            psrlw mm0,8                     'shift lo result 8 bits
            psrlw mm1,8                     'shift hi result 8 bits
            packuswb mm0,mm1                'pack bytes from both together
            movq [ecx],mm0                  'write back to buffer
            
            add ecx,8                       'next pixel
            cmp ecx,dword ptr[last_address] 'check for end of buffer
            
        jle fade_loop_mmx                   'loop if not past end of buffer
        
        emms                                'reset fpu
    end asm
end sub

ttemper · « **Reply #3 on:** January 25, 2012 »

Thanks Stonemonkey, I will check out the routine soon (for my upcoming "first ever" demo), much thanks.

Shockwave · « **Reply #4 on:** January 25, 2012 »

Good one Stonemonkey

K+

hellfire · « **Reply #5 on:** January 25, 2012 »

Quote from: Stonemonkey on January 25, 2012

I'm not too sure about MMX so if anyone can do this better I'd love to see.

Looks pretty good to me.
You should make sure your buffer starts at a multiple of 8 so that all your memory accesses are 64bit aligned.
The two pmullw won't work in parallel, atleast that was the case on p3 (but I doubt they ever changed anything on the mmx unit).
So you can put some instructions in between without any cost (but the cpu-front-end is probably clever enough to predraw add & cmp).
Personally I find it more convenient (and sometimes it's even faster) to have separate source- and destination-buffers.

You can also get rid of the cmp with a negative loop-counter (so you can check the zero-flag but it's still increasing)

Code: [Select]

mov  edi, sourceBuffer + numberOfPixels*4   (4 byte per pixel)
mov  ecx, -numberOfPixels/2                 (two pixels per iteration)
fadeloop:
   movq mm0,[edi+ecx*8]                     (first iteration accesses sourceBuffer[0])
   ...
   inc  ecx
   jnz  fadeloop

Stonemonkey · « **Reply #6 on:** January 25, 2012 »

Quote from: hellfire on January 25, 2012

Looks pretty good to me.
You should make sure your buffer starts at a multiple of 8 so that all your memory accesses are 64bit aligned.

Yep, in this case I'm only using the fb screenbuffer but I think it's aligned, maybe to 16 bytes

Quote

The two pmullw won't work in parallel, atleast that was the case on p3 (but I doubt they ever changed anything on the mmx unit).
So you can put some instructions in between without any cost (but the cpu-front-end is probably clever enough to predraw add & cmp).

ok thanks

Quote

Personally I find it more convenient (and sometimes it's even faster) to have separate source- and destination-buffers.
You can also get rid of the cmp with a negative loop-counter (so you can check the zero-flag but it's still increasing)
Code: [Select]
mov edi, sourceBuffer + numberOfPixels*4 (4 byte per pixel) mov ecx, -numberOfPixels/2 (two pixels per iteration) fadeloop: movq mm0,[edi+ecx*8] (first iteration accesses sourceBuffer[0]) ... inc ecx jnz fadeloop

That seems to be a little bit quicker, down to 1.58ms from 1.61ms, and compared to about 3.3ms for my non mmx version.

hellfire · « **Reply #7 on:** January 26, 2012 »

Quote from: Stonemonkey on January 25, 2012

That seems to be a little bit quicker, down to 1.58ms from 1.61ms

That doesn't seem much (less than 2%) but makes perfectly sense as the cpu is mostly waiting for the cache-loading and multiplications to finish.
You'll probably see a speed increase if you prefetch the pixels of the next scanline to some dummy register, so all the data has already been loaded into the cache before the actual computation starts.
This way alu and mmu are working in parallel without stalls.
On the downside it makes the code more complex and needs some special-case handling for the last scanline.

Stonemonkey · « **Reply #8 on:** January 26, 2012 »

I've looked into prefetching before but never worked it out so some tips on that would be more than welcome, instead of the special case for the last scanline what about just assigning an extra rows worth of memory to the buffer, maybe not so easy to do with FB's screenbuffer but not a problem with something like tinyptc or my own buffers.

hellfire · « **Reply #9 on:** January 26, 2012 »

It turns out that the benefits of prefetching are tricky to show.
For a first try I simply copied a constant 640x480 image into the screen-buffer and profiled "fade_buffer_mmx" using rdtsc.
It runs at pretty constant 790k cycles (that's 2.5 cycles per pixel!).
Since all the data has just been copied it's still in the cache - there's not a single cache miss and the two pmullw must be executed in parallel, too.
The copying runs at about 700k to 1900k cycles, so that's the place where all the cache issues happen.
Considering that the fader is a post-processing filter (so you already managed to fill the buffer), it doesn't really make sense to put much thought into prefetching for this routine.

On a second try I changed the fader to work on separate source- and destination-buffers to get all the trouble into a single place but it behaves surprisingly well, too.
The hardware-prefetcher seems to be much smarter than the last time I looked at it (already a couple of years ago), so it doesn't seem to require hints by prefetch instructions (it's easy to predict since all data-access happens strictly linear, though).

Tests have been done on a core2 quad (with 3mb shared cache per core-pair).
Prefetching behaved significantly different on an athlon xp.

Stonemonkey · « **Reply #10 on:** January 26, 2012 »

Thanks hellfire, so you think the cpu is pre fetching based on continuing the sequence of each pixel in a full screen buffer?

I'm wondering about this for ideas other than the fade routine and have a few other questions.

Does prefetching help in both reading and writing?

Say something like the fade routine was only affecting a rect area of the buffer, would prefetching help in that?

Is there much that could be done for reading from textures, would this work:

in the inner loop of a triangle rasteriser, after reading from the texture but before shading and writing to the buffer, calculate the address of the texel read for the next pixel and give that hint?

hellfire · « **Reply #11 on:** January 27, 2012 »

Quote from: Stonemonkey on January 26, 2012

so you think the cpu is pre fetching based on continuing the sequence of each pixel in a full screen buffer?

Unfortunately I haven't found any in-depth documentation on the exact behaviour of the hardware prefetcher.
From my understand it keeps track of all memory accesses and predicts the next access based on repeating patterns.
In the best case you'd always access sequential memory locations but of course that's rarely possible (and the cpu-designers are aware of that)
A simple example (in C):

Code: [Select]

// object containing 3 integers
struct Object {
  int a,b,c;
};

// array of 100 objects
Object array[100];

// iterate through all objects and access only a single member
int sum= 0;
for (int i=0; i<100; i++) sum += array[i].a;

Here the prefetcher will be clever enough to see that you're always skipping 8 bytes and tries to make the predicted memory location available before you actually access it.

It will always load a whole cache-line (on my core2 that's 64 bytes) and once a line is loaded the access is basically for free.
If you're skipping less than a cache-line, it would still need to transfer all the data into the cache, though.
And if you're not accessing memory in a somewhat sequential manner it might even load the wrong data into the cache.
So there will be some cases where it needs a little bit of help but most of the time you won't see any speed improvements from prefetch-instructions because the hardware prefetcher is already clever enough.

Quote

Does prefetching help in both reading and writing?

Depends on how you're writing.
If you're always filling whole cache-lines, there's no need to prefetch anything.
Especially when we're talking about rendering you'd preferably write sequential data (for example filling a scanline) but read data from several (and probably badly predictable) sources.
So prefetching the read-accesses is usually more important.
Still the start and end of each scanline will probably not fill a whole cache-line and need to be merged with the existing data.

Quote

Say something like the fade routine was only affecting a rect area of the buffer, would prefetching help in that?

Since you're accessing sequential pixels along each scanline, the prefetcher will probably mispredict the gap between two scanlines - but a single cache-miss per scanline wouldn't be that bad.
You can compensate the mis-prediction with a software-prefetch but since a cache-line needs some time to load you must schedule it early enough (which usually means a few hundred cycles in advance).
That might sound way too much but loading a cache-line always means to discard another, too.

Quote

Is there much that could be done for reading from textures, would this work:
in the inner loop of a triangle rasteriser, after reading from the texture but before shading and writing to the buffer, calculate the address of the texel read for the next pixel and give that hint?

That way you would still be waiting for the cache-line to load.
Since your uv-deltas are relatively linear (so you're accessing constant strides again) the hardware will predict most of that anyway.
But if your texels are far away from another, the hardware still loads a whole line although only a single pixel is required.
So in case of texture-mapping it's much more efficient to manage your textures in such a way that many required texels are in the same cache-line regardless of the orientation of the polygon.
For example use mip-maps and/or tiling and render batches of polygons which share the same texture.
Especially the latter will make most of the texture available in the cache after rendering a few polygons.

ttemper · « **Reply #12 on:** February 21, 2012 »

Trying out this routine...

In using the test code, it works fine (adjusted the loop to 255 steps). When I incorporate the subroutine into my own code, and call it once the ESC key has been pressed (outside of the loop, before ptc_close()/end/exitprocess(0)).. it does nothing.

It seems to me as the screenptr isnt correct, but not quite sure how to fix it. I'm using Jim's OpenGL screen setup at the moment with a screen set to...

Code: [Select]

screenres 800,600,32,0,2 or gfx_no_frame
It's currently setup like...

Code: [Select]

	for i as integer=1 to 255
		screenlock
		fade_buffer(screenptr,xres,yres,255)
		screenunlock
	next

I also tried...

Code: [Select]

		fade_buffer(@buffer(0),xres,yres,255)

but that didn't matter.

Not sure why i'ts not working.

Also, how can I adapt this to fade in from black to full screen (the opposite of what it's setup to do)?

Stonemonkey · « **Reply #13 on:** February 21, 2012 »

Hi, sorry but it won't work with opengl as it needs the buffer it acts on to be in system memory.
As it is just now it darkens the contents of a buffer so to fade in would require either the buffer to be completely redrawn each frame with the shading applied before flipping or if the image being faded in is static then it could be set up to copy from one buffer to another while doing the fading in.

ttemper · « **Reply #14 on:** February 21, 2012 »

Ah k, thanks for the reply. Back the the drawing board for my fade in/out routine then.

I might just use alpha blending (like that routine Shockwave did for me). Not sure how cpu intensive that would be though for a screen of 800x600. eg, setup a box of 800x600 in &H000000 and then transition that to full colour? idk.

Think I shall do some more googling around on alpha fades. I've seen a heap of demos on RetroRemake with full screen alpha fades, yet they ran so damn smooth... no idea on the methods used though. My guess its coded in PureBasic, not Freebasic, like I'm using.

Thanks for the reply Stonemonkey.

Stonemonkey · « **Reply #15 on:** February 21, 2012 »

To do a fade in of a static image you'll need to have the full colour image in a seperate buffer no matter what, if the screen is being redrawn each frame then you can apply the fade each frame to the framebuffer, either way nothing will beat a bit of asm (as long as it's reasonably well written) and although PB is probably a little faster than FB I don't think FB is too bad. If you're using opengl then you'll have to do the fade with opengl calls.

ttemper · « **Reply #16 on:** February 22, 2012 »

Ok, thanks, I'll have an attempt at it soon'ish I guess.

I also thought of actually doing a memcpy into a variable when ESC is pressed, so whatever is on screen at that given time is stored, and then doing a loop of that for the rgb values from 1 (full colour) to 0 (black), which would also give the desired 'fade to black' effect. As for the fade in, it's just the opposite I guess.

Jim · « **Reply #17 on:** February 22, 2012 »

You should:
1) copy away the screen when you press escape
2) in a loop

for c=255 to 0 step -1
copy the copy of the screen to the screen
fadescreentobrightness(c)
ptc_update
ptc_flip
next

or for c=0 to 255 to fade up

fadescreentobrightness will do something like

rgba = pixel value
'split out the r,g,b values
r = (rgba shr 16) and $ff
g = (rgba shr

and $ff
b = rgba and $ff
'apply the brightness
r = (r * c) shr 8
g = (g * c) shr 8
b = (b * c) shr 8
'repack the colours
newrgba = (r shl 16) or (g shl

or b or $ff000000
pixel value = newrgba

You can optimise that a bit, but get it working first

Jim

ttemper · « **Reply #18 on:** February 23, 2012 »

@Jim, thanks. I sort of had that idea in my head to begin with (with thanks to Shockwave), but didn't know how to go about it exactly.

I tried the sub routine, but I can't seem to get it to work.

My code for it is this...

Code: [Select]

do

' all the loopy stuff here

loop until inkey$ = chr$(27)

for c as integer = 255 to 0 step -1
	memcpy(@buffer(0),store,xres*yres*4) ' whatever is stored in 'store' gets put back into the buffer
	fadescreen(c) 'fade screen routine
	ptc_update @buffer(0) ' update the buffer
next

ptc_close()
end

exitprocess(0)

and...

Code: [Select]

sub fadescreen(byval c as integer)
	
	dim as uinteger rgbav,rv,gv,bv
	
	rgbav = c
	rv = (rgbav shr 16) and &hff
	gv = (rgbav shr 8) and &hff
	bv = rgbav and &hff
	'apply the brightness
	rv = (rv * c) shr 8
	gv = (gv * c) shr 8
	bv = (bv * c) shr 8
	'repack the colours
	rgbav = (rv shl 16) or (gv shl 8) or bv or &hff000000
	c = rgbav
end sub

What am I missing exactly?

Jim · « **Reply #19 on:** February 23, 2012 »

fadescreen looks like it's only working on the first pixel. It needs to work on xres*yres pixels!

Jim