Author Topic: Full screen 640 x 480 alpha fade between two images (Read 16517 times)

Jim · « **Reply #20 on:** September 14, 2006 »

About 103fps.Â Â 3.8GHz P4, Quadro FX1400.

I did a naughty thing and ran your code through my disassembler (IDA Pro).Â It shows me you're doing something like this

Code: [Select]

for y=0 to 479
for x=0 to 639
pixel=screen(x+y*640)
'some good mmx code to mix pixels
screen(x+y*640)=new pixel
next
next

The MMX stuff looks fine, but the code to do x+y*640 is almost as big!Â Try using a pointer, so it would go something like this

Code: [Select]

dim screen_ptr as uinteger ptr
screen_ptr = @screen(0)
for y=0 to 479
for x=0 to 639
pixel=*screen_ptr
'some good mmx code to mix pixels
*screen_ptr=new pixel
screen_ptr = screen_ptr+1
next
next

Then you can take it one step further and roll the two loops into one

Code: [Select]

dim screen_ptr as uinteger ptr
screen_ptr = @screen(0)
for p=0 to 640*480-1
pixel=*screen_ptr
'some good mmx code to mix pixels
*screen_ptr=new pixel
screen_ptr = screen_ptr+1
next

And finally, now you've got something far simpler, why not re-write the entire loop in asm?

Jim

MrP · « **Reply #21 on:** September 14, 2006 »

Er... bit confused now... the entire loop is in asm. Granted its a bit messy and needs a good looking over to simplify it a bit, but it is entirely in asm. The image drawing loop isn't just for drawing full screen pics. you can give it an image of any size and an x,y co-ordinate on screen and it will draw it there. I know that for the purpose of this full screen fade you could just blit two images starting from the top left and going down to the bottom right, then you wouldn't actually need to do the y*640+x thing at all. A simple increment for each pixel is all that would be required til you got to the end of each image, Just like in your last example. The code as it stands lets you pass an x,y, source buffer, and alpha value to it. it then calculates where on screen the pixels should be drawn using the y*640+x formula as it steps through each pixel in the source buffer. This is what is expensive because that is calculated for every pixel being drawn. A quicker approach (which i'm working on at the minute) would be to calculate where on the screen the top left pixel should be, then just increment both the source and destination buffers by one pixel then at the end of each line of the source buffer you could then increment the screen position by 640 - width of the source buffer and then continue line by line through the source buffer until all pixels are drawn. This i suppose would be more like the blitting approach used in your last example. I hope this explains why the code is a lot more complicated than it should be for the simple task of drawing two full screen images. Anyways i'm at work at the mo, but when i get home ill post the whole loop. Hopefully you can give me some pointers on where I might be able to improve it......

ninogenio · « **Reply #22 on:** September 14, 2006 »

i think what jim was saying mate is exactly the same as your now trying to incorporate and he was just showing you in basic for clarity.

ive done sprite blitting with the two methods and the later works far quicker.

MrP · « **Reply #23 on:** September 14, 2006 »

OK Heres my routine to draw an image of any size anywhere on screen, although I've not incorporated any bounds checking yet. This is probably very primitive and as i've said above i'm working on a quicker way to do it, but this is my first bash at it and what you see here is exactly what i had when it worked for the first time, so I'm damn sure this is messy to say the least.... Any comments on this would be most welcome, even though it's not going to stay like this any suggestions on what I could have done better would be appreciated....

Code: [Select]

    sub pgl_draw_image(byval src_buffer as integer ptr, byval x as integer, byval y as integer, byval alpha as ubyte)
    dim x_pos as integer = 0
    dim y_pos as integer = 0
    dim p_alpha as integer = rgb(alpha, alpha, alpha)
    asm
        mov eax, [src_buffer]       'get image width and height into edi and esi
        mov edi, [eax]               '
        add eax, 4                  '
        mov esi, [eax]               '
        add eax, 4
        
        dloop:
        mov edx, [x]                'edx has x position
        mov ecx, [y]                'ecx has y position
        
        mov ebx, [pgl_screen_ptr]   'ebx has screen pointer
        imul ecx, [pgl_screen_x]
        add ecx, edx
        lea ebx, [ebx + ecx * 4]

        pxor mm6, mm6
    
        movd mm0, [eax]       'image pixel col in mm0
        movd mm1, [ebx]       'screen pixel col in mm2
        movd mm2, [p_alpha]   'mm2 has alpha value
        
        punpcklbw mm0, mm6    'unpack mm0 using blank mm6 register as interleave
        punpcklbw mm1, mm6    'same for screen color
        punpcklbw mm2, mm6    'same for Alpha
    
        psubw mm0, mm1        'subtract image color from screen color
        pmullw mm0, mm2       'multiply resulting color by alpha
        psrlw mm0, 8          'divide that by 255
        paddb mm0, mm1        'and then add screen color back in

        packuswb mm0, mm0     'repack image pixel back to lower 32 bits of mm0
        movd [ebx], mm0       'move result back to screen
            
        push ecx
        mov ecx, [x_pos]
        inc ecx
        cmp ecx, edi
        ja inc_y
        back:
        mov [x_pos], ecx
        pop ecx
        add eax, 4
        add edx, 1
        mov [x], edx
        jmp dloop
        
        inc_y:
        push edx
        mov edx, [y_pos]
        inc edx
        cmp edx, esi
        ja finished
        mov [y_pos], edx
        pop edx
        sub [x], edi
        mov ecx, 0
        jmp back
        
        finished:
        emms
    end asm
end sub

Jim · « **Reply #24 on:** September 15, 2006 »

Don't be confused MrP, it's just how my brain works

It's hard to tell from my original disassembly whether it was all asm or just the pixel bit, which, without meaning to insult you in any way, means it's not terribly good assembler. $:-\$ You need to get those xpos/x ypos/y variables into registers, which with some thought is definitely possible.

I didn't realise until you posted again that this is supposed to allow you to blit an arbitrary rectangle with alpha over the top of an existing screen. That makes it a lot more than just a demo that blends two screens together. Sorry if I misunderstood.

To speed it up, then you need to fix that per-pixel calculation x+y*screen_width. That can be done by working with one scanline of the blitted image at a time. You can calculate the screen address for the destination by x+y*w and then at the end of each line, having added on 1 for each pixel of the image you are blitting on top, you need to add on screen_width-width of blitted image.
Also one really simple optimisation is that mm2 which contains your unpacked alpha value is constant for the whole blit, so you can easily move that outside your loop.

I think you should be aiming for 15 instructions or fewer in the inner loop.

Jim

MrP · « **Reply #25 on:** September 15, 2006 »

Thanks for the pointers Jim. I'm in the middle of changing the routine, just been a bit busy this week and I've not had much time to comit to it. I will aim for 15 instructions and I'll post back with what I come up with. I've got the alpha bit inside the loop, because eventually the drawing bit wil calculate per pixel alpha information from the source image and not just one constant alpha value for the whole thing. However i'm struggling extracting the alpha information from the integer and putting it back together in assembly, eg my source pixel value has the format aarrggbb, so I need to extract the first byte and then put that into a register with the format 00aaaaaa so i can load it into mm2 and sort the alpha stuff out in the mmx bit. Thats why its done before the asm bit with the freebasic rgb() command. I've tried moving a byte value into a register and shifting it left 8 then or'ing it with the byte value then shifting again then or'ing again but it doesn't seem to work. But anyways before that really becomes an issue I really need to do some basic housekeeping on the rest according to your excellent suggestions..... Once again thanks very much for your comments and suggestions, now lets see what I can do with them......

Shockwave · « **Reply #26 on:** September 15, 2006 »

That's a fair point Mr.P TGA files having that per pixel alpha info make them huge though and although I'm bound to say that they are great, there's no way I'd ever include one in a program of mine unless it was calculated mathematically at runtime. Just too big.

It would be very cool to see how fast you could make this go without that provision and hust having one alpha value per bilt.

MrP · « **Reply #27 on:** September 15, 2006 »

ok i'll post up a version later tonight (Friday) that just blits two images with an alpha value.

MrP · « **Reply #28 on:** September 15, 2006 »

OK Here it is alpha blitting rather than the other method, christ this is a lot quicker than I thought it would be, im maxing out at about 238 fps

I'll include two files one is just the exe if you've still got the two images, the other has all the lot... Enjoy....

[oversized attachment deleted by admin]

Shockwave · « **Reply #29 on:** September 15, 2006 »

Wahey! Thant's more like it, it's fairly belting along at 223 fps now

Nice work.

MrP · « **Reply #30 on:** September 15, 2006 »

do you know when you do something no matter how simple it is and you just can't stop f!#*ing looking at it, needless to say im a bit chuffed with that....

Clyde · « **Reply #31 on:** September 15, 2006 »

I get my full monitor refresh rate of 75/76 welldone mate.

Shockwave · « **Reply #32 on:** September 15, 2006 »

I had the self same feeling when I finished my first dbf intro mate

Enjoy it, there's more to come.

Rbz · « **Reply #33 on:** September 15, 2006 »

I get 64 fps now, welldone!

MrP · « **Reply #34 on:** September 16, 2006 »

Heres my latest routine to blit any size image to screen, rather than the drawing method i used before. This is tons faster and on full screen images theres hardly any drop in frame rate from the full screen blit i put up in my last post.......

@Jim if you happen to take a look at this, I would very much appreciate your thoughts on it, I've tried to take into account what you said on your last reply and hopefully i'm somewhere closer to where I should be. Many thanks in advance MrP.....

Code: [Select]

sub pgl_blit_image(byval src_buffer as integer ptr, byval x as integer, byval y as integer, byval alpha as ubyte)
    dim p_alpha as integer = rgb(alpha, alpha, alpha)
    asm
        mov ebx, [pgl_screen_ptr]   'ebx has screen pointer

        mov esi, [x]                        'x location in esi
        mov edi, [y]                       'y location in edi
        
        imul edi, [pgl_screen_x]    'calculate screen y position
        add edi, esi                'then the x position
        lea ebx, [ebx + edi * 4]    'add the result to our screen pointer

        mov eax, [src_buffer]       'start of source buffer in eax
        mov edi, [eax]              'image width into edi
        mov edx, [pgl_blit_inc]     'move screen line increment into edx
        sub edx, edi                'subtract the width of our image
                                    'edx now has the amount we need to add to screen pointer 
                                    'at the end of each line of source image
        add eax, 8                  'move eax to byte 8 so we can get image length
        mov ecx, [eax]              'length of image for drawing completion check into ecx
        add eax, 4                  'move eax to beginning of image data
        add ecx, eax                'add start of image to end of image so we have true pointer to end
        
        pxor mm6, mm6               'blank register mm6
        movd mm2, [p_alpha]         'mm2 has alpha value
        punpcklbw mm2, mm6          'unpack it for processing
        
        mov esi, 0                  'esi will be counter for source image x location
        
        idloop:
            movd mm0, [eax]         'image pixel col in mm0
            movd mm1, [ebx]         'screen pixel col in mm2
        
            punpcklbw mm0, mm6      'unpack mm0 using blank mm6 register as interleave
            punpcklbw mm1, mm6      'same for screen color
    
            psubw mm0, mm1          'subtract image color from screen color
            pmullw mm0, mm2         'multiply resulting color by alpha
            psrlw mm0, 8            'divide that by 255
            paddb mm0, mm1          'and then add screen color back in

            packuswb mm0, mm0       'repack image pixel back to lower 32 bits of mm0
            movd [ebx], mm0         'move result back to screen

            add ebx, 4              'increment our screen pointer
            add eax, 4              'increment our source pointer
            add esi, 4              'increment our source x loxation
            
            cmp eax, ecx            'have we got to the end of image
            jz enddraw              'if we have jump out of drawing loop
            
            cmp esi, edi            'are we at the end of a line in our image
            ja incline              'if we are jump to add line bit
            
        jmp idloop                  'if none of the above are met jump back to beginning of draw loop
            
            incline:
            add ebx, edx            'add increment to screen pointer to get to right position on screen for next line
            mov esi, 0              'reset our image x counter
        jmp idloop                  'jump back to beginning of draw loop
            
            enddraw:
            emms                    'restore floating point registers
    end asm
end sub

Jim · « **Reply #35 on:** September 17, 2006 »

That looks great now, good effort! Another 30fps faster, now 135fps!

There's one way to make this a fraction faster still, and that is to mix the ordinary CPU instructions in with the MMX ones. So instead of going
O=ordinary
M=mmx
O,O,M,M,M,M,M,M,O,O,O,O,
try to get it to go
O,M,O,M,O,M,O,M,O,M,O,M,

The reason for that is that each MMX instrution takes more than 1 clock tick, so if the next instruction is waiting for the result of the previous one, eg.
psubw mm0, mm1 'subtract image color from screen color
pmullw mm0, mm2 'multiply resulting color by alpha
here, the mul can't begin until the sub has taken place - it needs mm0 to be ready. Internally then to the cpu, there's a gap where nothing is happening between the two instructions which is being wasted. The solution to this is to move one of your other instructions into that gap and take advantage of the overlap.
So, you might move one of the adds for your pointers in there.
eg.
psubw mm0, mm1 'subtract image color from screen color
add eax, 4 'increment our source pointer
pmullw mm0, mm2 'multiply resulting color by alpha
effectively that gives you the add for free.

It'll be interesting to see whether that makes any difference.

Some other possibilities
ja incline 'if we are jump to add line bit
jmp idloop 'if none of the above are met jump back to beginning of draw loop
incline:
I'm sure you've done this for clarity, but you can replace that with
jbe idloop
incline:

You seem to have 3 counters, but only 2 sources, which means you've got an extra
add reg,4
in your loop. Can you move that outside the scanline loop so you're only adding/checking once per scanline instead of once per pixel?

Some very useful information here http://www.agner.org/optimize/. Agner Fog is a total expert on how to wring the last cycle out of Intel chips!
If you look in the instruction_tables.pdf for P4 timings in there, you will see the latency entry for psub and pmullw. There's a gap of 1 after a psub, and 5 after a pmul.

Jim

MrP · « **Reply #36 on:** September 17, 2006 »

Thanks very much for that detailed explanation Jim..... Would I be right in assuming that while your waiting for the other instructions to finish your effectively using the other pipe. I've read about the U and V pipe and the fact you can pair instructions so that your not wasting clock cycles. I've only really touched on that and i'm not really sure what you can pair and what you cant.... Anyways Thanks once again for your input to this, it is very much appreciated.

Jim · « **Reply #37 on:** September 17, 2006 »

u and v pipes are only found on the original Pentium. Basically all integer instructions (mov, add, cmp etc) could go down either pipe, in pairs, except shifts, muls and divides. So every instruction took 1/2 cycle. All the pairing had to be worked out by hand or by the compiler. With newer CPUs it's not so cut-and-dried. Each integer instruction is broken down into u-ops (micro-ops) and then there are a number of excution units that can process these ops, either in or out of order. So the CPU is kind of recompiling the code dynamically. MMX, XMM and FPU instructions are broken up the same way. The idea is, to get maximum speed, to get all these excution units firing all the time. So effectively you have far more than just uv pipes. Hyperthreading was invented as an automatic way of making sure these units are all working at once - if there's not enough instructions coming in from one thread, find another thread to start executing too!

You're nearly going as fast as you can with this routine, by the way. You'll end up being limited entirely by how slow it is to read the two screens, write them out, and blit the result using ptc. Just it would be great to see if the pipelining squeezes a little bit more out first.

Jim

MrP · « **Reply #38 on:** September 18, 2006 »

OK jim moved the increment to eax to after the psrlw mm0, and got rid of the extra add to esi per pixel you mentioned and squezed another 5 fps or so out of it. Tried the add to eax in various places within the mmx code, but where it sits now is definately faster then the rest.... Cheers again for your help on this. I'm rather pleased with this now......

Oh and if clicking the applaud thingy gives you some karma, which I'm assuming it does, then you can have some of that along with my thanks.....

Shockwave · « **Reply #39 on:** September 18, 2006 »

Applaud does indeed give positive Karma

Squeezing an extra 5fps out of this when it was already fast is not small beans either. Well done Mr. P

Whoops I just pressed your applaud button!