That looks great now, good effort! Another 30fps faster, now 135fps!
There's one way to make this a fraction faster still, and that is to mix the ordinary CPU instructions in with the MMX ones. So instead of going
O=ordinary
M=mmx
O,O,M,M,M,M,M,M,O,O,O,O,
try to get it to go
O,M,O,M,O,M,O,M,O,M,O,M,
The reason for that is that each MMX instrution takes more than 1 clock tick, so if the next instruction is waiting for the result of the previous one, eg.
psubw mm0, mm1 'subtract image color from screen color
pmullw mm0, mm2 'multiply resulting color by alpha
here, the mul can't begin until the sub has taken place - it needs mm0 to be ready. Internally then to the cpu, there's a gap where nothing is happening between the two instructions which is being wasted. The solution to this is to move one of your other instructions into that gap and take advantage of the overlap.
So, you might move one of the adds for your pointers in there.
eg.
psubw mm0, mm1 'subtract image color from screen color
add eax, 4 'increment our source pointer
pmullw mm0, mm2 'multiply resulting color by alpha
effectively that gives you the add for free.
It'll be interesting to see whether that makes any difference.
Some other possibilities
ja incline 'if we are jump to add line bit
jmp idloop 'if none of the above are met jump back to beginning of draw loop
incline:
I'm sure you've done this for clarity, but you can replace that with
jbe idloop
incline:
You seem to have 3 counters, but only 2 sources, which means you've got an extra
add reg,4
in your loop. Can you move that outside the scanline loop so you're only adding/checking once per scanline instead of once per pixel?
Some very useful information here
http://www.agner.org/optimize/. Agner Fog is a total expert on how to wring the last cycle out of Intel chips!
If you look in the instruction_tables.pdf for P4 timings in there, you will see the latency entry for psub and pmullw. There's a gap of 1 after a psub, and 5 after a pmul.
Jim