Hi Optimus,
when I took the screenshot I was halfway in implementing a span-buffer.
So "poly loop" contains only the scanning of the polygon edges and insertion of the individual spans into a sorted list.
The timing for the actual span-drawing is missing in the screenshot.
And at that time it was still reaching into the second frame.
The gba can only write aligned 16bit values to the framebuffer, that's why the start & end of a scanline are a bit ugly to handle when using 8bit pixels.
So I figured it would be easier to fill neighbouring spans in one go.
This can also include the clearing process, eliminates overdraw and possibly avoids pre-sorting.
Yet it isn't any faster than plain back to front rendering as long as there's not much overdraw (read as: it's actually slower most of the time).
However, I'm still planing to cut the spans against each other to get a reasonably cheap zbuffer (as I've got one scene in mind which will be quite difficult to sort).
For now I've been looking into affine texture mapping. It runs in 2 frames.
