so you think the cpu is pre fetching based on continuing the sequence of each pixel in a full screen buffer?
Unfortunately I haven't found any in-depth documentation on the exact behaviour of the hardware prefetcher.
From my understand it keeps track of all memory accesses and predicts the next access based on repeating patterns.
In the best case you'd always access sequential memory locations but of course that's rarely possible (and the cpu-designers are aware of that)
A simple example (in C):
// object containing 3 integers
struct Object {
int a,b,c;
};
// array of 100 objects
Object array[100];
// iterate through all objects and access only a single member
int sum= 0;
for (int i=0; i<100; i++) sum += array[i].a;
Here the prefetcher will be clever enough to see that you're always skipping 8 bytes and tries to make the predicted memory location available before you actually access it.
It will always load a whole cache-line (on my core2 that's 64 bytes) and once a line is loaded the access is basically for free.
If you're skipping less than a cache-line, it would still need to transfer all the data into the cache, though.
And if you're not accessing memory in a somewhat sequential manner it might even load the wrong data into the cache.
So there will be some cases where it needs a little bit of help but most of the time you won't see any speed improvements from prefetch-instructions because the hardware prefetcher is already clever enough.
Does prefetching help in both reading and writing?
Depends on how you're writing.
If you're always filling whole cache-lines, there's no need to prefetch anything.
Especially when we're talking about rendering you'd preferably write sequential data (for example filling a scanline) but read data from several (and probably badly predictable) sources.
So prefetching the read-accesses is usually more important.
Still the start and end of each scanline will probably not fill a whole cache-line and need to be merged with the existing data.
Say something like the fade routine was only affecting a rect area of the buffer, would prefetching help in that?
Since you're accessing sequential pixels along each scanline, the prefetcher will probably mispredict the gap between two scanlines - but a single cache-miss per scanline wouldn't be that bad.
You can compensate the mis-prediction with a software-prefetch but since a cache-line needs some time to load you must schedule it early enough (which usually means a few hundred cycles in advance).
That might sound way too much but loading a cache-line always means to discard another, too.
Is there much that could be done for reading from textures, would this work:
in the inner loop of a triangle rasteriser, after reading from the texture but before shading and writing to the buffer, calculate the address of the texel read for the next pixel and give that hint?
That way you would still be waiting for the cache-line to load.
Since your uv-deltas are relatively linear (so you're accessing constant strides again) the hardware will predict most of that anyway.
But if your texels are far away from another, the hardware still loads a whole line although only a single pixel is required.
So in case of texture-mapping it's much more efficient to manage your textures in such a way that many required texels are in the same cache-line regardless of the orientation of the polygon.
For example use mip-maps and/or tiling and render batches of polygons which share the same texture.
Especially the latter will make most of the texture available in the cache after rendering a few polygons.