It always predicts jmps correctly, since they are always taken. gotos are bad because they're not structured code, and they make it hard for compilers to optimise. They lead to spaghetti coding that can jump all over the place. Under the hood though, every if/then, switch/case or loop is coded as a goto, since that's how CPUs work.
Branch Prediction works a bit like card counting at blackjack. You want to know if you are likely to bust with the next card - since 50% of the cards are over or equal to 7 and 50% are under 7, you can take a guess whether to twist or not if you have been dealt 14 or fewer (since that's where your odds are 50:50 of going bust). Every time you see a >=7 come up, add 1 to your running total, every time you see a <7, subtract 1. Then to see if you should twist on 14, if your total is >0 then you've had more over 7s, if it's <0 then you've had more under 7s. If you've had lots of >7s you should twist, since the next card is more likely to be under 7. Don't do this in a real casino, they'll throw you out

With the branch prediction, you can do something similar. For every branch instruction in the program, create a counter. When a branch is taken, add 1 to a counter, when it's not taken, subtract 1. Next time the CPU sees the branch, it checks the counter to see if the branch is more likely to be taken than not taken, based on its history.
1. Obviously, you can't have a counter for EVERY branch instruction. In the CPU it uses the bottom N bits of the instruction's address to index an array of counters, the size of N depends on the CPU architecture.
2. The real algorithm used is far better than the +1/-1 algorithm - I think even the Pentium MMX could predict patterns of taken/not taken up to 16 long!
3. Branch prediction allows the CPU to speculatively execute code down the predicted path. If it ends up being predicted incorrectly, then all that work is lost and has to be undone before execution restarts down the alternate path.
So, Stonemoneky - if your 'if' is always taken, the branch prediction will always be correct. If it's taken in a logical pattern it will probably be predicted. If it's random, then it won't. As CPUs have got more complicated, mis-predicting a branch has become more and more expensive, I think it was something like 90+ cycles on a P4 (but it's less on a Core). You also can't compute gotos in C.
I think a jump table might work better for you, have a table of routines for each render type, but you still have the problem of maintaining the code if you have lots and lots of different versions of inner loops. You'll end up with dozens or hundreds. If you discover an optimisation you'll have to update every single version

One way round this is to use macros. Something like
#define USE_X float x
#define USE_Z float z
#define USE_RGB float r,g,b
#define USE_UV float u,v
etc.
#define INIT_X dx = x1-x0
#define INIT_Z dz = (z1-z0)/dx
#define INIT_RGB { dr=(r1-r0)/dx;dg = (g1-g0)/dx; db=(b1-b0)/dx;}
#define INIT_UV { du = (u1-u0)/dx; dv = (v1-v0)/dx;}
etc.
#define ADD_X x++
#define ADD_Z z += dz
#define ADD_RGB r+=dr, g+=dg, b+=db
#define ADD_UV u+=du, v+=dv
etc.
Then every routine can be written in terms of macros...
gouraud_poly(...)
{
USE_X;
USE_RGB;
...
INIT_X;
INIT_RGB;
while (dy)
{
...custom pixel plot
ADD_X;
ADD_RGB;
}
}
gouraud_tex_z_rgb_poly(...)
{
USE_X;
USE_Z;
USE_RGB;
USE_UV;
...
INIT_X;
INIT_Z;
INIT_RGB;
INIT_UV;
while (dy)
{
...pixel plot
ADD_X;
ADD_Z;
ADD_RGB;
ADD_UV;
}
}
That way you only ever have to update your macros

Jim