well i have too admit i didnt fully understand what was going on with your all your shifts
Well, what's probably not so straight is the normalization part - which would usually look like this:
int t= lightdir.x*lightdir.x + lightdir.y*lightdir.y + lightdir.z*lightdir.z + lightdir.w*lightdir.w;
// factor to normalize to -128..+128 with 8bit of fractional part
int invSqrt= (128*256) / sqrt(t);
lightdir.x= lightdir.x * invSqrt >> 8;
lightdir.y= lightdir.y * invSqrt >> 8;
lightdir.z= lightdir.z * invSqrt >> 8;
lightdir.w= lightdir.w * invSqrt >> 8;
But when multiplying two 16bit values with mmx you can only keep the upper or the lower word, like this:
pmullw: lightdir= lightdir * invSqrt;
pmulhw: lightdir= lightdir * invSqrt >> 16;
With pmullw you get an overflow when the vector exceed 0..255 (which it does),
with pmulhw the result gets 256x smaller than it's supposed to.
So I distributed the factor of 256 to both values, *8 to the vector and *32 to the invSqrt, ending up at:
invSqrt= (128*256*32) / sqrt(t);
viewdir= (viewdir<<3) * invSqrt >> 16;
Now the invSqrt doesn't fit into 16bit anymore for small values of t.
But that's not a big problem because it's impossible to normalize very small vector anyway (because 1/sqrt(0) = infinity).
So I just clamp the table-values at 32767 and accept that vectors shorter than 0.25 (32 at 7bits fractional) will be too short.
I choose a factor of 32 because at that point the lookup-table for invSqrt was much larger and I was looking up invSqrt[t>>5], so only the value of invSqrt got clamped (which must be clamped anyway).
Now that it's looking up invSqrt[t>>10], it makes more sense to put the whole *256 within the invSqrt-table and remove the shift...