you shift all ints and floats left by 8 bits,
add subtract multiply them together
then shift right at the end again
Exactly.
Those values which contain integers anyway won't need any additional bits, though.
For the rest you usually have to check, how much precision is actually required.
The trick is to keep track of the number of fractional bits.
For example if you multiply to values with 8bits of fractional part, the result has 16bits fraction - so you need to shr8 to get back to 8bits of precision.
For the diffuse part I would quantize the vectors to 8bits (so a normalized vector ranges from -255..+255):
DiffDot = ( Nx * Lx + Ny * Ly + Nz * Lz ) shr 8;So you get an 8bit (0..255) value to shade your texture color, which fits nicely into mmx.
For the specular part you might need more bits because the exponent keeps only a small piece of the range (eg. 0.8 - 1.0), all the rest is black (below 1/255) anyway.
And the integer value makes it much easier to pick the pow-function from a table.
Another thing that makes your code slower at the moment is that you precalculated everything into double-arrays, which increases your memory bandwidth by a factor of 8.
I had a look at your code and noticed that this part:
VdirX = (PixyX) * ReciOX
VdirX = CamX-VdirX
VdirY = (PixyY) * ReciOY
VdirY = CamY-VdirY
VdirZ = TextureZ - CamZ
SpecDot = ( Nx*VDirX + Ny*VDirY + Nz*VDirZ )
Rx = VDirX-Nx*2.0*SpecDot
Ry = VDirY-Ny*2.0*SpecDot
Rz = VDirZ-Nz*2.0*SpecDot...is constant for each pixel and can be precalculated just as the normal map.