Well, the main trick is to make your vectors fit into 4x short, so you can use mmx for vector and color-processing.
I'd suggest to convert your code to asm in very very small steps.
Write all intermediate values back into variables so you can check them.
Start with the simple stuff, for example:
ShortVector4 col;
/*
col.x= 0;
col.y= 0;
col.z= 0;
col.w= 0;
*/
_asm {
pxor mm7,mm7
movq [col], mm7
};
And
// calc light direction
/*
lightDir.x= pos.x - lightPos.x;
lightDir.y= pos.y - lightPos.y;
lightDir.z= pos.z - lightPos.z;
lightDir.w= pos.w - lightPos.w;
*/
_asm {
movq mm3, [pos]
movq mm4, [lightPos]
psubw mm3, mm4
movq [lightDir],mm3
};
Once you've converted the whole innerloop, you can remove most of the loading/storing from and to variables.
That's the point where your code suddenly gets much faster.