hmm will have to look into this VTune, sounds interesting, the intel manual for using rdtsc to benchmark software is full of flaws, their method only works correctly if your software happens to be the only software running on the CPU, as in no operating system / no task switching.
One more thing I was amazed by is how little bloat icc produces compared to msvc the exact same source code compiles to 103KB using cl.exe but only 35.5KB using icl.exe thats a whole lotta dead code removal using icc's libraries, also you should still have access to icc's optimised math kernel even if you specify NODEFAULTLIB, though I haven't tested this yet.