When you enable sse code generation, VC tries to auto-vectorize (and it's not very good at it).
This ends up in two possible situations:
- It computes intermediate values in parallel,producing a single output value
- It computes four output values in parallel
The latter produces quite a bit of code-size overhead because it has to handle the non-multiple-of-4 cases.