SSE is a separate processing unit (so, unlike mmx, it doesn't interfere with the fpu) and contains 8 128-bit registers (xmm0..xmm7) which are organized as 4 floats.
One aspect which is generally different is that there are two separate load/store instructions, one for unaligned data and one for aligned data.
Aligned data means that the adress of the data is dividable by 16 (so the lower 4 bits of the adress are zero) which guarantees that the whole 128bits of data lie within the same cache-line.
In contrast an unaligned load/store has to fetch data from two different cache-lines and thus has to make sure both lines are actually present in the cache (so it might has to transfer more data from memory as is actually needed).
In practice you would allocate every block of data to have 16 more bytes and fix it's starting address:
unsigned char* data= (unsigned char*)malloc( number of bytes + 16 );
unsigned int address= (unsigned int)data;
address = (address + 15) & 0xfffffff0;
// remember the original "data" pointer somewhere so you have a chance to free it
data= (unsigned char*)address;
But let's ignore that for now and loook at the add/mul example from above.
rgba addition:
float a1,r1,g1,b1;
float a2,r2,g2,b2;
asm {
lea edi,a1 // load address of a1 (asuming r1,g1,b1 follow right afterwards)
lea esi,a2 // load address of a2
movups xmm1,[edi] // load 4 floats (a1,r1,g1,b1) into xmm1
movups xmm2,[esi] // load 4 floats (a2,r2,g2,b2) into xmm2
addps xmm1, xmm2 // add 4 floats: a1+a2, r1+r2, g1+g2, b1+b2
movups [edi],xmm1 // store result in a1,r1,g1,b1
};
Since we're working in floating point precision, we can keep the whole precision and don't need to saturate the results (as in the mmx example).
rgba multiply works exactly the same:
float a1,r1,g1,b1;
float a2,r2,g2,b2;
asm {
lea edi,a1 // load address of a1 (asuming r1,g1,b1 follow right afterwards)
lea esi,a2 // load address of a2
movups xmm1,[edi] // load 4 floats (a1,r1,g1,b1) into xmm1
movups xmm2,[esi] // load 4 floats (a2,r2,g2,b2) into xmm2
mulps xmm1, xmm2 // multiply 4 floats: a1*a2, r1*r2, g1*g2, b1*b2
movups [edi],xmm1 // store result in a1,r1,g1,b1
};