Has anyone here done any?
At the moment the inner loop of my gradient filled tris looks like this:
tri_loop:
Vcvt.s32.f32 q1,q0 //convert 4 floats to ints
Vmov r2,r3,s4,s5 //move r&g ints to arm registers
Vmov r4,s6 //move b int to arm register
Orr r3,r2,lsl #8 //shift red and or with green
Orr r4,r3,lsl #8 //shift r&g and or with blue
str r4,[r0],#4 //write to pixel address and modify pointer
Vadd.f32 q0,q2 //add colour deltas
Cmp r0,r1 //compare pointer with final address
Ble tri_loop //repeat if less or equal
But I'm wondering if there's a way to shuffle the bytes in q1 to take the first bytes from each 32 bits and put them together to write to memory instead of moving into the arm registers and doing the or'ing and shifting.