Author Topic: arm neon assembly  (Read 596 times)

0 Members and 1 Guest are viewing this topic.

Offline Stonemonkey

  • Pentium
  • *****
  • Posts: 1306
  • Karma: 96
    • View Profile
arm neon assembly
« on: September 10, 2016 »
Has anyone here done any?

At the moment the inner loop of my gradient filled tris looks like this:
Code: [Select]
tri_loop:
  Vcvt.s32.f32 q1,q0 //convert 4 floats to ints
  Vmov r2,r3,s4,s5 //move r&g ints to arm registers
  Vmov r4,s6           //move b int to arm register
  Orr r3,r2,lsl #8     //shift red and or with green
  Orr r4,r3,lsl #8    //shift r&g and or with blue
  str r4,[r0],#4        //write to pixel address and modify pointer
  Vadd.f32 q0,q2  //add colour deltas
  Cmp r0,r1            //compare pointer with final address
Ble tri_loop           //repeat if less or equal

But I'm wondering if there's a way to shuffle the bytes in q1 to take the first bytes from each 32 bits and put them together to write to memory instead of moving into the arm registers and doing the or'ing and shifting.
« Last Edit: September 10, 2016 by Stonemonkey »

Offline Stonemonkey

  • Pentium
  • *****
  • Posts: 1306
  • Karma: 96
    • View Profile
Re: arm neon assembly
« Reply #1 on: September 11, 2016 »
Found a way, this is now 16:16 fixed point gradient triangle inner loop.

Code: [Select]
Triangle_loop:
Vtbl.u8 d2,{d0,d1},d6 //shuffle bytes from d0/d1 to d2, d6=table
Fsts s4,[r6]                 //Write to pixel address
Add r6,#4                    //modify pixel pointer
Vadd.s32 q0,q2          //add colour deltas
Cmp r6,r7                    //compare pixel address to final address
Ble triangle_loop       //repeat if less or equal