I had a quick look at your source, some points that can be improved:
Bubble-sort is not a particularly fast way to pick the minimum/maximum from 4 values.
You're much faster by using integer coordinates for interpolation, the fpu just requires permanent "fisting".
Don't test every texel if it's inside texture-boundaries:
You can test where the current scanline enters/leaves the valid range and skip the according amount of pixels (or use a polygon) which makes the innerloop end up in just a few instructions.
You don't need a multiplication to build a texel's address:
"my" always increasing by the integer-part of "tsdzx" or one more if the fractional part causes an overflow. So you can pick the value from a table according to the carry flag (see
here).