Author Topic: using mutlithreading in freebasic  (Read 5152 times)

0 Members and 1 Guest are viewing this topic.

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
using mutlithreading in freebasic
« on: April 02, 2013 »
hello all just wondering if anyone has tried multi threading in freebasic as id love too give it a try but dont know where i would even begin im working on a demo that does loads of number crunching rendering to different buffers applying filters too these buffer and combining etc. however i am on a core i7 so my processor is largely going to waste atm..

when doing multithreading is there any specific library that can be used?? or is certain instructions suffice. cheers for any help in advanced..
Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: using mutlithreading in freebasic
« Reply #1 on: April 02, 2013 »
just realized there is a threading example in the free basic folder, just offloading one of my intensive blur filters too a second core my fps has gone from 45ish too 105-109 so im gob smacked!!
Challenge Trophies Won:

Offline hellfire

  • Sponsor
  • Pentium
  • *******
  • Posts: 1289
  • Karma: 466
    • View Profile
    • my stuff
Re: using mutlithreading in freebasic
« Reply #2 on: April 04, 2013 »
offloading one of my intensive blur filters too a second core my fps has gone from 45ish too 105-109
The trick with "intensive blur filters" is to come up with a fast downsampling filter to halve or quarter the source image without introducing aliasing.
This way the complex filter has to process only 1/4 or 1/16 of the original number of pixels.
Try to put the rgb-processing innerloops into mmx or sse. Then go for threads.


Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: using mutlithreading in freebasic
« Reply #3 on: April 05, 2013 »
cheers hellfire,

im currently downsampling to 1/4 blurring then upsampling, have never tried any mmx sse as im useless at x86 asm i can see the benefit of pixel packing though. would you happen to have a very basic example of mmx or sse i could give a try..

there is a bit more too my multi threading that i first posted. i split my blur up into 4 quads carefully too avoid mutex binding then run each quad on individual threads in parallel. it gives very cheep massive speed gains.. but of course if packing groups of data together for procesing gives good returns too, im all for that.
Challenge Trophies Won:

Offline hellfire

  • Sponsor
  • Pentium
  • *******
  • Posts: 1289
  • Karma: 466
    • View Profile
    • my stuff
MMX
« Reply #4 on: April 06, 2013 »
would you happen to have a very basic example of mmx or sse i could give a try..

You can think of mmx as an additional operating mode for your fpu.
When you use an mmx instruction, all the data on your fpu is automatically saved and it switches to mmx mode.
When you're done with mmx operations, a special instruction ("emms" = end mmx sequence) restores all the fpu data and brings it back into the floating point operation mode (which is somewhat costly).
If you try to do any fpu operation when you're still in mmx-mode, you will just get garbage and the compiler might expect some values to still exist in fpu registers and will fail miserably.
This means you can't use floating-point and mmx operations at the same time and you should make as few operation switches as possible.
So you basically need to remove all floats from your innerloop to avoid permanent mode switching.

In mmx mode you have 8 64bit registers called mm0...mm7 (just as many as the fpu has; not a big surprise).
These registers can be interpreted as 8 bytes, 4 shorts or 2 ints (each data-type has different instructions).

The easiest application for mmx is "rgba addition with saturation":
(sorry for the C example but I haven't touched freebasic for years)
Code: [Select]
// "unsigned int" is 32bits wide and contains 4 bytes with the a,r,g,b values
unsigned int color1 ...;
unsigned int color2= ...;

// add rgba components separately
int a= (color1 >> 24 & 255) + (color2 >> 24 & 255);
int r= (color1 >> 16 & 255) + (color2 >> 16 & 255);
int g= (color1 >> 8  & 255) + (color2 >> 8  & 255);
int b= (color1       & 255) + (color2       & 255);

// saturate components at 255
if (a > 255) a= 255;
if (r > 255) r= 255;
if (g > 255) g= 255;
if (b > 255) b= 255;

// store componets back into c1
c1= (a << 24) | (r << 16) | (g << 8) | b;

That's what mmx can handle in a single instruction:
Code: [Select]
unsigned int color1 ...;
unsigned int color2= ...;

asm {
     movd      mm0,[color1]   // load "color1" to lower 32bit of mmx register 0
     movd      mm1,[color2]   // load "color2" to lower 32bit of mmx register 1
     paddusb   mm0,mm1        // add mm1 to mm0 assuming that it contains unsigned bytes and saturate the result
     movd      [color1],mm0   // store lower 32bit of mm0 back into "color1"

     emms                     // end mmx sequence: restores fpu
};

The "padd" instruction adds 8 bytes although only 4 are actually used here.
So in practice you can add 2x2 rgba in a single instruction.


Another application is rgba multiplication:
Code: [Select]
unsigned int color1 ...;
unsigned int color2= ...;

// multiply rgba components separately
int a= (color1 >> 24 & 255) * (color2 >> 24 & 255);
int r= (color1 >> 16 & 255) * (color2 >> 16 & 255);
int g= (color1 >> 8  & 255) * (color2 >> 8  & 255);
int b= (color1       & 255) * (color2       & 255);

// shift back to 0..255 range
a= a>>8;
r= r>>8;
g= g>>8;
b= b>>8;

// store componets back into c1
c1= (a << 24) | (r << 16) | (g << 8) | b;

This works a bit differently in mmx because we can only multiply shorts, so we have to do a bit of converting:
Code: [Select]
asm {
    movd      mm0,[color1]  // load color1:  [0:0][0:0][a1:r1][g1:b1]
    movd      mm1,[color2]  // load color2:  [0:0][0:0][a2:r2][g2:b2]
    pxor      mm3,mm3       // fill mm3 with zeros
    punpcklbw mm0,mm3       // merge bytes of mm0/mm3 to 4 shorts:  [0:a1][0:r1][0:g1][0:b1]
    punpcklbw mm1,mm3       // merge bytes of mm1/mm3 to 4 shorts:  [0:a2][0:r2][0:g2][0:b2]
    pmullw    mm0,mm1       // multiply shorts: [a1*a2][r1*r2][g1*g2][b1*b2]
    psrlw     mm0,8         // shift shorts back into 0..255 range [0:a][0:r][0:g][0:b]
    packuswb  mm0,mm3       // "unmerge" bytes: [0:a][0:r][0:g][0:b] -> [0:0][0:0][a:r][g:b]
    movd      [color1],mm0  // store lower 32bit of mm0 back into "color1"

    emms                    // end mmx sequence: restores fpu
};

If you're planing to use many consecutive filters after another, you might prefer to keep your data in the 4x short format to avoid the converting and repetitive round-off errors (and the sign bit can be useful, too).


In practice you won't place the "emms" instruction after each pixel but at the very end when you're actually done with mmx processing:
Code: [Select]
// convert all your floats to fixed point
for (all pixels)
{
  // some mmx code
  // some integer code
  // but no floats
}
asm {
  emms
};
« Last Edit: April 07, 2013 by hellfire »
Challenge Trophies Won:

Offline hellfire

  • Sponsor
  • Pentium
  • *******
  • Posts: 1289
  • Karma: 466
    • View Profile
    • my stuff
SSE
« Reply #5 on: April 07, 2013 »
SSE is a separate processing unit (so, unlike mmx, it doesn't interfere with the fpu) and contains 8 128-bit registers (xmm0..xmm7) which are organized as 4 floats.

One aspect which is generally different is that there are two separate load/store instructions, one for unaligned data and one for aligned data.
Aligned data means that the adress of the data is dividable by 16 (so the lower 4 bits of the adress are zero) which guarantees that the whole 128bits of data lie within the same cache-line.
In contrast an unaligned load/store has to fetch data from two different cache-lines and thus has to make sure both lines are actually present in the cache (so it might has to transfer more data from memory as is actually needed).

In practice you would allocate every block of data to have 16 more bytes and fix it's starting address:
Code: [Select]
unsigned char* data= (unsigned char*)malloc( number of bytes + 16 );
unsigned int address= (unsigned int)data;
address = (address + 15) & 0xfffffff0;
// remember the original "data" pointer somewhere so you have a chance to free it
data= (unsigned char*)address;

But let's ignore that for now and loook at the add/mul example from above.

rgba addition:
Code: [Select]
float a1,r1,g1,b1;
float a2,r2,g2,b2;

asm {
   lea edi,a1         // load address of a1 (asuming r1,g1,b1 follow right afterwards)
   lea esi,a2         // load address of a2
   movups xmm1,[edi]  // load 4 floats (a1,r1,g1,b1) into xmm1
   movups xmm2,[esi]  // load 4 floats (a2,r2,g2,b2) into xmm2
   addps  xmm1, xmm2  // add 4 floats: a1+a2, r1+r2, g1+g2, b1+b2
   movups [edi],xmm1  // store result in a1,r1,g1,b1
};
Since we're working in floating point precision, we can keep the whole precision and don't need to saturate the results (as in the mmx example).

rgba multiply works exactly the same:
Code: [Select]
float a1,r1,g1,b1;
float a2,r2,g2,b2;

asm {
   lea edi,a1         // load address of a1 (asuming r1,g1,b1 follow right afterwards)
   lea esi,a2         // load address of a2
   movups xmm1,[edi]  // load 4 floats (a1,r1,g1,b1) into xmm1
   movups xmm2,[esi]  // load 4 floats (a2,r2,g2,b2) into xmm2
   mulps  xmm1, xmm2  // multiply 4 floats: a1*a2, r1*r2, g1*g2, b1*b2
   movups [edi],xmm1  // store result in a1,r1,g1,b1
};

« Last Edit: April 07, 2013 by hellfire »
Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: using mutlithreading in freebasic
« Reply #6 on: April 07, 2013 »
hellfire thats awsome mate!!!

youve made my day k++, for the first time in years i can clearly see what mmx sse is all about and it has benefits in lots situations. i can apply this too lots of my tight inner loops, and the code examples are excellent too.

im going too try some examples today too see just how much quicker this is. i suspect it will be of good benefit in freebasic where the compiler wont produce as optimised asm as the c/c++ ones.

ill do a large array loop with and without mmx sse and run the loop a few thousand times too see how many millisecs it takes.

thanks very very much im sure your posts will help a lot more people than just me.


Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: using mutlithreading in freebasic
« Reply #7 on: April 07, 2013 »
here is a little test i made up with your code hellfire, its staggering how much faster this is! slightly over double. ill test sse out in a bit as well as mutliply.


Code: [Select]
#Include "Tinyptc_ext.Bi"
#Include "Windows.Bi"

Const XRes = 800
Const YRes = 600
Dim Shared As Double WndOrgX = XRes/2
Dim Shared As Double WndOrgY = YRes/2

RANDOMIZE TIMER

Type TimerType
   
    Frequency As LARGE_INTEGER
    LiStart As LARGE_INTEGER
    LiStop As LARGE_INTEGER
    LlTimeDiff As LONGLONG
    MDuration As Double

End Type
Declare Function    MmxAdd( ByVal Color1 As uInteger, ByVal Color2 As uInteger ) As uInteger
Declare Function    UnMmxSseAdd( ByRef Color1 As uInteger, ByVal Color2 As uInteger ) As uInteger
Declare Sub         MyLine( byval x1 as integer, byval y1 as integer, byval x2 as integer, byval y2 as integer, ByVal Col As Integer )
Declare Sub         PtcOpen()
Declare Sub         StartTimer( TempTimer As TimerType Ptr )
Declare Sub         DestroyTimer( TempTimer As TimerType Ptr )
Declare Function    NewTimer() As TimerType Ptr
Declare Function    GetTimerMs( TempTimer As TimerType Ptr ) As Double
Declare Function    GetTimerSec( TempTimer As TimerType Ptr ) As Double

Dim Shared As Integer Buffer( XRes * YRes )
Dim Shared As TimerType Ptr FrameTimer

FrameTimer = NewTimer()

PtcOpen()

Dim Shared As uInteger SourceColor = (120 shl 24) Or (40 shl 16) or (30 shl 8) or 30
Dim Shared As uInteger DestColor = (100 shl 24) Or (30 shr 16) or (20 shr 8) or 20
Dim Shared As uInteger ColorBuffer(1000)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' Test''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
   
    Ptc_Update( @Buffer(0) )
   
    StartTimer( FrameTimer )
    For Y = 0 To 60000
        For X = 0 To 1000
           ColorBuffer(X) = UnMmxSseAdd( SourceColor, DestColor )
        Next
    Next
    Print "Add without Mmx:   "+Str(GetTimerMs( FrameTimer ))
    Print Str(ColorBuffer(0) and 255)

   
    StartTimer( FrameTimer )
    For Y = 0 To 60000
        For X = 0 To 1000
            ColorBuffer(X) = MmxAdd( SourceColor, DestColor )
        Next
    Next
    Asm
        emms                     ' end mmx sequence: restores fpu
    End Asm
    Print "Add With Mmx:   "+Str(GetTimerMs( FrameTimer ))
    Print Str(ColorBuffer(0) And 255)

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

do
Loop While (  GetTimerMs( FrameTimer ) <= 6000 )

DestroyTimer( FrameTimer )



Function MmxAdd( ByVal Color1 As uInteger, ByVal Color2 As uInteger ) As uInteger
   
    Asm
        movd      mm0,[color1]   ' load "color1" to lower 32bit of mmx register 0
        movd      mm1,[color2]   ' load "color2" to lower 32bit of mmx register 1
        paddusb   mm0,mm1        ' add mm1 to mm0 assuming that it contains unsigned bytes and saturate the result
        movd      [color1],mm0   ' store lower 32bit of mm0 back into "color1"
    End Asm
   
    MmxAdd = Color1
   
End Function



Function UnMmxSseAdd( ByRef Color1 As uInteger, ByVal Color2 As uInteger ) As uInteger
   
    Dim As uInteger c1
   
    Dim As uInteger a =     (color1 Shr 24  And 255) + (color2 Shr 24   And 255)
    Dim As uInteger r =     (color1 Shr 16  And 255) + (color2 Shr 16   And 255)
    Dim As uInteger g =     (color1 Shr 8   And 255) + (color2 Shr 8    And 255)
    Dim As uInteger b =     (color1         And 255) + (color2          And 255)
   
    'saturate components at 255
    If (a Shr 255) Then a = 255
    If (r Shr 255) Then r = 255
    If (g Shr 255) Then g = 255
    If (b Shr 255) Then b = 255

    'store componets back into c1
    c1 = (a Shl 24) Or (r Shl 16) Or (g Shl 8) Or b
   
    UnMmxSseAdd = c1
   
End Function



Sub PtcOpen()
   
    Ptc_AllowClose(0)
    Ptc_SetDialog(1,"Template"+CHR$(13)+"FullScreen",0)
    Ptc_SetFlip(1)
    If ( Ptc_Open( "Tunnel", XRes, YRes ) = 0 ) Then
        End - 1
    End If
   
End Sub



Function NewTimer() As TimerType Ptr
   
    Dim As TimerType Ptr TempTimer
   
    TempTimer = CAllocate( SizeOf( TimerType ) )
    QueryPerformanceFrequency( @TempTimer->Frequency )
    NewTimer = TempTimer
   
End Function



Sub StartTimer( TempTimer As TimerType Ptr )
   
    QueryPerformanceCounter( @TempTimer->LiStart )
   
End Sub



Function GetTimerMs( TempTimer As TimerType Ptr ) As Double
   
    QueryPerformanceCounter( @TempTimer->LiStop )
    TempTimer->LlTimeDiff = TempTimer->LiStop.QuadPart - TempTimer->LiStart.QuadPart
    TempTimer->MDuration = Cast( Double, TempTimer->LlTimeDiff ) * 1000.0 / Cast( Double , TempTimer->Frequency.QuadPart )
    GetTimerMs = TempTimer->MDuration
   
End Function



Function GetTimerSec( TempTimer As TimerType Ptr ) As Double
   
    QueryPerformanceCounter( @TempTimer->LiStop )
    TempTimer->LlTimeDiff = TempTimer->LiStop.QuadPart - TempTimer->LiStart.QuadPart
    TempTimer->MDuration = Cast( Double, TempTimer->LlTimeDiff ) * 1000.0 / Cast( Double , TempTimer->Frequency.QuadPart )
    GetTimerSec = TempTimer->MDuration/1000.0
   
End Function



Sub DestroyTimer( TempTimer As TimerType Ptr )
   
    If ( TempTimer ) Then
        DeAllocate( TempTimer )
    EndIf
   
End Sub

i get about 359 millisecs with mmx and 736 without  :)
« Last Edit: April 07, 2013 by ninogenio »
Challenge Trophies Won:

Offline hellfire

  • Sponsor
  • Pentium
  • *******
  • Posts: 1289
  • Karma: 466
    • View Profile
    • my stuff
Re: using mutlithreading in freebasic
« Reply #8 on: April 07, 2013 »
its staggering how much faster this is! slightly over double.
At the moment you're even doing a function-call per pixel which destroys a lot of the benefits.
If you use an optimized innerloop for a whole scanline it will be quite much faster.
Most compilers are also over-cautious when they see an asm-block and save/restore all their registers around it.
Maybe you can try something like this (sorry, completely untested):
Code: [Select]
sub addBuffers(dst as UInteger Ptr, src as UInteger Ptr, count as Integer)
  ASM
     mov      edi, [dst]   // load address of dst buffer
     mov      esi, [src]   // load address of src buffer
     mov      ecx, [count] // hold number of pixels in ecx
pixelloop:
     movd     mm0,[edi]    // load rgba from dst buffer
     movd     mm1,[esi]    // load rgba from src buffer
     paddusb  mm0,mm1      // add rgba
     movd     [edi],mm0    // store in dst buffer
     add      edi,4        // increment address to next pixel in dst buffer
     add      esi,4        // next pixel in src buffer
     dec      ecx          // decrease number of pixels
     jnz      pixelloop    // continue until ecx=0

     emms
  END ASM
end sub
Versus
Code: [Select]
Sub UnMmxSseAdd( ByRef dst As uInteger, ByVal src As uInteger,  count as Integer)
    Dim As uInteger x;
    for x = 0 to count-1
        Dim As uInteger a =     (dst[x] Shr 24  And 255) + (src[x] Shr 24   And 255)
        Dim As uInteger r =     (dst[x] Shr 16  And 255) + (src[x] Shr 16   And 255)
        Dim As uInteger g =     (dst[x] Shr 8   And 255) + (src[x] Shr 8    And 255)
        Dim As uInteger b =     (dst[x]         And 255) + (src[x]          And 255)
   
        If (a > 255) Then a = 255
        If (r > 255) Then r = 255
        If (g > 255) Then g = 255
        If (b > 255) Then b = 255

        dst[x] = (a Shl 24) Or (r Shl 16) Or (g Shl 8) Or b
    next
End Sub

Oh, and this...
Code: [Select]
    'saturate components at 255
    If (a Shr 255) Then a = 255
    If (r Shr 255) Then r = 255
    If (g Shr 255) Then g = 255
    If (b Shr 255) Then b = 255
...was meant to be a compare, not a shift.


Back to multithreading I found it advantageous to spawn a separate thread per scanline.
That's because on intel two cores always share the same cache and it's likely that they address the same source data, so they need to transfer less data from memory.
It's also guaranteed that each thread processes the same number of pixels (although the dimensions of an image are usually dividible by four).
And in C++ most compilers support Open-MP which can handle multi-threading with a simple preprocessor macro.
So this code layout suits my general lazyness very well :)
Code: [Select]
for (all scanlines)
{
   #pragma omp parallel
   processScanline();
}
« Last Edit: April 07, 2013 by hellfire »
Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: using mutlithreading in freebasic
« Reply #9 on: April 07, 2013 »
thanks very much hellfire!!

currently trying your pointer asm like so

Code: [Select]
#Include "Tinyptc_ext.Bi"
#Include "Windows.Bi"

Const XRes = 800
Const YRes = 600
Dim Shared As Double WndOrgX = XRes/2
Dim Shared As Double WndOrgY = YRes/2

RANDOMIZE TIMER

Type TimerType
   
    Frequency As LARGE_INTEGER
    LiStart As LARGE_INTEGER
    LiStop As LARGE_INTEGER
    LlTimeDiff As LONGLONG
    MDuration As Double

End Type

Declare Sub         MmxAdd( ColorBuffer As uInteger Ptr, SourceBuffer As uInteger Ptr, ByVal BufferWidth As Integer )
Declare Sub         UnMmxSseAdd( ColorBuffer(), ByRef Color1 As uInteger, ByVal Color2 As uInteger, ByVal Ypos As uByte, ByVal BufferWidth As uByte )
Declare Sub         MyLine( byval x1 as integer, byval y1 as integer, byval x2 as integer, byval y2 as integer, ByVal Col As Integer )
Declare Sub         PtcOpen()
Declare Sub         StartTimer( TempTimer As TimerType Ptr )
Declare Sub         DestroyTimer( TempTimer As TimerType Ptr )
Declare Function    NewTimer() As TimerType Ptr
Declare Function    GetTimerMs( TempTimer As TimerType Ptr ) As Double
Declare Function    GetTimerSec( TempTimer As TimerType Ptr ) As Double

Dim Shared As Integer Buffer( XRes * YRes )
Dim Shared As TimerType Ptr FrameTimer

FrameTimer = NewTimer()

PtcOpen()

Dim Shared As uInteger SourceColor = (120 shl 24) Or (40 shl 16) or (30 shl 8) or 30
Dim Shared As uInteger DestColor = (100 shl 24) Or (30 shr 16) or (20 shr 8) or 20
Dim Shared As uByte ColBuffWidth = 255, ColBuffHeight = 255
Dim Shared As uInteger ColorBuffer( ColBuffWidth * ColBuffHeight )
Dim Shared As uInteger AddBuffer( ColBuffWidth * ColBuffHeight )
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
' Test''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
   
    Ptc_Update( @Buffer(0) )
   
    For Y = 0 To ColBuffHeight
        For X = 0 To ColBuffWidth
            ColorBuffer(X*Y) = SourceColor
            AddBuffer(X*Y) = DestColor
        Next
    Next
   
    StartTimer( FrameTimer )
    MmxAdd( @ColorBuffer(0), @AddBuffer(0), (ColBuffWidth-1)*(ColBuffHeight-1) )
    Asm
        emms                     ' end mmx sequence: restores fpu
    End Asm
    Print "Add With Mmx:   "+Str(GetTimerMs( FrameTimer ))
    Print Str(ColorBuffer(0) And 255)

''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

StartTimer( FrameTimer )
do
Loop While (  GetTimerMs( FrameTimer ) <= 4000 )'60fps Clamp

DestroyTimer( FrameTimer )



Sub MmxAdd( ColorBuffer As uInteger Ptr, SourceBuffer As uInteger Ptr, ByVal BufferWidth As Integer )
   
    ASM
       
        mov      edi, [ColorBuffer]   ' load address of dst buffer
        mov      esi, [SourceBuffer]   ' load address of src buffer
        mov      ecx, [BufferWidth] ' hold number of pixels in ecx
       
        pixelloop:
        movd     mm0,[edi]    ' load rgba from dst buffer
        movd     mm1,[esi]    ' load rgba from src buffer
        paddusb  mm0,mm1      ' add rgba
        movd     [edi],mm0    ' store in dst buffer
        add      edi,4        ' increment address to next pixel in dst buffer
        add      esi,4        ' next pixel in src buffer
        dec      ecx          ' decrease number of pixels
        jnz      pixelloop    ' continue until ecx=0

    END ASM
   
End Sub



Sub UnMmxSseAdd( ColorBuffer(), ByRef Color1 As uInteger, ByVal Color2 As uInteger, ByVal Ypos As uByte, ByVal BufferWidth As uByte )
   
    Dim As uInteger c1
   
    For X = 0 To BufferWidth - 1
        Dim As uInteger a =     (color1 Shr 24  And 255) + (color2 Shr 24   And 255)
        Dim As uInteger r =     (color1 Shr 16  And 255) + (color2 Shr 16   And 255)
        Dim As uInteger g =     (color1 Shr 8   And 255) + (color2 Shr 8    And 255)
        Dim As uInteger b =     (color1         And 255) + (color2          And 255)
   
        'saturate components at 255
        If (a > 255) Then a = 255
        If (r > 255) Then r = 255
        If (g > 255) Then g = 255
        If (b > 255) Then b = 255

        'store componets back into c1
        c1 = (a Shl 24) Or (r Shl 16) Or (g Shl 8) Or b
        ColorBuffer(X*Ypos) = c1
    Next
   
End Sub



Sub PtcOpen()
   
    Ptc_AllowClose(0)
    Ptc_SetDialog(1,"test"+CHR$(13)+"FullScreen",0)
    Ptc_SetFlip(1)
    If ( Ptc_Open( "Tunnel", XRes, YRes ) = 0 ) Then
        End - 1
    End If
   
End Sub



Function NewTimer() As TimerType Ptr
   
    Dim As TimerType Ptr TempTimer
   
    TempTimer = CAllocate( SizeOf( TimerType ) )
    QueryPerformanceFrequency( @TempTimer->Frequency )
    NewTimer = TempTimer
   
End Function



Sub StartTimer( TempTimer As TimerType Ptr )
   
    QueryPerformanceCounter( @TempTimer->LiStart )
   
End Sub



Function GetTimerMs( TempTimer As TimerType Ptr ) As Double
   
    QueryPerformanceCounter( @TempTimer->LiStop )
    TempTimer->LlTimeDiff = TempTimer->LiStop.QuadPart - TempTimer->LiStart.QuadPart
    TempTimer->MDuration = Cast( Double, TempTimer->LlTimeDiff ) * 1000.0 / Cast( Double , TempTimer->Frequency.QuadPart )
    GetTimerMs = TempTimer->MDuration
   
End Function



Function GetTimerSec( TempTimer As TimerType Ptr ) As Double
   
    QueryPerformanceCounter( @TempTimer->LiStop )
    TempTimer->LlTimeDiff = TempTimer->LiStop.QuadPart - TempTimer->LiStart.QuadPart
    TempTimer->MDuration = Cast( Double, TempTimer->LlTimeDiff ) * 1000.0 / Cast( Double , TempTimer->Frequency.QuadPart )
    GetTimerSec = TempTimer->MDuration/1000.0
   
End Function



Sub DestroyTimer( TempTimer As TimerType Ptr )
   
    If ( TempTimer ) Then
        DeAllocate( TempTimer )
    EndIf
   
End Sub

but it crashes when run. i noticed when i comment out the add edi,4 and add esi,4 it runs fine but obviously never gets past the first element of the pointer. im not sure however if ive done somthing silly somewhere else thats making it fail at that part.
Challenge Trophies Won:

Offline hellfire

  • Sponsor
  • Pentium
  • *******
  • Posts: 1289
  • Karma: 466
    • View Profile
    • my stuff
Re: using mutlithreading in freebasic
« Reply #10 on: April 07, 2013 »
Well, the idea was to give the routine a whole buffer to process.
Something like
Code: [Select]
dim buffer1 as UInteger Ptr
dim buffer2 as UInteger Ptr
buffer1 = new UInteger[width*height]
buffer2 = new UInteger[width*height]
mmxAdd(buffer1, buffer2, width*height)

So I've thrown this into the tinyptc example source and it seems to work:
Code: [Select]
' from tinyptc example
dim shared image(320*240) as integer
loadtexture("media\fblogo.bmp", @image(0))

' add image to itself
mmxadd(@image(0), @image(0), 320*240)
(image got brighter)

Had to make some minor changes to your code to make it compile with fb 0.24 and tinyptc_ext++ (source attached).
Your init code was a bit awkward, i think what you wanted to do was this:
Code: [Select]
Dim Shared As uInteger ColorBuffer( ColBuffWidth * ColBuffHeight )
Dim Shared As uInteger AddBuffer( ColBuffWidth * ColBuffHeight )

...

    For Y = 0 To ColBuffHeight-1
        For X = 0 To ColBuffWidth-1
            ColorBuffer(Y*ColBuffWidth+X) = SourceColor
            AddBuffer(Y*ColBuffWidth+X) = DestColor
        Next
    Next

...
   
    MmxAdd( @ColorBuffer(0), @AddBuffer(0), ColBuffWidth*ColBuffHeight )

And the results for a 1024x1024 buffer are
With Mmx:   4.1467
Without Mmx: 18.456
Factor 4. Nice :)

« Last Edit: April 07, 2013 by hellfire »
Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: using mutlithreading in freebasic
« Reply #11 on: April 07, 2013 »
thats amazing! :cheers:



im getting 4.1 with and 17.9 without i honestly cant belive how much faster this is! thanks very very much helfire.. ill be able to start messing about, properly integrating this in my projects now.
Challenge Trophies Won:

Offline hellfire

  • Sponsor
  • Pentium
  • *******
  • Posts: 1289
  • Karma: 466
    • View Profile
    • my stuff
Re: using mutlithreading in freebasic
« Reply #12 on: April 07, 2013 »
No problem, mate.
Have fun :)
Challenge Trophies Won:

Offline hellfire

  • Sponsor
  • Pentium
  • *******
  • Posts: 1289
  • Karma: 466
    • View Profile
    • my stuff
Re: using mutlithreading in freebasic
« Reply #13 on: April 07, 2013 »
One more thing.
You can change the innerloop to process two pixels at a time:
Code: [Select]
        shr     ecx, 1     ' number of pixels / 2
    pixelloop:
        movq     mm0,[edi]   ' read quad-word instead of double-word (two pixels)
        movq     mm1,[esi]
        paddusb  mm0,mm1
        movq     [edi],mm0
        add      edi,8        ' increment address by two pixels
        add      esi,8       
        dec      ecx         
        jnz      pixelloop   
But it won't get noticably faster!
That's because you're already spending most of the time waiting for the memory to deliver data.
So you can actually put more arithmetic instructions into the loop without getting any slower.
This also means you're not gaining much from using multiple threads as long as the individual threads can't access shared data.

This depends very much on the size of your image, though. If it's significantly smaller and most of it is still cached when the next pass starts, the loop will execute much faster.
« Last Edit: April 07, 2013 by hellfire »
Challenge Trophies Won:

Offline ninogenio

  • Pentium
  • *****
  • Posts: 1666
  • Karma: 133
    • View Profile
Re: using mutlithreading in freebasic
« Reply #14 on: April 07, 2013 »
i must confess ive been on and off the computer all day :). ive redone loads of routines some of them went over and over again, and the amount freebasic leaves on the table is crazy. generally tight routines that mmx is suited towards can gain anywhere between 2 and 4x and even then i dont think im anywhere near fully optimized.

like you say threading becomes less significant using
Code: [Select]
        shr     ecx, 1     ' number of pixels / 2
    pixelloop:
        movq     mm0,[edi]   ' read quad-word instead of double-word (two pixels)
        movq     mm1,[esi]
        paddusb  mm0,mm1
        movq     [edi],mm0
        add      edi,8        ' increment address by two pixels
        add      esi,8       
        dec      ecx         
        jnz      pixelloop   

but the potential is there for some really good gains in certain circumstances. i guess the more i use these techniques the better i will be able too decide the best optimization for my particular routines.

i would advise everyone too learn this stuff. you get the best of both worlds a taste for x86 asm and some decent performance for your programs.
Challenge Trophies Won: