Author Topic: Need some help solidifying an idea...  (Read 3625 times)

0 Members and 1 Guest are viewing this topic.

Offline Pixel_Outlaw

  • Pentium
  • *****
  • Posts: 1382
  • Karma: 83
    • View Profile
Need some help solidifying an idea...
« on: September 09, 2009 »


I was sitting in college today thinking about how file information is stored on computers. It occured to me that it might be possible to reduce file sizes based on not writing the leading 0's on byte values.

Values such as 00000001 could simply be stored as 1 bit = on
values such as 00000010 could simply be stored as 2 bits = on-off

of course, 11111111 would not have any compression.

This is all fine and well, but the problem now is marking the end of each value without using a special character. I know there are existing data compression methods but I would like to see this idea out if possible. How could I mark the end of each value so it could be converted back to an 8 bit character? Using a special ascii character would be a bad idea, and the files would end up being larger than originally created.

Maybe this cannot be done. But I would like to see the result of file sizes by using 8 bit values and trimming off the leading 0's.
Challenge Trophies Won:

Offline combatking0

  • JavaScript lives!
  • Senior Member
  • DBF Aficionado
  • ********
  • Posts: 4569
  • Karma: 235
  • Retroman!
    • View Profile
    • Combat King's Barcode Battler Home
Re: Need some help solidifying an idea...
« Reply #1 on: September 09, 2009 »
Short of inventing a crazy number like "2", I'm not sure how this could be done.

You could possibly map the number of different bit combinations - if the full 256 are not being used, and instead 128 or less are being used - you could build a map of the bit combinations, and encode the rest of the file in 7-bit chunks.
You are our 9001st visitor.
Challenge Trophies Won:

Offline Pixel_Outlaw

  • Pentium
  • *****
  • Posts: 1382
  • Karma: 83
    • View Profile
Re: Need some help solidifying an idea...
« Reply #2 on: September 09, 2009 »
Yeah, I thought of that too. This idea was purely out of bordom, there may be no way to tokenize the truncated data chunks...
Challenge Trophies Won:

Offline combatking0

  • JavaScript lives!
  • Senior Member
  • DBF Aficionado
  • ********
  • Posts: 4569
  • Karma: 235
  • Retroman!
    • View Profile
    • Combat King's Barcode Battler Home
Re: Need some help solidifying an idea...
« Reply #3 on: September 09, 2009 »
I've seen a similar concept used in barcodes, where zeroes are truncated from the middle of the barcode - utter madness I say!
You are our 9001st visitor.
Challenge Trophies Won:

Offline Jim

  • Founder Member
  • DBF Aficionado
  • ********
  • Posts: 5301
  • Karma: 402
    • View Profile
Re: Need some help solidifying an idea...
« Reply #4 on: September 13, 2009 »
As you've probably spotted, the problem is knowing when one byte finishes and the next one starts.  It occurred to me last night one way of doing this is to use 3 bits to encode how may bits follow.  So each byte would encode to something between 4 and 10 bits.  We can imply the leading 1 so we don't need to store it.
ie.
00000000 -> (1 bits), -> (000),0 -> 0000
00000001 -> (1 bits), -> (000),1 -> 0001
0000001x -> (2 bits),x -> (001),x -> 001x
000001xx -> (3 bits),xx -> (010),xx -> 010xx
00001xxx -> (4 bits),xxx -> (011),xxx -> 011xxx
0001xxxx -> (5 bits),xxxx -> (100),xxxx -> 100xxxx
001xxxxx -> (6 bits),xxxxx -> (101),xxxxx -> 101xxxxx
01xxxxxx -> (7 bits),xxxxxx -> (110),xxxxxx -> 110xxxxxx
1xxxxxxx -> (8 bits),xxxxxxx -> (111),xxxxxxx -> 111xxxxxxx
 
Will it compress?
Number of patterns for each bit length vs compressed length
bits -> occurences -> packed size
n-> 2^(n-1) -> (n-1)+3
0 -> 1 -> 4
1 -> 1 -> 4
2 -> 2 -> 4
3 -> 4 -> 5
4 -> 8 -> 6
5 -> 16 -> 7
6 -> 32 -> 8
7 -> 64 -> 9
8 -> 128 -> 10

So you can see that any values with more than 6 bits get bigger and any with fewer than 6 get smaller. So, 31 shrink, 32 stay the same, 193 get bigger.

But this doesn't mean it won't compress.  No compression algorithm can compress every data stream, and no compression algorithm can ever compress completely random data - which is good news for us.

If the data we were compressing only ever had the lower 6 bits set then we would have a winner.  ASCII data might compress using this algorithm, it never has the 7th bit set so the worst cases go out of the window.

Also, many compression algorithms go in stages - first, rearrange the data, then apply a Huffman compression.  JPEG, MPEG1, MPEG2 all do this, the final bit stream has several layers of compression.  It's possible that rearranging some streams using our algorithm (which will probably make the stream longer) will actually aid in compression with another algorithm like Huffman.

For instance, if all our bytes had 8 bits set, our output stream would be 10/8ths longer, but it now has some structure which might appeal to another compressor.

Jim
« Last Edit: September 13, 2009 by Jim »
Challenge Trophies Won:

Offline rain_storm

  • Here comes the Rain
  • DBF Aficionado
  • ******
  • Posts: 3088
  • Karma: 182
  • Rain never hurt nobody
    • View Profile
    • org_100h
Re: Need some help solidifying an idea...
« Reply #5 on: September 14, 2009 »
If I read the read me correctly I think crinkler does something similar to this when truncating real numbers. Its a great idea

Challenge Trophies Won:

Offline Pixel_Outlaw

  • Pentium
  • *****
  • Posts: 1382
  • Karma: 83
    • View Profile
Re: Need some help solidifying an idea...
« Reply #6 on: September 30, 2009 »


I never noticed there was more conversation on this topic! Thanks guys I'll take a look at it when I can get home.
Challenge Trophies Won: