tiny-c Programming Group exercise for March 12, 2010

Write a file compressor / decompressor in tiny-c.

Files handled by your program are "dictionary files" consisting of single words that use only the 26 capital letters followed by carriage return linefeed delimiters. The compression algorithm should be based on the following scheme:

Letters are divided into two classes, "common" and "rare." Here are the 15 "common" characters.

\n
stands for the newline character.

EISNATROLDCUGP\n

Here are the twelve "rare" characters.

MHBYFVWKZXQJ

The first byte of a compressed file is a "signature byte" of

FF
If the file to be compressed contains the following

A
ABLE
ABORT

...

MEDIUM
MEETING
MEGABAUD

the compressed file should look like so:

Rec  0

FF 4E 4F 28 0E 4F 27 65 E4 F2 76 50 9E 4F 27 65  .NO(.O'e..vP.O'e
13 CE 4F 27 65 2E 4F 27 B5 E4 F2 7F 50 E4 F2 27  ..O'e.O'....P..'
8B 50 8F 3E 4F 22 56 4A 5E 4A A0 80 64 57 6E 4A  .P.>O"VJ^J..dWnJ
A0 D5 E4 AA 0D 54 F2 80 E4 AA 02 2E 4A A0 22 09  .....T......J.".
E4 AA 02 21 F2 80 E4 AA 7F 0F 07 94 50 E4 AA 76  ...!........P..v
91 3C E4 AA BF 0B 84 50 E4 AA B6 45 0E 4A F1 10  .<.....P...E.J..
F5 09 E4 AF 73 7F 68 09 C0 E4 A6 72 2E 4A 51 F5  ....s.h....r.JQ.
45 09 E4 A5 B4 8E 4A 5B 48 8F 3E 49 4D 5E 49 9E  E.....J[H.>IM^I.

Rec  1

49 90 9E 49 91 3C E4 99 15 17 3E 49 91 51 73 48  I..I.<....>I.QsH
E4 99 60 22 13 CE 49 92 E4 9F BB 25 09 E4 97 D5  ..`"..I....%....
09 E4 9F 54 3A 02 E4 F4 50 6E 4C 41 3E 4C 41 32  ...T:...PnLA>LA2
5E 4C 0E 4C C6 1F 54 51 73 E4 C7 E4 C6 00 E4 80  ^L.L..TQs.......
65 E4 81 42 02 E4 88 E4 88 7A 45 17 3E 48 87 F6  e..B.....zE.>H..
E4 88 7F 60 9E 48 87 F6 2E 48 73 0E 48 73 CE 48  ...`.H...Hs.Hs.H
60 49 F3 E4 82 7E 48 50 60 9E 48 50 63 45 0E 48  `I...~HP`.HPcE.H
50 63 45 1F 50 E4 85 F1 7B CF 1E 4F 0E 4F 0F 21  PcE.P...{..O.O.!

...

Rec  33

EF 04 AF 11 30 EF 04 AF 11 30 2E F0 49 0E F0 41  ....0....0..I..A
8E F0 41 3E F0 41 35 41 3E F0 41 35 41 31 3C EF  ..A>.A5A>.A5A1<.
04 F7 0E F0 4F 70 2E F0 43 EF 04 34 C0 6E F0 43  ....Op..C..4.n.C
B4 88 F3 EF 04 3B F4 4A 5B 60 6E F0 43 F3 EF 04  .....;.J[`n.C...
6A F1 EF 04 6F 7E F0 46 F7 09 EF 04 6F 70 5E F0  j...o~.F....op^.
46 F7 13 CE F0 42 50 6E F0 45 06 14 8E F0 4F 91  F....BPn.E....O.
F0 BF 0E F0 4F 3E F0 4F 3F 20 EF 00 EF 00 43 2E  ....O>.O? ....C.
F0 04 2B 60 9E F0 09 14 EF 00 91 BF 0E F0 00 51  ..+`...........Q

Rec  34

3C EF 00 C4 F2 4B 9E FC                          <....K..

Explanation:

Byte 0 is the signature byte.

Byte 1 has two hexadecimal nibbles (a "nibble" is 4 "bits.") The 4 is because A is the 4th character in the "common" characters (we index using a base of 0). The E is because there's a newline after the "leading word" "A" in our input file and E is hexadecimal for decimal 14 and newline is the 14th character in the "common" characters (again, we index using a base of 0).

Byte 2 has two hexadecimal nibbles. The 4 is because A is the 4th character in the "common" characters. The F signals that we are have a "rare" character.

Byte 3 has two hexadecimal nibbles. The 2 is because B is the 2nd character in the "rare" characters. The 8 is because the L is the 8th "common character."

This compression method thus requires only one nibble for "common," two nibbles (i.e. one byte) for "rare" characters. File size is cut roughly in half by this relatively simple method. Since input files have "common" and "rare" characters at random spots, the signal you are entering "rare" mode might occur as a leading nibble or a trailing nibble.

End of file details:

If you find you have an unwritten nibble at the end of the compression, construct a full byte to write by using an

F
for the right nibble. Put this byte. Then put a hexadecimal
CC
as the last byte. If you don't have an unwritten nibble at the end, just put a hexadecimal
FC
as the last byte in your compressed file.

Test your decompressor on the file: SPTINY.LCX

This file looks like so if "dumped" in hexadecimal and ASCII.

Rec  0

FF 51 3F 3E AE D6 7C 64 F0 F0 06 2E 67 AF 7F CC  .Q?>..|d....g...
This is the compressed version of SPTINY.LEX

Test your compressor on the file: SP.LEX

Spoiler:

Solutions to the above are at

compress.tc

dompress.tc

A flowchart for the decompressor is at dompress.html