This is the third post in a series.
- Part 1: Studying an Old E-Reader for Fun
- Part 2: Studying an Old E-Reader for Fun : Text Compression 1
- Part 3: Studying an Old E-Reader : Compressing a Dictionary with Huffman Codes
- Part 4: Studying an Old E-Reader : Compressing a Dictionary with a Prefix Tree (coming soon)
Last week I established a few points that are relevant for today's post:
- The text of the King James Bible comes in at around 4.3MB for the raw text
- Using a modern compressor, it is possible to compress the raw text to about 740KB
- The KJ-21 (the device I am looking at in this series) likely uses some rules of the English language to eliminate storing " " (space)
- The KJ-21 stores three separate "files"
- A "lexicon" - a listing of the unique words used in the text
- A "text search-subfile" - storing key words as indexes back to the lexicon, plus offsets when words are repeated
- A "text master-subfile" which stores stop words and placeholders for words in the "search-subfile"
Today I want to go over how it is possible to compress text, specifically how to compress a dictionary of text, or in the terminology of the KJ-21, the "lexicon" file.
From what I have gleaned from the patents concerning the KJ-21, there is a dictionary listing of every word that the device knows about. That dictionary is the one place that the actual letters of those words are stored. The word "the" is only in memory in one place, instead of the thousands of times it is appears in the text.
Once all of the words are consolidated to the dictionary, we can then take a few steps to try to reduce how much memory it consumes. A very simple method of doing this is to use Huffman codes.
There are a couple of really good videos on the subject of Huffman coding which I will link to at the bottom, but I will do my best to describe the purpose here as it feeds into the rest of the discussion about the KJ-21.
When a computer writes the letter "A" to disk on most computer systems, it is written as a single byte. A byte is made up of 8 bits (1's or 0's). So we can say that the letter "A" is encoded as 8 bits. Most computer systems uses a variant of ASCII, which defines what letters and symbols can be represented in 8 bits, or the numbers 0-255. ASCII defines that the letter "A" is represented by the number 65, which is encoded as
01000001. Lower case "a" is represented by number 97, which is encoded as
When encoding English text, we usually are only dealing with a subset of the numbers 0-255. Specifically, 26 upper-case characters, 26 lower-case, 10 digits, space, and maybe 10 or so punctuation marks - for a total of 73 characters (give or take).
73 is far less than 255, and it turns out we could just use 7 bits to represent 73 different codes (in fact, 7 bits can represent 128 codes). Computers like to work in multiples of 2, which is how we landed at 8 bits, but they can work with 7 bits as well. So automatically we could save 1/8th of our space if we just encoded English letters with 7 bits.
But it turns out we can do better than that using variable-width codes, and one of the most widely used ones is the Huffman code.
Huffman codes take into account how frequent a specific code might appear, and encodes the most frequent codes in the fewest bits. In English text, the letter "e" typically is the most frequently used letter making up 13% of the letters in most texts. Then "t" (9%), "a" (8%), "o" (7.5%), etc.. until we get to "z" which is used only about 0.074% of the time. Adding spaces and capitalization to this distribution mixes things up a little bit - the "space" would likely be more prevalent than "e", and upper case "E" likely less prevalent than "e".
What if we could encode the letter "e" using 3 or 4 bits, at the expense of the letter "z" taking 7, 8 or maybe even 9 bits?
There is an algorithm that can build the most optimal Huffman code for a given text, taking into account the distribution of the letters being used in that specific text, and it assigns a bit code to represent a letter. The code table is stored with the text, because without it you would just see a stream of 1's and 0's and have no means of decoding them.
Wikipedia uses the following example:
this is an example of a huffman treesample text from Wikipedia
Here is what that text looks like encoded, with spaces and new lines added for clarity. In reality, there would be no spacing.
0110 1010 1000 1011 (this) 111 (space) 1000 1011 (is) 111 (space) 010 0010 (an) 111 (space) 000 10010 000 0111 10011 11001 000 (example) 111 (space) 00110 1101 (of) 111 (space) 010 (a) 111 (space) 1010 00111 1101 1101 0111 010 0010 (huffman) 111 (space) 0110 11000 000 000 (tree)
The text "this is an example of a huffman tree" is 36 characters long, but can be represented in 135 bits, or 17 bytes (not counting the lookup table).
Encoding the KJV Lexicon
The KJV has 13,600 distinct "tokens". These vast majority of these are words, but it does contain a few punctuation marks as well. These are also case-sensitive.
If we count how many bytes in ASCII coding make up the word list, we get 95,433. If we add a space (or a newline, or a
NULL between each word) that takes us up to 109,033. That is our number to beat.
We need to figure out how to compress this dictionary to something significantly smaller than 109kb, and this is where Huffman codes can come into play.
If we use
NULL as our separator, we get the following distributions over a sampling of some characters:
NULL separator is, not surprisingly, the most common code. We have one of these per word as a separator. And, we only need 3 bits to encode it -
The second most common character in the lexicon is the letter "e", which appears 12,182 times. That is to say, out of 109,033 bytes of raw text in the lexicon, 12,182 of them are the letter "e", about 11% (close to the 13% cited earlier). We can represent the letter "e" using the code
The letter "a" appears a little bit less frequently, and needs 4 bits to store it as
0100. We finally get to some of the characters like certain punctuation marks, which are only stored once in the lexicon. These need 17 bits to store their codes.
Writing the Huffman Tree to Disk
I will post some code that shows how the Huffman code tree is written in the near future, but I can summarize it. We don't actually need to write "The letter a is encoded as
0100". Instead, we just need to know how many bits are needed for each code. First write out how many characters your coding scheme covers - we cover 0-123, so write
NULL needs 3 bits, we write
3. We then don't use any characters from 1-32, so write out 31 zeroes. We then use character 33,
! once and it is encoded in 17 bits, so write out 17. Continue doing this until you reach your final character.
In hex, it looks like this:
007b # 7b : 123. We stop at the letter 'z', which is 123 in ASCII 03000000000000000000000000000000 # 3 bits for NULL, 0 bits for 1-15 00000000000000000000000000000000 # 0 bits for 16-31 001100000000001111110000110d1100 # 0 bit for 32, 0x11 = 17 bits for "!" (33) 00000000000000000000111100000011 # etc... 000809090a090b09080a090a0a080a0a 090f0908090b0d0b000d0a1100100000 000406050503060604040a0705060405 060a0404040507070a0708 # 0x08 = 8 bits needed for "z" (123)
There is a lot of repetition there, so the full tree can be written in 64 bytes with some simple run-length encoding applied. The entire Huffman-encoded lexicon for the KJV, with the serialized tree header, comes in at 60,990 bytes. This is pretty good - about 56% of the original size.
If we wanted to try just storing words once, case-insensitive, it helps a tiny bit. 12,616 words, encoded with Huffman codes, can be written with a 40 byte header and 54,323 bytes of data.
For comparison, we can also put this up against general-purpose compressors.
- Raw: 109,033
- Huffman: 60,990
- Huffman, lower-case only : 54,363
- GZIP: 49,739
- DEFLATE: 49,727
- BZIP2: 45,065
- XZ: 43,856
- LMZA: 43,786
Part of the overall goal here is to come up with a solution that would have been feasible on a 16-bit processor made in 1989, with only 2kb of memory to work with. So any algorithm other than maybe DEFLATE (which is the simplest of the above) would be out of the question.
There are tools though that can take this one step further, without having to resort to a general-purpose compressor.
Next time : prefix trees.
There are two good videos that explain Huffman codes. I linked to one last week, and here is another one.