Studying an Old E-Reader for Fun : Text Compression 1

This is the second post in a series.

Programming Tradeoffs

A lot of what we do in Computer Science is dealing with tradeoffs. One such tradeoff is structuring our applications and data for fast performance while at the same time optimizing the amount of storage we use.

As CPUs continue to get faster, and storage gets more dense and cheaper, a lot of developers forget that at one point these tradeoffs were big deals. This is why today we have relatively simple iPhone applications that are hundreds of megabytes in size, consuming more space than my first computer had for DOS, Windows, games, office suite, and room to spare - all on a 320 MB drive.

The State of Storage on our Phones

The largest application on my phone is DJI Go 4, an app for controlling my camera drone. It comes in at 515.9 MB, and that is just for the application. The second largest is Spotify, at 130.9MB (plus 9.31GB for the cache of music). I have 256GB of total storage, so it's not like I am hurting for space, but 64GB devices are still being sold, and bloated apps are more likely to affect those users.

This series of articles though is about an electronic Bible. So, on my iPhone, I have the following:

  • Logos Bible, version 9.3.0. Application: 180.2MB, data : 80.5MB. Translations downloaded: KJV, LEB, ASV
  • YouVersion, version 8.24.3. Application: 101.8MB, data: 94.7MB (CEB, CEV, ESV, YLT98)
  • Blue Letter Bible, version 78.9. Application 78.9MB, data 88MB. (CSB, KJV, ESV, YLT)

The trend I am seeing with these applications is that each installed translation consumes roughly 20-30 MB.

The raw text of the KJV at Project Guttenberg is 4.3MB. EPUB comes in at 1.5MB (which uses ZIP compression), and Kindle is 6.2MB. The EPUB for the KJV with additional formatting, footnotes, and the Apocrypha (additional books that are not included in most translations) comes in at 2.7MB. These three mobile apps store each translation in about 6-7x the size of the raw text.

So what are these mobile apps doing?

My guess is that they are storing the text in an uncompressed format, likely with hypertext (possibly HTML) to support formatting. It is also possible that the data is being loaded into a database (such as SQLite) to facilitate searching.

Compression

Very smart people have come up with algorithms that take data and make it consume less space. There are many different types of compressors out there. Some are general purpose - they don't understand anything about the data they are given, other than there is likely some repetition that can be optimized away. Then there are special purpose compressors, that can understand the type of file they are given and apply their magic better than a general purpose compressor could. Generally these exist for images, audio, and video, but there are even some out there for things for files that have a lot of numbers but not a lot of "text".

Under special purpose compressors there are additional categories - lossy and lossless. Lossy compressors for audio, video and still images take advantage of human perception, discarding some of the finer details that we won't generally notice. Lossless compressors, on the other hand, can take input A, compress it to B, and reconstruct A exactly. This comes at the expense of not compressing as small as it would using a lossy compression, but usually better than a general purpose compressor.

Now, in terms of the Bible, the KJV comes in at 4.3MB. Someone who goes by Yuki has already attempted to figure out how well the KJV can be compressed.

Compression MethodSize (bytes)Ratio
zpaq -m5739,40716.682%
bzip2 -9993,40622.412%
lzma -91,048,40823.653%
xz -91,048,61623.658%
7z -mx91,048,71023.660%
zstd –ultra -221,068,13724.099%
rar -m51,142,36025.773%
gzip -91,385,45731.258%
zip -91,385,59531.261%
lz4 -91,596,41836.017%
lzop -91,611,93936.367%
Uncompressed4,432,375100%

zpaq does a really good job of compressing the KJV. This compression algorithm consistently ranks as one of the best at compressing large bodies of text, and has won some awards for doing so. It certainly is not as mainstream as 7zip, RAR or ZIP. The tradeoff is its speed. Decompressing the KJV in Yuki's tests takes over 16 seconds using zpaq, where most other methods took under 100 milliseconds. zpaq is primarily for archival of data (think - backups of your computer), where you might only be retrieving it once per year, if even that frequently.

Most of these decompressors need a lot more RAM than the KJ-21 has (just 2KB) to work, and these compression numbers are for compressing the entire text. That means if we want to scan to the very end (the book of Revelations), we first have to decompress all 65 books before it. If we were to compress the KJV book by book (66 smaller "files"), access to a single book would be faster, but would come at the expense of larger file sizes, as compressors generally are more efficient space-wise when they have more input data to work with.

1990's Technology

As I said in the first post, the KJ-21 has 1.125MB of storage, and that includes not only the text, but the program code, a limited dictionary, footnotes, help text, and structures to enable searching. My guess is that the text data itself likely fits in well under 1MB. We can see that with modern general-purposes compressors, that this certainly an easily attainable goal.

It turns out the KJ-21 had a few contemporary applications that were able to achieve similar compression ratios.

Yuki's post is actually about a similar device from the same era. There was a Game Boy cartridge that contained the entire KJV, plus a couple mini-games. It also includes two search modes, with some features that overlap the KJ-21 (which I will get to in a future post). The ROM for the cartridge comes in at slightly less than 1MB, so their developers are likely using something similar to the KJ-21.

There is also a copy of the KJV called "Christ On Disk" which is designed to fit on a 1.4MB floppy. It is a Windows executable that is fully self contained. It is a very basic application, and does not have any search capabilities.

Christ On Disk
ApplicationSize
Raw KJV4.3MB
Christ on Disk - Win32 Executable, text1.3MB
KJ-21 - 16-bit program, text, inflection db, dictionary, help text1.125MB
Wisdom Tree, Game Boy - 8-bit program, text, games1MB
Summary of applications and their sizes

How can we compress the large body of text that makes up the KJV in to less than 1MB while being able scan to a specific location in it with limited memory and CPU power?

A Patents!

Patent 5153831, by the software developer behind many of Franklin / Proximity's devices, gives some hints at the compression being employed by the KJ-21.

US Patent 5153831 Figure 2

Instead of storing the text, the KJ-21 could store two files - a "lexicon file" and a "text file". The lexicon file contains a unique list of all words in the text, and the text file uses numbers to look up values in the lexicon. Using the example above, instead of storing the word "PETER" - 5 bytes, it can store it using a single integer (2 or fewer bytes).

Every token in the text is assigned a number in the lexicon. Yes, I said token - it could be a word, it could be a punctuation mark. One thing that is notably missing is the SPACE character, more on that later.

Lexicon Tokens

Out of the translations that can be downloaded that are in the public domain, the KJV is actually the simpler side. One would think the KJV, with all of its "thee", "thou" and older older terms it might have more words, but translations that use more modern language actually have more distinct tokens. There is a noticeable difference though with the BBE - The Bible in Basic English. This is a translation specifically designed for readers "with limited education or where English is a second language". It has fewer distinct tokens since it uses simpler words (1,000 distinct English words plus proper nouns), but has a total of about 4% more tokens than other translations.

TranslationDistinct TokensTotal TokensTokens w/ 1 Occurrence
ASV14,530917,0594,872
BBE7,336956,0402,311
KJV13,600917,0894,393
WEB15,549904,4915,550
YLT15,044938,2915,297

The KJV contains 917,089 total tokens, including punctuation but excluding newlines and spaces. The numbers 0-13,600 can be represented in less than two bytes (14 bits, which covers the range of 0-16,383, whereas 2 bytes is 16 bits, or the numbers 0-65,535).

For what it's worth, of those 13,600 distinct tokens, we could reduce that to 12,616 if we did not care about case-sensitivity.

Here is what a sampling of the lexicon of the KJV looks like, sorted by number of occurrences.

Index (0 based)TokenOccurrences
0, (comma)70,574
1the62,064
2and38,847
3of34,427
4. (period)26,201
5to13,378
6And12,846
7: (colon)12,698
15his8,385
31which4,283
63house2,023
127among903
255under391
511hold174
1023Bethel66
2047Mine26
4096Uri8
8191Besai2

You can see that at index 8,191 we have a token that only appears in the entire text only twice, and we have a total of 13,570 tokens. There are a lot of tokens that only appear once (4,393), two (1,885) or three (1,053 )times. Those 7,331 tokens in the KJV that appear 1-3 times make up a little more than half of the total of 13,570 tokens. Many of these are proper names.

If we take a very naïve approach and just use the full two bytes, this means we could encode the text (excluding the lexicon) in 970,000 x 2 = 1,940,000 bytes, or about 45% of the size of the raw text. Encoding the file using 14 bits (14 * 970,000 / 8) = 1,697,500 bytes, or about 40% of the size of the raw text. Once we throw in the lexicon data, the overall filesize is still somewhere close to 2MB, nowhere near as small as the sub 1MB that the text consumes on the KJ-21. This is still too big, so the KJ-21 and the other applications are doing something else.

What About Spaces?

It turns out that if we were to include spaces, there would be almost as many spaces as there are tokens - a full 758,532 spaces exist in the CSV version of the KJV, and there would be another 31,000 or so newlines. We really don't want to store those.

We could store them as part of the word tokens - for example using the patent example, we could store "PETER ", "PIPER ", "PICKED " in our lexicon, but what about "PIPER?" ? - We now have stored piper twice - once with a space, once without.

Just like how image compression algorithms can use some tricks to store images better than a general-purpose compressor, we can employ some tricks of our own by using a little bit of code. Now, these rules would likely only work for some Western languages, and would only work for texts that follow normal grammatical rules. Given those constraints, the decompressor can use a little bit of code to not store any spaces at all:

  1. After decompressing a token, write a space
  2. If we have decompressed " ' s ", rewrite that as "'s"
  3. If we decompress " ( ", rewrite that as " (" (eliminate the space after the opening parenthesis)
  4. When we encounter any space followed by any other punctuation mark, eliminate the space

Peter Piper, Peter Piper, Peter Piper - Eliminating Repetition

The KJ-21 takes a few more steps. The first one is that it uses lookahead markers for words that are repeated. It took me reading through the patent 5153831 a few times to understand the positive and negative numbers, but here is what they mean:

  • A positive number means read this many spaces ahead to find out what word to use. When skipping ahead, it is possible to encounter another positive number.
  • If after reading that many spaces ahead another positive number is encountered, repeat that process - skipping ahead again that number of spaces.
  • Once a negative number is encountered, negate it (make it positive) and look it up in the Lexicon.
  • The word from the Lexicon is inserted where the negative number was encountered, and any of the references in the chain that took us to that negative number.

Using the sample from the patent, the first two numbers we encounter are 10. If you skip ahead from each of those 10 spaces, you encounter another two 17's. 17 spaces after those you get -5 and -11. -5 corresponds to "PETER" and -11 corresponds to "PIPER" in the lexicon. The first 10, the first 17 and the -5 get replaced with "PETER", and the second 10, second 17 and -11 are replaced with "PIPER".

This compression scheme is a little bit different than compressors that were common at the time, like LZ77 (there is a video that explains this at the bottom). LZ will look for repeating bytes, and will write a backwards offset and length, usually expressed in 16 bits (12 for the offset, 4 for the length). The KJ-21 compression method is not dealing with bytes but rather is dealing with tokens, so it does not need the length component, just an offset. For some reason they chose to use a forward lookup instead of a backwards: this may have been done to avoid patent issues, or maybe there is a reason that we will discover later that makes sense with how this device works.

The decompression code would look something like the following:

var compressedTokens: Int[] = loadTokensFromFile()
var lexicon: String[] = loadLexiconFromTime()
var decompressedTokens = []

for (i = 0 ; i < compressedTokens.length ; i++ ) {
  var t = compressedTokens[i] 
  while (t > 0) {
    var n = compressedTokens[i+t]
    if (n < 0) { 
      t = n
    } else {
      t = t + n
    }
  }
  decompressedTokens[i] = lexicon[-t]
}
var text = joinTokensWithSpaces(decompressedTokens)

PETER PIPER PICKED PECK PICKLED PEPPERS - Stop Words

Read that headline again. You know the full phrase - "PETER PIPER PICKED A PECK OF PICKLED PEPPERS". The "A" and "OF" were omitted. Did you notice that at a a first, quick reading?

There is one final trick that the KJ-21 employs: the text is split into two files, storing "stop words" separately. This is a topic I want to address in a future post about searching as it has more relevance there, but for now let's say that stop words are words like "A", "AN", "THE", "OR", "IS", "OF", etc. These words are extremely common, and usually are not significant for search purposes.

In the KVJ, the top 8 tokens (which includes some punctuation) are ",", "the", "and", "of", ".", "to", "And", ":". These account for 271,036 of the 917,089 total tokens, or about 30%.

Looking at the patent, there is a "master-subfile" and a "text-file". The "master-subfile" contains the stop words and punctuation, and then markers to indicate that a word should be looked up in the separately compressed "text-file".

I will explain this a little further in the next post, but by separating the text out like this, the KJ-21 can compress the files slightly differently, and at the same time dramatically reduce how much has to be read to accomplish searching.

Next up, hopefully next week: how to actually compactly store text.

Further Learning

These two videos will give some insight into where this series is headed.

Leave a Reply