Compressing Our Digital World

We live in a wasteful digital world. Our data stream were originally designed with little in the way of reducing latency or in optimizing…

‘C’, ‘o’, ‘w’, ‘s’, ‘ ‘, ‘g’, ‘r’, ‘a’, ‘z’, ‘e’, ‘ ‘, ‘i’, ’n’, 260, ‘r’, ‘o’, ‘v’, ‘e’, 259, ‘o’, 268, 261, ‘a’, ‘s’, 259, ‘w’, ‘h’, ‘i’, ‘c’, ‘h’, 269, 257, 259, 267, 286, 271, 273, 266, 276, 270, 272, ‘s’

We live in a wasteful digital world. Our data stream were originally designed with little in the way of reducing latency or in optimizing network bandwidth/storage. But in a world where we need almost instant responses for our data transfers, we often implement compression. The most popular standard for compressing data streams is now gzip, and which is a combination of LZ77 (Lempel Ziv 1977) and Huffman coding.

In gzip we can compress into a stream and convert to Base64 with [here]:

Input:  hello
Compressed:  eJzLSM3JyQcABiwCFQ==
Compressed:  <Buffer 78 9c cb 48 cd c9 c9 07 00 06 2c 02 15>

And then decompress with [here]:

Input: eJwLSS0uMTQyBgAJ6gI3
Uncompressed: Test123
Uncompressed: <Buffer 54 65 73 74 31 32 33>

The compression does not look good in these cases, as we need to repeat character sequences and also create enough data to justify the compression. Now let’s create repeated characters, and we see that the compression stream length stays the same length [here]:

Input: abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabc
Compressed: eJxLTEpOpC8CAM4DK6U=

But watch what happens if we do not have repeated characters:

Input: Go learn some crypto, Python and JavaScript
Compressed: eJxzz1fISU0sylMozs9NVUguqiwoyddRCKgsycjPU0jMS1HwSixLDE4uyiwoAQBTyQ+2

In this case the compressed data is longer. The methods used thus find repeated and common patterns and then stores these with a smaller number of bits than average.

Increasingly Web page communications send their results as a Gzip format, and then for the content to be decompressed at the other end. The code we have used in this case is Node.js:

var zlib = require('zlib');
var test="hello";

var flag="zip"

console.log("Input: ",test);

if (flag=="zip") {
 var input = new Buffer.from(test)

zlib.deflate(input, function(err, buf) {
  var res=buf.toString('base64');

console.log("Compressed: " ,res ); 
 // console.log("Compressed: " ,buf );

});

}
else {
 var input = new Buffer.from(test,'base64')
 
 zlib.inflate(input, function(err, buf) {
      console.log("Uncompressed:", buf.toString("utf8") );
 // console.log("Uncompressed: " ,buf );

});

To make a .gz file, we add on a magic number of “1f 8b 08” and the add the gzip stream data bytes [here]:

Now let’s look at how Huffman and LZ77 coding work.

Huffman coding

Huffman coding uses a variable length code for each of the elements within the data. This normally involves analyzing the data to determine the probability of its elements. The most probable elements are coded with a few bits and the least probable coded with a greater number of bits. This could be done on a character-by-character basis, in a text file, or could be achieved on a byte-by-byte basis for other files. The following example relates to characters. First, the textual data is scanned to determine the number of occurrences of a given letter.

The following example relates to characters. First, the textual data is scanned to determine the number of occurrences of a given letter. For example:

Letter:             'b'  'c'   'e'  'i'  'o'  'p'
No. of occurrences: 12    3    57   51   33   20

Next the characters are arranged in order of their number of occurrences, such as:

'e'  'i'   'o'   'p'   'b'   'c'
 57   51    33    20    12     3

After this the two least probable characters are assigned either a 0 or a 1. Figure 1 shows that the least probable (‘c’) has been assigned a 0 and the next least probable (‘b’) has been assigned a 1. The addition of the number of occurrences for these is then taken into the next column and the occurrence values are again arranged in descending order (that is, 57, 51, 33, 20 and 15). As with the first column, the least probable occurrence is assigned a 0 and the next least probable occurrence is assigned a 1. This continues until the last column. When complete, the Huffman-coded values are read from left to right and the bits are listed from right to left.

The final coding will be:

'e'  11   'i'  10
'o'  00   'p'  011
'b'  0101 'c'  0100

Figure 1

test
t 2 ------> 2 [0]
e 1 [0] --> 2 [1]  
s 1 [1]

t e  s  t 
0 10 11 0

hello
l 2 ------ 2 [1]
h 1 -----> 1 [0] ---> 3 [0]
e 1 [0] ------------> 2 [1]
o 1 [1]

00 01 11 11 10
h  e   l l  o

The following is an example [here]:

Input: eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiioooooooooooooooooooooooooooooooooppppppppppppppppppppbbbbbbbbbbbbccc
Symbol Weight Huffman Code
e 57 11
i 51 10
o 33 00
p 20 011
b 12 0101
c 3 0100

Coding is:

11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 011 011 011 011 011 011 011 011 011 011 011 011 011 011 011 011 011 011 011 011 0101 0101 0101 0101 0101 0101 0101 0101 0101 0101 0101 0101 0100 0100 0100

Binary digits used: 402
Binary digits (using ASCII): 1408 Eff: 28.55 %
Normal coding: 528 Eff: 76.14 %

As we have 6 characters, the minimum number of bits used would be 3. Try an example:

“peter piper picked a picked pepper”. Calc
“test” should give 1 00 01 1. Huff
“hello” should give 111 110 0 0 10 . Huff
“arkansas” should give 11 00 010 11 011 10 11 10. Huff
“eebbeecdebeeebecceeeddeb bbeceedebeeddeeeeccee eedeeedeeebeedeceedeb eeedeceeedebee”. Huff
“57 ‘e’, 51 ‘i’, 33 ‘o’, 22 ‘p’, 12 ‘b’, 3 ‘c’”. Huff
“34 ‘1 goal’, 21 ‘0 goals’, 15 ‘2 goals’, 14 ‘3 goals’, 5 ‘4 goals’ ,2 ‘5 goals’, 1 ‘6 goals’”. Huff

Lempel Ziv 1977 (LZ77)

Around 1977, Abraham Lempel and Jacob Ziv developed the Lempel–Ziv class of adaptive dictionary data compression techniques (also known as LZ-77 coding), which are now some of the most popular compression techniques. The LZ coding scheme is especially suited to data which has a high degree of repetition, and makes back references to these repeated parts. Typically a flag is normally used to identify coded and unencoded parts, where the flag creates back references to the repeated sequence.

The Lempel–Ziv–Welsh (LZW) algorithm (also known LZ-78) builds a dictionary of fre-quently used groups of characters (or 8-bit binary values). Before the file is decoded, the compression dictionary is sent (if transmitting data) or stored (if data is being stored). This method is good at compressing text files because text files contain ASCII characters (which are stored as 8-bit binary values) but not so good for graphics files, which may have repeating patterns of binary digits that might not be multiples of 8 bits.

A simple example is to use a six-character alphabet and a 16-entry dictionary, thus the resulting code word will have 4 bits. If the transmitted message is:

ababacdcdaaaaaaef

Then, the transmitter and receiver would initially add the following to its dictionary:

0000			‘a’	
0001			‘b’
0010			‘c’	
0011			‘d’
0100			‘e’	
0101			‘f’
0110–1111		empty

First the ‘a’ character is sent with 0000, next the ‘b’ character is sent and the transmitter checks to see that the ‘ab’ sequence has been stored in the dictionary. As it has not, it adds ‘ab’ to the dictionary, to give:

0000			‘a’		
0001			‘b’
0010			‘c’	
0011			‘d’
0100			‘e’		
0101			‘f’
0110			‘ab’	
0111–1111		empty

The receiver will also add this to its table (thus, the transmitter and receiver will always have the same tables). Next, the transmitter reads the ‘a’ character and checks to see if the ‘ba’ sequence is in the code table. As it is not, it transmits the ‘a’ character as 0000, adds the ‘ba’ sequence to the dictionary, which will now contain:

0000			‘a’	
0001			‘b’
0010			‘c’
0011			‘d’
0100			‘e’
0101			‘f’
0110			‘ab’
0111			‘ba’
1000–1111		empty

Next, the transmitter reads the ‘b’ character and checks to see if the ‘ba’ sequence is in the table. As it is, it will transmit the code table address which identifies it, i.e. 0111. When this is received, the receiver detects that it is in its dictionary and it knows that the addressed sequence is ‘ba’.

Next, the transmitter reads a ‘c’ and checks for the character in its dictionary. As it is included, it transmits its address, i.e. 0010. When this is received, the receiver checks its dic-tionary and locates the character ‘c’. This then continues with the transmitter and receiver maintaining identical copies of their dictionaries. A great deal of compression occurs when sending a sequence of one character, such as a long sequence of ‘a’.

Typically, in a practical implementation of LZW, the dictionary size for LZW starts at 4K (4096). The dictionary then stores bytes from 0 to 255 and the addresses 256 to 4095 are used for strings (which can contain two or more characters). As there are 4096 entries then it is a 12-bit coding scheme (0 to 4096 gives 0 to 2¹²–1 different addresses). Let’s try and example [here]:

Input: Cows graze in groves on grass which grows in grooves in groves

Compressed:
[‘C’, ‘o’, ‘w’, ‘s’, ‘ ‘, ‘g’, ‘r’, ‘a’, ‘z’, ‘e’, ‘ ‘, ‘i’, ’n’, 260, ‘r’, ‘o’, ‘v’, ‘e’, 259, ‘o’, 268, 261, ‘a’, ‘s’, 259, ‘w’, ‘h’, ‘i’, ‘c’, ‘h’, 269, 257, 259, 267, 286, 271, 273, 266, 276, 270, 272, ‘s’]
256
297

Deompressed:
Cows graze in groves on grass which grows in grooves in groves

Debug
Adding: [256] Co
Adding: [257] ow
Adding: [258] ws
Adding: [259] s 
Adding: [260] g
Adding: [261] gr
Adding: [262] ra
Adding: [263] az
Adding: [264] ze
Adding: [265] e 
Adding: [266] i
Adding: [267] in
Adding: [268] n 
Adding: [269] gr
Adding: [270] ro
Adding: [271] ov
Adding: [272] ve
Adding: [273] es
Adding: [274] s o
Adding: [275] on
Adding: [276] n g
Adding: [277] gra
Adding: [278] as
Adding: [279] ss
Adding: [280] s w
Adding: [281] wh
Adding: [282] hi
Adding: [283] ic
Adding: [284] ch
Adding: [285] h 
Adding: [286] gro
Adding: [287] ows
Adding: [288] s i
Adding: [289] in 
Adding: [290] groo
Adding: [291] ove
Adding: [292] es 
Adding: [293] in
Adding: [294] n gr
Adding: [295] rov
Adding: [296] ves

In he example above, we have:

Input:  Cows graze in groves on grass which grows in grooves in groves

Compressed:
['C', 'o', 'w', 's', ' ', 'g', 'r', 'a', 'z', 'e', ' ', 'i', 'n', 260, 'r', 'o', 'v', 'e', 259, 'o', 268, 261, 'a', 's', 259, 'w', 'h', 'i', 'c', 'h', 269, 257, 259, 267, 286, 271, 273, 266, 276, 270, 272, 's']

Deompressed:
Cows graze in groves on grass which grows in grooves in groves

The first entry in the dictionary add position 256 will be ‘Co’, next it will be ‘ow’. We can see that the following index values have been defined:

The code ‘ g’ has been defined with an index of 260.
The code ‘s ‘ has been defined with an index of 259.
268, 261 represents ‘n ‘ and ‘gr’, respectively.

Try an example:

“Cows graze in groves on grass which grows in grooves in groves”. Calc
“I wish to wish the wish you wish to wish, but if you wish the wish the witch wishes, I won’t wish the wish you wish to wish”. Calc
“Picky people pick Peter Pan Peanut-Butter, ’tis the peanut-butter picky people pick”. Calc

Conclusions

We are quite wasteful in our communications and in the storage of data. Gzip provides a way to convert data into compressed data streams and be more efficient in our processing, storage and communication,