Which Method Has Done More To Reduce Bandwidth/Data Than Any Other?

You can try Huffman here. Your can try LZW here.

Which Method Has Done More To Reduce Bandwidth/Data Than Any Other?

You can try Huffman here. Your can try LZW here.

The method is Huffman Coding. It has been working away in the background for decades, and has done more to reduce the amount of bandwidth that we use, and in the amount of disk space, than virtually any other method.

Huffman coding often reduces data to less than 10% of its original size. So, the more bits we send and store, the more energy that is used … so Huffman is the greenest protocol around … saving the planet! We see it in PKZIP, AVI, MOV, MP3, RAR, DOCX, MP4, and so on much more.

It also shares the crown with Lempel-Ziv, which is used to store repeated patterns. With GIF, Lempel-Ziv takes the same pixel blocks and create a pointer to them. In the definition we thus just need to store the reference.

Huffman Principles

Huffman coding uses a variable length code for each of the elements within the data. This normally involves analyzing the data to determine the probability of its elements. The most probable elements are coded with a few bits and the least probable coded with a greater number of bits. This could be done on a character-by-character basis, in a text file, or could be achieved on a byte-by-byte basis for other files. The following example relates to characters. First, the textual data is scanned to determine the number of occurrences of a given letter. For example:

Letter: ‘b’ ‘c’ ‘e’ ‘i’ ‘o’ ‘p’
No. of occurrences: 12 3 57 51 33 20

Next the characters are arranged in order of their number of occurrences, such as:

‘e’ ‘i’ ‘o’ ‘p’ ‘b’ ‘c’
57 51 33 20 12 3

After this the two least probable characters are assigned either a 0 or a 1. Figure 1 shows that the least probable (‘c’) has been assigned a 0 and the next least probable (‘b’) has been assigned a 1. The addition of the number of occurrences for these is then taken into the next column and the occurrence values are again arranged in descending order (that is, 57, 51, 33, 20 and 15). As with the first column, the least probable occurrence is assigned a 0 and the next least probable occurrence is assigned a 1. This continues until the last column. When complete, the Huffman-coded values are read from left to right and the bits are listed from right to left.

The final coding will be:

‘e’ 11 ‘i’ 10
‘o’ 00 ‘p’ 011
‘b’ 0101 ‘c’ 0100

Let’s try “test” … Try here.

Lempel-Ziv

Around 1977, Abraham Lempel and Jacob Ziv developed the Lempel–Ziv class of adaptive dictionary data compression techniques (also known as LZ-77 coding), which are now some of the most popular compression techniques. The LZ coding scheme is especially suited to data which has a high degree of repetition, and makes back references to these repeated parts. Typically a flag is normally used to identify coded and unencoded parts, where the flag creates back references to the repeated sequence. An example piece of text could be:

‘The receiver requires a receipt for it. This is automatically sent when it is received.’

This text has several repeated sequences, such as ‘ is ’, ‘it’, ‘en’, ‘re’ and ‘ receiv’. For example, the repetitive sequence recei (as shown by the underlined highlight), and the encoded sequence could be modified with the flag sequence #m#n where m represents the number of characters to trace back to find the character sequence and n the number of replaced characters.

Thus, the encoded message could become:

‘The receiver#9#3quires a#20#5pt for it.This is automatically sent wh#6#2 it #30#2#47#5ved.’

Normally, a long sequence of text has many repeated words and phrases, such as ‘and’, ‘there’, and so on. Note that in some cases, this could lead to longer files if short sequences were replaced with codes that were longer than the actual sequence itself. Using the previous example of sport results:

Mulchester 3 Madchester 2
Smellmore Wanderers 60 Drinksome Wanderers 23

which could become:

Mulchester 3 Mad#13#7 2 Smellmore Wanderers 60 Drinksome#23#1123

The Lempel–Ziv–Welsh (LZW) algorithm (also known LZ-78) builds a dictionary of frequently used groups of characters (or 8-bit binary values) [here]. Before the file is decoded, the compression dictionary is sent (if transmitting data) or stored (if data is being stored). In the following example we store the words in a table, and then refer to it in the store data:

This method is especially useful in graphics, where we have pixel block which repeat, such as where there are repeated patterns of the same pixel colour:

Conclusions

Huffman and LZ are two of the key foundation blocks in many of the files that we see. A document that would be many 10s of megabytes will drop down to a tenth of it size, saving both disk space and bandwidth when transmitted. You might not know it, but your DOCX and PPTX files are built with these compressed formats.