Around 1977, Abraham Lempel and Jacob Ziv developed the Lempel–Ziv class of adaptive dictionary data compression techniques (also known as LZ-77 coding), which are now some of the most popular compression techniques. The LZ coding scheme is especially suited to data which has a high degree of repetition, and makes back references to these repeated parts. Typically a flag is normally used to identify coded and unencoded parts, where the flag creates back references to the repeated sequence.
Lempel-Ziv-Welsh Compression |
Example
In he example above, we have:
Input: Cows graze in groves on grass which grows in grooves in groves Compressed: ['C', 'o', 'w', 's', ' ', 'g', 'r', 'a', 'z', 'e', ' ', 'i', 'n', 260, 'r', 'o', 'v', 'e', 259, 'o', 268, 261, 'a', 's', 259, 'w', 'h', 'i', 'c', 'h', 269, 257, 259, 267, 286, 271, 273, 266, 276, 270, 272, 's'] Deompressed: Cows graze in groves on grass which grows in grooves in groves
The first entry in the dictionary add position 256 will be 'Co', next it will be 'ow'. We can see that the following index values have been defined:
- The code ' g' has been defined with an index of 260.
- The code 's ' has been defined with an index of 259.
- 268, 261 represents 'n ' and 'gr', representively.
The resulting dictionary entries that are added:
Adding: [256] Co Adding: [257] ow Adding: [258] ws Adding: [259] s Adding: [260] g Adding: [261] gr Adding: [262] ra Adding: [263] az Adding: [264] ze Adding: [265] e Adding: [266] i Adding: [267] in Adding: [268] n Adding: [269] gr Adding: [270] ro Adding: [271] ov Adding: [272] ve Adding: [273] es Adding: [274] s o Adding: [275] on Adding: [276] n g Adding: [277] gra Adding: [278] as Adding: [279] ss Adding: [280] s w Adding: [281] wh Adding: [282] hi Adding: [283] ic Adding: [284] ch Adding: [285] h Adding: [286] gro Adding: [287] ows Adding: [288] s i Adding: [289] in Adding: [290] groo Adding: [291] ove Adding: [292] es Adding: [293] in Adding: [294] n gr Adding: [295] rov Adding: [296] ves
Theory
The Lempel–Ziv–Welsh (LZW) algorithm (also known LZ-78) builds a dictionary of fre-quently used groups of characters (or 8-bit binary values). Before the file is decoded, the compression dictionary is sent (if transmitting data) or stored (if data is being stored). This method is good at compressing text files because text files contain ASCII characters (which are stored as 8-bit binary values) but not so good for graphics files, which may have repeating patterns of binary digits that might not be multiples of 8 bits.
A simple example is to use a six-character alphabet and a 16-entry dictionary, thus the resulting code word will have 4 bits. If the transmitted message is:
ababacdcdaaaaaaef
Then, the transmitter and receiver would initially add the following to its dictionary:
0000 ‘a’ 0001 ‘b’ 0010 ‘c’ 0011 ‘d’ 0100 ‘e’ 0101 ‘f’ 0110–1111 empty
First the ‘a’ character is sent with 0000, next the ‘b’ character is sent and the transmitter checks to see that the ‘ab’ sequence has been stored in the dictionary. As it has not, it adds ‘ab’ to the dictionary, to give:
0000 ‘a’ 0001 ‘b’ 0010 ‘c’ 0011 ‘d’ 0100 ‘e’ 0101 ‘f’ 0110 ‘ab’ 0111–1111 empty
The receiver will also add this to its table (thus, the transmitter and receiver will always have the same tables). Next, the transmitter reads the ‘a’ character and checks to see if the ‘ba’ sequence is in the code table. As it is not, it transmits the ‘a’ character as 0000, adds the ‘ba’ sequence to the dictionary, which will now contain:
0000 ‘a’ 0001 ‘b’ 0010 ‘c’ 0011 ‘d’ 0100 ‘e’ 0101 ‘f’ 0110 ‘ab’ 0111 ‘ba’ 1000–1111 empty
Next, the transmitter reads the ‘b’ character and checks to see if the ‘ba’ sequence is in the table. As it is, it will transmit the code table address which identifies it, i.e. 0111. When this is received, the receiver detects that it is in its dictionary and it knows that the addressed se-quence is ‘ba’.
Next, the transmitter reads a ‘c’ and checks for the character in its dictionary. As it is included, it transmits its address, i.e. 0010. When this is received, the receiver checks its dic-tionary and locates the character ‘c’. This then continues with the transmitter and receiver maintaining identical copies of their dictionaries. A great deal of compression occurs when sending a sequence of one character, such as a long sequence of ‘a’.
Typically, in a practical implementation of LZW, the dictionary size for LZW starts at 4K (4096). The dictionary then stores bytes from 0 to 255 and the addresses 256 to 4095 are used for strings (which can contain two or more characters). As there are 4096 entries then it is a 12-bit coding scheme (0 to 4096 gives 0 to 212–1 different addresses).
Coding
The following outlines the coding:
import sys ## Based on http://rosettacode.org/wiki/LZW_compression#Python message="Cows graze in groves on grass which grows in grooves in groves" if (len(sys.argv)>1): message=str(sys.argv[1]) def compress(uncompressed): """Compress a string to a list of output symbols.""" # Build the dictionary. dict_size = 256 dictionary = dict((chr(i), chr(i)) for i in range(dict_size)) # in Python 3: dictionary = {chr(i): chr(i) for i in range(dict_size)} w = "" result = [] for c in uncompressed: wc = w + c if wc in dictionary: w = wc else: result.append(dictionary[w]) # Add wc to the dictionary. dictionary[wc] = dict_size dict_size += 1 w = c # Output the code for w. if w: result.append(dictionary[w]) return result def decompress(compressed): """Decompress a list of output ks to a string.""" debug_str="" # Build the dictionary. dict_size = 256 dictionary = dict((chr(i), chr(i)) for i in range(dict_size)) # in Python 3: dictionary = {chr(i): chr(i) for i in range(dict_size)} print(dict_size) w = result = compressed.pop(0) for k in compressed: if k in dictionary: entry = dictionary[k] elif k == dict_size: entry = w + w[0] else: raise ValueError('Bad compressed k: %s' % k) result += entry # Add w+entry[0] to the dictionary. dictionary[dict_size] = w + entry[0] debug_str += "Adding: ["+str(dict_size)+"]\t"+dictionary[dict_size]+"\n" dict_size += 1 w = entry print(dict_size) return result,debug_str print("Input: ",message) compressed = compress(message) print("\nCompressed:\n",compressed) decompressed,debug_str = decompress(compressed) print("\nDeompressed:\n",(decompressed)) print("\n\nDebug") print(debug_str)