Lempel-Ziv-Welsh Compression

[Communications Home][Home]

Around 1977, Abraham Lempel and Jacob Ziv developed the Lempel–Ziv class of adaptive dictionary data compression techniques (also known as LZ-77 coding), which are now some of the most popular compression techniques. The LZ coding scheme is especially suited to data which has a high degree of repetition, and makes back references to these repeated parts. Typically a flag is normally used to identify coded and unencoded parts, where the flag creates back references to the repeated sequence.

Text input

Enter some text:

Cows graze in groves on grass which grows in grooves in groves

Try an example

"Cows graze in groves on grass which grows in grooves in groves". Calc
"I wish to wish the wish you wish to wish, but if you wish the wish the witch wishes, I won't wish the wish you wish to wish". Calc
"Picky people pick Peter Pan Peanut-Butter, 'tis the peanut-butter picky people pick". Calc

Input: Cows graze in groves on grass which grows in grooves in groves Compressed: ['C', 'o', 'w', 's', ' ', 'g', 'r', 'a', 'z', 'e', ' ', 'i', 'n', 260, 'r', 'o', 'v', 'e', 259, 'o', 268, 261, 'a', 's', 259, 'w', 'h', 'i', 'c', 'h', 269, 257, 259, 267, 286, 271, 273, 266, 276, 270, 272, 's'] 256 297 Deompressed: Cows graze in groves on grass which grows in grooves in groves Debug Adding: [256] Co Adding: [257] ow Adding: [258] ws Adding: [259] s Adding: [260] g Adding: [261] gr Adding: [262] ra Adding: [263] az Adding: [264] ze Adding: [265] e Adding: [266] i Adding: [267] in Adding: [268] n Adding: [269] gr Adding: [270] ro Adding: [271] ov Adding: [272] ve Adding: [273] es Adding: [274] s o Adding: [275] on Adding: [276] n g Adding: [277] gra Adding: [278] as Adding: [279] ss Adding: [280] s w Adding: [281] wh Adding: [282] hi Adding: [283] ic Adding: [284] ch Adding: [285] h Adding: [286] gro Adding: [287] ows Adding: [288] s i Adding: [289] in Adding: [290] groo Adding: [291] ove Adding: [292] es Adding: [293] in Adding: [294] n gr Adding: [295] rov Adding: [296] ves

Example

In he example above, we have:

Input:  Cows graze in groves on grass which grows in grooves in groves

Compressed:
['C', 'o', 'w', 's', ' ', 'g', 'r', 'a', 'z', 'e', ' ', 'i', 'n', 260, 'r', 'o', 'v', 'e', 259, 'o', 268, 261, 'a', 's', 259, 'w', 'h', 'i', 'c', 'h', 269, 257, 259, 267, 286, 271, 273, 266, 276, 270, 272, 's']

Deompressed:
Cows graze in groves on grass which grows in grooves in groves

The first entry in the dictionary add position 256 will be 'Co', next it will be 'ow'. We can see that the following index values have been defined:

The code ' g' has been defined with an index of 260.
The code 's ' has been defined with an index of 259.
268, 261 represents 'n ' and 'gr', representively.

The resulting dictionary entries that are added:

Adding: [256]	Co
Adding: [257]	ow
Adding: [258]	ws
Adding: [259]	s 
Adding: [260]	 g
Adding: [261]	gr
Adding: [262]	ra
Adding: [263]	az
Adding: [264]	ze
Adding: [265]	e 
Adding: [266]	 i
Adding: [267]	in
Adding: [268]	n 
Adding: [269]	 gr
Adding: [270]	ro
Adding: [271]	ov
Adding: [272]	ve
Adding: [273]	es
Adding: [274]	s o
Adding: [275]	on
Adding: [276]	n g
Adding: [277]	gra
Adding: [278]	as
Adding: [279]	ss
Adding: [280]	s w
Adding: [281]	wh
Adding: [282]	hi
Adding: [283]	ic
Adding: [284]	ch
Adding: [285]	h 
Adding: [286]	 gro
Adding: [287]	ows
Adding: [288]	s i
Adding: [289]	in 
Adding: [290]	 groo
Adding: [291]	ove
Adding: [292]	es 
Adding: [293]	 in
Adding: [294]	n gr
Adding: [295]	rov
Adding: [296]	ves

Theory

The Lempel–Ziv–Welsh (LZW) algorithm (also known LZ-78) builds a dictionary of fre-quently used groups of characters (or 8-bit binary values). Before the file is decoded, the compression dictionary is sent (if transmitting data) or stored (if data is being stored). This method is good at compressing text files because text files contain ASCII characters (which are stored as 8-bit binary values) but not so good for graphics files, which may have repeating patterns of binary digits that might not be multiples of 8 bits.

A simple example is to use a six-character alphabet and a 16-entry dictionary, thus the resulting code word will have 4 bits. If the transmitted message is:

ababacdcdaaaaaaef

Then, the transmitter and receiver would initially add the following to its dictionary:

0000			‘a’		0001				‘b’
0010			‘c’		0011				‘d’
0100			‘e’		0101				‘f’
0110–1111		empty

First the ‘a’ character is sent with 0000, next the ‘b’ character is sent and the transmitter checks to see that the ‘ab’ sequence has been stored in the dictionary. As it has not, it adds ‘ab’ to the dictionary, to give:

0000			‘a’		0001				‘b’
0010			‘c’		0011				‘d’
0100			‘e’		0101				‘f’
0110			‘ab’	0111–1111		empty

The receiver will also add this to its table (thus, the transmitter and receiver will always have the same tables). Next, the transmitter reads the ‘a’ character and checks to see if the ‘ba’ sequence is in the code table. As it is not, it transmits the ‘a’ character as 0000, adds the ‘ba’ sequence to the dictionary, which will now contain:

0000			‘a’	
0001			‘b’
0010			‘c’
0011			‘d’
0100			‘e’
0101			‘f’
0110			‘ab’
0111			‘ba’
1000–1111		empty

Next, the transmitter reads the ‘b’ character and checks to see if the ‘ba’ sequence is in the table. As it is, it will transmit the code table address which identifies it, i.e. 0111. When this is received, the receiver detects that it is in its dictionary and it knows that the addressed se-quence is ‘ba’.

Next, the transmitter reads a ‘c’ and checks for the character in its dictionary. As it is included, it transmits its address, i.e. 0010. When this is received, the receiver checks its dic-tionary and locates the character ‘c’. This then continues with the transmitter and receiver maintaining identical copies of their dictionaries. A great deal of compression occurs when sending a sequence of one character, such as a long sequence of ‘a’.

Typically, in a practical implementation of LZW, the dictionary size for LZW starts at 4K (4096). The dictionary then stores bytes from 0 to 255 and the addresses 256 to 4095 are used for strings (which can contain two or more characters). As there are 4096 entries then it is a 12-bit coding scheme (0 to 4096 gives 0 to 212–1 different addresses).

Coding

The following outlines the coding:

import sys

## Based on http://rosettacode.org/wiki/LZW_compression#Python

message="Cows graze in groves on grass which grows in grooves in groves"

if (len(sys.argv)>1):
        message=str(sys.argv[1])


def compress(uncompressed):
    """Compress a string to a list of output symbols."""

    # Build the dictionary.
    dict_size = 256
    dictionary = dict((chr(i), chr(i)) for i in range(dict_size))
    # in Python 3: dictionary = {chr(i): chr(i) for i in range(dict_size)}
    w = ""
    result = []
    for c in uncompressed:
        wc = w + c
        if wc in dictionary:
            w = wc
        else:
            result.append(dictionary[w])
            # Add wc to the dictionary.
            dictionary[wc] = dict_size
            dict_size += 1
            w = c

    # Output the code for w.
    if w:
        result.append(dictionary[w])

    return result


def decompress(compressed):
    """Decompress a list of output ks to a string."""
    debug_str=""

    # Build the dictionary.
    dict_size = 256
    dictionary = dict((chr(i), chr(i)) for i in range(dict_size))
    # in Python 3: dictionary = {chr(i): chr(i) for i in range(dict_size)}
    print(dict_size)
    w = result = compressed.pop(0)
    for k in compressed:
        if k in dictionary:
            entry = dictionary[k]
        elif k == dict_size:
            entry = w + w[0]
        else:
            raise ValueError('Bad compressed k: %s' % k)
        result += entry

        # Add w+entry[0] to the dictionary.
        dictionary[dict_size] = w + entry[0]
        debug_str += "Adding: ["+str(dict_size)+"]\t"+dictionary[dict_size]+"\n"
        dict_size += 1

        w = entry
    print(dict_size)
    return result,debug_str 


print("Input: ",message)
compressed = compress(message)
print("\nCompressed:\n",compressed)
decompressed,debug_str = decompress(compressed)
print("\nDeompressed:\n",(decompressed))

print("\n\nDebug")
print(debug_str)

Lempel-Ziv-Welsh Compression

Text input

Example

Theory

Coding

Referencing this page