Lecture 2 huffman coding

31
October 15, 2015 1 [email protected]

Transcript of Lecture 2 huffman coding

Page 1: Lecture 2 huffman coding

October 15, 2015 1 [email protected]

Page 2: Lecture 2 huffman coding

October 15, 2015 2 [email protected]

Prefix Code

Representing prefix Codes Using Binary Tree

Binary Tree Terminology

Decoding a Prefix Code

Example

Huffman Coding

Cost of Huffman Tree

Optimality

Contents

Page 3: Lecture 2 huffman coding

October 15, 2015 3

A prefix code is a type of code system (typically a variable-length code)

distinguished by its possession of the "prefix property", which requires that

there is no code word in the system that is a prefix (initial segment) of any

other code word in the system. For example, a code with code words

{9, 59, 55} has the prefix property; a code consisting of {9, 5, 59, 55} does

not, because "5" is a prefix of "59" and also of "55".

A prefix code is a uniquely decodable code: a receiver can identify each

word without requiring a special marker between words.

If every word in the code has the same length, the code is called a

fixed-length code, or a block code.

[email protected]

Prefix code

Page 4: Lecture 2 huffman coding

October 15, 2015 4

Suppose we have two binary code words a and b, where a is k bits long, b

is n bits long, and k < n. If the first k bits of b are identical to a, then a is

called a prefix of b. The last n − k bits of b are called the dangling suffix.

For example, if

a = 010 and b = 01011,

then a is a prefix of b and the dangling suffix is 11.

[email protected]

Prefix codes

Page 5: Lecture 2 huffman coding

October 15, 2015 5

A prefix code is most easily represented by a binary tree in which the external nodes

are labeled with single characters that are combined to form the message. The

encoding for a character is determined by following the path down from the root of

the tree to the external node that holds that character: a 0 bit identifies a left branch

in the path, and a 1 bit identifies a right branch.

In order for this encoding scheme to reduce the number of bits in a message, we use

short encodings for frequently used characters, and long encodings for infrequent ones.

[email protected]

Representing Prefix Codes using Binary Trees

Page 6: Lecture 2 huffman coding

October 15, 2015 6

A fundamental property of prefix codes is that messages can be formed

by simply stringing together the code bits from left to right. For example,

the bit-string.

0111110010110101001111100100

encodes the message "abracadabra!". The first 0 must encode 'a', then

the next three 1's must encode 'b', then 110 must encode r, and so on as

follows:

|0|111|110|0|1011|0|1010|0|111|110|0|100

a b r a c a d a b r a !

[email protected]

Page 7: Lecture 2 huffman coding

October 15, 2015 7

Binary Tree Terminology

1. Each node, except the root, has a unique parent.

2. Each internal node has exactly two children.

[email protected]

Page 8: Lecture 2 huffman coding

October 15, 2015 8

Decoding a Prefix Code

11000111100

[email protected]

Page 9: Lecture 2 huffman coding

October 15, 2015 9

1. For a given list of symbols, develop a corresponding list of probabilities or frequency

counts so that each symbol’s relative frequency of occurrence is known.

we assume the following frequency counts: A : 15, B : 7, C : 6, D : 6, E : 5)

2. Sort the lists of symbols according to frequency, with the most frequently occurring

symbols at the left and the least common at the right.

3. Divide the list into two parts, with the total frequency counts of the left half being as

close to the total of the right as possible.

4. The left half of the list is assigned the binary digit 0, and the right half is assigned the

digit 1. This means that the codes for the symbols in the first half will all start with 0, and

the codes in the second half will all start with 1.

[email protected]

Example

Page 10: Lecture 2 huffman coding

October 15, 2015 10

5. Recursively apply the steps 3 and 4 to each of the two halves,

subdividing groups and adding bits to the codes until each symbol has

become a corresponding code leaf on the tree.

[email protected]

Example …

Page 11: Lecture 2 huffman coding

October 15, 2015 11

• Huffman (1951)

• Uses frequencies of symbols in a string to build a variable rate prefix

code.

– Each symbol is mapped to a binary string.

– More frequent symbols have shorter codes.

– No code is a prefix of another.

• Example:

a 0

b 100

c 101

d 11

[email protected]

Huffman Coding

Page 12: Lecture 2 huffman coding

October 15, 2015 12

Huffman Coding …

Q. Given a text that uses 32 symbols (26 different letters, space, and some

punctuation characters), how can we encode this text in bits?

A. they encoded with 25. Then 32 symbols encoded with 5 bits.

[email protected]

Page 13: Lecture 2 huffman coding

October 15, 2015 13

Q. Given a text that uses 32 symbols (26 different letters, space, and some

punctuation characters), how can we encode this text in bits?

Huffman Coding …

[email protected]

Page 14: Lecture 2 huffman coding

October 15, 2015 14

Q. Some symbols (e, t, a, o, i, n) are used far more often than others.

How can we use this to reduce our encoding?

A. Encode these characters with fewer bits, and the others with more bits.

Huffman Coding …

[email protected]

Page 15: Lecture 2 huffman coding

October 15, 2015 15

Q. How do we know when the next symbol begins?

Huffman Coding …

[email protected]

Page 16: Lecture 2 huffman coding

October 15, 2015 16

Q. How do we know when the next symbol begins?

A. Use a separation symbol (like the pause in Morse), or make sure that there is no

ambiguity by ensuring that no code is a prefix of another one.

Ex. c(a) = 01 What is 0101?

c(b) = 010

c(e) = 1

Huffman Coding …

[email protected]

Page 17: Lecture 2 huffman coding

October 15, 2015 17

Cost of a Huffman Tree

• Let p1, p2, ... , pm be the probabilities for the symbols a1, a2, ... ,am,

respectively.

• Define the cost of the Huffman tree T to be

where ri is the length of the path from the root to ai.

• C(T) is the expected length of the code of a symbol coded by the tree T.

C(T) is the bit rate of the code.

[email protected]

Page 18: Lecture 2 huffman coding

October 15, 2015 18

• Input: Probabilities p1, p2, ... , pm for symbols a1, a2, ... ,am, respectively.

• Output: We would like to find a prefix code that is has the lowest possible

average bits per symbol.

That is, minimizes

Suppose we model a code in a binary tree…

[email protected]

Cost of a Huffman Tree

Page 19: Lecture 2 huffman coding

October 15, 2015 19

• Example: a : 1/2, b : 1/8, c : 1/8, d : 1/4

Example of Cost

C(T) = 1 x 1/2 + 3 x 1/8 + 3 x 1/8 + 2 x 1/4 = 1.75

a b c d

[email protected]

Page 20: Lecture 2 huffman coding

October 15, 2015 20

Ex. c(a) = 11

c(e) = 01

c(k) = 001

c(l) = 10

c(u) = 000

Note: only the leaves have a label.

An encoding of x is a prefix of an encoding of y if and only if the path of x

is a prefix of the path of y.

[email protected]

Page 21: Lecture 2 huffman coding

October 15, 2015 21

Principle 1

• In a Huffman tree a lowest probability symbol has maximum distance from

the root.

– exchanging a lowest probability symbol with one at maximum distance will

lower the cost.

Optimality

[email protected]

C(T) - C(T’) = (h-k)(q-p) >= 0

C(T) - C(T’) = hq + kp - hp - kq = (hq – hp) – (kq – kp)

Page 22: Lecture 2 huffman coding

October 15, 2015 22

Principle 2

• The second lowest probability is a sibling of the smallest in some

Huffman tree.

– If we can move it there not raising the cost.

[email protected]

Optimality …

Page 23: Lecture 2 huffman coding

October 15, 2015 23

Principle 3

• Assuming we have a Huffman tree T whose two lowest probability symbols

are siblings at maximum depth, they can be replaced by a new symbol whose

probability is the sum of their probabilities.

– The resulting tree is optimal for the new symbol set.

[email protected]

Optimality …

Page 24: Lecture 2 huffman coding

October 15, 2015 24

1. If there is just one symbol, a tree with one node is optimal. Otherwise

2. Find the two lowest probability symbols with probabilities p and q

respectively.

3. Replace these with a new symbol with probability p + q.

4. Solve the problem recursively for new symbols.

5. Replace the leaf with the new symbol with an internal node with two

children with the old symbols.

[email protected]

Optimality …

Page 25: Lecture 2 huffman coding

October 15, 2015 25

Principle 3 (cont’)

• If T’ were not optimal then we could find a lower cost tree T’’. This will

lead to a lower cost tree T’’’ for the original alphabet.

[email protected]

Optimality …

Page 26: Lecture 2 huffman coding

October 15, 2015 26

Q. What is the meaning of

111010001111101000 ?

[email protected]

Optimality …

Page 27: Lecture 2 huffman coding

October 15, 2015 27

Q. What is the meaning of

111010001111101000 ?

A. “simpel”

Q. How can this prefix code be made more efficient?

[email protected]

Optimality …

Page 28: Lecture 2 huffman coding

October 15, 2015 28

Q. What is the meaning of

111010001111101000 ?

A. “simpel”

Q. How can this prefix code be made more efficient?

A. Change encoding of p and s to a shorter one.

This tree is now full.

[email protected]

Optimality …

Page 29: Lecture 2 huffman coding

October 15, 2015 29

Definition. A tree is full if every node that is not a leaf has two children.

Claim. The binary tree corresponding to the optimal prefix code is full.

Q. Where in the tree of an optimal prefix code should symbols be placed

with a high frequency?

[email protected]

Optimality …

Page 30: Lecture 2 huffman coding

October 15, 2015 30 [email protected]

Definition. A tree is full if every node that is not a leaf has two children.

Claim. The binary tree corresponding to the optimal prefix code is full.

Q. Where in the tree of an optimal prefix code should symbols be placed

with a high frequency?

A. Near the top

Optimality …

Page 31: Lecture 2 huffman coding

October 15, 2015 31 [email protected]