CMSC 132 -- Project 6

Huffman Tree Encoding/Decoding

Project Due: Wednesday 4/20/05 at 11:00 PM

This project is the hardest one yet -- get started now!

Closed Policy

This is a closed project.  You are expected to do all of the work on this project without consulting with anyone other than the CMSC 132 instructors and TAs.  Treat this project as though it were a take home exam.  If you have any questions about whether or not a certain topic is okay to discuss with classmates or friends, please ask your instructor.



The goals of this project are:


What's a Huffman Tree?

In case you missed that lecture (or have been hiding under a rock) we will give a very brief description here.

A Huffman Tree is a special type of binary tree used for data compression.  A small Huffman Tree appears below:

Each leaf in the tree contains a symbol (in this case a Character) and an integer value.  Each non-leaf (internal node) contains just references to its left and right children.  The 0's and 1's that you see are not stored anywhere, we just mentally associate 0 with any left branch and 1 with any right branch.

Each character that appears in the tree is assigned a unique code (a sequence of 0's and 1's) obtained by following the path from the root of the tree to the leaf containing that character.   Below are the codes for the four characters found in this sample tree:


A sequence of characters can be "encoded" into a String of 0's and 1's by using the codings as described above.  For example, the String "eddbc" would be encoded into the sequence "01111100101".  (That's good compression -- we went from 5 characters down to 11 bits.)


A sequence of 0's and 1's can be "decoded" into a String of Characters by using the codings backwards.  For example, the sequence "11010111" could be decoded into the String "decd".

Constructing a Huffman Tree from a Stream of Characters

Given a "sample" stream of Characters, one can create a Huffman Tree which will do a very good job of compressing other streams whose characters share the same relative frequency as those in the sample stream used to create the Tree.

1.  The first step is to count the frequency of each character in the "sample" stream of Characters.  Suppose your sample stream is:


Below are the counts for each character that occurs in the stream:

2.  In this example, the next step is to create four little trees -- one for each character -- each consisting of just a single leaf.  These little trees should be put into some kind of list.  Below is a picture of the list of four little trees:


3.  The next step is to select the two trees with the SMALLEST frequencies.  In this example, that would be 8 and 11.  Remove them from the list of trees, and then join them as children of a new node to make a new tree.  Then add this new tree to the list of trees.  Below is the list of trees after this step has been accomplished.  There are now three trees in the list.


The "frequency" of a tree that is more than just a single leaf is the sum of the frequencies of all leaves below it.  So the frequency of the first tree above is 19, the sum of it's leaves.

Now we repeat step 3.  (This time, the two smallest frequencies are 19 and 12).  Below is the list of trees after we join them.  (We're down to TWO trees now).


Now we repeat step 3 again.  (This time, the two smallest frequencies are obvious -- 31 and 49).  Below is the list of trees after we join them.  (Just one tree now.)

Since the list now contains just one tree, we are done -- this is the Huffman Tree generated by the "sample" sequence of characters.


Project Overview

You will write a program that will first construct a Huffman Tree from a sequence of Characters.  Each leaf at the bottom of the tree will represent just a single Character.  We will allow letters, digits, punctuation, and any other type of Character.

Once the Huffman Tree has been built, your program will be able to do two things:


Efficiency Requirement

This is the first project where you may be penalized if your program performs too slowly.  If your implementation takes less than 10 times as long as Fawzi's to get through each release test, then you will be fine.  Otherwise, you will not pass the release tests.  (Fawzi's code is not particularly efficient, so as long as you don't do anything really inefficient, you should be okay.)



HuffmanTree is an interface that will be implemented by two classes:  HuffmanInternalNode and HuffmanLeaf.

A HuffmanLeaf is a node at the bottom of the Tree.  It has two fields, listed below.  You may not add any other fields to this class.

A HuffmanInternalNode is a node in the tree that is not a leaf.  It has two fields that are self-explanatory.  You may not add any other fields to this class.

There is a class called HuffmanTools that will contain the three static methods that make this project work.  You will be writing these methods.  They are:

There is a class called InternalTools, which you may ignore, but which you may find helpful while testing your project.  There are two public methods in there:


What YOU Will Implement

You will write most of the HuffmanInternalNode class, most of the HuffmanLeaf class, and almost all of the HuffmanTools class.  Please see the javadoc in the file for information about how to implement the methods of the HuffmanInternalNode and HuffmanLeaf classes.

You should begin by first reading over the javadoc for the methods in the HuffmanTools class, and then the javadoc for the HuffmanTree interface. This will give you a feel for how the project components fit together.  Then write the methods of the HuffmanLeaf class and the HuffmanInternalNode class.  Finally, write the methods of the HuffmanTools class.



Your grade on this project will be calculated as follows: