Problem: Given some sequence, like a string, we want to be able to assign a binary string to each letter, for storage or transmission purposes. How can we do this?

Intuition

Suppose we have a binary string like HELLOOOO. How can we assign a binary string to each letter?

One idea is to just simply assign incremental bit values to each letter

This will give us string , taking up 16 bits. However, what if we can do fewer?

To reduce the number of bits, we could potentially use assignment

But this would have issues, as some strings are prefixes of others, which creates ambiguity!

Let’s instead try

Which won’t have ambiguity, and give us string , taking up 14 bits!

Notice how with different assignments, we can encode the same string in a differing number of bits. This is because of differing encoding techniques.

  • Fixed Length Encoding represents encoding where all binary strings are of equal length.
  • Prefix Length Encoding represents encoding where there is no binary string that is a prefix of another (but they may not be the same size).

Given a character string (or any other sequence), we ask - how do we come up with an assignment of binary strings minimizing the total number of bits?

Algorithm

Example

Take string HELLOOOO. Then, for each unique character, calculate its count, and build a tree consisting of a single node with its count and letter.

graph TD
      1[1:H]; 2[1:E]; 3[2:L]; 4[4:O];

Now with this tree, iterate and do the following:

  1. Pick the 2 nodes with the smallest counts.
  2. Place these nodes as children of a root nodes with no letter, and count set as the sum of the two children’s counts.

After the first iteration, we choose the and node to obtain

graph TD
     1[H:1]; 2[E:1]; 3[L:2]; 4[O:4];
     5[-:2] -.- 1 & 2;

After the second iteration, we choose the previous root node and node to obtain

graph TD
     1[H:1]; 2[E:1]; 3[L:2]; 4[O:4];
     5[-:2] -.- 1 & 2;
     6[-:4] -.- 5 & 3;

Finally, after the third iteration, we obtain

graph TD
     1[H:1]; 2[E:1]; 3[L:2]; 4[O:4];
     5[-:2] -.- 1 & 2;
     6[-:4] -.- 5 & 3;
     7[-:8] -.- 6 & 4;

Notice that with this tree, the depth represents relative frequencies of counts, where the most frequent count is at the top, and the least frequent count is at the bottom!

Now, we’ll give labels to each of the branches, where denotes a left branch, and denotes a right branch. Note that this assignment of labels is arbitrary, and could be swapped.

graph TD
     1[H:1]; 2[E:1]; 3[L:2]; 4[O:4];
     5[-:2];
     5 -. 1 .- 1;
     5 -. 0 .- 2;
     
     
     6[-:4];
     6 -. 1 .- 3;
     6 -. 0 .- 5;
     
     7[-:8];
     7 -. 1 .- 4;
     7 -. 0 .- 6;

Finally, we’ll give binary assignments by concatenating the labels along the path from the root to each leaf (which represent a character).

We are done! It can be proven that this encoding we just derived minimizes the number of bits we can use.

Pseudocode

The below pseudocode represents the tree building portion of the algorithm.

def build_tree(nodes):
    min_heap = heapify(nodes) # build a min heap with the node counts
    
    for i in range(1, nodes.size - 1):
        c1 = min_heap.heappop() # extract tree/node with smallest count
        c2 = min_heap.heappush() # extract next tree/node
 
        tree = combine_trees(c1,c2)
        min_heap.heappush(tree) # add new tree back to heap
 
    return min_heap.peek(0) # last element in heap

We have the following time complexities. Let denote the number of nodes we start with.

  • Line takes time to build a min heap.
  • Line will iterate times.
    • Lines , , will each take time

This gives us total time complexity !

The below pseudocode represents the code extraction portion of the algorithm.

# dict will contain the final encodings
def extract_codes(node: Node, encodings: dict, s: String):
    if is_leaf(node):
       encodings[node.char] = s
    else:
        extract_codes(node.left, encodings, s + "0")
        extract_codes(node.right, encodings, s + "1")

This algorithm will run once for each node. If is the number of characters (starting nodes) we have, then our final tree will have nodes. This gives us final time complexity .