\section{Huffman Codes} \example{} Now consider the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{E}\}$. \par With a na\"ive coding scheme, we can encode a length-$n$ string with $3n$ bits, by mapping... \begin{itemize} \item $\texttt{A}$ to $\texttt{000}$ \item $\texttt{B}$ to $\texttt{001}$ \item $\texttt{C}$ to $\texttt{010}$ \item $\texttt{D}$ to $\texttt{011}$ \item $\texttt{E}$ to $\texttt{100}$ \end{itemize} For example, this encodes \texttt{ADEBCE} as \texttt{[000 011 100 001 010 100]}. \par To encoding strings over $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{E}\}$ with this scheme, we need an average of three bits per symbol. \vspace{2mm} One could argue that this coding scheme is wasteful: \par we're not using three of the eight possible three-bit sequences! \example{} There is, of course, a better way. \par Consider the following mapping: \begin{itemize} \item $\texttt{A}$ to $\texttt{00}$ \item $\texttt{B}$ to $\texttt{01}$ \item $\texttt{C}$ to $\texttt{10}$ \item $\texttt{D}$ to $\texttt{110}$ \item $\texttt{E}$ to $\texttt{111}$ \end{itemize} \problem{} \begin{itemize} \item Using the above code, encode \texttt{ADEBCE}. \item Then, decode \texttt{[110011001111]}. \end{itemize} \begin{solution} \texttt{ADEBCE} becomes \texttt{[00 110 111 01 10 111]}, \par and \texttt{[110 01 10 01 111]} is \texttt{DBCBE}. \end{solution} \vfill \problem{} How many bits does this code need per symbol, on average? \begin{solution} \begin{equation*} \frac{2 + 2 + 2 + 3 + 3}{5} = \frac{12}{5} = 2.4 \end{equation*} \end{solution} \vfill \problem{} Consider the code below. How is it different from the one above? \par Is this a good way to encode five-letter strings? \begin{itemize} \item $\texttt{A}$ to $\texttt{00}$ \item $\texttt{B}$ to $\texttt{01}$ \item $\texttt{C}$ to $\texttt{10}$ \item $\texttt{D}$ to $\texttt{110}$ \item $\texttt{E}$ to $\texttt{11}$ \end{itemize} \begin{solution} No. The code for \texttt{E} occurs inside the code for \texttt{D}, and we thus can't decode sequences uniquely. For example, we could decode the fragment \texttt{[11001$\cdot\cdot\cdot$]} as \texttt{EA} or as \texttt{DB}. \end{solution} \vfill \pagebreak \remark{} Huffman codes can be visualized as a tree which we traverse while decoding our sequence. \par We start at the topmost node, taking the left edge if we see a \texttt{0} and the right edge if we see a \texttt{1}. \par As an example, consider the code for $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{E}\}$ on the previous page: \begin{itemize} \item $\texttt{A}$ encodes as $\texttt{00}$ \item $\texttt{B}$ encodes as $\texttt{01}$ \item $\texttt{C}$ encodes as $\texttt{10}$ \item $\texttt{D}$ encodes as $\texttt{110}$ \item $\texttt{E}$ encodes as $\texttt{111}$ \end{itemize} Drawing this scheme as a tree, we get the following: \begin{center} \begin{tikzpicture}[scale=1.0] \begin{scope}[layer = nodes] \node[int] (x) at (0, 0) {}; \node[int] (0) at (-0.75, -1) {}; \node[int] (1) at (0.75, -1) {}; \node[end] (00) at (-1.25, -2) {\texttt{A}}; \node[end] (01) at (-0.25, -2) {\texttt{B}}; \node[end] (10) at (0.25, -2) {\texttt{C}}; \node[int] (11) at (1.25, -2) {}; \node[end] (110) at (0.75, -3) {\texttt{D}}; \node[end] (111) at (1.75, -3) {\texttt{E}}; \end{scope} \draw[-] (x) to node[midway, fill=white, text=gray] {\texttt{0}} (0) (x) to node[midway, fill=white, text=gray] {\texttt{1}} (1) (0) to node[midway, fill=white, text=gray] {\texttt{0}} (00) (0) to node[midway, fill=white, text=gray] {\texttt{1}} (01) (1) to node[midway, fill=white, text=gray] {\texttt{0}} (10) (1) to node[midway, fill=white, text=gray] {\texttt{1}} (11) (11) to node[midway, fill=white, text=gray] {\texttt{0}} (110) (11) to node[midway, fill=white, text=gray] {\texttt{1}} (111) ; \end{tikzpicture} \end{center} \vfill \pagebreak