244 lines
7.1 KiB
TeX
Raw Normal View History

2024-04-12 13:11:24 -07:00
\section{Huffman Codes}
2024-04-21 21:26:19 -07:00
\example{}
Now consider the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{E}\}$. \par
2024-04-12 13:11:24 -07:00
With a na\"ive coding scheme, we can encode a length-$n$ string with $3n$ bits, by mapping...
\begin{itemize}
\item $\texttt{A}$ to $\texttt{000}$
\item $\texttt{B}$ to $\texttt{001}$
\item $\texttt{C}$ to $\texttt{010}$
\item $\texttt{D}$ to $\texttt{011}$
\item $\texttt{E}$ to $\texttt{100}$
\end{itemize}
2024-04-21 21:26:19 -07:00
For example, this encodes \texttt{ADEBCE} as \texttt{[000 011 100 001 010 100]}. \par
To encoding strings over $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{E}\}$ with this scheme, we
need an average of three bits per symbol.
2024-04-12 13:11:24 -07:00
\vspace{2mm}
2024-04-21 21:26:19 -07:00
One could argue that this coding scheme is wasteful: \par
we're not using three of the eight possible three-bit sequences!
\example{}
There is, of course, a better way. \par
Consider the following mapping:
\begin{itemize}
\item $\texttt{A}$ to $\texttt{00}$
\item $\texttt{B}$ to $\texttt{01}$
\item $\texttt{C}$ to $\texttt{10}$
\item $\texttt{D}$ to $\texttt{110}$
\item $\texttt{E}$ to $\texttt{111}$
\end{itemize}
\problem{}
\begin{itemize}
\item Using the above code, encode \texttt{ADEBCE}.
\item Then, decode \texttt{[110011001111]}.
\end{itemize}
\begin{solution}
\texttt{ADEBCE} becomes \texttt{[00 110 111 01 10 111]}, \par
and \texttt{[110 01 10 01 111]} is \texttt{DBCBE}.
\end{solution}
\vfill
\problem{}
How many bits does this code need per symbol, on average?
\begin{solution}
\begin{equation*}
\frac{2 + 2 + 2 + 3 + 3}{5} = \frac{12}{5} = 2.4
\end{equation*}
\end{solution}
\vfill
\problem{}
2024-04-22 17:58:32 -07:00
Consider the code below. How is it different from the one on the previous page? \par
2024-04-21 21:26:19 -07:00
Is this a good way to encode five-letter strings?
\begin{itemize}
\item $\texttt{A}$ to $\texttt{00}$
\item $\texttt{B}$ to $\texttt{01}$
\item $\texttt{C}$ to $\texttt{10}$
\item $\texttt{D}$ to $\texttt{110}$
\item $\texttt{E}$ to $\texttt{11}$
\end{itemize}
\begin{solution}
No. The code for \texttt{E} occurs inside the code for \texttt{D},
and we thus can't decode sequences uniquely. For example, we could
decode the fragment \texttt{[11001$\cdot\cdot\cdot$]} as \texttt{EA}
or as \texttt{DB}.
\end{solution}
\vfill
\pagebreak
2024-04-22 17:58:32 -07:00
2024-04-21 21:26:19 -07:00
\remark{}
2024-04-22 17:58:32 -07:00
The code from the previous page can be visualized as a tree which we traverse while decoding our sequence.
Starting from the topmost node, we take the left edge if we see a \texttt{0} and the right edge if we see a \texttt{1}.
Once we reach a letter, we return to the top node and repeat the process.
\vspace{-5mm}
\null\hfill
\begin{minipage}[t]{0.48\textwidth}
\vspace{0pt}
\begin{itemize}
\item $\texttt{A}$ encodes as $\texttt{00}$
\item $\texttt{B}$ encodes as $\texttt{01}$
\item $\texttt{C}$ encodes as $\texttt{10}$
\item $\texttt{D}$ encodes as $\texttt{110}$
\item $\texttt{E}$ encodes as $\texttt{111}$
\end{itemize}
\end{minipage}
\hfill
\begin{minipage}[t]{0.48\textwidth}
\vspace{0pt}
\begin{center}
\begin{tikzpicture}[scale=1.0]
\begin{scope}[layer = nodes]
\node[int] (x) at (0, 0) {};
\node[int] (0) at (-0.75, -1) {};
\node[int] (1) at (0.75, -1) {};
\node[end] (00) at (-1.25, -2) {\texttt{A}};
\node[end] (01) at (-0.25, -2) {\texttt{B}};
\node[end] (10) at (0.25, -2) {\texttt{C}};
\node[int] (11) at (1.25, -2) {};
\node[end] (110) at (0.75, -3) {\texttt{D}};
\node[end] (111) at (1.75, -3) {\texttt{E}};
\end{scope}
\draw[-]
(x) to node[edg] {\texttt{0}} (0)
(x) to node[edg] {\texttt{1}} (1)
(0) to node[edg] {\texttt{0}} (00)
(0) to node[edg] {\texttt{1}} (01)
(1) to node[edg] {\texttt{0}} (10)
(1) to node[edg] {\texttt{1}} (11)
(11) to node[edg] {\texttt{0}} (110)
(11) to node[edg] {\texttt{1}} (111)
;
\end{tikzpicture}
\end{center}
\end{minipage}
\hfill\null
\problem{}<treedecode>
Decode \texttt{[110111001001110110]} using the tree above.
2024-04-21 21:26:19 -07:00
2024-04-22 17:58:32 -07:00
\begin{solution}
This is \texttt{[110$\cdot$111$\cdot$00$\cdot$10$\cdot$01$\cdot$110$\cdot$110]}, which is \texttt{DEACBDD}
\end{solution}
2024-04-21 21:26:19 -07:00
2024-04-22 17:58:32 -07:00
\vfill
\problem{}
In \ref{treedecode}, we needed 18 bits to encode \texttt{DEACBDD}. \par
\note{Note that we'd need $3 \times 7 = 21$ bits to encode this string na\"ively.}
2024-04-21 21:26:19 -07:00
2024-04-22 17:58:32 -07:00
\vspace{2mm}
Draw a tree that encodes this string more efficiently. \par
2024-04-21 21:26:19 -07:00
2024-04-22 17:58:32 -07:00
\begin{solution}
Two possible solutions are below. \par
\begin{itemize}
\item The left tree encodes \texttt{DEACBDD} as \texttt{[00$\cdot$111$\cdot$110$\cdot$10$\cdot$01$\cdot$00$\cdot$00]}, using 16 bits.
\item The right tree encodes \texttt{DEACBDD} as \texttt{[0$\cdot$111$\cdot$101$\cdot$110$\cdot$100$\cdot$0$\cdot$0]}, using 15 bits.
\end{itemize}
\null\hfill
\begin{minipage}{0.48\textwidth}
\begin{center}
\begin{tikzpicture}[scale=1.0]
\begin{scope}[layer = nodes]
\node[int] (x) at (0, 0) {};
\node[int] (0) at (-0.75, -1) {};
\node[int] (1) at (0.75, -1) {};
\node[end] (00) at (-1.25, -2) {\texttt{D}};
\node[end] (01) at (-0.25, -2) {\texttt{B}};
\node[end] (10) at (0.25, -2) {\texttt{C}};
\node[int] (11) at (1.25, -2) {};
\node[end] (110) at (0.75, -3) {\texttt{A}};
\node[end] (111) at (1.75, -3) {\texttt{E}};
\end{scope}
\draw[-]
(x) to node[edg] {\texttt{0}} (0)
(x) to node[edg] {\texttt{1}} (1)
(0) to node[edg] {\texttt{0}} (00)
(0) to node[edg] {\texttt{1}} (01)
(1) to node[edg] {\texttt{0}} (10)
(1) to node[edg] {\texttt{1}} (11)
(11) to node[edg] {\texttt{0}} (110)
(11) to node[edg] {\texttt{1}} (111)
;
\end{tikzpicture}
\end{center}
\end{minipage}
\hfill
\begin{minipage}{0.48\textwidth}
\begin{center}
\begin{tikzpicture}[scale=1.0]
\begin{scope}[layer = nodes]
\node[int] (x) at (0, 0) {};
\node[int] (0) at (-0.75, -1) {\texttt{D}};
\node[int] (1) at (0.75, -1) {};
\node[end] (10) at (0.25, -2) {};
\node[int] (11) at (1.25, -2) {};
\node[end] (100) at (-0.15, -3) {\texttt{A}};
\node[end] (101) at (0.6, -3) {\texttt{B}};
\node[end] (110) at (0.9, -3) {\texttt{C}};
\node[end] (111) at (1.6, -3) {\texttt{E}};
\end{scope}
\draw[-]
(x) to node[edg] {\texttt{0}} (0)
(x) to node[edg] {\texttt{1}} (1)
(1) to node[edg] {\texttt{0}} (10)
(1) to node[edg] {\texttt{1}} (11)
(10) to node[edg] {\texttt{0}} (101)
(10) to node[edg] {\texttt{1}} (100)
(11) to node[edg] {\texttt{0}} (110)
(11) to node[edg] {\texttt{1}} (111)
;
\end{tikzpicture}
\end{center}
\end{minipage}
\hfill\null
\end{solution}
2024-04-21 21:26:19 -07:00
2024-04-22 17:58:32 -07:00
\vfill
\problem{}
Now, do the opposite: draw a tree that encodes \texttt{DEACBDD} \textit{less} efficiently than before.
\begin{solution}
Bury \texttt{D} as deep as possible in the tree, so that we need four bits to encode it.
\end{solution}
2024-04-12 13:11:24 -07:00
\vfill
2024-04-22 17:58:32 -07:00
\remark{}
We say a coding scheme is \textit{prefix-free} if no whole code word is a prefix of another code word. \par
As we've seen, it is fairly easy to construct a prefix-free variable-length code using a binary tree. \par
Constucting the \textit{most efficient} prefix-free code for a given message is a bit more difficult. \par
We'll spend the rest of this section solving this problem.
2024-04-12 13:11:24 -07:00
\pagebreak