Advanced handouts

Add missing file
Co-authored-by: Mark <mark@betalupi.com>
Co-committed-by: Mark <mark@betalupi.com>
This commit is contained in:
2025-01-22 12:28:44 -08:00
parent 13b65a6c64
commit dd4abdbab0
177 changed files with 20658 additions and 0 deletions

View File

@ -0,0 +1,31 @@
% use [nosolutions] flag to hide solutions.
% use [solutions] flag to show solutions.
\documentclass[
solutions,
singlenumbering
]{../../../lib/tex/ormc_handout}
\usepackage{../../../lib/tex/macros}
\input{tikzset.tex}
\usepackage{units}
\usepackage{pdfpages}
\uptitlel{Advanced 2}
\uptitler{\smallurl{}}
\title{Compression}
\subtitle{Prepared by Mark on \today{}}
% TODO: add a section on info theory,
% shannon entropy. etc.
\begin{document}
\maketitle
\input{parts/0 intro.tex}
\input{parts/1 runlength.tex}
\input{parts/2 lzss.tex}
\input{parts/3 huffman.tex}
\input{parts/4 bonus.tex}
\end{document}

View File

@ -0,0 +1,6 @@
[metadata]
title = "Compression"
[publish]
handout = true
solutions = true

View File

@ -0,0 +1,38 @@
\section{Introduction}
\definition{}
An \textit{alphabet} is a set of symbols. Two examples are
$\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$ and $\{\texttt{0}, \texttt{1}\}$.
\definition{}
A \textit{string} is a sequence of symbols from an alphabet. \par
For example, \texttt{CBCAADDD} is a string over the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$.
\problem{}
Say we want to store a length-$n$ string over the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$ as a binary sequence. \par
How many bits will we need? \par
\hint{
Our alphabet has four symbols, so we can encode each symbol using two bits, \par
mapping $\texttt{A} \rightarrow \texttt{00}$,
$\texttt{B} \rightarrow \texttt{01}$,
$\texttt{C} \rightarrow \texttt{10}$, and
$\texttt{D} \rightarrow \texttt{11}$.
}
\begin{solution}
$2n$ bits.
\end{solution}
\vfill
\problem{}<naivelen>
Similarly, we can encode an $n$-symbol string over an alphabet of size $k$ \par
using $n \times \lceil \log_2k \rceil$ bits. Show that this is true. \par
\note[Note]{We'll call this the \textit{na\"ive coding scheme}.}
\vfill
As you might expect, this isn't ideal: we can do much better than $n \times \lceil \log_2k \rceil$.
We will spend the rest of this handout exploring more efficient ways of encoding such sequences of symbols.
\pagebreak

View File

@ -0,0 +1,190 @@
% TODO:
% Basic run-length
% LZ77
\section{Run-length Coding}
%\definition{}
%\textit{Entropy} is a measure of information in a certain sequence. \par
%A sequence with high entropy contains a lot of information, and a sequence with low entropy contains relatively little.
%For example, consider the following two ten-symbol ASCII\footnotemark{} strings:
%\begin{itemize}
% \item \texttt{AAAAAAAAAA}
% \item \texttt{pDa3:7?j;F}
%\end{itemize}
%The first string clearly contains less information than the second.
%It's much harder to describe \texttt{pDa3:7?j;F} than it is \texttt{AAAAAAAAAA}.
%Thus, we say that the first has low entropy, and the second has fairly high entropy.
%
%\vspace{2mm}
%
%The definition above is intentionally hand-wavy. \par
%Formal definitions of entropy exist, but we won't need them today---we just need
%an intuitive understanding of the \say{density} of information in a given string.
%
%\footnotetext{
% American Standard Code for Information Exchange, an early character encoding for computers. \par
% It contains 128 symbols, including numbers, letters, and
% \texttt{!"\#\$\%\&`()*+,-./:;<=>?@[\textbackslash]\^\_\{|\}\textasciitilde}
%}
%\vspace{5mm}
\problem{}<runlenone>
Using the na\"ive coding scheme, encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} in binary. \par
\note[Note]{
We're still using the four-symbol alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$. \par
Dots ($\cdot$) in the string are drawn for readability. Ignore them.
}
\begin{solution}
There are eight \texttt{A}s on each end of that string. Mapping symbols as before, \par
we get \texttt{[00 00 00 00 00 00 00 00 01 10 11 00 00 00 00 00 00 00 00]}
\begin{instructornote}
In this handout, all encoded binary is written in square brackets. \par
Spaces, dashes, dots, and etc are added for readability, and should be ignored.
\end{instructornote}
\end{solution}
\vfill
In \ref{runlenone}---and often, in the real world---the strings we want to encode have fairly low \textit{entropy}. \par
That is, they have predictable patterns, sequences of symbols that don't contain a lot of information. \par
\note{
For example, consider the text in this document. \par
The symbols \texttt{e}, \texttt{t}, and \texttt{<space>} are much more common than any others. \par
Also, certain subsequences are repeated: \texttt{th}, \texttt{and}, \texttt{encode}, and so on.
}
We can exploit this fact to develop encoding schemes that need relatively few bits per letter.
\example{}
A simple example of such a coding scheme is \textit{run-length encoding}. Instead of simply listing letters of a string
in their binary form, we'll add a \textit{count} to each letter, shortening repeated instances of the same symbol.
\vspace{2mm}
We'll encode our string into a sequence of 6-bit blocks, interpreted as follows:
\begin{center}
\begin{tikzpicture}
\node[anchor=west,color=gray] at (-2.3, 0) {Bits};
\node[anchor=west,color=gray] at (-2.3, -0.5) {Meaning};
\draw[color=gray] (-2.3, -0.25) -- (5.5, -0.25);
\draw[color=gray] (-2.3, 0.15) -- (-2.3, -0.65);
\node at (0, 0) {\texttt{0}};
\node at (1, 0) {\texttt{0}};
\node at (2, 0) {\texttt{1}};
\node at (3, 0) {\texttt{1}};
\node at (4, 0) {\texttt{0}};
\node at (5, 0) {\texttt{1}};
\draw (-0.5, 0.25) -- (5.5, 0.25);
\draw (-0.5, -0.25) -- (5.5, -0.25);
\draw (-0.5, -0.75) -- (5.5, -0.75);
\draw (-0.5, 0.25) -- (-0.5, -0.75);
\draw (3.5, 0.25) -- (3.5, -0.75);
\draw (5.5, 0.25) -- (5.5, -0.75);
\node at (1.5, -0.5) {number of copies};
\node at (4.5, -0.5) {symbol};
\end{tikzpicture}
\end{center}
So, the sequence \texttt{BBB} will be encoded as \texttt{[0011-01]}. \par
\note[Notation]{
Just like dots, dashes and spaces are added for readability. Pretend they don't exist. \par
Encoded binary sequences will always be written in square brackets. \texttt{[]}.
}
\problem{}
Decode \texttt{[010000001111]} using this scheme.
\begin{solution}
\texttt{AAAADDD}
\end{solution}
\vfill
\problem{}
Encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} using this scheme. \par
Is this more or less efficient than \ref{runlenone}?
\begin{solution}
\texttt{[1000-00 0001-01 0001-10 0001-11 1000-00]} \par
This requires 30 bits, as compared to 38 in \ref{runlenone}.
\end{solution}
\vfill
\pagebreak
\problem{}
Give an example of a message on $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$
that uses $n$ bits when encoded with a na\"ive scheme, and \textit{fewer} than $\nicefrac{n}{2}$ bits
when encoded using the scheme described on the previous page.
\vfill
\problem{}
Give an example of a message on $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$
that uses $n$ bits when encoded with a na\"ive scheme, and \textit{more} than $2n$ bits
when encoded using the scheme described on the previous page.
\vfill
\problem{}
Is run-length coding always more efficient than na\"ive coding? \par
When does it work well, and when does it fail?
\vfill
\problem{}
Our coding scheme wastes a lot of space when our string has few runs of the same symbol. \par
Fix this problem: modify the scheme so that single occurrences of symbols do not waste space. \par
\hint{We don't need a run length for every symbol. We only need one for \textit{repeated} symbols.}
\begin{solution}
One idea is as follows: \par
\begin{itemize}
\item Encode single symbols na\"ively: \texttt{ABCD} becomes \texttt{[00 01 10 11]}
\item Signal runs using two copies of the same symbol: \texttt{AAAAAA} becomes \texttt{[00 00 0110]}. \par
When our decoder sees two copies of the same symbol, it will interpret the next four bits as
a run length.
\end{itemize}
\texttt{BDC$\cdot$DDDDD$\cdot$AADBDC} will be encoded as \texttt{[01 11 10 11-11-0101 01-01-0010 11 01 11 10]}.
\end{solution}
\vfill
\problem{}<firstlz>
Consider the following string: \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD}. \par
\begin{itemize}
\item How many bits do we need to encode this na\"ively? \par
\item How about with the (unmodified) run-length scheme described on the previous page?
\end{itemize}
\hint{You don't need to encode this string---just find the length of its encoded form.}
\begin{solution}
Na\"ively: \tab 22 bits \par
Run-length: \tab $6 \times 21 = 126$ bits. Watch out for the two repeated \texttt{A}s!
\end{solution}
\vfill
Neither solution to \ref{firstlz} is ideal. Run-length is very wasteful due to the lack of runs, and na\"ive coding
does not take advantage of repetition in the string. We'll need a better coding scheme.
\pagebreak

View File

@ -0,0 +1,174 @@
\section{LZ Codes}
The LZ-family\footnotemark{} of codes (LZ77, LZ78, LZSS, LZMA, and others) take advantage of repeated subsequences
in a string. They are the basis of most modern compression algorithms, including DEFLATE, which is used in the ZIP, PNG,
and GZIP formats.
\footnotetext{
Named after Abraham Lempel and Jacob Ziv, the original inventors. \par
LZ77 is the algorithm described in their first paper on the topic, which was published in 1977. \par
LZ78, LZSS, and LZMA are minor variations on the same general idea.
}
\vspace{2mm}
The idea behind LZ is to represent repeated substrings as \textit{pointers} to previous parts of the string. \par
Pointers take the form \texttt{<pos, len>}, where \texttt{pos} is the position of the string to repeat and
\texttt{len} is the number of symbols to copy.
\vspace{2mm}
For example, we can encode the string \texttt{ABRACADABRA} as \texttt{[ABRACAD<7, 4>]}. \par
The pointer \texttt{<7, 4>} tells us to look back 7 positions (to the first \texttt{A}), and copy the next 4 symbols. \par
Note that pointers refer to the partially decoded output---\textit{not} to the encoded string. \par
This allows pointers to reference other pointers, and ensures that codes like \texttt{A<1,9>} are valid. \par
\note{For example, \texttt{[B<1,2>]} decodes to \texttt{BBB}.}
\problem{}
Encode \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD} using this scheme. \par
Then, decode the following:
\begin{itemize}
\item \texttt{[ABCD<4,4>]}
\item \texttt{[A<1,9>]}
\item \texttt{[DAC<3,5>]}
\end{itemize}
\begin{solution}
% spell:off
\texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD} becomes \texttt{[ABCD<4, 4> BA<2,4> ABCD<4,4>]}.
% spell:on
\linehack{}
In parts two and three, remember that we're reading the \textit{output string.} \par
The ten \texttt{A}s in part two are produced one by one, \par
with the decoder's \say{read head} following its \say{write head.}
\begin{itemize}
\item \texttt{ABCD$\cdot$ABCD}
\item \texttt{AAAAA$\cdot$AAAAA}
\item \texttt{DACDACDA}
\end{itemize}
\end{solution}
\vfill
\problem{}
Convince yourself that LZ is a generalization of the run-length code we discussed in the previous section.
\hint{\texttt{[A<1,9>]} and \texttt{[00-1001]} are the same thing!}
\remark{}
Note that we left a few things out of this section: we didn't discuss the algorithm that converts a string to an LZ-encoded blob,
nor did we discuss how we should represent strings encoded with LZ in binary. We skipped these details because they are
problems of implementation---they're the engineer's headache, not the mathematician's. \par
\pagebreak
%\begin{instructornote}
% A simple LZ-scheme can work as follows. We encode our string into a sequence of
% nine-bit blocks, drawn below. The first bit of each block tells us whether or not
% this block is a pointer, and the next eight bits contain either a \texttt{pos, len} pair
% (using, say, for bits for each number) or a plain eight-bit symbol code.
% \begin{center}
% \begin{tikzpicture}
% \node[anchor=west,color=gray] at (-2.3, 0) {Bits};
% \node[anchor=west,color=gray] at (-2.3, -0.5) {Meaning};
% \draw[color=gray] (-2.3, -0.25) -- (5.5, -0.25);
% \draw[color=gray] (-2.3, 0.15) -- (-2.3, -0.65);
%
% \node at (0, 0) {\texttt{0}};
% \node at (1, 0) {\texttt{0}};
% \node at (2, 0) {\texttt{1}};
% \node at (3, 0) {\texttt{0}};
% \node at (4, 0) {\texttt{1}};
% \node at (5, 0) {\texttt{1}};
% \node at (6, 0) {\texttt{0}};
% \node at (7, 0) {\texttt{0}};
% \node at (8, 0) {\texttt{1}};
%
% \draw (-0.5, 0.25) -- (8.5, 0.25);
% \draw (-0.5, -0.25) -- (8.5, -0.25);
% \draw (-0.5, -0.75) -- (8.5, -0.75);
%
% \draw (-0.5, 0.25) -- (-0.5, -0.75);
% \draw (0.5, 0.25) -- (0.5, -0.75);
% \draw (8.5, 0.25) -- (8.5, -0.75);
%
% \node at (0, -0.5) {flag};
% \node at (4.5, -0.5) {if flag \texttt{<pos, len>}, else eight-bit symbol};
% \end{tikzpicture}
% \end{center}
%
% To encode a string, we read it using a \say{window}, shown below. This window consists of
% a search buffer and a lookahead buffer, both of which have a fixed (but configurable) size.
% This window passes over the string one character at a time, inserting a pointer if it finds
% the lookahead buffer inside its search buffer, and a plain character otherwise.
%
%
% \begin{center}
% \begin{tikzpicture}
% % Text tape
% \node[color=gray] at (-0.75, 0) {\texttt{...}};
% \node[color=gray] at (0.0, 0) {\texttt{D}};
% \node at (0.5, 0) {\texttt{A}};
% \node at (1.0, 0) {\texttt{B}};
% \node at (1.5, 0) {\texttt{C}};
% \node at (2.0, 0) {\texttt{D}};
% \node at (2.5, 0) {\texttt{A}};
% \node at (3.0, 0) {\texttt{B}};
% \node at (3.5, 0) {\texttt{C}};
% \node at (4.0, 0) {\texttt{D}};
% \node[color=gray] at (4.5, 0) {\texttt{B}};
% \node[color=gray] at (5.0, 0) {\texttt{D}};
% \node[color=gray] at (5.5, 0) {\texttt{A}};
% \node[color=gray] at (6.0, 0) {\texttt{C}};
% \node[color=gray] at (6.75, 0) {\texttt{...}};
%
% \draw (-1.75, 0.25) -- (7.25, 0.25);
% \draw (-1.75, -0.25) -- (7.25, -0.25);
%
%
% \draw[line width = 0.7mm, color=oblue, dotted] (2.25, 0.5) -- (2.25, -0.5);
% \draw[line width = 0.7mm, color=oblue]
% (-1.25, 0.5)
% -- (4.25, 0.5)
% -- (4.25, -0.5)
% -- (-1.25, -0.5)
% -- cycle
% ;
%
% \draw
% (4.2, -0.625)
% -- (4.2, -0.75)
% to node[anchor=north, midway] {lookahead} (2.3, -0.75)
% -- (2.3, -0.625)
% ;
%
% \draw
% (2.2, -0.625)
% -- (2.2, -0.75)
% to node[anchor=north, midway] {search buffer} (-1.1, -0.75)
% -- (-1.1, -0.625)
% ;
%
% \draw[color=gray]
% (2.2, 0.625)
% -- (2.2, 0.75)
% to node[anchor=south, midway] {match!} (0.3, 0.75)
% -- (0.3, 0.625)
% ;
%
% %\draw[->, color=gray] (2.5, 0.3) -- (2.5, 0.8) to[out=90,in=90] (0.5, 0.8);
% \node at (7.0, -0.75) {Result: \texttt{[$\cdot\cdot\cdot$DABCD<4,4>$\cdot\cdot\cdot$]}};
% \end{tikzpicture}
% \end{center}
%
% This is not the exact process used in practice---but it's close enough. \par
% This process may be tweaked in any number of ways.
%\end{instructornote}
%
%\makeatletter\if@solutions
% \vfill
% \pagebreak
%\fi\makeatother

View File

@ -0,0 +1,424 @@
\section{Huffman Codes}
\example{}
Now consider the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{E}\}$. \par
With the na\"ive coding scheme, we can encode a length $n$ string with $3n$ bits, by mapping...
\begin{itemize}
\item $\texttt{A}$ to $\texttt{000}$
\item $\texttt{B}$ to $\texttt{001}$
\item $\texttt{C}$ to $\texttt{010}$
\item $\texttt{D}$ to $\texttt{011}$
\item $\texttt{E}$ to $\texttt{100}$
\end{itemize}
For example, this encodes \texttt{ADEBCE} as \texttt{[000 011 100 001 010 100]}. \par
It is easy to see that this scheme uses an average of three bits per symbol.
\vspace{2mm}
However, one could argue that this coding scheme is wasteful: \par
we're not using three of the eight possible three-bit sequences!
\example{}
There is, of course, a better way. \par
Consider the following mapping:
\begin{itemize}
\item $\texttt{A}$ to $\texttt{00}$
\item $\texttt{B}$ to $\texttt{01}$
\item $\texttt{C}$ to $\texttt{10}$
\item $\texttt{D}$ to $\texttt{110}$
\item $\texttt{E}$ to $\texttt{111}$
\end{itemize}
\problem{}
\begin{itemize}
\item Using the above code, encode \texttt{ADEBCE}.
\item Then, decode \texttt{[110011001111]}.
\end{itemize}
\begin{solution}
\texttt{ADEBCE} becomes \texttt{[00 110 111 01 10 111]}, \par
and \texttt{[110 01 10 01 111]} is \texttt{DBCBE}.
\end{solution}
\vfill
\problem{}
How many bits does this code need per symbol, on average?
\begin{solution}
\begin{equation*}
\frac{2 + 2 + 2 + 3 + 3}{5} = \frac{12}{5} = 2.4
\end{equation*}
\end{solution}
\vfill
\problem{}
Consider the code below. How is it different from the one on the previous page? \par
Is this a good way to encode five-letter strings?
\begin{itemize}
\item $\texttt{A}$ to $\texttt{00}$
\item $\texttt{B}$ to $\texttt{01}$
\item $\texttt{C}$ to $\texttt{10}$
\item $\texttt{D}$ to $\texttt{110}$
\item $\texttt{E}$ to $\texttt{11}$
\end{itemize}
\begin{solution}
No. The code for \texttt{E} occurs inside the code for \texttt{D},
and we thus can't decode sequences uniquely. For example, we could
decode the fragment \texttt{[11001$\cdot\cdot\cdot$]} as \texttt{EA}
or as \texttt{DB}.
\end{solution}
\vfill
\pagebreak
\remark{}
The code from the previous page can be visualized as a full binary tree: \par
\note{Every node in a \textit{full binary tree} has either zero or two children.}
\vspace{-5mm}
\null\hfill
\begin{minipage}[t]{0.48\textwidth}
\vspace{0pt}
\begin{itemize}
\item $\texttt{A}$ encodes as $\texttt{00}$
\item $\texttt{B}$ encodes as $\texttt{01}$
\item $\texttt{C}$ encodes as $\texttt{10}$
\item $\texttt{D}$ encodes as $\texttt{110}$
\item $\texttt{E}$ encodes as $\texttt{111}$
\end{itemize}
\end{minipage}
\hfill
\begin{minipage}[t]{0.48\textwidth}
\vspace{0pt}
\begin{center}
\begin{tikzpicture}[scale=1.0]
\begin{scope}[layer = nodes]
\node[int] (x) at (0, 0) {};
\node[int] (0) at (-0.75, -1) {};
\node[int] (1) at (0.75, -1) {};
\node[end] (00) at (-1.25, -2) {\texttt{A}};
\node[end] (01) at (-0.25, -2) {\texttt{B}};
\node[end] (10) at (0.25, -2) {\texttt{C}};
\node[int] (11) at (1.25, -2) {};
\node[end] (110) at (0.75, -3) {\texttt{D}};
\node[end] (111) at (1.75, -3) {\texttt{E}};
\end{scope}
\draw[-]
(x) to node[edg] {\texttt{0}} (0)
(x) to node[edg] {\texttt{1}} (1)
(0) to node[edg] {\texttt{0}} (00)
(0) to node[edg] {\texttt{1}} (01)
(1) to node[edg] {\texttt{0}} (10)
(1) to node[edg] {\texttt{1}} (11)
(11) to node[edg] {\texttt{0}} (110)
(11) to node[edg] {\texttt{1}} (111)
;
\end{tikzpicture}
\end{center}
\end{minipage}
\hfill\null
You can think of each symbol's code as it's \say{address} in this tree.
When decoding a string, we start at the topmost node. Reading the binary sequence
bit by bit, we move down the tree, taking a left edge if we see a \texttt{0}
and a right edge if we see a \texttt{1}.
Once we reach a letter, we return to the top node and repeat the process.
\definition{}
We say a coding scheme is \textit{prefix-free} if no whole code word is a prefix of another code word. \par
\problem{}
Convince yourself that trees like the one above always produce a prefix-free code.
\problem{}<treedecode>
Decode \texttt{[110111001001110110]} using the tree above.
\begin{solution}
This is \texttt{[110$\cdot$111$\cdot$00$\cdot$10$\cdot$01$\cdot$110$\cdot$110]}, which is \texttt{DEACBDD}
\end{solution}
\vfill
\problem{}
Encode \texttt{ABDECBE} using this tree. \par
How many bits do we save over a na\"ive scheme?
\begin{solution}
This is \texttt{[00 01 110 111 10 01 111]}, and saves four bits.
\end{solution}
\vfill
\pagebreak
\problem{}
In \ref{treedecode}, we needed 18 bits to encode \texttt{DEACBDD}. \par
\note{Note that we'd need $3 \times 7 = 21$ bits to encode this string na\"ively.}
\vspace{2mm}
Draw a tree that encodes this string more efficiently. \par
\begin{solution}
Two possible solutions are below. \par
\begin{itemize}
\item The left tree encodes \texttt{DEACBDD} as \texttt{[00$\cdot$111$\cdot$110$\cdot$10$\cdot$01$\cdot$00$\cdot$00]}, using 16 bits.
\item The right tree encodes \texttt{DEACBDD} as \texttt{[0$\cdot$111$\cdot$101$\cdot$110$\cdot$100$\cdot$0$\cdot$0]}, using 15 bits.
\end{itemize}
\null\hfill
\begin{minipage}{0.48\textwidth}
\begin{center}
\begin{tikzpicture}[scale=1.0]
\begin{scope}[layer = nodes]
\node[int] (x) at (0, 0) {};
\node[int] (0) at (-0.75, -1) {};
\node[int] (1) at (0.75, -1) {};
\node[end] (00) at (-1.25, -2) {\texttt{D}};
\node[end] (01) at (-0.25, -2) {\texttt{B}};
\node[end] (10) at (0.25, -2) {\texttt{C}};
\node[int] (11) at (1.25, -2) {};
\node[end] (110) at (0.75, -3) {\texttt{A}};
\node[end] (111) at (1.75, -3) {\texttt{E}};
\end{scope}
\draw[-]
(x) to node[edg] {\texttt{0}} (0)
(x) to node[edg] {\texttt{1}} (1)
(0) to node[edg] {\texttt{0}} (00)
(0) to node[edg] {\texttt{1}} (01)
(1) to node[edg] {\texttt{0}} (10)
(1) to node[edg] {\texttt{1}} (11)
(11) to node[edg] {\texttt{0}} (110)
(11) to node[edg] {\texttt{1}} (111)
;
\end{tikzpicture}
\end{center}
\end{minipage}
\hfill
\begin{minipage}{0.48\textwidth}
\begin{center}
\begin{tikzpicture}[scale=1.0]
\begin{scope}[layer = nodes]
\node[int] (x) at (0, 0) {};
\node[int] (0) at (-0.75, -1) {\texttt{D}};
\node[int] (1) at (0.75, -1) {};
\node[end] (10) at (0.25, -2) {};
\node[int] (11) at (1.25, -2) {};
\node[end] (100) at (-0.15, -3) {\texttt{A}};
\node[end] (101) at (0.6, -3) {\texttt{B}};
\node[end] (110) at (0.9, -3) {\texttt{C}};
\node[end] (111) at (1.6, -3) {\texttt{E}};
\end{scope}
\draw[-]
(x) to node[edg] {\texttt{0}} (0)
(x) to node[edg] {\texttt{1}} (1)
(1) to node[edg] {\texttt{0}} (10)
(1) to node[edg] {\texttt{1}} (11)
(10) to node[edg] {\texttt{0}} (101)
(10) to node[edg] {\texttt{1}} (100)
(11) to node[edg] {\texttt{0}} (110)
(11) to node[edg] {\texttt{1}} (111)
;
\end{tikzpicture}
\end{center}
\end{minipage}
\hfill\null
\end{solution}
\vfill
\problem{}
Now, do the opposite: draw a tree that encodes \texttt{DEACBDD} \textit{less} efficiently than before.
\begin{solution}
Bury \texttt{D} as deep as possible in the tree, so that we need four bits to encode it.
\end{solution}
\vfill
\remark{}
As we just saw, constructing a prefix-free code is fairly easy. \par
Constructing the \textit{most efficient} prefix-free code for a
given message is a bit more difficult. \par
\pagebreak
\remark{}
Let's restate our problem. \par
Given an alphabet $A$ and a frequency function $f$, we want to construct a binary tree $T$ that minimizes
\begin{equation*}
\mathcal{B}_f(T) = \sum_{a \in A} f(a) \times d_T(a)
\end{equation*}
Where...
\begin{itemize}[itemsep=1mm]
\item $a$ is a symbol in $A$
\item $d_T(a)$ is the \say{depth} of $a$ in our tree. \par
\note{In other words, $d_T(a)$ is the number of bits we need to encode $a$}
\item $f(a)$ is a frequency function that maps each symbol in $A$ to a value in $[0, 1]$. \par
You can think of this as the distribution of symbols in messages we expect to encode. \par
For example, consider the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}\}$:
\begin{itemize}
\item In $\texttt{AAA}$, $f(\texttt{A}) = 1$ and $f(\texttt{B}) = f(\texttt{C}) = 0$.
\item In $\texttt{ABC}$, $f(\texttt{A}) = f(\texttt{B}) = f(\texttt{C}) = \nicefrac{1}{3}$.
\end{itemize}
\note{Note that $f(a) \geq 0$ and $\sum f(a) = 1$.}
\end{itemize}
\vspace{2mm}
Also notice that $\mathcal{B}_f(T)$ is the \say{average bits per symbol} metric we saw in previous problems.
\problem{}<hufptone>
Let $f$ be fixed frequency function over an alphabet $A$. \par
Let $T$ be an arbitrary tree for $A$, and let $a, b$ be two symbols in $A$. \par
Construct $T'$ by swapping $a$ and $b$ in $T$. Show that \par
\begin{equation*}
\mathcal{B}_f(T) - \mathcal{B}_f(T') = \Bigl(f(b) - f(a)\Bigr) \times \Bigl(d_T(a) - d_T(b)\Bigr)
\end{equation*}
\begin{solution}
$\mathcal{B}_f(T)$ and $\mathcal{B}_f(T')$ are nearly identical, and differ only at $d_T(a)$ and $d_T(b)$.
So, we get...
\begin{align*}
\mathcal{B}_f(T) - \mathcal{B}_f(T')
&= f(a)d_T(a) + f(b)d_T(b) - f(a)d_T(b) - f(b)d_T(a) \\
&= f(a)\bigl(d_T(a) - d_T(b)\bigr) + f(b)\bigl(d_T(b) - d_T(a)\bigr) \\
&= \Bigl(f(b) - f(a)\Bigr) \times \Bigl(d_T(a) - d_T(b)\Bigr)
\end{align*}
\end{solution}
\vfill
\pagebreak
\problem{}<hufpttwo>
Show that there is an optimal tree in which the two symbols with the lowest frequencies have the same parent.
\hint{You may assume that an optimal tree exists. There are a few cases.}
\begin{solution}
Let $T$ be an optimal tree, and let $a, b$ be the two symbols with the lowest frequency. \par
If there is a tie among three or more symbols, pick $a, b$ to be those with the greatest depth. \par
Label $a$ and $b$ so that that $d_T(a) \geq d_T(a)$.
\vspace{1mm}
If $a$ and $b$ share a parent, we're done.
If $a$ and $b$ do not share a parent, we have three cases:
\begin{itemize}[itemsep=1mm]
\item There is a node $x$ with $d_T(x) > d_T(a)$. \par
Create $T'$ by swapping $a$ and $x$. By definition, $f(a) < f(x)$, and thus
by \ref{hufptone} $\mathcal{B}_f(T) > \mathcal{B}_f(T')$. This is a contradiction,
since we chose $T$ as an optimal tree---so this case is impossible.
\item $a$ is an only child. Create $T'$ by removing $a$'s parent and replacing it with $a$. \par
Then $\mathcal{B}_f(T) > \mathcal{B}_f(T')$, same contradiction as above. \par
\note{If we assume $T$ is a full binary tree, this case doesn't exist.}
\item $a$ has a sibling $x$, and $x$ isn't $b$. \par
Let $T'$ be the tree created by swapping $x$ and $b$ (thus making $a$ and $b$ siblings). \par
By \ref{hufptone}, $\mathcal{B}_f(T) \geq \mathcal{B}_f(T')$. $T$ is optimal, so there cannot
be a tree with a better average length---thus $\mathcal{B}_f(T) = \mathcal{B}_f(T')$ and $T'$
is also optimal.
\end{itemize}
\end{solution}
\vfill
\pagebreak
\problem{}
Devise an algorithm that builds an optimal tree given an alphabet $A$ and a frequency function $f$. \par
Then, use the previous two problems to show that your algorithm indeed produces an ideal tree. \par
\hint{
First, make an algorithm that makes sense intuitively. \par
Once you have something that looks good, start your proof.
} \par
\hint{Build from the bottom.}
\begin{solution}
\textbf{The Algorithm:} \par
Given an alphabet $A$ and a frequency function $f$...
\begin{itemize}
\item If $|A| = 1$, return a single node.
\item Let $a, b$ be two symbols with the smallest frequency.
\item Let $A' = A - \{a, b\} + \{x\}$ \tab \note{(Where $x$ is a new \say{placeholder} symbol)}
\item Let $f'(x) = f(a) + f(b)$, and $f'(s) = f(s)$ for all other symbols $s$.
\item Compute $T'$ by repeating this algorithm on $A'$ and $f'$
\item Create $T$ from $T'$ by adding $a$ and $b$ as children of $x$.
\end{itemize}
\vspace{2mm}
In plain english: pick the two nodes with the smallest frequency, combine them,
and replace them with a \say{compound symbol}. Repeat until you're done.
\linehack{}
\textbf{The Proof:} \par
We'll proceed by induction on $|A|$. \par
Let $f$ be an arbitrary frequency function.
\vspace{4mm}
\textbf{Base case:} $|A| = 1$. We only have one vertex, and we thus only have one tree. \par
The algorithm above produces this tree. Done.
\vspace{4mm}
\textbf{Induction:} Assume that for all $A$ with $|A| = n - 1$, the algorithm above produces an ideal tree.
First, we'll show that $\mathcal{B}_f(T) = \mathcal{B}_{f'}(T') + f(a) + f(b)$:
\begin{align*}
\mathcal{B}_f(T)
&= \sum_{x \in A - \{a, b\}} \Bigl(f(x)d_T(x)\Bigr) + f(a)d_T(a) + f(b)d_T(b) \\
&= \sum_{x \in A - \{a, b\}} \Bigl(f(x)d_T(x)\Bigr) + \Bigl(f(a)+f(b)\Bigr)\Bigl(d_{T'}(x) + 1\Bigr) \\
&= \sum_{x \in A - \{a, b\}} \Bigl(f(x)d_T(x)\Bigr) + f'(z)d_{T'}(z) + f(a) + f(b) \\
&= \sum_{x \in A'} \Bigl(f'(x)d_{T'}(x)\Bigr) + f(a) + f(b) \\
&= \mathcal{B}_{f'}(T') + f(a) + f(b)
\end{align*}
Now, assume that $T$ is not optimal. There then exists an optimal tree $U$ with $a$ and $b$ as siblings (by \ref{hufpttwo}).
Let $U'$ be the tree created by removing $a, b$ from $U$. $U'$ is a tree for $A'$ and $f'$, so we can repeat the calculation
above to find that $\mathcal{B}_f(U) = \mathcal{B}_{f'}(U') + f(a) + f(b)$.
\vspace{2mm}
So, $
\mathcal{B}_{f'}(T')
~=~ \mathcal{B}_f(T) - f(a) - f(b)
~>~ \mathcal{B}_f(U) - f(a) - f(b)
~=~ \mathcal{B}_{f'}(U')
$. \par
Since $T'$ is optimal for $A'$ and $f'$, this is a contradition. $T$ must therefore be optimal.
\end{solution}
\vfill
\pagebreak

View File

@ -0,0 +1,40 @@
\section{Bonus problems}
\problem{}
Make sense of the document on the next page. \par
What does it describe, and how does it work?
\problem{}
Given a table with a marked point, $O$, and with $2013$ properly working watches put down on the table, prove that there exists a moment in time when the sum of the distances from $O$ to the watches' centers is less than the sum of the distances from $O$ to the tips of the watches' minute hands.
\vfill
\problem{A Minor Inconvenience}
A group of eight friends goes out to dinner. Each drives his own car, checking it in with valet upon arrival.
Unfortunately, the valet attendant forgot to tag the friends' keys. Thus, when the group leaves the restaurant,
each friend is handed a random key.
\begin{itemize}
\item What is the probability that everyone gets the correct set of keys?
\item What is the probability that each friend gets the wrong set?
\end{itemize}
\vfill
\problem{Bimmer Parking}
A parking lot has a row of 16 spaces, of which a random 12 are taken. \par
Ivan drives a BMW, and thus needs two adjacent spaces to park. \par
What is the probability he'll find a spot?
\vfill
\pagebreak
\includepdf[
pages=1,
fitpaper=true
]{parts/qoi-specification.pdf}
\pagebreak

Binary file not shown.

View File

@ -0,0 +1,68 @@
\usetikzlibrary{arrows.meta}
\usetikzlibrary{shapes.geometric}
\usetikzlibrary{patterns}
% We put nodes in a separate layer, so we can
% slightly overlap with paths for a perfect fit
\pgfdeclarelayer{nodes}
\pgfdeclarelayer{path}
\pgfsetlayers{main,nodes}
% Layer settings
\tikzset{
% Layer hack, lets us write
% later = * in scopes.
layer/.style = {
execute at begin scope={\pgfonlayer{#1}},
execute at end scope={\endpgfonlayer}
},
%
% Arrowhead tweak
>={Latex[ width=2mm, length=2mm ]},
%
% Labels inside edges
label/.style = {
rectangle,
% For automatic red background in solutions
fill = \ORMCbgcolor,
draw = none,
rounded corners = 0mm
},
%
% Nodes
edg/.style = {
midway,
fill = \ORMCbgcolor,
text = gray
},
int/.style = {},
end/.style = {
anchor=north
},
%
% Loop tweaks
loop above/.style = {
min distance = 2mm,
looseness = 8,
out = 45,
in = 135
},
loop below/.style = {
min distance = 5mm,
looseness = 10,
out = 315,
in = 225
},
loop right/.style = {
min distance = 5mm,
looseness = 10,
out = 45,
in = 315
},
loop left/.style = {
min distance = 5mm,
looseness = 10,
out = 135,
in = 215
}
}