From 8269bf1135cd777c95f7c6af898b02862c4431a4 Mon Sep 17 00:00:00 2001 From: Mark Date: Tue, 23 Apr 2024 17:33:58 -0700 Subject: [PATCH] Polish --- Advanced/Compression/parts/0 intro.tex | 4 +- Advanced/Compression/parts/1 runlength.tex | 84 +++++---- Advanced/Compression/parts/2 lzss.tex | 210 +++++++++++---------- Advanced/Compression/parts/3 huffman.tex | 59 ++++-- 4 files changed, 200 insertions(+), 157 deletions(-) diff --git a/Advanced/Compression/parts/0 intro.tex b/Advanced/Compression/parts/0 intro.tex index eda06ca..25d5644 100644 --- a/Advanced/Compression/parts/0 intro.tex +++ b/Advanced/Compression/parts/0 intro.tex @@ -9,7 +9,7 @@ A \textit{string} is a sequence of symbols from an alphabet. \par For example, \texttt{CBCAADDD} is a string over the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$. \problem{} -Say we want to store a length-$n$ string over the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$ as a binary blob. \par +Say we want to store a length-$n$ string over the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$ as a binary sequence. \par How many bits will we need? \par \hint{ Our alphabet has four symbols, so we can encode each symbol using two bits, \par @@ -32,6 +32,6 @@ using $n \times \lceil \log_2k \rceil$ bits. Convince yourself that this is true \vfill -Of course, this isn't ideal---we can do much better than $n \times \lceil \log_2k \rceil$. +As you might expect, this isn't ideal: we can do much better than $n \times \lceil \log_2k \rceil$. We will spend the rest of this handout exploring more efficient ways of encoding such sequences of symbols. \pagebreak diff --git a/Advanced/Compression/parts/1 runlength.tex b/Advanced/Compression/parts/1 runlength.tex index 7696ac1..070cabb 100644 --- a/Advanced/Compression/parts/1 runlength.tex +++ b/Advanced/Compression/parts/1 runlength.tex @@ -5,37 +5,37 @@ \section{Run-length Coding} -\definition{} -\textit{Entropy} is a measure of information in a certain sequence. \par -A sequence with high entropy contains a lot of information, and a sequence with low entropy contains relatively little. -For example, consider the following two ten-symbol ASCII\footnotemark{} strings: -\begin{itemize} - \item \texttt{AAAAAAAAAA} - \item \texttt{pDa3:7?j;F} -\end{itemize} -The first string clearly contains less information than the second. -It's much harder to describe \texttt{pDa3:7?j;F} than it is \texttt{AAAAAAAAAA}. -Thus, we say that the first has low entropy, and the second has fairly high entropy. +%\definition{} +%\textit{Entropy} is a measure of information in a certain sequence. \par +%A sequence with high entropy contains a lot of information, and a sequence with low entropy contains relatively little. +%For example, consider the following two ten-symbol ASCII\footnotemark{} strings: +%\begin{itemize} +% \item \texttt{AAAAAAAAAA} +% \item \texttt{pDa3:7?j;F} +%\end{itemize} +%The first string clearly contains less information than the second. +%It's much harder to describe \texttt{pDa3:7?j;F} than it is \texttt{AAAAAAAAAA}. +%Thus, we say that the first has low entropy, and the second has fairly high entropy. +% +%\vspace{2mm} +% +%The definition above is intentionally hand-wavy. \par +%Formal definitions of entropy exist, but we won't need them today---we just need +%an intuitive understanding of the \say{density} of information in a given string. -\vspace{2mm} - -The definition above is intentionally hand-wavy. \par -Formal definitions of entropy exist, but we won't need them today---we just need -an intuitive understanding of the \say{density} of information in a given string. +% +%\footnotetext{ +% American Standard Code for Information Exchange, an early character encoding for computers. \par +% It contains 128 symbols, including numbers, letters, and +% \texttt{!"\#\$\%\&`()*+,-./:;<=>?@[\textbackslash]\^\_\{|\}\textasciitilde} +%} -\footnotetext{ - American Standard Code for Information Exchange, an early character encoding for computers. \par - It contains 128 symbols, including numbers, letters, and - \texttt{!"\#\$\%\&`()*+,-./:;<=>?@[\textbackslash]\^\_\{|\}\textasciitilde} -} - - -\vspace{5mm} +%\vspace{5mm} \problem{} -Using a na\"ive coding scheme, encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} as binary blob. \par +Using a na\"ive coding scheme, encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} in binary. \par \note[Note]{ We're still using the four-symbol alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$. \par Dots ($\cdot$) in the string are drawn for readability. Ignore them. @@ -48,12 +48,13 @@ Using a na\"ive coding scheme, encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AA \vfill -In \ref{runlenone}---and often, in the real world---the strings we want to encode have fairly low entropy. -We can leverage this fact to develop efficient encoding schemes. +In \ref{runlenone}---and often, in the real world---the strings we want to encode have fairly low \textit{entropy}. \par +They have predictable patterns, sequences of symbols that don't contain a lot of information. \par +We can exploit this fact to develop efficient encoding schemes. \example{} -The simplest such coding scheme is \textit{run-length encoding}. Instead of simply listing letters of a string -in their binary form, we'll add a \textit{count} to each letter, compressing repeated sequences of the same symbol. +A simple example of such a coding scheme is \textit{run-length encoding}. Instead of simply listing letters of a string +in their binary form, we'll add a \textit{count} to each letter, shortening repeated instances of the same symbol. \vspace{2mm} @@ -86,16 +87,10 @@ We'll encode our string into a sequence of 6-bit blocks, interpreted as follows: \end{tikzpicture} \end{center} So, the sequence \texttt{BBB} will be encoded as \texttt{[0011-01]}. \par -\note[Notation]{Just like spaces, dashes in a binary blob are added for readability.} - - -\remark{Notation} -In this handout, encoded binary blobs will always be written in square brackets. \par -Ignore spaces and dashes, they are provided for convenience. \par -For example, the binary sequences \texttt{[000 011 100 001 010 100]} and \texttt{[000011100001010100]} \par -are identical. The first, however, is easier to read. - -\pagebreak +\note[Notation]{ + Just like dots, dashes and spaces are added for readability. \par + Encoded binary sequences will always be written in square brackets. \texttt{[]}. +} \problem{} Encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} using this scheme. \par @@ -107,6 +102,15 @@ Is this more or less efficient than \ref{runlenone}? \end{solution} \vfill +\pagebreak + + + + + + + + \problem{} @@ -137,7 +141,7 @@ Fix this problem: modify the scheme so that single occurrences of symbols do not Consider the following string: \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD}. \par \begin{itemize} \item How many bits do we need to encode this na\"ively? \par - \item How about with the (unmodified) run-length scheme described above? + \item How about with the (unmodified) run-length scheme described on the previous page? \end{itemize} \hint{You don't need to encode this string---just find the length of its encoded form.} diff --git a/Advanced/Compression/parts/2 lzss.tex b/Advanced/Compression/parts/2 lzss.tex index c710993..0734333 100644 --- a/Advanced/Compression/parts/2 lzss.tex +++ b/Advanced/Compression/parts/2 lzss.tex @@ -1,6 +1,6 @@ \section{LZ Codes} -The LZ-family\footnotemark{} of codes (LZ77, LZ78, LZSS, LZMA, and others) take advantage of repeated sequences of symbols +The LZ-family\footnotemark{} of codes (LZ77, LZ78, LZSS, LZMA, and others) take advantage of repeated subsequences in a string. They are the basis of most modern compression algorithms, including DEFLATE, which is used in the ZIP, PNG, and GZIP formats. @@ -21,10 +21,10 @@ Pointers take the form \texttt{}, where \texttt{pos} is the position o For example, we can encode the string \texttt{ABRACADABRA} as \texttt{[ABRACAD<7, 4>]}. \par The pointer \texttt{<7, 4>} tells us to look back 7 positions (to the first \texttt{A}), and copy the next 4 symbols. \par Note that pointers refer to the partially decoded output---\textit{not} to the encoded string. \par -This allows pointers to reference other pointers, and ensures codes like \texttt{A<1,9>} are valid. +This allows pointers to reference other pointers, and ensures that codes like \texttt{A<1,9>} are valid. \problem{} -Encode \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD} using LZ. +Encode \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD} using this scheme. \par Then, decode the following: \begin{itemize} \item \texttt{[ABCD<4,4>]} @@ -39,7 +39,7 @@ Then, decode the following: \linehack{} In parts two and three, remember that we're reading the \textit{output string.} \par - The nine \texttt{A}s in part two are produced one by one, \par + The ten \texttt{A}s in part two are produced one by one, \par with the decoder's \say{read head} following its \say{write head.} \begin{itemize} @@ -58,98 +58,114 @@ Convince yourself that LZ is a generalization of the run-length code we discusse \remark{} Note that we left a few things out of this section: we didn't discuss the algorithm that converts a string to an LZ-encoded blob, nor did we discuss how we should represent strings encoded with LZ in binary. We skipped these details because they are -problems of implementation---they're the engineer's headache, not the mathematician's. If you're interested, a brief explanation is below. -Ask an instructor to explain. +problems of implementation---they're the engineer's headache, not the mathematician's. \par -\begin{center} - \begin{tikzpicture} - \node[anchor=west,color=gray] at (-2.3, 0) {Bits}; - \node[anchor=west,color=gray] at (-2.3, -0.5) {Meaning}; - \draw[color=gray] (-2.3, -0.25) -- (5.5, -0.25); - \draw[color=gray] (-2.3, 0.15) -- (-2.3, -0.65); +\pagebreak - \node at (0, 0) {\texttt{0}}; - \node at (1, 0) {\texttt{0}}; - \node at (2, 0) {\texttt{1}}; - \node at (3, 0) {\texttt{0}}; - \node at (4, 0) {\texttt{1}}; - \node at (5, 0) {\texttt{1}}; - \node at (6, 0) {\texttt{0}}; - \node at (7, 0) {\texttt{0}}; - \node at (8, 0) {\texttt{1}}; - - \draw (-0.5, 0.25) -- (8.5, 0.25); - \draw (-0.5, -0.25) -- (8.5, -0.25); - \draw (-0.5, -0.75) -- (8.5, -0.75); - - \draw (-0.5, 0.25) -- (-0.5, -0.75); - \draw (0.5, 0.25) -- (0.5, -0.75); - \draw (8.5, 0.25) -- (8.5, -0.75); - - \node at (0, -0.5) {flag}; - \node at (4.5, -0.5) {if flag \texttt{}, else eight-bit symbol}; - \end{tikzpicture} -\end{center} - - -\begin{center} - \begin{tikzpicture} - % Text tape - \node[color=gray] at (-0.75, 0) {\texttt{...}}; - \node[color=gray] at (0.0, 0) {\texttt{D}}; - \node at (0.5, 0) {\texttt{A}}; - \node at (1.0, 0) {\texttt{B}}; - \node at (1.5, 0) {\texttt{C}}; - \node at (2.0, 0) {\texttt{D}}; - \node at (2.5, 0) {\texttt{A}}; - \node at (3.0, 0) {\texttt{B}}; - \node at (3.5, 0) {\texttt{C}}; - \node at (4.0, 0) {\texttt{D}}; - \node[color=gray] at (4.5, 0) {\texttt{B}}; - \node[color=gray] at (5.0, 0) {\texttt{D}}; - \node[color=gray] at (5.5, 0) {\texttt{A}}; - \node[color=gray] at (6.0, 0) {\texttt{C}}; - \node[color=gray] at (6.75, 0) {\texttt{...}}; - - \draw (-1.75, 0.25) -- (7.25, 0.25); - \draw (-1.75, -0.25) -- (7.25, -0.25); - - - \draw[line width = 0.7mm, color=oblue, dotted] (2.25, 0.5) -- (2.25, -0.5); - \draw[line width = 0.7mm, color=oblue] - (-1.25, 0.5) - -- (4.25, 0.5) - -- (4.25, -0.5) - -- (-1.25, -0.5) - -- cycle - ; - - \draw - (4.2, -0.625) - -- (4.2, -0.75) - to node[anchor=north, midway] {lookahead} (2.3, -0.75) - -- (2.3, -0.625) - ; - - \draw - (2.2, -0.625) - -- (2.2, -0.75) - to node[anchor=north, midway] {search buffer} (-1.1, -0.75) - -- (-1.1, -0.625) - ; - - \draw[color=gray] - (2.2, 0.625) - -- (2.2, 0.75) - to node[anchor=south, midway] {match!} (0.3, 0.75) - -- (0.3, 0.625) - ; - - %\draw[->, color=gray] (2.5, 0.3) -- (2.5, 0.8) to[out=90,in=90] (0.5, 0.8); - \node at (7.0, -0.75) {Result: \texttt{[$\cdot\cdot\cdot$DABCD<4,4>$\cdot\cdot\cdot$]}}; - \end{tikzpicture} -\end{center} - - -\vfill -\pagebreak \ No newline at end of file +%\begin{instructornote} +% A simple LZ-scheme can work as follows. We encode our string into a sequence of +% nine-bit blocks, drawn below. The first bit of each block tells us whether or not +% this block is a pointer, and the next eight bits contain either a \texttt{pos, len} pair +% (using, say, for bits for each number) or a plain eight-bit symbol code. +% \begin{center} +% \begin{tikzpicture} +% \node[anchor=west,color=gray] at (-2.3, 0) {Bits}; +% \node[anchor=west,color=gray] at (-2.3, -0.5) {Meaning}; +% \draw[color=gray] (-2.3, -0.25) -- (5.5, -0.25); +% \draw[color=gray] (-2.3, 0.15) -- (-2.3, -0.65); +% +% \node at (0, 0) {\texttt{0}}; +% \node at (1, 0) {\texttt{0}}; +% \node at (2, 0) {\texttt{1}}; +% \node at (3, 0) {\texttt{0}}; +% \node at (4, 0) {\texttt{1}}; +% \node at (5, 0) {\texttt{1}}; +% \node at (6, 0) {\texttt{0}}; +% \node at (7, 0) {\texttt{0}}; +% \node at (8, 0) {\texttt{1}}; +% +% \draw (-0.5, 0.25) -- (8.5, 0.25); +% \draw (-0.5, -0.25) -- (8.5, -0.25); +% \draw (-0.5, -0.75) -- (8.5, -0.75); +% +% \draw (-0.5, 0.25) -- (-0.5, -0.75); +% \draw (0.5, 0.25) -- (0.5, -0.75); +% \draw (8.5, 0.25) -- (8.5, -0.75); +% +% \node at (0, -0.5) {flag}; +% \node at (4.5, -0.5) {if flag \texttt{}, else eight-bit symbol}; +% \end{tikzpicture} +% \end{center} +% +% To encode a string, we read it using a \say{window}, shown below. This window consists of +% a search buffer and a lookahead buffer, both of which have a fixed (but configurable) size. +% This window passes over the string one character at a time, inserting a pointer if it finds +% the lookahead buffer inside its search buffer, and a plain character otherwise. +% +% +% \begin{center} +% \begin{tikzpicture} +% % Text tape +% \node[color=gray] at (-0.75, 0) {\texttt{...}}; +% \node[color=gray] at (0.0, 0) {\texttt{D}}; +% \node at (0.5, 0) {\texttt{A}}; +% \node at (1.0, 0) {\texttt{B}}; +% \node at (1.5, 0) {\texttt{C}}; +% \node at (2.0, 0) {\texttt{D}}; +% \node at (2.5, 0) {\texttt{A}}; +% \node at (3.0, 0) {\texttt{B}}; +% \node at (3.5, 0) {\texttt{C}}; +% \node at (4.0, 0) {\texttt{D}}; +% \node[color=gray] at (4.5, 0) {\texttt{B}}; +% \node[color=gray] at (5.0, 0) {\texttt{D}}; +% \node[color=gray] at (5.5, 0) {\texttt{A}}; +% \node[color=gray] at (6.0, 0) {\texttt{C}}; +% \node[color=gray] at (6.75, 0) {\texttt{...}}; +% +% \draw (-1.75, 0.25) -- (7.25, 0.25); +% \draw (-1.75, -0.25) -- (7.25, -0.25); +% +% +% \draw[line width = 0.7mm, color=oblue, dotted] (2.25, 0.5) -- (2.25, -0.5); +% \draw[line width = 0.7mm, color=oblue] +% (-1.25, 0.5) +% -- (4.25, 0.5) +% -- (4.25, -0.5) +% -- (-1.25, -0.5) +% -- cycle +% ; +% +% \draw +% (4.2, -0.625) +% -- (4.2, -0.75) +% to node[anchor=north, midway] {lookahead} (2.3, -0.75) +% -- (2.3, -0.625) +% ; +% +% \draw +% (2.2, -0.625) +% -- (2.2, -0.75) +% to node[anchor=north, midway] {search buffer} (-1.1, -0.75) +% -- (-1.1, -0.625) +% ; +% +% \draw[color=gray] +% (2.2, 0.625) +% -- (2.2, 0.75) +% to node[anchor=south, midway] {match!} (0.3, 0.75) +% -- (0.3, 0.625) +% ; +% +% %\draw[->, color=gray] (2.5, 0.3) -- (2.5, 0.8) to[out=90,in=90] (0.5, 0.8); +% \node at (7.0, -0.75) {Result: \texttt{[$\cdot\cdot\cdot$DABCD<4,4>$\cdot\cdot\cdot$]}}; +% \end{tikzpicture} +% \end{center} +% +% This is not the exact process used in practice---but it's close enough. \par +% This process may be tweaked in any number of ways. +%\end{instructornote} +% +%\makeatletter\if@solutions +% \vfill +% \pagebreak +%\fi\makeatother \ No newline at end of file diff --git a/Advanced/Compression/parts/3 huffman.tex b/Advanced/Compression/parts/3 huffman.tex index 5767231..c264152 100644 --- a/Advanced/Compression/parts/3 huffman.tex +++ b/Advanced/Compression/parts/3 huffman.tex @@ -3,7 +3,7 @@ \example{} Now consider the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{E}\}$. \par -With a na\"ive coding scheme, we can encode a length-$n$ string with $3n$ bits, by mapping... +With a na\"ive coding scheme, we can encode a length $n$ string with $3n$ bits, by mapping... \begin{itemize} \item $\texttt{A}$ to $\texttt{000}$ \item $\texttt{B}$ to $\texttt{001}$ @@ -12,12 +12,12 @@ With a na\"ive coding scheme, we can encode a length-$n$ string with $3n$ bits, \item $\texttt{E}$ to $\texttt{100}$ \end{itemize} For example, this encodes \texttt{ADEBCE} as \texttt{[000 011 100 001 010 100]}. \par -To encoding strings over $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{E}\}$ with this scheme, we +To encode strings over $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{E}\}$ with this scheme, we need an average of three bits per symbol. \vspace{2mm} -One could argue that this coding scheme is wasteful: \par +However, one could argue that this coding scheme is wasteful: \par we're not using three of the eight possible three-bit sequences! \example{} @@ -86,9 +86,8 @@ Is this a good way to encode five-letter strings? \remark{} -The code from the previous page can be visualized as a tree which we traverse while decoding our sequence. -Starting from the topmost node, we take the left edge if we see a \texttt{0} and the right edge if we see a \texttt{1}. -Once we reach a letter, we return to the top node and repeat the process. +The code from the previous page can be visualized as a full binary tree: \par +\note{Every node in a \textit{full binary tree} has either zero or two children.} \vspace{-5mm} \null\hfill @@ -135,10 +134,19 @@ Once we reach a letter, we return to the top node and repeat the process. \end{center} \end{minipage} \hfill\null +You can think of each symbol's code as it's \say{address} in this tree. +When decoding a string, we start at the topmost node. Reading the binary sequence +bit by bit, we move down the tree, taking a left edge if we see a \texttt{0} +and a right edge if we see a \texttt{1}. +Once we reach a letter, we return to the top node and repeat the process. +\definition{} +We say a coding scheme is \textit{prefix-free} if no whole code word is a prefix of another code word. \par +\problem{} +Convince yourself that trees like the one above always produce a prefix-free code. \problem{} Decode \texttt{[110111001001110110]} using the tree above. @@ -149,6 +157,18 @@ Decode \texttt{[110111001001110110]} using the tree above. \vfill +\problem{} +Encode \texttt{ABDECBE} using this tree. \par +How many bits do we save over a na\"ive scheme? + +\begin{solution} + This is \texttt{[00 01 110 111 10 01 111]}, and saves four bits. +\end{solution} + + +\vfill +\pagebreak + \problem{} In \ref{treedecode}, we needed 18 bits to encode \texttt{DEACBDD}. \par \note{Note that we'd need $3 \times 7 = 21$ bits to encode this string na\"ively.} @@ -236,13 +256,19 @@ Now, do the opposite: draw a tree that encodes \texttt{DEACBDD} \textit{less} ef \vfill \remark{} -We say a coding scheme is \textit{prefix-free} if no whole code word is a prefix of another code word. \par -As we've seen, it is fairly easy to construct a prefix-free variable-length code using a binary tree. \par +As we just saw, constructing a prefix-free code is fairly easy. \par Constucting the \textit{most efficient} prefix-free code for a given message is a bit more difficult. \par -We'll spend the rest of this section solving this problem. - \pagebreak + + + + + + + + + \remark{} Let's restate our problem. \par Given an alphabet $A$ and a frequency function $f$, we want to construct a binary tree $T$ that minimizes @@ -270,16 +296,13 @@ Where... \vspace{2mm} -Also, notice that $\mathcal{B}_f(T)$ is the \say{average bits per symbol} metric we saw in previous problems. +Also notice that $\mathcal{B}_f(T)$ is the \say{average bits per symbol} metric we saw in previous problems. \problem{} Let $f$ be fixed frequency function over an alphabet $A$. \par Let $T$ be an arbitrary tree for $A$, and let $a, b$ be two symbols in $A$. \par - -\vspace{2mm} - -Now, construct $T'$ by swapping $a$ and $b$ in $T$. Show that \par +Construct $T'$ by swapping $a$ and $b$ in $T$. Show that \par \begin{equation*} \mathcal{B}_f(T) - \mathcal{B}_f(T') = \Bigl(f(b) - f(a)\Bigr) \times \Bigl(d_T(a) - d_T(b)\Bigr) \end{equation*} @@ -300,8 +323,8 @@ Now, construct $T'$ by swapping $a$ and $b$ in $T$. Show that \par \pagebreak \problem{} -Show that is an optimal tree in which the two symbols with the lowest frequencies have the same parent. -\hint{You may assume that an optimal tree exists. Check three nontrivial cases.} +Show that there is an optimal tree in which the two symbols with the lowest frequencies have the same parent. +\hint{You may assume that an optimal tree exists. There are a few cases.} \begin{solution} Let $T$ be an optimal tree, and let $a, b$ be the two symbols with the lowest frequency. \par @@ -356,7 +379,7 @@ Then, use the previous two problems to show that your algorithm indeed produces \vspace{2mm} In plain english: pick the two nodes with the smallest frequency, combine them, - and add that into the alphabet as a \say{compound symbol}. Repeat until you're done. + and replace them with a \say{compound symbol}. Repeat until you're done. \linehack{}