This commit is contained in:
Mark 2024-04-23 17:33:58 -07:00
parent d8698b4c81
commit 8269bf1135
4 changed files with 200 additions and 157 deletions

View File

@ -9,7 +9,7 @@ A \textit{string} is a sequence of symbols from an alphabet. \par
For example, \texttt{CBCAADDD} is a string over the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$.
Say we want to store a length-$n$ string over the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$ as a binary blob. \par
Say we want to store a length-$n$ string over the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$ as a binary sequence. \par
How many bits will we need? \par
Our alphabet has four symbols, so we can encode each symbol using two bits, \par
@ -32,6 +32,6 @@ using $n \times \lceil \log_2k \rceil$ bits. Convince yourself that this is true
Of course, this isn't ideal---we can do much better than $n \times \lceil \log_2k \rceil$.
As you might expect, this isn't ideal: we can do much better than $n \times \lceil \log_2k \rceil$.
We will spend the rest of this handout exploring more efficient ways of encoding such sequences of symbols.

View File

@ -5,37 +5,37 @@
\section{Run-length Coding}
\textit{Entropy} is a measure of information in a certain sequence. \par
A sequence with high entropy contains a lot of information, and a sequence with low entropy contains relatively little.
For example, consider the following two ten-symbol ASCII\footnotemark{} strings:
\item \texttt{AAAAAAAAAA}
\item \texttt{pDa3:7?j;F}
The first string clearly contains less information than the second.
It's much harder to describe \texttt{pDa3:7?j;F} than it is \texttt{AAAAAAAAAA}.
Thus, we say that the first has low entropy, and the second has fairly high entropy.
%\textit{Entropy} is a measure of information in a certain sequence. \par
%A sequence with high entropy contains a lot of information, and a sequence with low entropy contains relatively little.
%For example, consider the following two ten-symbol ASCII\footnotemark{} strings:
% \item \texttt{AAAAAAAAAA}
% \item \texttt{pDa3:7?j;F}
%The first string clearly contains less information than the second.
%It's much harder to describe \texttt{pDa3:7?j;F} than it is \texttt{AAAAAAAAAA}.
%Thus, we say that the first has low entropy, and the second has fairly high entropy.
%The definition above is intentionally hand-wavy. \par
%Formal definitions of entropy exist, but we won't need them today---we just need
%an intuitive understanding of the \say{density} of information in a given string.
The definition above is intentionally hand-wavy. \par
Formal definitions of entropy exist, but we won't need them today---we just need
an intuitive understanding of the \say{density} of information in a given string.
% American Standard Code for Information Exchange, an early character encoding for computers. \par
% It contains 128 symbols, including numbers, letters, and
% \texttt{!"\#\$\%\&`()*+,-./:;<=>?@[\textbackslash]\^\_\{|\}\textasciitilde}
American Standard Code for Information Exchange, an early character encoding for computers. \par
It contains 128 symbols, including numbers, letters, and
Using a na\"ive coding scheme, encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} as binary blob. \par
Using a na\"ive coding scheme, encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} in binary. \par
We're still using the four-symbol alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$. \par
Dots ($\cdot$) in the string are drawn for readability. Ignore them.
@ -48,12 +48,13 @@ Using a na\"ive coding scheme, encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AA
In \ref{runlenone}---and often, in the real world---the strings we want to encode have fairly low entropy.
We can leverage this fact to develop efficient encoding schemes.
In \ref{runlenone}---and often, in the real world---the strings we want to encode have fairly low \textit{entropy}. \par
They have predictable patterns, sequences of symbols that don't contain a lot of information. \par
We can exploit this fact to develop efficient encoding schemes.
The simplest such coding scheme is \textit{run-length encoding}. Instead of simply listing letters of a string
in their binary form, we'll add a \textit{count} to each letter, compressing repeated sequences of the same symbol.
A simple example of such a coding scheme is \textit{run-length encoding}. Instead of simply listing letters of a string
in their binary form, we'll add a \textit{count} to each letter, shortening repeated instances of the same symbol.
@ -86,16 +87,10 @@ We'll encode our string into a sequence of 6-bit blocks, interpreted as follows:
So, the sequence \texttt{BBB} will be encoded as \texttt{[0011-01]}. \par
\note[Notation]{Just like spaces, dashes in a binary blob are added for readability.}
In this handout, encoded binary blobs will always be written in square brackets. \par
Ignore spaces and dashes, they are provided for convenience. \par
For example, the binary sequences \texttt{[000 011 100 001 010 100]} and \texttt{[000011100001010100]} \par
are identical. The first, however, is easier to read.
Just like dots, dashes and spaces are added for readability. \par
Encoded binary sequences will always be written in square brackets. \texttt{[]}.
Encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} using this scheme. \par
@ -107,6 +102,15 @@ Is this more or less efficient than \ref{runlenone}?
@ -137,7 +141,7 @@ Fix this problem: modify the scheme so that single occurrences of symbols do not
Consider the following string: \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD}. \par
\item How many bits do we need to encode this na\"ively? \par
\item How about with the (unmodified) run-length scheme described above?
\item How about with the (unmodified) run-length scheme described on the previous page?
\hint{You don't need to encode this string---just find the length of its encoded form.}

View File

@ -1,6 +1,6 @@
\section{LZ Codes}
The LZ-family\footnotemark{} of codes (LZ77, LZ78, LZSS, LZMA, and others) take advantage of repeated sequences of symbols
The LZ-family\footnotemark{} of codes (LZ77, LZ78, LZSS, LZMA, and others) take advantage of repeated subsequences
in a string. They are the basis of most modern compression algorithms, including DEFLATE, which is used in the ZIP, PNG,
and GZIP formats.
@ -21,10 +21,10 @@ Pointers take the form \texttt{<pos, len>}, where \texttt{pos} is the position o
For example, we can encode the string \texttt{ABRACADABRA} as \texttt{[ABRACAD<7, 4>]}. \par
The pointer \texttt{<7, 4>} tells us to look back 7 positions (to the first \texttt{A}), and copy the next 4 symbols. \par
Note that pointers refer to the partially decoded output---\textit{not} to the encoded string. \par
This allows pointers to reference other pointers, and ensures codes like \texttt{A<1,9>} are valid.
This allows pointers to reference other pointers, and ensures that codes like \texttt{A<1,9>} are valid.
Encode \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD} using LZ.
Encode \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD} using this scheme. \par
Then, decode the following:
\item \texttt{[ABCD<4,4>]}
@ -39,7 +39,7 @@ Then, decode the following:
In parts two and three, remember that we're reading the \textit{output string.} \par
The nine \texttt{A}s in part two are produced one by one, \par
The ten \texttt{A}s in part two are produced one by one, \par
with the decoder's \say{read head} following its \say{write head.}
@ -58,98 +58,114 @@ Convince yourself that LZ is a generalization of the run-length code we discusse
Note that we left a few things out of this section: we didn't discuss the algorithm that converts a string to an LZ-encoded blob,
nor did we discuss how we should represent strings encoded with LZ in binary. We skipped these details because they are
problems of implementation---they're the engineer's headache, not the mathematician's. If you're interested, a brief explanation is below.
Ask an instructor to explain.
problems of implementation---they're the engineer's headache, not the mathematician's. \par
\node[anchor=west,color=gray] at (-2.3, 0) {Bits};
\node[anchor=west,color=gray] at (-2.3, -0.5) {Meaning};
\draw[color=gray] (-2.3, -0.25) -- (5.5, -0.25);
\draw[color=gray] (-2.3, 0.15) -- (-2.3, -0.65);
\node at (0, 0) {\texttt{0}};
\node at (1, 0) {\texttt{0}};
\node at (2, 0) {\texttt{1}};
\node at (3, 0) {\texttt{0}};
\node at (4, 0) {\texttt{1}};
\node at (5, 0) {\texttt{1}};
\node at (6, 0) {\texttt{0}};
\node at (7, 0) {\texttt{0}};
\node at (8, 0) {\texttt{1}};
\draw (-0.5, 0.25) -- (8.5, 0.25);
\draw (-0.5, -0.25) -- (8.5, -0.25);
\draw (-0.5, -0.75) -- (8.5, -0.75);
\draw (-0.5, 0.25) -- (-0.5, -0.75);
\draw (0.5, 0.25) -- (0.5, -0.75);
\draw (8.5, 0.25) -- (8.5, -0.75);
\node at (0, -0.5) {flag};
\node at (4.5, -0.5) {if flag \texttt{<pos, len>}, else eight-bit symbol};
% Text tape
\node[color=gray] at (-0.75, 0) {\texttt{...}};
\node[color=gray] at (0.0, 0) {\texttt{D}};
\node at (0.5, 0) {\texttt{A}};
\node at (1.0, 0) {\texttt{B}};
\node at (1.5, 0) {\texttt{C}};
\node at (2.0, 0) {\texttt{D}};
\node at (2.5, 0) {\texttt{A}};
\node at (3.0, 0) {\texttt{B}};
\node at (3.5, 0) {\texttt{C}};
\node at (4.0, 0) {\texttt{D}};
\node[color=gray] at (4.5, 0) {\texttt{B}};
\node[color=gray] at (5.0, 0) {\texttt{D}};
\node[color=gray] at (5.5, 0) {\texttt{A}};
\node[color=gray] at (6.0, 0) {\texttt{C}};
\node[color=gray] at (6.75, 0) {\texttt{...}};
\draw (-1.75, 0.25) -- (7.25, 0.25);
\draw (-1.75, -0.25) -- (7.25, -0.25);
\draw[line width = 0.7mm, color=oblue, dotted] (2.25, 0.5) -- (2.25, -0.5);
\draw[line width = 0.7mm, color=oblue]
(-1.25, 0.5)
-- (4.25, 0.5)
-- (4.25, -0.5)
-- (-1.25, -0.5)
-- cycle
(4.2, -0.625)
-- (4.2, -0.75)
to node[anchor=north, midway] {lookahead} (2.3, -0.75)
-- (2.3, -0.625)
(2.2, -0.625)
-- (2.2, -0.75)
to node[anchor=north, midway] {search buffer} (-1.1, -0.75)
-- (-1.1, -0.625)
(2.2, 0.625)
-- (2.2, 0.75)
to node[anchor=south, midway] {match!} (0.3, 0.75)
-- (0.3, 0.625)
%\draw[->, color=gray] (2.5, 0.3) -- (2.5, 0.8) to[out=90,in=90] (0.5, 0.8);
\node at (7.0, -0.75) {Result: \texttt{[$\cdot\cdot\cdot$DABCD<4,4>$\cdot\cdot\cdot$]}};
% A simple LZ-scheme can work as follows. We encode our string into a sequence of
% nine-bit blocks, drawn below. The first bit of each block tells us whether or not
% this block is a pointer, and the next eight bits contain either a \texttt{pos, len} pair
% (using, say, for bits for each number) or a plain eight-bit symbol code.
% \begin{center}
% \begin{tikzpicture}
% \node[anchor=west,color=gray] at (-2.3, 0) {Bits};
% \node[anchor=west,color=gray] at (-2.3, -0.5) {Meaning};
% \draw[color=gray] (-2.3, -0.25) -- (5.5, -0.25);
% \draw[color=gray] (-2.3, 0.15) -- (-2.3, -0.65);
% \node at (0, 0) {\texttt{0}};
% \node at (1, 0) {\texttt{0}};
% \node at (2, 0) {\texttt{1}};
% \node at (3, 0) {\texttt{0}};
% \node at (4, 0) {\texttt{1}};
% \node at (5, 0) {\texttt{1}};
% \node at (6, 0) {\texttt{0}};
% \node at (7, 0) {\texttt{0}};
% \node at (8, 0) {\texttt{1}};
% \draw (-0.5, 0.25) -- (8.5, 0.25);
% \draw (-0.5, -0.25) -- (8.5, -0.25);
% \draw (-0.5, -0.75) -- (8.5, -0.75);
% \draw (-0.5, 0.25) -- (-0.5, -0.75);
% \draw (0.5, 0.25) -- (0.5, -0.75);
% \draw (8.5, 0.25) -- (8.5, -0.75);
% \node at (0, -0.5) {flag};
% \node at (4.5, -0.5) {if flag \texttt{<pos, len>}, else eight-bit symbol};
% \end{tikzpicture}
% \end{center}
% To encode a string, we read it using a \say{window}, shown below. This window consists of
% a search buffer and a lookahead buffer, both of which have a fixed (but configurable) size.
% This window passes over the string one character at a time, inserting a pointer if it finds
% the lookahead buffer inside its search buffer, and a plain character otherwise.
% \begin{center}
% \begin{tikzpicture}
% % Text tape
% \node[color=gray] at (-0.75, 0) {\texttt{...}};
% \node[color=gray] at (0.0, 0) {\texttt{D}};
% \node at (0.5, 0) {\texttt{A}};
% \node at (1.0, 0) {\texttt{B}};
% \node at (1.5, 0) {\texttt{C}};
% \node at (2.0, 0) {\texttt{D}};
% \node at (2.5, 0) {\texttt{A}};
% \node at (3.0, 0) {\texttt{B}};
% \node at (3.5, 0) {\texttt{C}};
% \node at (4.0, 0) {\texttt{D}};
% \node[color=gray] at (4.5, 0) {\texttt{B}};
% \node[color=gray] at (5.0, 0) {\texttt{D}};
% \node[color=gray] at (5.5, 0) {\texttt{A}};
% \node[color=gray] at (6.0, 0) {\texttt{C}};
% \node[color=gray] at (6.75, 0) {\texttt{...}};
% \draw (-1.75, 0.25) -- (7.25, 0.25);
% \draw (-1.75, -0.25) -- (7.25, -0.25);
% \draw[line width = 0.7mm, color=oblue, dotted] (2.25, 0.5) -- (2.25, -0.5);
% \draw[line width = 0.7mm, color=oblue]
% (-1.25, 0.5)
% -- (4.25, 0.5)
% -- (4.25, -0.5)
% -- (-1.25, -0.5)
% -- cycle
% ;
% \draw
% (4.2, -0.625)
% -- (4.2, -0.75)
% to node[anchor=north, midway] {lookahead} (2.3, -0.75)
% -- (2.3, -0.625)
% ;
% \draw
% (2.2, -0.625)
% -- (2.2, -0.75)
% to node[anchor=north, midway] {search buffer} (-1.1, -0.75)
% -- (-1.1, -0.625)
% ;
% \draw[color=gray]
% (2.2, 0.625)
% -- (2.2, 0.75)
% to node[anchor=south, midway] {match!} (0.3, 0.75)
% -- (0.3, 0.625)
% ;
% %\draw[->, color=gray] (2.5, 0.3) -- (2.5, 0.8) to[out=90,in=90] (0.5, 0.8);
% \node at (7.0, -0.75) {Result: \texttt{[$\cdot\cdot\cdot$DABCD<4,4>$\cdot\cdot\cdot$]}};
% \end{tikzpicture}
% \end{center}
% This is not the exact process used in practice---but it's close enough. \par
% This process may be tweaked in any number of ways.
% \vfill
% \pagebreak

View File

@ -3,7 +3,7 @@
Now consider the alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{E}\}$. \par
With a na\"ive coding scheme, we can encode a length-$n$ string with $3n$ bits, by mapping...
With a na\"ive coding scheme, we can encode a length $n$ string with $3n$ bits, by mapping...
\item $\texttt{A}$ to $\texttt{000}$
\item $\texttt{B}$ to $\texttt{001}$
@ -12,12 +12,12 @@ With a na\"ive coding scheme, we can encode a length-$n$ string with $3n$ bits,
\item $\texttt{E}$ to $\texttt{100}$
For example, this encodes \texttt{ADEBCE} as \texttt{[000 011 100 001 010 100]}. \par
To encoding strings over $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{E}\}$ with this scheme, we
To encode strings over $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}, \texttt{E}\}$ with this scheme, we
need an average of three bits per symbol.
One could argue that this coding scheme is wasteful: \par
However, one could argue that this coding scheme is wasteful: \par
we're not using three of the eight possible three-bit sequences!
@ -86,9 +86,8 @@ Is this a good way to encode five-letter strings?
The code from the previous page can be visualized as a tree which we traverse while decoding our sequence.
Starting from the topmost node, we take the left edge if we see a \texttt{0} and the right edge if we see a \texttt{1}.
Once we reach a letter, we return to the top node and repeat the process.
The code from the previous page can be visualized as a full binary tree: \par
\note{Every node in a \textit{full binary tree} has either zero or two children.}
@ -135,10 +134,19 @@ Once we reach a letter, we return to the top node and repeat the process.
You can think of each symbol's code as it's \say{address} in this tree.
When decoding a string, we start at the topmost node. Reading the binary sequence
bit by bit, we move down the tree, taking a left edge if we see a \texttt{0}
and a right edge if we see a \texttt{1}.
Once we reach a letter, we return to the top node and repeat the process.
We say a coding scheme is \textit{prefix-free} if no whole code word is a prefix of another code word. \par
Convince yourself that trees like the one above always produce a prefix-free code.
Decode \texttt{[110111001001110110]} using the tree above.
@ -149,6 +157,18 @@ Decode \texttt{[110111001001110110]} using the tree above.
Encode \texttt{ABDECBE} using this tree. \par
How many bits do we save over a na\"ive scheme?
This is \texttt{[00 01 110 111 10 01 111]}, and saves four bits.
In \ref{treedecode}, we needed 18 bits to encode \texttt{DEACBDD}. \par
\note{Note that we'd need $3 \times 7 = 21$ bits to encode this string na\"ively.}
@ -236,13 +256,19 @@ Now, do the opposite: draw a tree that encodes \texttt{DEACBDD} \textit{less} ef
We say a coding scheme is \textit{prefix-free} if no whole code word is a prefix of another code word. \par
As we've seen, it is fairly easy to construct a prefix-free variable-length code using a binary tree. \par
As we just saw, constructing a prefix-free code is fairly easy. \par
Constucting the \textit{most efficient} prefix-free code for a given message is a bit more difficult. \par
We'll spend the rest of this section solving this problem.
Let's restate our problem. \par
Given an alphabet $A$ and a frequency function $f$, we want to construct a binary tree $T$ that minimizes
@ -270,16 +296,13 @@ Where...
Also, notice that $\mathcal{B}_f(T)$ is the \say{average bits per symbol} metric we saw in previous problems.
Also notice that $\mathcal{B}_f(T)$ is the \say{average bits per symbol} metric we saw in previous problems.
Let $f$ be fixed frequency function over an alphabet $A$. \par
Let $T$ be an arbitrary tree for $A$, and let $a, b$ be two symbols in $A$. \par
Now, construct $T'$ by swapping $a$ and $b$ in $T$. Show that \par
Construct $T'$ by swapping $a$ and $b$ in $T$. Show that \par
\mathcal{B}_f(T) - \mathcal{B}_f(T') = \Bigl(f(b) - f(a)\Bigr) \times \Bigl(d_T(a) - d_T(b)\Bigr)
@ -300,8 +323,8 @@ Now, construct $T'$ by swapping $a$ and $b$ in $T$. Show that \par
Show that is an optimal tree in which the two symbols with the lowest frequencies have the same parent.
\hint{You may assume that an optimal tree exists. Check three nontrivial cases.}
Show that there is an optimal tree in which the two symbols with the lowest frequencies have the same parent.
\hint{You may assume that an optimal tree exists. There are a few cases.}
Let $T$ be an optimal tree, and let $a, b$ be the two symbols with the lowest frequency. \par
@ -356,7 +379,7 @@ Then, use the previous two problems to show that your algorithm indeed produces
In plain english: pick the two nodes with the smallest frequency, combine them,
and add that into the alphabet as a \say{compound symbol}. Repeat until you're done.
and replace them with a \say{compound symbol}. Repeat until you're done.