191 lines
6.5 KiB
TeX
191 lines
6.5 KiB
TeX
% TODO:
|
|
% Basic run-length
|
|
% LZ77
|
|
|
|
\section{Run-length Coding}
|
|
|
|
|
|
%\definition{}
|
|
%\textit{Entropy} is a measure of information in a certain sequence. \par
|
|
%A sequence with high entropy contains a lot of information, and a sequence with low entropy contains relatively little.
|
|
%For example, consider the following two ten-symbol ASCII\footnotemark{} strings:
|
|
%\begin{itemize}
|
|
% \item \texttt{AAAAAAAAAA}
|
|
% \item \texttt{pDa3:7?j;F}
|
|
%\end{itemize}
|
|
%The first string clearly contains less information than the second.
|
|
%It's much harder to describe \texttt{pDa3:7?j;F} than it is \texttt{AAAAAAAAAA}.
|
|
%Thus, we say that the first has low entropy, and the second has fairly high entropy.
|
|
%
|
|
%\vspace{2mm}
|
|
%
|
|
%The definition above is intentionally hand-wavy. \par
|
|
%Formal definitions of entropy exist, but we won't need them today---we just need
|
|
%an intuitive understanding of the \say{density} of information in a given string.
|
|
|
|
%
|
|
%\footnotetext{
|
|
% American Standard Code for Information Exchange, an early character encoding for computers. \par
|
|
% It contains 128 symbols, including numbers, letters, and
|
|
% \texttt{!"\#\$\%\&`()*+,-./:;<=>?@[\textbackslash]\^\_\{|\}\textasciitilde}
|
|
%}
|
|
|
|
|
|
%\vspace{5mm}
|
|
|
|
|
|
\problem{}<runlenone>
|
|
Using the na\"ive coding scheme, encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} in binary. \par
|
|
\note[Note]{
|
|
We're still using the four-symbol alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$. \par
|
|
Dots ($\cdot$) in the string are drawn for readability. Ignore them.
|
|
}
|
|
|
|
\begin{solution}
|
|
There are eight \texttt{A}s on each end of that string. Mapping symbols as before, \par
|
|
we get \texttt{[00 00 00 00 00 00 00 00 01 10 11 00 00 00 00 00 00 00 00]}
|
|
|
|
\begin{instructornote}
|
|
In this handout, all encoded binary is written in square brackets. \par
|
|
Spaces, dashes, dots, and etc are added for readability, and should be ignored.
|
|
\end{instructornote}
|
|
\end{solution}
|
|
|
|
|
|
\vfill
|
|
In \ref{runlenone}---and often, in the real world---the strings we want to encode have fairly low \textit{entropy}. \par
|
|
That is, they have predictable patterns, sequences of symbols that don't contain a lot of information. \par
|
|
\note{
|
|
For example, consider the text in this document. \par
|
|
The symbols \texttt{e}, \texttt{t}, and \texttt{<space>} are much more common than any others. \par
|
|
Also, certain subsequences are repeated: \texttt{th}, \texttt{and}, \texttt{encode}, and so on.
|
|
}
|
|
We can exploit this fact to develop encoding schemes that need relatively few bits per letter.
|
|
|
|
\example{}
|
|
A simple example of such a coding scheme is \textit{run-length encoding}. Instead of simply listing letters of a string
|
|
in their binary form, we'll add a \textit{count} to each letter, shortening repeated instances of the same symbol.
|
|
|
|
\vspace{2mm}
|
|
|
|
We'll encode our string into a sequence of 6-bit blocks, interpreted as follows:
|
|
|
|
\begin{center}
|
|
\begin{tikzpicture}
|
|
\node[anchor=west,color=gray] at (-2.3, 0) {Bits};
|
|
\node[anchor=west,color=gray] at (-2.3, -0.5) {Meaning};
|
|
\draw[color=gray] (-2.3, -0.25) -- (5.5, -0.25);
|
|
\draw[color=gray] (-2.3, 0.15) -- (-2.3, -0.65);
|
|
|
|
\node at (0, 0) {\texttt{0}};
|
|
\node at (1, 0) {\texttt{0}};
|
|
\node at (2, 0) {\texttt{1}};
|
|
\node at (3, 0) {\texttt{1}};
|
|
\node at (4, 0) {\texttt{0}};
|
|
\node at (5, 0) {\texttt{1}};
|
|
|
|
\draw (-0.5, 0.25) -- (5.5, 0.25);
|
|
\draw (-0.5, -0.25) -- (5.5, -0.25);
|
|
\draw (-0.5, -0.75) -- (5.5, -0.75);
|
|
|
|
\draw (-0.5, 0.25) -- (-0.5, -0.75);
|
|
\draw (3.5, 0.25) -- (3.5, -0.75);
|
|
\draw (5.5, 0.25) -- (5.5, -0.75);
|
|
|
|
\node at (1.5, -0.5) {number of copies};
|
|
\node at (4.5, -0.5) {symbol};
|
|
\end{tikzpicture}
|
|
\end{center}
|
|
So, the sequence \texttt{BBB} will be encoded as \texttt{[0011-01]}. \par
|
|
\note[Notation]{
|
|
Just like dots, dashes and spaces are added for readability. Pretend they don't exist. \par
|
|
Encoded binary sequences will always be written in square brackets. \texttt{[]}.
|
|
}
|
|
|
|
\problem{}
|
|
Decode \texttt{[010000001111]} using this scheme.
|
|
|
|
\begin{solution}
|
|
\texttt{AAAADDD}
|
|
\end{solution}
|
|
\vfill
|
|
|
|
\problem{}
|
|
Encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} using this scheme. \par
|
|
Is this more or less efficient than \ref{runlenone}?
|
|
|
|
\begin{solution}
|
|
\texttt{[1000-00 0001-01 0001-10 0001-11 1000-00]} \par
|
|
This requires 30 bits, as compared to 38 in \ref{runlenone}.
|
|
\end{solution}
|
|
|
|
\vfill
|
|
\pagebreak
|
|
|
|
|
|
|
|
|
|
|
|
|
|
\problem{}
|
|
Give an example of a message on $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$
|
|
that uses $n$ bits when encoded with a na\"ive scheme, and \textit{fewer} than $\nicefrac{n}{2}$ bits
|
|
when encoded using the scheme described on the previous page.
|
|
|
|
|
|
\vfill
|
|
|
|
\problem{}
|
|
Give an example of a message on $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$
|
|
that uses $n$ bits when encoded with a na\"ive scheme, and \textit{more} than $2n$ bits
|
|
when encoded using the scheme described on the previous page.
|
|
|
|
|
|
\vfill
|
|
|
|
|
|
\problem{}
|
|
Is run-length coding always more efficient than na\"ive coding? \par
|
|
When does it work well, and when does it fail?
|
|
|
|
\vfill
|
|
|
|
|
|
\problem{}
|
|
Our coding scheme wastes a lot of space when our string has few runs of the same symbol. \par
|
|
Fix this problem: modify the scheme so that single occurrences of symbols do not waste space. \par
|
|
\hint{We don't need a run length for every symbol. We only need one for \textit{repeated} symbols.}
|
|
|
|
\begin{solution}
|
|
One idea is as follows: \par
|
|
\begin{itemize}
|
|
\item Encode single symbols na\"ively: \texttt{ABCD} becomes \texttt{[00 01 10 11]}
|
|
\item Signal runs using two copies of the same symbol: \texttt{AAAAAA} becomes \texttt{[00 00 0110]}. \par
|
|
When our decoder sees two copies of the same symbol, it will interpret the next four bits as
|
|
a run length.
|
|
\end{itemize}
|
|
\texttt{BDC$\cdot$DDDDD$\cdot$AADBDC} will be encoded as \texttt{[01 11 10 11-11-0101 01-01-0010 11 01 11 10]}.
|
|
\end{solution}
|
|
|
|
\vfill
|
|
|
|
\problem{}<firstlz>
|
|
Consider the following string: \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD}. \par
|
|
\begin{itemize}
|
|
\item How many bits do we need to encode this na\"ively? \par
|
|
\item How about with the (unmodified) run-length scheme described on the previous page?
|
|
\end{itemize}
|
|
\hint{You don't need to encode this string---just find the length of its encoded form.}
|
|
|
|
\begin{solution}
|
|
Na\"ively: \tab 22 bits \par
|
|
Run-length: \tab $6 \times 21 = 126$ bits. Watch out for the two repeated \texttt{A}s!
|
|
\end{solution}
|
|
|
|
|
|
\vfill
|
|
|
|
Neither solution to \ref{firstlz} is ideal. Run-length is very wasteful due to the lack of runs, and na\"ive coding
|
|
does not take advantage of repetition in the string. We'll need a better coding scheme.
|
|
\pagebreak
|