% TODO: % Basic run-length % LZ77 \section{Run-length Coding} %\definition{} %\textit{Entropy} is a measure of information in a certain sequence. \par %A sequence with high entropy contains a lot of information, and a sequence with low entropy contains relatively little. %For example, consider the following two ten-symbol ASCII\footnotemark{} strings: %\begin{itemize} % \item \texttt{AAAAAAAAAA} % \item \texttt{pDa3:7?j;F} %\end{itemize} %The first string clearly contains less information than the second. %It's much harder to describe \texttt{pDa3:7?j;F} than it is \texttt{AAAAAAAAAA}. %Thus, we say that the first has low entropy, and the second has fairly high entropy. % %\vspace{2mm} % %The definition above is intentionally hand-wavy. \par %Formal definitions of entropy exist, but we won't need them today---we just need %an intuitive understanding of the \say{density} of information in a given string. % %\footnotetext{ % American Standard Code for Information Exchange, an early character encoding for computers. \par % It contains 128 symbols, including numbers, letters, and % \texttt{!"\#\$\%\&`()*+,-./:;<=>?@[\textbackslash]\^\_\{|\}\textasciitilde} %} %\vspace{5mm} \problem{} Using a na\"ive coding scheme, encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} in binary. \par \note[Note]{ We're still using the four-symbol alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$. \par Dots ($\cdot$) in the string are drawn for readability. Ignore them. } \begin{solution} There are eight \texttt{A}s on each end of that string. Mapping symbols as before, \par we get \texttt{[00 00 00 00 00 00 00 00 01 10 11 00 00 00 00 00 00 00 00]} \end{solution} \vfill In \ref{runlenone}---and often, in the real world---the strings we want to encode have fairly low \textit{entropy}. \par They have predictable patterns, sequences of symbols that don't contain a lot of information. \par We can exploit this fact to develop efficient encoding schemes. \example{} A simple example of such a coding scheme is \textit{run-length encoding}. Instead of simply listing letters of a string in their binary form, we'll add a \textit{count} to each letter, shortening repeated instances of the same symbol. \vspace{2mm} We'll encode our string into a sequence of 6-bit blocks, interpreted as follows: \begin{center} \begin{tikzpicture} \node[anchor=west,color=gray] at (-2.3, 0) {Bits}; \node[anchor=west,color=gray] at (-2.3, -0.5) {Meaning}; \draw[color=gray] (-2.3, -0.25) -- (5.5, -0.25); \draw[color=gray] (-2.3, 0.15) -- (-2.3, -0.65); \node at (0, 0) {\texttt{0}}; \node at (1, 0) {\texttt{0}}; \node at (2, 0) {\texttt{1}}; \node at (3, 0) {\texttt{1}}; \node at (4, 0) {\texttt{0}}; \node at (5, 0) {\texttt{1}}; \draw (-0.5, 0.25) -- (5.5, 0.25); \draw (-0.5, -0.25) -- (5.5, -0.25); \draw (-0.5, -0.75) -- (5.5, -0.75); \draw (-0.5, 0.25) -- (-0.5, -0.75); \draw (3.5, 0.25) -- (3.5, -0.75); \draw (5.5, 0.25) -- (5.5, -0.75); \node at (1.5, -0.5) {number of copies}; \node at (4.5, -0.5) {symbol}; \end{tikzpicture} \end{center} So, the sequence \texttt{BBB} will be encoded as \texttt{[0011-01]}. \par \note[Notation]{ Just like dots, dashes and spaces are added for readability. \par Encoded binary sequences will always be written in square brackets. \texttt{[]}. } \problem{} Encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} using this scheme. \par Is this more or less efficient than \ref{runlenone}? \begin{solution} \texttt{[1000-00 0001-01 0001-10 0001-11 1000-00]} \par This requires 30 bits, as compared to 38 in \ref{runlenone}. \end{solution} \vfill \pagebreak \problem{} Is run-length coding always efficient? When does it work well, and when does it fail? \vfill \problem{} Our coding scheme wastes a lot of space when our string has few runs of the same symbol. \par Fix this problem: modify the scheme so that single occurrences of symbols do not waste space. \par \hint{We don't need a run length for every symbol. We only need one for \textit{repeated} symbols.} \begin{solution} One idea is as follows: \par \begin{itemize} \item Encode single symbols na\"ively: \texttt{ABCD} becomes \texttt{[00 01 10 11]} \item Signal runs using two copies of the same symbol: \texttt{AAAAAA} becomes \texttt{[00 00 0110]}. \par When our decoder sees two copies of the same symbol, it will interpret the next four bits as a run length. \end{itemize} \texttt{BDC$\cdot$DDDDD$\cdot$AADBDC} will be encoded as \texttt{[01 11 10 11-11-0101 01-01-0010 11 01 11 10]}. \end{solution} \vfill \problem{} Consider the following string: \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD}. \par \begin{itemize} \item How many bits do we need to encode this na\"ively? \par \item How about with the (unmodified) run-length scheme described on the previous page? \end{itemize} \hint{You don't need to encode this string---just find the length of its encoded form.} \begin{solution} Na\"ively: \tab 22 bits \par Run-length: \tab $6 \times 21 = 126$ bits. Watch out for the two repeated \texttt{A}s! \end{solution} \vfill Neither solution to \ref{firstlz} is ideal. Run-length is very wasteful due to the lack of runs, and na\"ive coding does not take advantage of repetition in the string. We'll need a better coding scheme. \pagebreak