handouts/Advanced/Compression/parts/1 runlength.tex

% TODO:
% Basic run-length
% LZ77

\section{Run-length Coding}


%\definition{}
%\textit{Entropy} is a measure of information in a certain sequence. \par
%A sequence with high entropy contains a lot of information, and a sequence with low entropy contains relatively little.
%For example, consider the following two ten-symbol ASCII\footnotemark{} strings:
%\begin{itemize}
%	\item \texttt{AAAAAAAAAA}
%	\item \texttt{pDa3:7?j;F}
%\end{itemize}
%The first string clearly contains less information than the second.
%It's much harder to describe \texttt{pDa3:7?j;F} than it is \texttt{AAAAAAAAAA}.
%Thus, we say that the first has low entropy, and the second has fairly high entropy.
%
%\vspace{2mm}
%
%The definition above is intentionally hand-wavy. \par
%Formal definitions of entropy exist, but we won't need them today---we just need
%an intuitive understanding of the \say{density} of information in a given string.

%
%\footnotetext{
%	American Standard Code for Information Exchange, an early character encoding for computers. \par
%	It contains 128 symbols, including numbers, letters, and
%	\texttt{!"\#\$\%\&`()*+,-./:;<=>?@[\textbackslash]\^\_\{|\}\textasciitilde}
%}


%\vspace{5mm}


\problem{}<runlenone>
Using a na\"ive coding scheme, encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} in binary. \par
\note[Note]{
	We're still using the four-symbol alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$. \par
	Dots ($\cdot$) in the string are drawn for readability. Ignore them.
}

\begin{solution}
	There are eight \texttt{A}s on each end of that string. Mapping symbols as before, \par
	we get \texttt{[00 00 00 00 00 00 00 00 01 10 11 00 00 00 00 00 00 00 00]}
\end{solution}


\vfill
In \ref{runlenone}---and often, in the real world---the strings we want to encode have fairly low \textit{entropy}. \par
That is, they have predictable patterns, sequences of symbols that don't contain a lot of information. \par
\note{
	For example, consider the text in this document. \par
	The symbols \texttt{e}, \texttt{t}, and \texttt{<space>} are much more common than any others. \par
	Also, certain subsequences are repeated: \texttt{th}, \texttt{and}, \texttt{encode}, and so on.
}
We can exploit this fact to develop encoding schemes that need relatively few bits per letter.

\example{}
A simple example of such a coding scheme is \textit{run-length encoding}. Instead of simply listing letters of a string
in their binary form, we'll add a \textit{count} to each letter, shortening repeated instances of the same symbol.

\vspace{2mm}

We'll encode our string into a sequence of 6-bit blocks, interpreted as follows:

\begin{center}
	\begin{tikzpicture}
		\node[anchor=west,color=gray] at (-2.3, 0) {Bits};
		\node[anchor=west,color=gray] at (-2.3, -0.5) {Meaning};
		\draw[color=gray] (-2.3, -0.25) -- (5.5, -0.25);
		\draw[color=gray] (-2.3, 0.15) -- (-2.3, -0.65);

		\node at (0, 0) {\texttt{0}};
		\node at (1, 0) {\texttt{0}};
		\node at (2, 0) {\texttt{1}};
		\node at (3, 0) {\texttt{1}};
		\node at (4, 0) {\texttt{0}};
		\node at (5, 0) {\texttt{1}};

		\draw (-0.5, 0.25) -- (5.5, 0.25);
		\draw (-0.5, -0.25) -- (5.5, -0.25);
		\draw (-0.5, -0.75) -- (5.5, -0.75);

		\draw (-0.5, 0.25) -- (-0.5, -0.75);
		\draw (3.5, 0.25) -- (3.5, -0.75);
		\draw (5.5, 0.25) -- (5.5, -0.75);

		\node at (1.5, -0.5) {number of copies};
		\node at (4.5, -0.5) {symbol};
	\end{tikzpicture}
\end{center}
So, the sequence \texttt{BBB} will be encoded as \texttt{[0011-01]}. \par
\note[Notation]{
	Just like dots, dashes and spaces are added for readability. Pretend they don't exist. \par
	Encoded binary sequences will always be written in square brackets. \texttt{[]}.
}

\problem{}
Decode \texttt{[010000001111]} using this scheme.

\begin{solution}
	\texttt{AAAADDD}
\end{solution}
\vfill

\problem{}
Encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} using this scheme. \par
Is this more or less efficient than \ref{runlenone}?

\begin{solution}
	\texttt{[1000-00 0001-01 0001-10 0001-11 1000-00]} \par
	This requires 30 bits, as compared to 38 in \ref{runlenone}.
\end{solution}

\vfill
\pagebreak


\problem{}
Give an example of a message on $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$
that uses $n$ bits when encoded with a na\"ive scheme, and \textit{fewer} than $\nicefrac{n}{2}$ bits
when encoded using the scheme described on the previous page.


\vfill

\problem{}
Give an example of a message on $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$
that uses $n$ bits when encoded with a na\"ive scheme, and \textit{more} than $2n$ bits
when encoded using the scheme described on the previous page.


\vfill


\problem{}
Is run-length coding always more efficient than na\"ive coding? \par
When does it work well, and when does it fail?

\vfill


\problem{}
Our coding scheme wastes a lot of space when our string has few runs of the same symbol. \par
Fix this problem: modify the scheme so that single occurrences of symbols do not waste space. \par
\hint{We don't need a run length for every symbol. We only need one for \textit{repeated} symbols.}

\begin{solution}
	One idea is as follows: \par
	\begin{itemize}
		\item Encode single symbols na\"ively: \texttt{ABCD} becomes \texttt{[00 01 10 11]}
		\item Signal runs using two copies of the same symbol: \texttt{AAAAAA} becomes \texttt{[00 00 0110]}. \par
		When our decoder sees two copies of the same symbol, it will interpret the next four bits as
		a run length.
	\end{itemize}
	\texttt{BDC$\cdot$DDDDD$\cdot$AADBDC} will be encoded as \texttt{[01 11 10 11-11-0101 01-01-0010 11 01 11 10]}.
\end{solution}

\vfill

\problem{}<firstlz>
Consider the following string: \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD}. \par
\begin{itemize}
	\item How many bits do we need to encode this na\"ively? \par
	\item How about with the (unmodified) run-length scheme described on the previous page?
\end{itemize}
\hint{You don't need to encode this string---just find the length of its encoded form.}

\begin{solution}
	Na\"ively: \tab 22 bits \par
	Run-length: \tab $6 \times 21 = 126$ bits. Watch out for the two repeated \texttt{A}s!
\end{solution}


\vfill

Neither solution to \ref{firstlz} is ideal. Run-length is very wasteful due to the lack of runs, and na\"ive coding
does not take advantage of repetition in the string. We'll need a better coding scheme.
\pagebreak
Added compression parts 2024-04-12 13:11:24 -07:00			`% TODO:`
			`% Basic run-length`
			`% LZ77`

			`\section{Run-length Coding}`


Polish 2024-04-23 17:33:58 -07:00			`%\definition{}`
			`%\textit{Entropy} is a measure of information in a certain sequence. \par`
			`%A sequence with high entropy contains a lot of information, and a sequence with low entropy contains relatively little.`
			`%For example, consider the following two ten-symbol ASCII\footnotemark{} strings:`
			`%\begin{itemize}`
			`% \item \texttt{AAAAAAAAAA}`
			`% \item \texttt{pDa3:7?j;F}`
			`%\end{itemize}`
			`%The first string clearly contains less information than the second.`
			`%It's much harder to describe \texttt{pDa3:7?j;F} than it is \texttt{AAAAAAAAAA}.`
			`%Thus, we say that the first has low entropy, and the second has fairly high entropy.`
			`%`
			`%\vspace{2mm}`
			`%`
			`%The definition above is intentionally hand-wavy. \par`
			`%Formal definitions of entropy exist, but we won't need them today---we just need`
			`%an intuitive understanding of the \say{density} of information in a given string.`

			`%`
			`%\footnotetext{`
			`% American Standard Code for Information Exchange, an early character encoding for computers. \par`
			`% It contains 128 symbols, including numbers, letters, and`
			% \texttt{!"\#\$\%\&`()*+,-./:;<=>?@[\textbackslash]\^\_\{\|\}\textasciitilde}
			`%}`


			`%\vspace{5mm}`
Added compression parts 2024-04-12 13:11:24 -07:00

			`\problem{}<runlenone>`
Polish 2024-04-23 17:33:58 -07:00			`Using a na\"ive coding scheme, encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} in binary. \par`
Added compression parts 2024-04-12 13:11:24 -07:00			`\note[Note]{`
			`We're still using the four-symbol alphabet $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$. \par`
			`Dots ($\cdot$) in the string are drawn for readability. Ignore them.`
			`}`

			`\begin{solution}`
			`There are eight \texttt{A}s on each end of that string. Mapping symbols as before, \par`
			`we get \texttt{[00 00 00 00 00 00 00 00 01 10 11 00 00 00 00 00 00 00 00]}`
			`\end{solution}`


			`\vfill`
Polish 2024-04-23 17:33:58 -07:00			`In \ref{runlenone}---and often, in the real world---the strings we want to encode have fairly low \textit{entropy}. \par`
Polish 2024-04-24 15:33:33 -07:00			`That is, they have predictable patterns, sequences of symbols that don't contain a lot of information. \par`
			`\note{`
			`For example, consider the text in this document. \par`
			`The symbols \texttt{e}, \texttt{t}, and \texttt{<space>} are much more common than any others. \par`
			`Also, certain subsequences are repeated: \texttt{th}, \texttt{and}, \texttt{encode}, and so on.`
			`}`
			`We can exploit this fact to develop encoding schemes that need relatively few bits per letter.`
Added compression parts 2024-04-12 13:11:24 -07:00
			`\example{}`
Polish 2024-04-23 17:33:58 -07:00			`A simple example of such a coding scheme is \textit{run-length encoding}. Instead of simply listing letters of a string`
			`in their binary form, we'll add a \textit{count} to each letter, shortening repeated instances of the same symbol.`
Added compression parts 2024-04-12 13:11:24 -07:00
			`\vspace{2mm}`

			`We'll encode our string into a sequence of 6-bit blocks, interpreted as follows:`

			`\begin{center}`
			`\begin{tikzpicture}`
			`\node[anchor=west,color=gray] at (-2.3, 0) {Bits};`
			`\node[anchor=west,color=gray] at (-2.3, -0.5) {Meaning};`
			`\draw[color=gray] (-2.3, -0.25) -- (5.5, -0.25);`
			`\draw[color=gray] (-2.3, 0.15) -- (-2.3, -0.65);`

			`\node at (0, 0) {\texttt{0}};`
			`\node at (1, 0) {\texttt{0}};`
			`\node at (2, 0) {\texttt{1}};`
			`\node at (3, 0) {\texttt{1}};`
			`\node at (4, 0) {\texttt{0}};`
			`\node at (5, 0) {\texttt{1}};`

			`\draw (-0.5, 0.25) -- (5.5, 0.25);`
			`\draw (-0.5, -0.25) -- (5.5, -0.25);`
			`\draw (-0.5, -0.75) -- (5.5, -0.75);`

			`\draw (-0.5, 0.25) -- (-0.5, -0.75);`
			`\draw (3.5, 0.25) -- (3.5, -0.75);`
			`\draw (5.5, 0.25) -- (5.5, -0.75);`

			`\node at (1.5, -0.5) {number of copies};`
			`\node at (4.5, -0.5) {symbol};`
			`\end{tikzpicture}`
			`\end{center}`
			`So, the sequence \texttt{BBB} will be encoded as \texttt{[0011-01]}. \par`
Polish 2024-04-23 17:33:58 -07:00			`\note[Notation]{`
Polish 2024-04-24 15:33:33 -07:00			`Just like dots, dashes and spaces are added for readability. Pretend they don't exist. \par`
Polish 2024-04-23 17:33:58 -07:00			`Encoded binary sequences will always be written in square brackets. \texttt{[]}.`
			`}`
Minor edits 2024-04-21 21:26:19 -07:00
Polish 2024-04-24 15:33:33 -07:00			`\problem{}`
			`Decode \texttt{[010000001111]} using this scheme.`

			`\begin{solution}`
			`\texttt{AAAADDD}`
			`\end{solution}`
			`\vfill`

Added compression parts 2024-04-12 13:11:24 -07:00			`\problem{}`
			`Encode \texttt{AAAA$\cdot$AAAA$\cdot$BCD$\cdot$AAAA$\cdot$AAAA} using this scheme. \par`
			`Is this more or less efficient than \ref{runlenone}?`

			`\begin{solution}`
			`\texttt{[1000-00 0001-01 0001-10 0001-11 1000-00]} \par`
			`This requires 30 bits, as compared to 38 in \ref{runlenone}.`
			`\end{solution}`

			`\vfill`
Polish 2024-04-23 17:33:58 -07:00			`\pagebreak`






Polish 2024-04-24 15:33:33 -07:00			`\problem{}`
			`Give an example of a message on $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$`
			`that uses $n$ bits when encoded with a na\"ive scheme, and \textit{fewer} than $\nicefrac{n}{2}$ bits`
			`when encoded using the scheme described on the previous page.`

Polish 2024-04-23 17:33:58 -07:00
Polish 2024-04-24 15:33:33 -07:00			`\vfill`
Polish 2024-04-23 17:33:58 -07:00
Polish 2024-04-24 15:33:33 -07:00			`\problem{}`
			`Give an example of a message on $\{\texttt{A}, \texttt{B}, \texttt{C}, \texttt{D}\}$`
			`that uses $n$ bits when encoded with a na\"ive scheme, and \textit{more} than $2n$ bits`
			`when encoded using the scheme described on the previous page.`


			`\vfill`
Minor edits 2024-04-21 21:26:19 -07:00
Added compression parts 2024-04-12 13:11:24 -07:00
			`\problem{}`
Polish 2024-04-24 15:33:33 -07:00			`Is run-length coding always more efficient than na\"ive coding? \par`
			`When does it work well, and when does it fail?`
Added compression parts 2024-04-12 13:11:24 -07:00
			`\vfill`


			`\problem{}`
			`Our coding scheme wastes a lot of space when our string has few runs of the same symbol. \par`
			`Fix this problem: modify the scheme so that single occurrences of symbols do not waste space. \par`
			`\hint{We don't need a run length for every symbol. We only need one for \textit{repeated} symbols.}`

			`\begin{solution}`
			`One idea is as follows: \par`
			`\begin{itemize}`
			`\item Encode single symbols na\"ively: \texttt{ABCD} becomes \texttt{[00 01 10 11]}`
			`\item Signal runs using two copies of the same symbol: \texttt{AAAAAA} becomes \texttt{[00 00 0110]}. \par`
			`When our decoder sees two copies of the same symbol, it will interpret the next four bits as`
			`a run length.`
			`\end{itemize}`
			`\texttt{BDC$\cdot$DDDDD$\cdot$AADBDC} will be encoded as \texttt{[01 11 10 11-11-0101 01-01-0010 11 01 11 10]}.`
			`\end{solution}`

			`\vfill`

			`\problem{}<firstlz>`
			`Consider the following string: \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD}. \par`
			`\begin{itemize}`
			`\item How many bits do we need to encode this na\"ively? \par`
Polish 2024-04-23 17:33:58 -07:00			`\item How about with the (unmodified) run-length scheme described on the previous page?`
Added compression parts 2024-04-12 13:11:24 -07:00			`\end{itemize}`
			`\hint{You don't need to encode this string---just find the length of its encoded form.}`

			`\begin{solution}`
			`Na\"ively: \tab 22 bits \par`
			`Run-length: \tab $6 \times 21 = 126$ bits. Watch out for the two repeated \texttt{A}s!`
			`\end{solution}`


			`\vfill`

			`Neither solution to \ref{firstlz} is ideal. Run-length is very wasteful due to the lack of runs, and na\"ive coding`
			`does not take advantage of repetition in the string. We'll need a better coding scheme.`
			`\pagebreak`