handouts/Advanced/Compression/parts/2 lzss.tex

\section{LZ Codes}

The LZ-family\footnotemark{} of codes (LZ77, LZ78, LZSS, LZMA, and others) take advantage of repeated subsequences
in a string. They are the basis of most modern compression algorithms, including DEFLATE, which is used in the ZIP, PNG,
and GZIP formats.

\footnotetext{
	Named after Abraham Lempel and Jacob Ziv, the original inventors. \par
	LZ77 is the algorithm described in their first paper on the topic, which was published in 1977. \par
	LZ78, LZSS, and LZMA are minor variations on the same general idea.
}

\vspace{2mm}

The idea behind LZ is to represent repeated substrings as \textit{pointers} to previous parts of the string. \par
Pointers take the form \texttt{<pos, len>}, where \texttt{pos} is the position of the string to repeat and
\texttt{len} is the number of symbols to copy.

\vspace{2mm}

For example, we can encode the string \texttt{ABRACADABRA} as \texttt{[ABRACAD<7, 4>]}. \par
The pointer \texttt{<7, 4>} tells us to look back 7 positions (to the first \texttt{A}), and copy the next 4 symbols. \par
Note that pointers refer to the partially decoded output---\textit{not} to the encoded string. \par
This allows pointers to reference other pointers, and ensures that codes like \texttt{A<1,9>} are valid. \par
\note{For example, \texttt{[B<1,2>]} decodes to \texttt{BBB}.}

\problem{}
Encode \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD} using this scheme. \par
Then, decode the following:
\begin{itemize}
	\item \texttt{[ABCD<4,4>]}
	\item \texttt{[A<1,9>]}
	\item \texttt{[DAC<3,5>]}
\end{itemize}

\begin{solution}

	% spell:off
	\texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD} becomes \texttt{[ABCD<4, 4> BA<2,4> ABCD<4,4>]}.
	% spell:on

	\linehack{}

	In parts two and three, remember that we're reading the \textit{output string.} \par
	The ten \texttt{A}s in part two are produced one by one, \par
	with the decoder's \say{read head} following its \say{write head.}

	\begin{itemize}
		\item \texttt{ABCD$\cdot$ABCD}
		\item \texttt{AAAAA$\cdot$AAAAA}
		\item \texttt{DACDACDA}
	\end{itemize}
\end{solution}

\vfill

\problem{}
Convince yourself that LZ is a generalization of the run-length code we discussed in the previous section.
\hint{\texttt{[A<1,9>]} and \texttt{[00-1001]} are the same thing!}

\remark{}
Note that we left a few things out of this section: we didn't discuss the algorithm that converts a string to an LZ-encoded blob,
nor did we discuss how we should represent strings encoded with LZ in binary. We skipped these details because they are
problems of implementation---they're the engineer's headache, not the mathematician's. \par

\pagebreak

%\begin{instructornote}
%	A simple LZ-scheme can work as follows. We encode our string into a sequence of
%	nine-bit blocks, drawn below. The first bit of each block tells us whether or not
%	this block is a pointer, and the next eight bits contain either a \texttt{pos, len} pair
%	(using, say, for bits for each number) or a plain eight-bit symbol code.
%	\begin{center}
%		\begin{tikzpicture}
%			\node[anchor=west,color=gray] at (-2.3, 0) {Bits};
%			\node[anchor=west,color=gray] at (-2.3, -0.5) {Meaning};
%			\draw[color=gray] (-2.3, -0.25) -- (5.5, -0.25);
%			\draw[color=gray] (-2.3, 0.15) -- (-2.3, -0.65);
%
%			\node at (0, 0) {\texttt{0}};
%			\node at (1, 0) {\texttt{0}};
%			\node at (2, 0) {\texttt{1}};
%			\node at (3, 0) {\texttt{0}};
%			\node at (4, 0) {\texttt{1}};
%			\node at (5, 0) {\texttt{1}};
%			\node at (6, 0) {\texttt{0}};
%			\node at (7, 0) {\texttt{0}};
%			\node at (8, 0) {\texttt{1}};
%
%			\draw (-0.5, 0.25) -- (8.5, 0.25);
%			\draw (-0.5, -0.25) -- (8.5, -0.25);
%			\draw (-0.5, -0.75) -- (8.5, -0.75);
%
%			\draw (-0.5, 0.25) -- (-0.5, -0.75);
%			\draw (0.5, 0.25) -- (0.5, -0.75);
%			\draw (8.5, 0.25) -- (8.5, -0.75);
%
%			\node at (0, -0.5) {flag};
%			\node at (4.5, -0.5) {if flag \texttt{<pos, len>}, else eight-bit symbol};
%		\end{tikzpicture}
%	\end{center}
%
%	To encode a string, we read it using a \say{window}, shown below. This window consists of
%	a search buffer and a lookahead buffer, both of which have a fixed (but configurable) size.
%	This window passes over the string one character at a time, inserting a pointer if it finds
%	the lookahead buffer inside its search buffer, and a plain character otherwise.
%
%
%	\begin{center}
%		\begin{tikzpicture}
%			% Text tape
%			\node[color=gray] at (-0.75, 0) {\texttt{...}};
%			\node[color=gray] at (0.0, 0) {\texttt{D}};
%			\node at (0.5, 0) {\texttt{A}};
%			\node at (1.0, 0) {\texttt{B}};
%			\node at (1.5, 0) {\texttt{C}};
%			\node at (2.0, 0) {\texttt{D}};
%			\node at (2.5, 0) {\texttt{A}};
%			\node at (3.0, 0) {\texttt{B}};
%			\node at (3.5, 0) {\texttt{C}};
%			\node at (4.0, 0) {\texttt{D}};
%			\node[color=gray] at (4.5, 0) {\texttt{B}};
%			\node[color=gray] at (5.0, 0) {\texttt{D}};
%			\node[color=gray] at (5.5, 0) {\texttt{A}};
%			\node[color=gray] at (6.0, 0) {\texttt{C}};
%			\node[color=gray] at (6.75, 0) {\texttt{...}};
%
%			\draw (-1.75, 0.25) -- (7.25, 0.25);
%			\draw (-1.75, -0.25) -- (7.25, -0.25);
%
%
%			\draw[line width = 0.7mm, color=oblue, dotted] (2.25, 0.5) -- (2.25, -0.5);
%			\draw[line width = 0.7mm, color=oblue]
%				(-1.25, 0.5)
%				-- (4.25, 0.5)
%				-- (4.25, -0.5)
%				-- (-1.25, -0.5)
%				-- cycle
%			;
%
%			\draw
%				(4.2, -0.625)
%				-- (4.2, -0.75)
%				to node[anchor=north, midway] {lookahead} (2.3, -0.75)
%				-- (2.3, -0.625)
%			;
%
%			\draw
%				(2.2, -0.625)
%				-- (2.2, -0.75)
%				to node[anchor=north, midway] {search buffer} (-1.1, -0.75)
%				-- (-1.1, -0.625)
%			;
%
%			\draw[color=gray]
%				(2.2, 0.625)
%				-- (2.2, 0.75)
%				to node[anchor=south, midway] {match!} (0.3, 0.75)
%				-- (0.3, 0.625)
%			;
%
%			%\draw[->, color=gray] (2.5, 0.3) -- (2.5, 0.8) to[out=90,in=90] (0.5, 0.8);
%			\node at (7.0, -0.75) {Result: \texttt{[$\cdot\cdot\cdot$DABCD<4,4>$\cdot\cdot\cdot$]}};
%		\end{tikzpicture}
%	\end{center}
%
%	This is not the exact process used in practice---but it's close enough. \par
%	This process may be tweaked in any number of ways.
%\end{instructornote}
%
%\makeatletter\if@solutions
%	\vfill
%	\pagebreak
%\fi\makeatother