handouts/Advanced/Compression/parts/2 lzss.tex

\section{LZ Codes}

The LZ-family\footnotemark{} of codes (LZ77, LZ78, LZSS, LZMA, and others) take advantage of repeated sequences of symbols
in a string. They are the basis of most modern compression algorithms, including DEFLATE, which is used in the ZIP, PNG,
and GZIP formats.

\footnotetext{
	Named after Abraham Lempel and Jacob Ziv, the original inventors. \par
	LZ77 is the algorithm described in their first paper on the topic, which was published in 1977. \par
	LZ78, LZSS, and LZMA are minor variations on the same general idea.
}

\vspace{2mm}

The idea behind LZ is to represent repeated substrings as \textit{pointers} to previous parts of the string. \par
Pointers take the form \texttt{<pos, len>}, where \texttt{pos} is the position of the string to repeat and
\texttt{len} is the number of symbols to copy.

\vspace{2mm}

For example, we can encode the string \texttt{ABRACADABRA} as \texttt{[ABRACAD<7, 4>]}. \par
The pointer \texttt{<7, 4>} tells us to look back 7 positions (to the first \texttt{A}), and copy the next 4 symbols. \par
Note that pointers refer to the partially decoded output---\textit{not} to the encoded string. \par
This allows pointers to reference other pointers, and ensures codes like \texttt{A<1,9>} are valid.

\problem{}
Encode \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD} using LZ.
Then, decode the following:
\begin{itemize}
	\item \texttt{[ABCD<4,4>]}
	\item \texttt{[A<1,9>]}
	\item \texttt{[DAC<3,5>]}
\end{itemize}

\begin{solution}

	\texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD} becomes \texttt{[ABCD<4, 4> BA<2,4> ABCD<4,4>]}.

	\linehack{}

	In parts two and three, remember that we're reading the \textit{output string.} \par
	The nine \texttt{A}s in part two are produced one by one, \par
	with the decoder's \say{read head} following its \say{write head.}

	\begin{itemize}
		\item \texttt{ABCD$\cdot$ABCD}
		\item \texttt{AAAAA$\cdot$AAAAA}
		\item \texttt{DACDACDA}
	\end{itemize}
\end{solution}

\vfill

\problem{}
Convince yourself that LZ is a generalization of the run-length code we discussed in the previous section.
\hint{\texttt{[A<1,9>]} and \texttt{[00-1001]} are the same thing!}

\remark{}
Note that we left a few things out of this section: we didn't discuss the algorithm that converts a string to an LZ-encoded blob,
nor did we discuss how we should represent strings encoded with LZ in binary. We skipped these details because they are
problems of implementation---they're the engineer's headache, not the mathematician's. If you're interested, a brief explanation is below.
Ask an instructor to explain.

\begin{center}
	\begin{tikzpicture}
		\node[anchor=west,color=gray] at (-2.3, 0) {Bits};
		\node[anchor=west,color=gray] at (-2.3, -0.5) {Meaning};
		\draw[color=gray] (-2.3, -0.25) -- (5.5, -0.25);
		\draw[color=gray] (-2.3, 0.15) -- (-2.3, -0.65);

		\node at (0, 0) {\texttt{0}};
		\node at (1, 0) {\texttt{0}};
		\node at (2, 0) {\texttt{1}};
		\node at (3, 0) {\texttt{0}};
		\node at (4, 0) {\texttt{1}};
		\node at (5, 0) {\texttt{1}};
		\node at (6, 0) {\texttt{0}};
		\node at (7, 0) {\texttt{0}};
		\node at (8, 0) {\texttt{1}};

		\draw (-0.5, 0.25) -- (8.5, 0.25);
		\draw (-0.5, -0.25) -- (8.5, -0.25);
		\draw (-0.5, -0.75) -- (8.5, -0.75);

		\draw (-0.5, 0.25) -- (-0.5, -0.75);
		\draw (0.5, 0.25) -- (0.5, -0.75);
		\draw (8.5, 0.25) -- (8.5, -0.75);

		\node at (0, -0.5) {flag};
		\node at (4.5, -0.5) {if flag \texttt{<pos, len>}, else eight-bit symbol};
	\end{tikzpicture}
\end{center}


\begin{center}
	\begin{tikzpicture}
		% Text tape
		\node[color=gray] at (-0.75, 0) {\texttt{...}};
		\node[color=gray] at (0.0, 0) {\texttt{D}};
		\node at (0.5, 0) {\texttt{A}};
		\node at (1.0, 0) {\texttt{B}};
		\node at (1.5, 0) {\texttt{C}};
		\node at (2.0, 0) {\texttt{D}};
		\node at (2.5, 0) {\texttt{A}};
		\node at (3.0, 0) {\texttt{B}};
		\node at (3.5, 0) {\texttt{C}};
		\node at (4.0, 0) {\texttt{D}};
		\node[color=gray] at (4.5, 0) {\texttt{B}};
		\node[color=gray] at (5.0, 0) {\texttt{D}};
		\node[color=gray] at (5.5, 0) {\texttt{A}};
		\node[color=gray] at (6.0, 0) {\texttt{C}};
		\node[color=gray] at (6.75, 0) {\texttt{...}};

		\draw (-1.75, 0.25) -- (7.25, 0.25);
		\draw (-1.75, -0.25) -- (7.25, -0.25);


		\draw[line width = 0.7mm, color=oblue, dotted] (2.25, 0.5) -- (2.25, -0.5);
		\draw[line width = 0.7mm, color=oblue]
			(-1.25, 0.5)
			-- (4.25, 0.5)
			-- (4.25, -0.5)
			-- (-1.25, -0.5)
			-- cycle
		;

		\draw
			(4.2, -0.625)
			-- (4.2, -0.75)
			to node[anchor=north, midway] {lookahead} (2.3, -0.75)
			-- (2.3, -0.625)
		;

		\draw
			(2.2, -0.625)
			-- (2.2, -0.75)
			to node[anchor=north, midway] {search buffer} (-1.1, -0.75)
			-- (-1.1, -0.625)
		;

		\draw[color=gray]
			(2.2, 0.625)
			-- (2.2, 0.75)
			to node[anchor=south, midway] {match!} (0.3, 0.75)
			-- (0.3, 0.625)
		;

		%\draw[->, color=gray] (2.5, 0.3) -- (2.5, 0.8) to[out=90,in=90] (0.5, 0.8);
		\node at (7.0, -0.75) {Result: \texttt{[$\cdot\cdot\cdot$DABCD<4,4>$\cdot\cdot\cdot$]}};
	\end{tikzpicture}
\end{center}


\vfill
\pagebreak