155 lines
5.0 KiB
TeX
155 lines
5.0 KiB
TeX
\section{LZ Codes}
|
|
|
|
The LZ-family\footnotemark{} of codes (LZ77, LZ78, LZSS, LZMA, and others) take advantage of repeated sequences of symbols
|
|
in a string. They are the basis of most modern compression algorithms, including DEFLATE, which is used in the ZIP, PNG,
|
|
and GZIP formats.
|
|
|
|
\footnotetext{
|
|
Named after Abraham Lempel and Jacob Ziv, the original inventors. \par
|
|
LZ77 is the algorithm described in their first paper on the topic, which was published in 1977. \par
|
|
LZ78, LZSS, and LZMA are minor variations on the same general idea.
|
|
}
|
|
|
|
\vspace{2mm}
|
|
|
|
The idea behind LZ is to represent repeated substrings as \textit{pointers} to previous parts of the string. \par
|
|
Pointers take the form \texttt{<pos, len>}, where \texttt{pos} is the position of the string to repeat and
|
|
\texttt{len} is the number of symbols to copy.
|
|
|
|
\vspace{2mm}
|
|
|
|
For example, we can encode the string \texttt{ABRACADABRA} as \texttt{[ABRACAD<7, 4>]}. \par
|
|
The pointer \texttt{<7, 4>} tells us to look back 7 positions (to the first \texttt{A}), and copy the next 4 symbols. \par
|
|
Note that pointers refer to the partially decoded output---\textit{not} to the encoded string. \par
|
|
This allows pointers to reference other pointers, and ensures codes like \texttt{A<1,9>} are valid.
|
|
|
|
\problem{}
|
|
Encode \texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD} using LZ.
|
|
Then, decode the following:
|
|
\begin{itemize}
|
|
\item \texttt{[ABCD<4,4>]}
|
|
\item \texttt{[A<1,9>]}
|
|
\item \texttt{[DAC<3,5>]}
|
|
\end{itemize}
|
|
|
|
\begin{solution}
|
|
|
|
\texttt{ABCD$\cdot$ABCD$\cdot$BABABA$\cdot$ABCD$\cdot$ABCD} becomes \texttt{[ABCD<4, 4> BA<2,4> ABCD<4,4>]}.
|
|
|
|
\linehack{}
|
|
|
|
In parts two and three, remember that we're reading the \textit{output string.} \par
|
|
The nine \texttt{A}s in part two are produced one by one, \par
|
|
with the decoder's \say{read head} following its \say{write head.}
|
|
|
|
\begin{itemize}
|
|
\item \texttt{ABCD$\cdot$ABCD}
|
|
\item \texttt{AAAAA$\cdot$AAAAA}
|
|
\item \texttt{DACDACDA}
|
|
\end{itemize}
|
|
\end{solution}
|
|
|
|
\vfill
|
|
|
|
\problem{}
|
|
Convince yourself that LZ is a generalization of the run-length code we discussed in the previous section.
|
|
\hint{\texttt{[A<1,9>]} and \texttt{[00-1001]} are the same thing!}
|
|
|
|
\remark{}
|
|
Note that we left a few things out of this section: we didn't discuss the algorithm that converts a string to an LZ-encoded blob,
|
|
nor did we discuss how we should represent strings encoded with LZ in binary. We skipped these details because they are
|
|
problems of implementation---they're the engineer's headache, not the mathematician's. If you're interested, a brief explanation is below.
|
|
Ask an instructor to explain.
|
|
|
|
\begin{center}
|
|
\begin{tikzpicture}
|
|
\node[anchor=west,color=gray] at (-2.3, 0) {Bits};
|
|
\node[anchor=west,color=gray] at (-2.3, -0.5) {Meaning};
|
|
\draw[color=gray] (-2.3, -0.25) -- (5.5, -0.25);
|
|
\draw[color=gray] (-2.3, 0.15) -- (-2.3, -0.65);
|
|
|
|
\node at (0, 0) {\texttt{0}};
|
|
\node at (1, 0) {\texttt{0}};
|
|
\node at (2, 0) {\texttt{1}};
|
|
\node at (3, 0) {\texttt{0}};
|
|
\node at (4, 0) {\texttt{1}};
|
|
\node at (5, 0) {\texttt{1}};
|
|
\node at (6, 0) {\texttt{0}};
|
|
\node at (7, 0) {\texttt{0}};
|
|
\node at (8, 0) {\texttt{1}};
|
|
|
|
\draw (-0.5, 0.25) -- (8.5, 0.25);
|
|
\draw (-0.5, -0.25) -- (8.5, -0.25);
|
|
\draw (-0.5, -0.75) -- (8.5, -0.75);
|
|
|
|
\draw (-0.5, 0.25) -- (-0.5, -0.75);
|
|
\draw (0.5, 0.25) -- (0.5, -0.75);
|
|
\draw (8.5, 0.25) -- (8.5, -0.75);
|
|
|
|
\node at (0, -0.5) {flag};
|
|
\node at (4.5, -0.5) {if flag \texttt{<pos, len>}, else eight-bit symbol};
|
|
\end{tikzpicture}
|
|
\end{center}
|
|
|
|
|
|
\begin{center}
|
|
\begin{tikzpicture}
|
|
% Text tape
|
|
\node[color=gray] at (-0.75, 0) {\texttt{...}};
|
|
\node[color=gray] at (0.0, 0) {\texttt{D}};
|
|
\node at (0.5, 0) {\texttt{A}};
|
|
\node at (1.0, 0) {\texttt{B}};
|
|
\node at (1.5, 0) {\texttt{C}};
|
|
\node at (2.0, 0) {\texttt{D}};
|
|
\node at (2.5, 0) {\texttt{A}};
|
|
\node at (3.0, 0) {\texttt{B}};
|
|
\node at (3.5, 0) {\texttt{C}};
|
|
\node at (4.0, 0) {\texttt{D}};
|
|
\node[color=gray] at (4.5, 0) {\texttt{B}};
|
|
\node[color=gray] at (5.0, 0) {\texttt{D}};
|
|
\node[color=gray] at (5.5, 0) {\texttt{A}};
|
|
\node[color=gray] at (6.0, 0) {\texttt{C}};
|
|
\node[color=gray] at (6.75, 0) {\texttt{...}};
|
|
|
|
\draw (-1.75, 0.25) -- (7.25, 0.25);
|
|
\draw (-1.75, -0.25) -- (7.25, -0.25);
|
|
|
|
|
|
\draw[line width = 0.7mm, color=oblue, dotted] (2.25, 0.5) -- (2.25, -0.5);
|
|
\draw[line width = 0.7mm, color=oblue]
|
|
(-1.25, 0.5)
|
|
-- (4.25, 0.5)
|
|
-- (4.25, -0.5)
|
|
-- (-1.25, -0.5)
|
|
-- cycle
|
|
;
|
|
|
|
\draw
|
|
(4.2, -0.625)
|
|
-- (4.2, -0.75)
|
|
to node[anchor=north, midway] {lookahead} (2.3, -0.75)
|
|
-- (2.3, -0.625)
|
|
;
|
|
|
|
\draw
|
|
(2.2, -0.625)
|
|
-- (2.2, -0.75)
|
|
to node[anchor=north, midway] {search buffer} (-1.1, -0.75)
|
|
-- (-1.1, -0.625)
|
|
;
|
|
|
|
\draw[color=gray]
|
|
(2.2, 0.625)
|
|
-- (2.2, 0.75)
|
|
to node[anchor=south, midway] {match!} (0.3, 0.75)
|
|
-- (0.3, 0.625)
|
|
;
|
|
|
|
%\draw[->, color=gray] (2.5, 0.3) -- (2.5, 0.8) to[out=90,in=90] (0.5, 0.8);
|
|
\node at (7.0, -0.75) {Result: \texttt{[$\cdot\cdot\cdot$DABCD<4,4>$\cdot\cdot\cdot$]}};
|
|
\end{tikzpicture}
|
|
\end{center}
|
|
|
|
|
|
\vfill
|
|
\pagebreak |