\chapter[ Compression and Decompression]{Compression and Decompression}
Algorithms for data compression usually proceed as follows.
They encode a text over some finite alphabet into a
sequence of bits, hereby exploiting the fact that
the letters of this alphabet occur with different frequencies.
For instance, an ``e'' occurs more frequently than a ``q'' and
will therefore be assigned a shorter codeword. The quality of the
compression procedure is then measured in terms of the average
codeword length.
So the underlying model is probabilistic, namely we consider
a finite alphabet and a probability distribution on this alphabet, where
the probability distribution reflects the (relative)
frequencies of the letters. Such a pair---an alphabet with a
probability distribution---is called a source.
We shall first introduce some basic facts from Information Theory.
Most important is the notion of entropy, since the source entropy
characterises the achievable lower bounds for compressibility.
The source model to be best understood, is the discrete memoryless
source. Here the letters occur independently of each other in the text.
The use of prefix codes, in which no codeword is the beginning of
another one,
allows to compress the text down to the entropy of the source.
We shall study this in detail. The lower bound
is obtained via Kraft's inequality, the achievability is demonstrated
by the use of Huffman codes, which can be shown to be optimal.
There are some assumptions on the discrete memoryless source, which
are not fulfilled in most practical situations.
Firstly, usually this source model is not realistic, since the letters do
not occur independently in the text. Secondly, the probability
distribution is not known in advance. So the coding algorithms
should be universal for a whole class of probability distributions
on the alphabet. The analysis of such universal coding techniques
is much more involved than the analysis of the discrete memoryless
source, such that we shall only present the algorithms and do not
prove the quality of their performance.
Universal coding techniques mainly fall into
two classes.
Statistical coding techniques estimate the probability of the next
letters as accurately as possible. This process is called modelling of
the source. Having enough information about the probabilities,
the text is encoded, where usually
arithmetic coding is applied. Here the probability is represented
by an interval and this interval will be encoded.
\newpage
Dictionary-based algorithms store patterns, which occurred before
in the text, in a dictionary and at the next occurrence of a pattern
this is encoded via its position in the dictionary. The most prominent
procedure of this kind is due to Ziv and Lempel.
We shall also present a third universal coding technique which falls
in neither of these two classes. The algorithm due to Burrows
\nevindex{Burrows, Michael} and
Wheeler \nevindex{Wheeler, David J.}
has become quite prominent in recent years, since implementations
based on it perform very well in practice.
All algorithms mentioned so far are lossless, i. e., there is no
information lost after decoding. So the original text
will be recovered without any errors. In contrast, there are
lossy data compression techniques, where the text obtained after
decoding does not completely coincide with the original text.
Lossy compression algorithms are used in applications
like image, sound, video, or speech compression. The loss should, of course,
only marginally effect the quality. For instance, frequencies not realizable
by the human eye or ear can be dropped. However, the understanding
of such techniques requires a solid background in image, sound
or speech processing, which would be far beyond the scope of this
paper, such that we shall illustrate only the basic concepts behind
image compression algorithms such as JPEG.
We emphasise here the recent developments such as the
Burrows-Wheeler transform and the context--tree weighting method.
Rigorous proofs will only be presented for the results on the
discrete memoryless
source which is best understood but not a very realistic
source model in practice. However, it is also the basis for
more complicated source models, where the calculations
involve conditional probabilities.
The asymptotic computational complexity of compression algorithms
is often linear in the text length, since the algorithms simply parse
through the text. However, the running time relevant for practical
implementations is mostly
determined by the constants as dictionary size in Ziv-Lempel coding
or depth of the context tree, when arithmetic coding is applied.
\nevindex{Ziv, Jacob}\nevindex{Lempel, Abraham}
Further, an exact analysis or comparison of compression algorithms
often heavily depends on the structure of the source or the type of
file to be compressed, such that usually the performance of
compression algorithms is tested on benchmark files. The most
well-known collections of benchmark files are the Calgary Corpus
and the Canterbury Corpus.
\section{Facts from information theory}
\subsection{The Discrete Memoryless Source}
The source model discussed throughout this chapter is the
\ki{Discrete Memoryless Source} (DMS).\inddef{Discrete Memoryless Source}\index{DMS|see{Discrete Memoryless Source}}
Such a source is a pair $({{\cal X}},P)$, where
${{\cal X}}=\{1,\dots,m\}$, is a finite alphabet and $P=
(P(1),\dots,P(m))$ is a probability distribution on ${{\cal X}}$.
A discrete memoryless source can also be described
by a random variable $X$, where $\mbox{Prob}(X=x)=P(x)$
for all $x\in{{\cal X}}$. A word
$x^n=(x_1 x_2 \dots x_n)\in{{\cal X}}^n$ is the
realization of the random variable $(X_1\dots X_n)$, where the
$X_i\mbox{'s}$ are identically distributed and
independent of each other. So the probability
$P^n(x_1 x_2 \dots x_n)=P(x_1)\cdot P(x_2)\cdot\dots\cdot P(x_n)$
is the product of the probabilities of the single letters.
\newpage
Estimations for the letter probabilities in natural languages are obtained
by statistical methods.
If we consider the English language and choose for
${{\cal X}}$ the latin
alphabet with an additional symbol for Space and
punctuation marks, the probability distribution can be derived from the
frequency table in \ref{datac:freq}, which is
obtained from the copy--fitting tables used by
professional printers. So $P(A)=0.064,P(B)=0.014$, etc.
\begin{figure}
\begin{center}
\begin{tabular}{crccrccrccr}
A & 64 & \quad & H & 42 & \quad & N & 56 & \quad & U & 31 \\
B & 14 & & I & 63 & & O & 56 & & V & 10 \\
C & 27 & & J & 3 & & P & 17 & & W & 10 \\
D & 35 & & K & 6 & & Q & 4 & & X & 3 \\
E & 100 & & L & 35 & & R & 49 & & Y & 18 \\
F & 20 & & M & 20 & & S & 56 & & Z & 2 \\
G & 14 & & T & 71 & & & & & &
\end{tabular}
\end{center}
\centerline{Space/Punctuation mark 166}
\caption{Frequency of letters in $1000$ characters of English.}
\label{datac:freq}
\end{figure}
Observe that this source model is often not realistic.
For instance, in English texts e.g. the combination `th' occurs more often
than `ht'.
This could not be the case, if an English text was produced by a discrete
memoryless source, since then $P(th)=P(t)\cdot P(h)=P(ht)$.
In the discussion of the communication model it was pointed out that the
encoder wants to compress the original data
into a short sequence of binary digits, hereby using a binary code, i. e.,
a function
$c:{{\cal X}}\longrightarrow\{0,1\}^*=\bigcup\limits_{n=0}^{\infty}\{0,1\}^n$. To each
element $x\in{{\cal X}}$ a codeword $c(x)$ is assigned.
The aim of the encoder is to minimise the average length of the codewords.
It turns out that the best possible data compression can be
described in terms of the \ki{entropy} $H(P)$ \inddef{entropy} of the probability distribution
$P$. The entropy is given by the formula
\[ H(P)= -\sum_{x\in{{\cal X}}} P(x)\cdot\lg P(x) \koz ,\]
where the logarithm is to the base 2. We shall also use the notation
$H(X)$ according to the interpretation of the source as a random variable.
\subsection{Prefix codes}
A \ki{code} (of variable length) is a function
$c:{\cal X}\longrightarrow \{0,1\}^*$, ${\cal X}=\{1,\dots,m\}$.
Here $\{c(1),c(2),\dots,c(m)\}$ is the set of \ki{codewords,} where for
$x=1,\dots,m$ the codeword is $c(x)=\left(c_1(x),c_2(x),\dots,c_{L(x)}
(x)\right)$ where $L(x)$ denotes the \ki{length} of
$c(x)$, i. e., the number of bits used to present$c(x)$.
A code $c$ is \ki{uniquely decipherable} (UDC)
\inddef{uniquely decipherable code}\index{UDC|see{uniquely decipherable code}},
if every word in $\{0,1\}^*$ is representable by
at most one sequence of codewords.
A code $c$ is a \ki{prefix code,}\inddef{prefix code} if no codeword is
prefix of another one, i. e., for any two codewords
$c(x)$ and $c(y)$, $x\neq y$, with
$L(x)\leq L(y)$ holds $\bigl(c_1(x),c_2(x),\dots,c_{L(x)}(x)\bigr)\neq
\bigl(c_1(y),c_2(y),\dots,c_{L(x)}(y)\bigr)$.
So in at least one of the first $L(x)$ components $c(x)$ and $c(y)$ differ.
Messages encoded using a prefix code are uniquely decipherable. The decoder
proceeds by reading the next letter until a codeword $c(x)$ is formed.
Since $c(x)$ cannot be the beginning of another codeword, it must
correspond to the letter $x\in{\cal X}$. Now the decoder continues until another
codeword is formed. The process may be repeated until the end of the message.
So after having found the codeword $c(x)$ the decoder instantaneously
knows that $x\in{\cal X}$ is the next letter of the message. Because of
this property a prefix code is also denoted as instantaneous code.
The criterion for data compression
is to minimise the average length of the codewords.
So if we are given a source $({\cal X},P)$, where ${\cal X}=\{1,\dots,m\}$ and
$P=\bigl(P(1),P(2),\dots,P(m)\bigr)$ is a probability distribution on
${\cal X}$, the \ki{average length} $\overline L(c)$ is defined by
\[ \overline L(c)=\sum_{x\in{\cal X}} P(x)\cdot L(x) \koz .\]
The following prefix code $c$ for English texts has average length
$\overline L(c)=3\cdot0.266+4\cdot0.415
+5\cdot0.190+6\cdot0.101+7\cdot0.016+8\cdot0.012=4.222$.
\medskip
\begin{center}
\begin{tabular}{lllll}
$A\longrightarrow0110$,& $B\longrightarrow010111$,& $C\longrightarrow10001$,& $D\longrightarrow01001,$\\
$E\longrightarrow110$,&$F\longrightarrow11111$,&$G\longrightarrow111110$,&$H\longrightarrow00100$,\\
$I\longrightarrow0111$,&$J\longrightarrow11110110$,&$K\longrightarrow1111010$,&$L\longrightarrow01010$,\\
$M\longrightarrow001010$,&$N\longrightarrow1010$,&$O\longrightarrow1001$,&$P\longrightarrow010011$,\\
$Q\longrightarrow01011010$,&$R\longrightarrow1110$,&$S\longrightarrow1011$,&$T\longrightarrow0011$,\\
$U\longrightarrow10000$,&$V\longrightarrow0101100$,&$W\longrightarrow001011$,&$X\longrightarrow01011011$,\\
$Y\longrightarrow010010$,&$Z\longrightarrow11110111$,&$SP\longrightarrow000$.
\end{tabular}
\end{center}
\medskip
We can still do better, if we do not encode single letters, but
blocks of $n$ letters for some $n\in N$.
In this case we replace the source $({\cal X},P)$ by $({\cal X}^n,P^n)$
for some $n\in N$.
Remember that $P^n(x_1 x_2 \dots x_n)=P(x_1)\cdot P(x_2)\cdot\dots\cdot
P(x_n)$ for a word $(x_1 x_2 \dots x_n)\in{\cal X}^n$, since the source
is memoryless. If e.g. we are given an alphabet with two letters,
${\cal X}=\{a,b\}$ and $P(a)=0.9$, $P(b)=0.1$, then the code $c$ defined
by $c(a)=0$, $c(b)=1$ has average length
$\overline L(c)=0.9\cdot1+0.1\cdot1=1$.
Obviously we cannot find a better code. The combinations of two letters
now have the following probabilities:
\[P^2(aa)=0.81,\quad P^2(ab)=0.09,\quad P^2(ba)=0.09, \quad P^2(bb)=0.01 \koz .\]
The prefix code $c^2$ defined by
\[c^2(aa)=0,\quad c^2(ab)=10,\quad c^2(ba)=110,\quad c^2(bb)=111\]
has average length
$\overline L(c^2)=1\cdot0.81+2\cdot0.09+3\cdot0.09+3\cdot0.01=1.29$. So
$\frac12\overline L(c^2)=0.645$ could be interpreted as the average length
the code $c^2$ requires per letter of the alphabet ${\cal X}$.
When we encode blocks of $n$ letters we are interested
in the behaviour of
\[ L(n,P)=\min_{c UDC}\ \frac1n\
\sum_{(x_1 \dots x_n)\in{\cal X}^n}
P^n(x_1 \dots x_n)L(x_1 \dots x_n) \koz .
\]
It follows from the Noiseless Coding Theorem, which is stated in the next
section, that $\lim_{n\longrightarrow\infty} L(n,P)=H(P)$ the entropy of the
source $({\cal X},P)$.
In our example for the English language we have $H(P)\approx4.19$. So the
code presented above, where only single letters are encoded, is already
nearly optimal in respect of $L(n,P)$.
Further compression is possible, if we consider the
dependencies between the letters.
\subsection{Kraft's inequality and noiseless coding theorem}
\inddef{Kraft's inequality} \inddef{Noiseless Coding Theorem}
We shall now introduce a necessary and sufficient condition for the existence
of a prefix code with prescribed word lengths
$L(1),\dots,L(m)$.
\begin{tetel}[Kraft's inequality]
Let ${\cal X}=\{1,\dots,m\}$. A uniquely decipherable code
$c:{\cal X}\longrightarrow\{0,1\}^*$ with
word lengths $L(1),\dots,L(m)$ exists, if and only if
\[ \sum\limits_{x\in{\cal X}} 2^{-L(x)}\leq 1 \koz .\]
\end{tetel}
\begin{biz}
The central idea is to interpret the codewords as nodes of a rooted binary
tree with depth
$T=\max_{x\in{\cal X}}\{L(x)\}$. The tree is required to be
complete (every path from the root to a leaf has length $T$) and regular
(every inner node has outdegree 2).
The example in Figure \ref{datac:tree}
for $T=3$ may serve as an illustration.
\begin{figure}
\setlength{\unitlength}{0.02cm}%
\begin{center}
\begin{picture}(245,310)(80,540)
\thicklines
\put( 80,680){\vector( 1, 1){ 80}}
\put( 80,680){\vector( 1,-1){ 80}}
\put(175,765){\vector( 2, 1){ 66}}
\put(175,765){\vector( 2,-1){ 66}}
\put(175,599){\vector( 2,-1){ 66}}
\put(175,599){\vector( 2, 1){ 66}}
\put(260,805){\vector( 3, 1){ 60}}
\put(260,805){\vector( 3,-1){ 60}}
\put(260,730){\vector( 3, 1){ 60}}
\put(260,635){\vector( 3, 1){ 60}}
\put(260,730){\vector( 3,-1){ 60}}
\put(260,635){\vector( 3,-1){ 60}}
\put(260,565){\vector( 3,-1){ 60}}
\put(260,565){\vector( 3, 1){ 60}}
\put(165,760){\makebox(0,0)[lb]{$0$}}
\put(165,595){\makebox(0,0)[lb]{$1$}}
\put(245,800){\makebox(0,0)[lb]{$00$}}
\put(245,725){\makebox(0,0)[lb]{$01$}}
\put(245,630){\makebox(0,0)[lb]{$10$}}
\put(245,560){\makebox(0,0)[lb]{$11$}}
\put(325,825){\makebox(0,0)[lb]{$000$}}
\put(325,780){\makebox(0,0)[lb]{$001$}}
\put(325,750){\makebox(0,0)[lb]{$010$}}
\put(325,705){\makebox(0,0)[lb]{$011$}}
\put(325,655){\makebox(0,0)[lb]{$100$}}
\put(325,610){\makebox(0,0)[lb]{$101$}}
\put(325,585){\makebox(0,0)[lb]{$110$}}
\put(325,540){\makebox(0,0)[lb]{$111$}}
\end{picture}
\caption{Example of a code tree.}
\label{datac:tree}
\end{center}
\end{figure}
So the nodes with distance $n$ from the root are labeled with the words
$x^n\in\{0,1\}^n$. The upper successor of $x_1x_2\dots x_n$ is labeled
$x_1x_2\dots x_n 0$, its lower successor is labeled $x_1x_2\dots x_n 1.$
The \ki{shadow}\index{shadow} of a node labeled by $x_1x_2\dots x_n$ is the set of all
the leaves which are labeled by a word (of length $T)$ beginning with
$x_1x_2\dots x_n$. In other words, the shadow of $x_1\dots x_n$ consists
of the leaves labeled by a sequence with prefix $x_1\dots x_n$. In our example
$\{000,001,010,011\}$ is the shadow of the node labeled by $0$.
Now suppose we are given positive integers $L(1),\dots,L(m)$. We
further assume that $L(1)\leq L(2)\leq\dots\leq L(m)$.
As first codeword $c(1)=\underbrace{00\dots0}_{L(1)}$ is chosen. Since
$\sum\limits_{x\in{\cal X}}2^{T-L(x)}\leq 2^T$, we
have $2^{T-L(1)}<2^T$ (otherwise
only one letter has to be encoded). Hence there are left some nodes on the
$T\mbox{-th}$ level, which are not in the shadow of $c(1)$. We pick the
first of these remaining nodes and go back $T-L(2)$ steps in direction to the
root. Since $L(2)\geq L(1)$ we shall find a node labeled by a sequence
of $L(2)$
bits, which is not a prefix of $c(1)$. So we can choose this sequence as
$c(2)$. Now again, either $m=2$, and we are ready, or by the hypothesis
$2^{T-L(1)}+2^{T-L(2)}<2^T$ and we can find a node on the $T\mbox{-th}$
level, which is not contained in the shadows of $c(1)$ and $c(2)$. We find
the next codeword as shown above. The process can be continued until all
codewords are assigned.
Conversely, observe that $\sum\limits_{x\in{\cal X}}2^{-L(x)}=
\sum\limits_{j=1}^Tw_j\ 2^{-j}$, where
$w_j$ is the number of codewords with length $j$ in the uniquely decipherable
prefix code and $T$ again denotes the maximal word length.
The $s\mbox{-th}$ power of this term can be expanded as
\[ \left(\sum\limits_{j=1}^T\ w_j\ 2^{-j}\right)^s=
\sum\limits_{k=s}^{T\cdot s}\ N_k\ 2^{-k} \koz .\]
Here $N_k=\sum\limits_{i_1+\dots+i_s=k}w_{i_1}\dots w_{i_s}$ is the
total number of messages whose coded representation is of length $k.$
Since the code is uniquely decipherable, to every sequence of $k$ letters
corresponds at most one possible message. Hence
$N_k\leq 2^k$ and $\sum\limits_{k=s}^{T\cdot s}N_k\ 2^{-k}\leq
\sum\limits_{k=s}^{T\cdot
s}1=T\cdot s-s+1\leq T\cdot s$.
\smallskip
Taking $s\mbox{--th}$ root this yields
$\sum\limits_{j=1}^Tw_j\ 2^{-j}\leq(T\cdot
s)^{\frac1s}$.
Since this inequality holds for any $s$ and $\lim\limits_{s\longrightarrow\infty}
(T\cdot s)^{\frac1s}=1$, we have the desired result
\[ \sum\limits_{j=1}^T w_j\ 2^{-j}=\sum\limits_{x\in{\cal X}}2^{-L(x)}\leq 1 \koz .
\]
\end{biz}
\begin{tetel}[Noiseless Coding Theorem]
For a source $({\cal X},P)$, ${\cal X}=\{1,\dots,m\}$
it is always possible to find a uniquely decipherable
code $c:{\cal X}\longrightarrow \{0,1\}^*$ with average length
$$ H(P)\leq \overline{L}(c)L(y)$. Then the code $d$ obtained by interchanging the codewords
$c(x)$ and $c(y)$ has average length $\overline L(d)\leq\overline L(c)$, since
\hspace*{-18pt}
\begin{tabular}{ll}
$\overline L(d)-\overline L(c)$&$=P(x)\cdot L(y)+P(y)\cdot
L(x)-P(x)\cdot L(x)-P(y)\cdot L(y)$\\
&$=(P(x)-P(y))\cdot(L(y)-L(x))\leq0$
\end{tabular}
\item[ ii)] Assume we are given a code with $L(1)\leq\dots\leq
L(m-1)0$.
\ujgyak
Compute the entropy\gyakindex{entropy} of the source $({\cal X}, P)$, with ${\cal X} = \{1,2\}$
and $P=(0.8, 0,2)$.
\ujgyak
Find the Huffman-codes and the Shannon-Fano-codes
for the sources $({\cal X}^n, P^n)$ with
$({\cal X}, P)$ as in the previous exercise for $n=1,2,3$
and calculate their average codeword lengths.
\ujgyak
Show that always $0 \leq H(P) \leq \lg |{\cal X}|.$
\ujgyak
Show that the redundancy $\rho (c) = \overline{L}(c) - H(P)$
of a prefix code $c$ for a source with probability distribution
$P$ can be expressed as a special I--divergence.
\ujgyak
Show that the I-divergence $D(P||Q)\geq 0$ for all probability distributions
$P$ and $Q$ over some alphabet ${\cal X}$
with equality exactly if $P=Q$ but that the I-divergence is not a metric.
\end{gyak}
\vspace{-2mm}
\section{Arithmetic coding and modelling}
\vspace{-1mm}
In statistical coding techniques
as Shannon-Fano- or Huffman-coding the probability distribution of
the source is modelled as accurately as possible and then the words are
encoded such that a higher probability results in a shorter codeword length.
We know that Huffman-codes are optimal with respect to the average codeword
length. However, the entropy is approached by increasing
the block length. On the other hand, for long
blocks of source symbols, Huffman-coding is a rather complex
procedure, since it requires the calculation of the probabilities
of all sequences of the given block length and the construction
of the corresponding complete code.
For compression techniques based on statistical methods
often \ki{arithmetic coding} \inddef{arithmetic coding} is preferred.
Arithmetic coding is a straightforward extension of the Shannon-Fano-Elias-code.
The idea is to represent a probability by an interval.
In order to do so, the probabilities
have to be calculated very accurately. This process is denoted as
\ki{modelling} of the source \inddef{modelling of a source}.
So statistical compression techniques
consist of two stages: modelling and coding. As just mentioned,
coding is usually done by arithmetic coding. The different algorithms
like, for instance, DCM (Discrete Markov Coding) and PPM
(Prediction by Partial Matching)
vary in the way of modelling the source.
We are going to present the context-tree weighting
method, a transparent algorithm for the estimation
of block probabilities due to Willems, Shtarkov, and Tjalkens,
which also allows a straightforward analysis of the
efficiency.
\nevindex{Willems, Frans M. J.} \nevindex{Shtarkov, Yuri M.}
\nevindex{Tjalkens, Tjalling J.}
\subsection{Arithmetic coding}
The idea behind arithmetic coding is to represent
a message $x^n = (x_1 \dots x_n)$ by interval
$I(x^n)=[Q^n(x^n), Q^n(x^n)+ P^n(x^n))$,
where $Q^n(x^n) = \sum_{y^n < x^n} P^n(y^n)$ is
the sum of the probabilities of those sequences which are smaller
than $x^n$ in lexicographic order.
A codeword $c(x^n)$ assigned to message $x^n$ also corresponds to
an interval. Namely, we identify codeword $c=c(x^n)$ of length
$L=L(x^n)$ with interval $J(c) = [bin(c), bin(c)+2^{-L})$, where $bin(c)$ is the
binary expansion of the nominator in the fraction $\frac{c}{2^L}$.
The special choice of codeword $c(x^n)$ will be obtained from
$P^n(x^n)$ and $Q^n(x^n)$ as follows:
$$L(x^n) = \lceil \lg \frac{1}{P^n(x^n)} \rceil +1, \qquad
bin(c) = \lceil Q^n(x^n)\cdot 2^{L(x^n)} \rceil \koz .$$
So message $x^n$ is encoded by a codeword $c(x^n)$, whose interval
$J(x^n)$ is inside interval $I(x^n)$.
Let us illustrate arithmetic coding by the following example
of a discrete memoryless source with $P(1) = 0.1$ and $n=2$.
$$
\begin{array}{r|r|r|r|r}
x^n & P^n(x^n) & Q^n(x^n) & L(x^n) & c(x^n) \\
\hline
00 & 0.81 & 0.00 & 2 & 00 \\
01 & 0.09 & 0.81 & 5 & 11010 \\
10 & 0.09 & 0.90 & 5 & 11101 \\
11 & 0.01 & 0.99 & 8 & 11111110 \koz . \\
\end{array}
$$
At first glance it may seem that this code is much worse than the
Huffman code for the same source with codeword lengths
($1,2,3,3$) we found previously. On the other hand, it can be shown
that arithmetic coding always achieves an average codeword length
$\overline{L}(c) < H(P^n)+2$, which is only two bits apart from the
lower bound in the noiseless coding theorem.
Huffman coding would usually yield an even better code. However,
this ``negligible'' loss in compression rate is compensated by
several advantages. The codeword is directly computed from the source
sequence, which means that we do not have to store the code as in the case
of Huffman coding.
Further, the relevant source models
allow to easily compute $P^n(x_1 x_2 \dots x_{n-1} x_n)$ and
$Q^n(x_1 x_2 \dots x_{n-1} x_n)$ from $P^{n-1}(x_1 x_2 \dots x_{n-1})$,
usually by multiplication by $P(x_n)$.
This means that the sequence to be encoded can be parsed sequentially
bit by bit, unlike in Huffman coding, where we would have to encode
blockwise.
\subsubsection{Encoding.} The basic algorithm for encoding a sequence
$(x_1 \dots x_n)$ by arithmetic coding works as follows.
We assume that $P^n(x_1 \dots x_n)= P_1(x_1)\cdot
P_2(x_2) \cdots P_n(x_n)$,
(in the case $P_i=P$ for all $i$ the discrete memoryless source arises,
but in the section on modelling more complicated formulae come into play)
and hence $Q_i(x_i)= \sum _{y \key{do} \= $B \leftarrow B + A \cdot Q_i(x[i])$ \\
5 \' \> \> $A \leftarrow A \cdot P_i(x[i])$ \\
6 \' $L \leftarrow \lceil \lg \frac1A \rceil +1$ \\
7 \' $c \leftarrow \lceil B \cdot 2^L \rceil$ \\
8 \' \key{return} $c$
\end{alg}
We shall illustrate the encoding procedure by the following example
from the literature. Let the discrete, memoryless
source $({\cal X}, P)$ be given with
ternary alphabet ${\cal X} = \{1,2,3\}$ and $P(1) = 0.4$, $P(2) =0.5$,
$P(3)=0.1$. The sequence $x^4 = (2,2,2,3)$ has to be encoded.
Observe that $P_i=P$ and $Q_i=Q$ for all $i=1, 2,3,4$. Further
$Q(1)=0$, $Q(2)= P(1)=0.4$, and $Q(3)=P(1)+P(2)=0.9$.
The above algorithm yields
\[
\begin{array}{rrr}
i & B_i & A_i \\
\hline
0 & 0 & 1 \\
1 & B_0+A_0 \cdot Q(2)=0.4 & A_0 \cdot P(2) = 0.5 \\
2 & B_1+A_1 \cdot Q(2)=0.6 & A_1 \cdot P(2) = 0.25 \\
3 & B_2+A_2 \cdot Q(2)=0.7 & A_2 \cdot P(2) = 0.125 \\
4 & B_3+A_3 \cdot Q(3)=0.8125 & A_3 \cdot P(3) = 0.0125 \koz .\\
\end{array}
\]
Hence $Q(2,2,2,3)=B_4=0.8125$ and $P(2,2,2,3)=A_4=0.0125$. From this can
be calculated that $L= \lceil \lg \frac1A \rceil +1= 8$ and finally
$\lceil B \cdot 2^L \rceil=
\lceil 0.8125 \cdot 256 \rceil = 208$ whose binary representation
is codeword $c(2,2,2,3)=11010000$.
\subsubsection{Decoding.} Decoding is very similar to encoding. The decoder recursively
"undoes" the encoder's recursion. We divide the interval $[0,1)$
into subintervals with bounds defined by $Q_i$. Then we
find the interval in which codeword $c$ can be found. This interval
determines the next symbol. Then we subtract $Q_i(x_i)$ and rescale
by multiplication by $\frac1{P_i(x_i)}$.
\begin{alg}{Arithmetic-Decoder$(c)$}\inddef{\textsc{Arithmetic-Decoder}}
1 \' \key{for} \= $i \leftarrow 1$ {\bf to} $n$ \\
2 \' \> \key{do} \= $j \leftarrow 1$ \\
3 \' \> \> \key{while} \= $(c < Q_i(j))$ \\
4 \' \> \> \> \key{do} \= $j \leftarrow j+1$ \\
5 \' \> \> \> \> $x[i] \leftarrow j-1$ \\
6 \' \> \> \> \> $c \leftarrow (c-Q_i(x[i]))/ P_i(x[i])$ \\
7 \' \key{return} $x$
\end{alg}
Observe that when the decoder only receives codeword $c$
he does not know when the decoding procedure terminates.
For instance $c=0$ can be the codeword for $x^1=(1)$,
$x^2=(1,1)$, $x^3=(1,1,1)$, etc.
In the above pseudocode it is implicit that the number $n$ of symbols has
also been transmitted to the decoder, in which case it is clear
what the last letter to be encoded was. Another possibility would be
to provide a special end-of-file (EOF)-symbol with a small probability,
which is known to both the encoder and the decoder. When the decoder
sees this symbol, he stops decoding. In this case line 1 would be replaced
by
\smallskip
1 \key{while} ($x[i] \not=$ EOF)
\smallskip
(and $i$ would have to be increased).
In our above example, the decoder would receive the codeword $11010000$,
the binary expansion of $0.8125$ up to $L=8$ bits. This number falls in
the interval $[0.4,0.9)$ which belongs to the letter $2$, hence
the first letter $x_1=2$. Then he calculates
$(0.8075 - Q(2))\frac1{P(2)}=(0.815-0.4)\cdot 2=0.83$. Again this number
is in the interval $[0.4,0.9)$ and the second letter is $x_2=2$.
In order to determine $x_3$ the calculation
$(0.83 - Q(2))\frac1{P(2)}=(0.83-0.4)\cdot 2=0.86$ must be performed. Again
$0.86 \in [0.4,0.9)$ such that also $x_3=2$. Finally
$(0.86 - Q(2))\frac1{P(2)}=(0.86-0.4)\cdot 2=0.92$. Since
$0.92 \in [0.9,1)$, the last letter of the sequence must be $x_4=3$.
\subsubsection{Correctness.} Recall that message $x^n$ is encoded by a codeword $c(x^n)$,
whose interval $J(x^n)$ is inside interval $I(x^n)$.
This follows from
$ \lceil Q^n(x^n)\cdot 2^{L(x^n)} \rceil 2^{-L(x^n)} + 2^{-L(x^n)}
< Q^n(x^n) + 2^{1-L(x^n)} = Q^n(x^n) + 2^{-\lceil \lg \frac{1}{P^n(x^n)} \rceil}
\leq Q^n(x^n) + P^n(x^n)$.
Obviously a prefix code is obtained, since a codeword can only be
a prefix of another one, if their corresponding intervals overlap --
and the intervals $J(x^n) \subset I(x^n)$ are obviously disjoint
for different $n$-s.
Further, we mentioned already that arithmetic coding compresses
down to the entropy up to two bits. This is because for every
sequence $x^n$ it is $L(x^n) < \lg \frac{1}{P^n(x^n)}+2$.
It can also be shown that the additional transmission of block length
$n$ or the introduction of the EOF symbol only results in
a negligible loss of compression.
However, the basic algorithms we presented are not useful in order
to compress longer files, since with increasing block length $n$
the intervals are getting smaller and smaller, such that rounding errors
will be unavoidable. We shall present a technique to overcome
this problem in the following.
\subsubsection{Analysis.} The basic algorithm for arithmetic coding is linear in the length $n$
of the sequence to be encoded. Usually, arithmetic coding is compared
to Huffman coding. In contrast to Huffman coding, we do not have
to store the whole code, but can obtain the codeword directly
from the corresponding interval. However, for a discrete memoryless
source, where the probability distribution $P_i=P$ is the same
for all letters, this is not such a big advantage, since the
Huffman code will be the same for all letters (or blocks of $k$ letters)
and hence has to be computed only once. Huffman coding, on the other hand,
does not use any multiplications which slow down arithmetic coding.
For the adaptive case, in which the $P_i$'s may change for different
letters $x_i$ to be encoded, a new Huffman code would have to be
calculated for each new letter. In this case, usually arithmetic
coding is preferred. We shall investigate such situations in the
section on modelling.
For implementations in
practice floating point arithmetic is avoided. Instead, the subdivision
of the interval $[0,1)$ is represented by
a subdivision of the integer range $0, \dots ,M$, say, with proportions
according to the source probabilities. Now integer arithmetic
can be applied, which is faster and more precise.
\subsubsection{Precision problem.} In the basic algorithms for arithmetic encoding and decoding
the shrinking of the current interval would require
the use of high precision arithmetic for longer sequences.
Further, no bit of the codeword is produced until the complete
sequence $x^n$ has been read in.
This can be overcome by coding each bit as soon as it is known
and then double the length of the current interval $[LO,HI)$, say,
so that this expansion represents only
the unknown part of the interval.
This is the case when the leading bits of the lower and upper bound are the
same, i. e. the interval is completely contained either in $[0,\frac12)$
or in $[\frac12,1)$.
The following expansion rules guarantee that the current interval
does not become too small.
\smallskip
{\bf Case 1} ($[LO,HI) \in [0,\frac12)$): $LO \leftarrow 2\cdot Lo$,
$HI \leftarrow 2 \cdot HI$ \koz.
\smallskip
{\bf Case 2} ($[LO,HI) \in [\frac12,1)$): $LO \leftarrow 2\cdot LO-1$,
$HI \leftarrow 2 \cdot HI-1$ \koz.
\smallskip
{\bf Case 3} ($\frac{1}{4}\leq LO < \frac{1}{2} \leq HI < \frac{3}{4}$):
$LO \leftarrow 2\cdot LO - \frac12$,
$HI \leftarrow 2\cdot HI - \frac12$ \koz.
\smallskip
The last case called \ki{underflow} (or follow)
prevents the interval from shrinking too much when the
bounds are close to $\frac{1}{2}$.
Observe that if the current interval
is contained in $[\frac14,\frac34)$ with $LO<\frac12 \leq HI$,
we do not know the next output bit, but we do know that whatever it
is, the following bit will have the opposite value. However, in contrast
to the other cases we cannot continue encoding here, but have to wait
(remain in the underflow state and adjust a counter $underflowcount$
to the number of subsequent underflows, i. e. $underflowcount \leftarrow
underflowcount+1$)
until the current interval falls into either $[0,\frac12)$ or
$[\frac12,1)$. In this case we encode the leading bit of this interval
-- $0$ for $[0,\frac12)$ and $1$ for $[\frac12,1)$ -- followed by
$underflowcount$ many inverse bits and then set $underflowcount=0$.
The procedure stops, when all letters are read in and the current interval
does not allow any further expansion.
\begin{alg}{Arithmetic-Precision-Encoder$(x)$}\inddef{\textsc{Arithmetic-Precision-Encoder}}
1 \' $LO \leftarrow 0$ \\
2 \' $HI \leftarrow 1$ \\
3 \' $A \leftarrow 1$ \\
4 \' \textit{underflowcount} $\leftarrow 0$ \\
5 \' \key{for} \= $i \leftarrow 1$ \key{to} $n$ \\
6 \' \> \key{do} \= $LO \leftarrow LO+Q_i(x[i])\cdot A$ \\
7 \' \> \> $A \leftarrow P_i(x[i])$ \\
8 \' \> \> $HI \leftarrow LO+A$ \\
9 \' \> \> \key{while} \= $HI-LO<\frac12$ AND NOT ($LO<\frac14$ AND $HI \geq \frac12$)\\
10 \' \> \> \> \key{do} \key{if} \= $HI<\frac12$ \\
11 \' \> \> \> \> \key{then} \=
$c \leftarrow c|| 0, \textit{underflowcount}$ many $1$s \\
12 \' \> \> \> \> \> \textit{underflowcount}
$\leftarrow 0$ \\
13 \' \> \> \> \> \> \textit{LO}
$\leftarrow 2 \cdot$ \textit{LO} \\
14 \' \> \> \> \> \> \textit{HI} $\leftarrow 2
\cdot$ \textit{HI} \\
15 \' \> \> \> \> \key{else} \key{if} \= $LO \geq \frac12$ \\
16 \' \> \> \> \> \> \key{then} \=
$c \leftarrow c||1,$ \textit{underflowcount} many 0s \\
17 \' \> \> \> \> \> \>
\textit{underflowcount} $\leftarrow 0$ \\
18 \' \> \> \> \> \> \>
$LO \leftarrow 2 \cdot LO-1$ \\
19 \' \> \> \> \> \> \>
$HI \leftarrow 2 \cdot HI-1$ \\
20 \' \> \> \> \> \> \key{else if} \=
$LO \geq \frac14$ AND $HI < \frac34$ \\
21 \' \> \> \> \> \> \>
\> \key{then} \= \textit{underflowcount}
$\leftarrow$ \textit{underflowcount} $+1$ \\
22 \' \> \> \> \> \> $LO \leftarrow 2 \cdot LO- \frac12$ \\
23 \' \> \> \> \> \> $HI \leftarrow 2 \cdot HI- \frac12$ \\
24 \' \key{if} \= \textit{underflowcount} $> 0$ \\
25 \' \> \key{then} $c \leftarrow c||0,$ \textit{underflowcount} many 1s) \\
26 \' \key{return} $c$
\end{alg}
We shall illustrate the encoding algorithm
in Figure \ref{datac:arith} by our example --
the encoding of the message $(2,2,2,3)$ with alphabet ${\cal X} = \{1,2,3\}$
and probability distribution $P = (0.4, 0.5, 0.1)$.
An underflow occurs in the sixth row: we keep track of
the underflow state and later encode the inverse of the next bit, here
this inverse bit is the
$0$ in the ninth row. The encoded string is $1101000$.
\begin{figure}
\begin{center}
\begin{tabular}{r|l|ccc|c}
Current & & \multicolumn{3}{c|}{Subintervals} &\\
Interval & Action & 1 & 2 & 3 & Input\\
\hline
$[0.00,1.00)$ & subdivide & $[0.00,0.40)$ & $[0.40,0.90)$ & $[0.90,1.00)$
& 2\\
$[0.40,0.90)$ & subdivide & $[0.40,0.60)$ & $[0.60,0.85)$ & $[0.85,0.90)$
& 2\\
$[0.60,0.85)$ & encode $1$ & & &\\
& expand $[\frac{1}{2},1)$ & & &\\
$[0.20,0.70)$ & subdivide & $[0.20,0.40)$ & $[0.40,0.65)$ & $[0.65,0.70)$
& 2\\
$[0.40,0.65)$ & underflow & & &\\
& expand $[\frac{1}{4},\frac{3}{4})$ & & &\\ $[0.30,0.80)$ & subdivide &
$[0.30,0.50)$ & $[0.50,0.75)$ & $[0.75,0.80)$ & 3\\
$[0.75,0.80)$ & encode $10$ & & &\\
& expand $[\frac{1}{2},1)$ & & &\\
$[0.50,0.60)$ & encode $1$ & & &\\
& expand $[\frac{1}{2},1)$ & & &\\
$[0.00,0.20)$ & encode $0$ & & &\\
& expand $[0,\frac{1}{2})$ & & &\\
$[0.00,0.40)$ & encode $0$ & & &\\
& expand $[0,\frac{1}{2})$ & & &\\
$[0.00,0.80)$ & encode $0$ & & &
\end{tabular}
\end{center}
\caption{Example of arithmetic encoding with interval expansion.}
\label{datac:arith}
\end{figure}
Precision-decoding involves the consideration of a third variable besides
the interval bounds \textit{LO} and \textit{HI}.
\subsection{Modelling}
\subsubsection{Modelling of memoryless sources.}
In this section we shall only consider binary sequences $x^n \in
\{0,1\}^n$ to be compressed by an arithmetic coder. Further,
we shortly write $P(x^n)$
instead of $P^n(x^n)$ in order to allow further subscripts and
superscripts for the description of the special situation.
$P_e$ will denote estimated probabilities,
$P_w$ weighted probabilities,
and $P^s$ probabilities assigned to a special context $s$.
The application of arithmetic coding is quite appropriate if the
probability distribution of the source is such that
$P(x_1 x_2 \dots x_{n-1} x_n)$ can easily be calculated from
$P(x_1 x_2 \dots x_{n-1})$. Obviously this is the case,
when the source is discrete and memoryless, since then
$P(x_1 x_2 \dots x_{n-1}x_n) = P(x_1 x_2 \dots x_{n-1}) \cdot P(x_n)$.
Even when the underlying parameter $\theta = P(1)$ of a binary, discrete
memoryless source is not known,
there is an efficient way due to Krichevsky \nevindex{Krichevsky, R. E.}
and Trofimov\nevindex{Trofimov, V. K.} to estimate the probabilities via
$$P(X_n=1|x^{n-1}) = \frac{b+\frac12}{a+b+1} \koz ,$$
where $a$ and $b$ denote the number of $0$s and $1$s, respectively,
in the sequence $x^{n-1}=(x_1 x_2 \dots x_{n-1})$. So given the sequence
$x^{n-1}$ with $a$ many $0$s and $b$ many $1$s, the probability
that the next letter $x_n$ will be a $1$ is estimated as
$\frac{b+\frac12}{a+b+1}$.
The estimated block probability of a sequence containing $a$ zeros and $b$
ones then is
$$P_e(a,b) = \frac{\frac12 \cdots (a-\frac12)\frac12 \cdots (b-\frac12)}{1\cdot 2 \cdots (a+b)}$$
with initial values $a=0$ and $b=0$ as in Figure \ref{datac:kte},
where the values of the \ki{Krichevsky-Trofimov-estimator}
\inddef{Krichevsky-Trofimov-estimator}
$P_e(a,b)$ for small $(a,b)$ are listed.
\begin{figure}
\begin{center}
\begin{tabular}{c|ccccccc}
a & b & 0 & 1 & 2 & 3 & 4 & 5\\
\hline
0 & & 1 & 1/2 & 3/8 & 5/16 & 35/128 & 63/256\\
1 & & 1/2 & 1/8 & 1/16 & 5/128 & 7/256 & 21/1024\\
2 & & 3/8 & 1/16 & 3/128 & 3/256 & 7/1024 & 9/2048\\
3 & & 5/16 & 5/128 & 3/256 & 5/1024 & 5/2048 & 45/32768
\end{tabular}
\end{center}
\caption{Table of the first values for the Krichevsky-Trofimov-estimator.}
\label{datac:kte}
\end{figure}
\smallskip
Note that the summand $\frac12$ in the nominator guarantees that the
probability for the next letter to be a $1$ is positive even when the symbol
$1$ did not occur
in the sequence so far. In order to avoid infinite codeword length,
this phenomenon has to be carefully taken into
account when estimating the probability of the next letter in all
approaches to estimate the parameters, when arithmetic coding
is applied.
\subsubsection{Modells with known context tree.}
In most situations the source is not memoryless, i. e., the dependencies
between the letters have to be considered. A suitable way to represent such
dependencies is the use of a suffix tree, which we denote as
\ki{context tree.}\inddef{context tree} The context of symbol
$x_t$ is suffix $s$ preceding $x_t$.
To each context (or leaf in the suffix tree) $s$
there corresponds a parameter $\theta_s=P(X_t=1|s)$, which is the probability
of the occurrence of a 1 when the last sequence of past source symbols
is equal to context $s$
(and hence $1 - \theta_s$ is the probability for a
$0$ in this case).
We are distinguishing here between the model (the suffix tree) and
the parameters ($\theta_s$).
\begin{pelda}
Let ${\cal S} = \{00, 10, 1\}$ and $ \theta_{00} = \frac12$, $\theta_{10} =
\frac13$, and $\theta_1 = \frac15$. The corresponding suffix tree
jointly with the parsing process for a special sequence can be
seen in Figure \ref{datac:src3}.
\end{pelda}
\begin{figure}[!t]
\begin{center}
\setlength{\unitlength}{0.00083300in}
\begin{picture}(2937,1797)(76,-1123)
\put(2401,539){\line( 2,-1){600}}
\put(3001,239){\line(-2,-1){600}}
\put(2401,-61){\line(-2, 1){600}}
\put(2401,-61){\line(-2,-1){600}}
\put(1651,239){\line(-1, 0){1125}}
\put(526,239){\vector( 0,-1){1050}}
\put(1651,-361){\line(-1, 0){975}}
\put(676,-361){\vector( 0,-1){450}}
\put(2251,539){\line(-1, 0){1425}}
\put(826,539){\vector( 0,-1){1350}}
\put(976,239){\vector( 0,-1){1050}}
\put(1126,-361){\vector( 0,-1){450}}
\put(1276,539){\vector( 0,-1){1350}}
\put(1426,539){\vector( 0,-1){1350}}
\put(481,-706){\line( 0,-1){375}}
\put(2416,554){\makebox(0,0)[lb]{$\theta_1$}}
\put(1821,264){\makebox(0,0)[lb]{$\theta_{10}$}}
\put(1831,-536){\makebox(0,0)[lb]{$\theta_{00}$}}
\put(376,-961){\makebox(0,0)[lb]{0}}
\put(1426,-961){\makebox(0,0)[lb]{1}}
\put(1276,-961){\makebox(0,0)[lb]{1}}
\put(1126,-961){\makebox(0,0)[lb]{1}}
\put(976,-961){\makebox(0,0)[lb]{0}}
\put(826,-961){\makebox(0,0)[lb]{0}}
\put(676,-961){\makebox(0,0)[lb]{1}}
\put(526,-961){\makebox(0,0)[lb]{0}}
\put( 76,-961){\makebox(0,0)[lb]{0}}
\put(226,-961){\makebox(0,0)[lb]{1}}
\end{picture}
\caption{An example for a tree source.}
\label{datac:src3}
\end{center}
\end{figure}
The actual probability of the sequence '0100111' given the past '\dots010' is
$P^s(0100111|\dots010) = (1 - \theta_{10}) \theta_{00} (1 - \theta_1) (1 -
\theta_{10}) \theta_{00} \theta_1 \theta_1
= \frac23 \cdot \frac12 \cdot \frac45 \cdot \frac23 \cdot \frac12
\cdot \frac15 \cdot \frac15 = \frac{4}{1075}$,
since the first letter $0$ is preceded by suffix $10$, the second letter $1$
is preceded by suffix $00$, etc.
Suppose the model ${\cal S}$ is known, but not the parameters $\theta_s$.
The problem now is to find a good coding distribution for this case.
The tree structure allows to easily determine which
context precedes a particular symbol.
All symbols having the same context (or suffix) $s \in {\cal S}$ form
a memoryless source subsequence whose probability is determined
by the unknown parameter $\theta_s$. In our example these subsequences
are '$11$' for $\theta_{00}$, '$00$' for $\theta_{10}$ and '$011$'
for $\theta_{1}$.
One uses the Krichevsky-Trofimov-estimator for this case.
To each node $s$ in the suffix tree, we count the numbers $a_s$ of zeros
and $b_s$ of ones preceded by suffix $s$. For the
children $0s$ and $1s$ of parent node $s$ obviously $a_{0s} +
a_{1s} = a_s$ and $b_{0s} + b_{1s} = b_s$ must be satisfied.
In our example $(a_{\lambda}, b_{\lambda})=(3,4)$ for the root $\lambda$,
$(a_1,b_1)=(1,2)$, $(a_0,b_0) = (2,2)$ and $(a_{10}, b_{10})
= (2,0)$, $(a_{00}, b_{00}) = (0,2)$. Further $(a_{11}, b_{11})=(0,1)$,
$(a_{01}, b_{01})=(1,1)$, $(a_{111}, b_{111})=(0,0)$,
$(a_{011}, b_{011})=(0,1)$,
$(a_{101}, b_{101})=(0,0)$,$(a_{001}, b_{001})=(1,1)$,
$(a_{110}, b_{110})=(0,0)$, $(a_{010}, b_{010})=(2,0)$,
$(a_{100}, b_{100})=(0,2)$, and $(a_{000}, b_{000})=(0,0)$.
These last numbers are not relevant for our special source
${\cal S}$ but will be important later on, when the source model
or the corresponding suffix tree, respectively,
is not known in advance.
\begin{pelda}
Let ${\cal S} = \{00, 10, 1\}$ as in the previous example. Encoding a
subsequence is done by successively updating the corresponding counters
for $a_s$ and $b_s$. For example, when we encode the sequence '0100111'
given the past '\dots010' using the above suffix tree and
Krichevsky-Trofimov--estimator we obtain
$$
P_e^s(0100111|\dots010) = \frac{1}{2} \cdot \frac{1}{2} \cdot \frac{1}{2}
\cdot \frac{3}{4} \cdot \frac{3}{4} \cdot \frac{1}{4} \cdot \frac{1}{2} =
\frac{3}{8} \cdot \frac{3}{8} \cdot \frac{1}{16} = \frac{9}{1024} \koz ,
$$
where $ \frac{3}{8}$, $\frac{3}{8}$ and $ \frac{1}{16}$ are the probabilities
of the subsequences '11', '00' and '011' in the context of the leaves. These
subsequences are assumed to be memoryless.
\end{pelda}
\subsubsection{The context-tree weighting method.}
Suppose we have a good coding distribution $P_1$ for source 1 and
another one, $P_2$, for source 2. We are looking for a good coding
distribution for both sources. One possibility is to compute $P_1$ and
$P_2$ and then 1 bit is needed to identify the best model which then will
be used to compress the sequence. This method is called selecting.
Another possibility is to employ the weighted distribution, which is
$$P_w(x^n) = \frac{P_1(x^n) + P_2(x^n)}{2} \koz .$$
We shall present now the
\ki{context-tree weighting algorithm.}
\inddef{context-tree weighting algorithm} Under the assumption that a
context tree is a full tree of depth $D$, only
$a_s$ and $b_s$, i. e. the number
of zeros and ones in the subsequence of bits preceded by context $s$,
are stored in each node $s$ of the context tree.
Further, to each node $s$ is assigned a weighted probability
$P_w^s$ which is recursively defined as
$$
P_w^s = \left \{ \begin{array}{ll}
\frac{P_e(a_s,b_s) + P_w^{0s}P_w^{1s}}{2} & \mbox{for } 0\leq L(s) < D \koz ,\\
P_e(a_s,b_s) & \mbox{for } L(s) = D \koz ,
\end{array} \right.
$$
where $L(s)$ describes the length of the (binary) string $s$
and $P_e(a_s,b_s)$ is the estimated probability using the
Krichevsky-Trofimov-estimator.
\begin{pelda}
After encoding the sequence '0100111' given the past
'\dots010' we obtain the
context tree of depth 3 in Figure~\ref{datac:ctw}.
The weighted probability
$P_w^\lambda = \frac{35}{4096}$ of the root node $\lambda$ finally yields
the coding probability corresponding to the parsed sequence.
\end{pelda}
\begin{figure}[!t]
\begin{center}
{\scriptsize \setlength{\unitlength}{0.00083300in}
\begin{picture}(2475,3435)(301,-2788)
\put(601,539){\line( 1, 0){600}}
\put(1201,539){\line( 0,-1){600}}
\put(1201,-61){\line(-1, 0){600}}
\put(601,-361){\line( 1, 0){600}}
\put(1201,-361){\line( 0,-1){600}}
\put(1201,-961){\line(-1, 0){600}}
\put(601,-1261){\line( 1, 0){600}}
\put(1201,-1261){\line( 0,-1){600}}
\put(1201,-1861){\line(-1, 0){600}}
\put(601,-2161){\line( 1, 0){600}}
\put(1201,-2161){\line( 0,-1){600}}
\put(1201,-2761){\line(-1, 0){600}}
\put(1201,239){\line( 1, 0){600}}
\put(1801,239){\line( 0,-1){900}}
\put(1801,-661){\line(-1, 0){600}}
\put(1201,-1561){\line( 1, 0){600}}
\put(1801,-1561){\line( 0,-1){900}}
\put(1801,-2461){\line(-1, 0){600}}
\put(1801,-211){\line( 1, 0){600}}
\put(2401,-211){\line( 0,-1){1800}}
\put(2401,-2011){\line(-1, 0){600}}
\put(2401,-1111){\line( 1, 0){300}}
\put(2701,-661){\vector( 0, 1){900}}
\put(2701,-1561){\vector( 0,-1){900}}
\put(301,539){\makebox(0,0)[lb]{(0,0)}}
\put(301,-61){\makebox(0,0)[lb]{(0,1)}}
\put(301,-361){\makebox(0,0)[lb]{(0,0)}}
\put(301,-961){\makebox(0,0)[lb]{(1,1)}}
\put(301,-1261){\makebox(0,0)[lb]{(0,0)}}
\put(301,-1861){\makebox(0,0)[lb]{(2,0)}}
\put(301,-2161){\makebox(0,0)[lb]{(0,2)}}
\put(301,-2761){\makebox(0,0)[lb]{(0,0)}}
\put(301,-2011){\makebox(0,0)[lb]{$P_w^{010}=3/8$}}
\put(1276,314){\makebox(0,0)[lb]{$P_w^{11}=1/2$}}
\put(301,-211){\makebox(0,0)[lb]{$P_w^{011}=1/2$}}
\put(301,-1111){\makebox(0,0)[lb]{$P_w^{001}=1/8$}}
\put(301,-2311){\makebox(0,0)[lb]{$P_w^{100}=3/8$}}
\put(1276,-811){\makebox(0,0)[lb]{$P_w^{01}=1/8$}}
\put(1276,-1486){\makebox(0,0)[lb]{$P_w^{10}=3/8$}}
\put(1276,-2611){\makebox(0,0)[lb]{$P_w^{00}=3/8$}}
\put(1876,-136){\makebox(0,0)[lb]{$P_w^{1}=1/16$}}
\put(2476,-1036){\makebox(0,0)[lb]{$P_w^{\lambda}=35/4096$}}
\put(1876,-2161){\makebox(0,0)[lb]{$P_w^{0}=21/256$}}
\put(2776,-286){\makebox(0,0)[lb]{1}}
\put(2776,-2011){\makebox(0,0)[lb]{0}}
\put(1426,-2011){\makebox(0,0)[lb]{(2,2)}}
\put(826,-1561){\makebox(0,0)[lb]{(2,0)}}
\put(826,-2461){\makebox(0,0)[lb]{(0,2)}}
\put(826,-661){\makebox(0,0)[lb]{(1,1)}}
\put(826,239){\makebox(0,0)[lb]{(0,1)}}
\put(1426,-211){\makebox(0,0)[lb]{(1,2)}}
\put(2026,-1111){\makebox(0,0)[lb]{(3,4)}}
\end{picture}}
\caption{Weighted context tree for source sequence '0100111' with past
\dots010. The pair $(a_s,b_s)$ denotes
$a_s$ zeros and $b_s$ ones preceded by the corresponding context $s$.
For the contexts $s=111,101,110,000$ it is $P_w^s=P_e(0,0)=1$.}
\label{datac:ctw}
\end{center}
\end{figure}
Recall that for the application in arithmetic coding it is important
that probabilities $P(x_1 \dots x_{n-1} 0)$ and
$P(x_1 \dots x_{n-1} 1)$ can be efficiently calculated from
$P(x_1 \dots x_n)$. This is possible with the context-tree weighting
method, since the weighted probabilities $P_w^s$ only
have to be updated, when $s$ is changing. This just occurs for the contexts
along the path from the root to the leaf in the context tree preceding
the new symbol $x_n$---namely the $D+1$ contexts $x_{n-1}, \dots , x_{n-i}$
for $i=1, \dots ,D-1$ and the root $\lambda$.
Along this path, $a_s=a_s+1$ has to be performed,
when $x_n=0$, and $b_s=b_s+1$ has to be performed, when $x_n=1$,
and the corresponding probabilities $P_e(a_s,b_s)$ and $P_w^s$ have to
be updated.
This suggests the following algorithm for updating the context tree
$CT(x_{1}, \dots ,x_{n-1}|x_{-D+1}, \dots x_0)$ when reading the next letter
$x_n$. Recall that to each node of the tree we store the parameters
$(a_s,b_s)$, $P_e(a_s,b_s)$ and $P_w^s$. These parameters have to be updated
in order to obtain $CT(x_{1}, \dots ,x_{n}|x_{-D+1}, \dots x_{0})$.
We assume the convention that the ordered pair $(x_{n-1},x_n)$ denotes
the root $\lambda$.
\begin{alg}{Update-Context-Tree($x_n, CT(x_1 \dots x_{n-1}|x_{-D+1}
\dots x_{0})$)}\inddef{\textsc{Update-Context-Tree}}
1 \' $s \leftarrow (x_{n-1} \dots x_{n-D})$ \\
2 \' \key{if} \= $x_n = 0$ \\
3 \' \> \key{then} \= $P_w^s \leftarrow P_w^s \cdot \frac{a_s+1/2}{a_s+b_s+1}$ \\
4 \' \> \> $a_s \leftarrow a_s +1$ \\
5 \' \> \key{else} \> $P_w^s \leftarrow P_w^s \cdot \frac{b_s+1/2}{a_s+b_s+1}$ \\
6 \' \> \> $b_s \leftarrow b_s +1$ \\
7 \' \key{for} \= $i \leftarrow 1$ \key{to} $D$ \\
8 \' \> \key{do} \= $s \leftarrow (x_{n-1}, \dots ,x_{n-D+i})$ \\
9 \' \> \> \key{if} \= $x_n = 0$ \\
10 \' \> \> \> \key{then} \= $P_e(a_s,b_s) \leftarrow P_e(a_s,b_s)
\cdot \frac{a_s+1/2}{a_s+b_s+1}$ \\
11 \' \> \> \> \> $a_s \leftarrow a_s +1$ \\
12 \' \> \> \> \key{else} \> $P_e(a_s,b_s) \leftarrow P_e(a_s,b_s)
\cdot \frac{a_s+1/2}{a_s+b_s+1}$ \\
13 \' \> \> \> \> $b_s \leftarrow b_s +1$ \\
14 \' \> $P_w^s \leftarrow \frac12 \cdot (P_e(a_s,b_s)+P_w^{0s} \cdot P_w^{1s})$ \\
15 \' \key{return} $P_w^s$
\end{alg}
The probability $P_w^{\lambda}$ assigned to the root in the context tree
will be used for the successive subdivisions in arithmetic coding.
Initially, before reading $x_1$,
the parameters in the context tree are $(a_s,b_s)=(0,0)$,
$P_e(a_s,b_s)=1$, and $P_w^s=1$ for all contexts $s$ in the tree.
In our example the updates given the past $(x_{-2},x_{-1},x_0)=(0,1,0)$
would yield the successive probabilities $P_w^{\lambda}$:
$\frac12$ for $x_1=0$, $\frac{9}{32}$ for $(x_1 x_2)=(01)$,
$\frac{5}{64}$ for $(x_1 x_2 x_3)=(010)$,
$\frac{13}{256}$ for $(x_1 x_2 x_3 x_4)=(0100)$,
$\frac{27}{1024}$ for $(x_1 x_2 x_3 x_4)=(01001)$,
$\frac{13}{1024}$ for $(x_1 x_2 x_3 x_4 x_5)=(010011)$,
$\frac{13}{1024}$ for $(x_1 x_2 x_3 x_4 x_5 x_6)=(010011)$, and finally
$\frac{35}{4096}$ for $(x_1 x_2 x_3 x_4 x_5 x_6 x_7)=(0100111)$.
\subsubsection{Correctness} Recall that the quality of a code concerning its compression
capability is measured with respect to the average codeword length.
The average codeword length of the best code comes as close as possible to
the entropy of the source. The difference between the average codeword length
and the entropy is denoted as the \ki{redundancy} \inddef{redundancy}
$\overline{\rho}(c)$ of code $c$, hence
$$\overline{\rho}(c) = \overline{L}(c) - H(P) \koz ,$$
which obviously is the weighted (by $P(x^n)$) sum of the individual
redundancies
$$\rho(x^n) = L(x^n) - \lg \frac{1}{P(x^n)} \koz \ .$$
The individual redundancy $\rho(x^n|{\cal S})$ of sequences $x^n$
given the (known) source ${\cal S}$ for all $\theta_s \in [0,1]$ for
$s \in {\cal S}$, $|{\cal S}|\leq n$ is bounded by
$$
\rho(x^n|{\cal S}) \leq
\frac{|{\cal S}|}{2}\lg\frac{n}{|{\cal S}|} + |{\cal S}| + 2 \koz .
$$
The individual redundancy $\rho(x^n|{\cal S})$ of sequences $x^n$
using the context-tree weighting algorithm (and hence a complete tree
of all possible contexts as model ${\cal S}$)
is bounded by
$$
\rho(x^n|{\cal S}) < 2|{\cal S}| - 1 +
\frac{|{\cal S}|}{2}\lg\frac{n}{|{\cal S}|} + |{\cal S}| + 2 \koz .
$$
Comparing these two formulae, we see that the difference of
the individual redundancies is $2|{\cal S}| - 1$~bits.
This can be considered as the
cost of not knowing the model, i.e. the model redundancy. So, the redundancy
splits into the parameter redundancy, i. e. the cost of not knowing
the parameter, and the model redundancy. It can
be shown that the expected redundancy behaviour of the context-tree
weighting method achieves the asymptotic lower bound due to
Rissanen who could demonstrate that about $\frac12 \lg n$
bits per parameter is the minimum possible expected redundancy for
$n \longrightarrow \infty$.
\subsubsection{Analysis.} The computational complexity is proportional to the number of nodes that
are visited when updating the tree, which is about
$n(D + 1)$. Therefore, the number of operations necessary for processing $n$
symbols is linear in $n$. However, these operations are mainly multiplications
with factors requiring high precision.
As for most modelling algorithms,
the backlog of implementations in practice is the huge amount of memory.
A complete tree of depth $D$ has to be stored and updated.
Only with increasing $D$ the estimations of the probabilities are
becoming more accurate and hence the average codeword length of
an arithmetic code based on these estimations would become shorter.
The size of the memory, however, depends exponentially on the
depth of the tree.
We presented the context--tree weighting method only for
binary sequences. Note that in this case
the cumulative probability of a binary sequence
$(x_1 \dots x_n)$ can be calculated as
\[
Q^n(x_1 x_2 \dots x_{n-1} x_n) = \sum_{j=1, \dots ,n; x_j=1}
P^{j}(x_1 x_2 \dots x_{j-1} 0) \koz .
\]
For compression of sources with larger alphabets,
for instance ASCII-files, we refer to the literature.
\begin{gyak}
\ujgyak
Compute the arithmetic codes\gyakindex{code!arithmetic} for the sources $({\cal X}^n, P^n)$, $n=1,2,3$
with ${\cal X} = \{1,2\}$ and $P=(0.8, 0.2)$ and compare these codes
with the corresponding Huffman-codes derived previously.
\ujgyak
For the codes derived in the previous exercise compute the individual
redundancies of each codeword and the redundancies of the codes.
\ujgyak
Compute the estimated probabilities
$P_e(a,b)$ for the sequence
$0100110$ and all its subsequences
using the Krichevsky-Trofimov-estimator.
\ujgyak
Compute all parameters $(a_s,b_s)$ and the estimated probability
$P_e^s$ for the sequence $0100110$ given the
past $110$, when the context tree ${\cal S} = \{00, 10, 1\}$ is known.
What will be the codeword of an arithmetic code in this case?
\ujgyak
Compute all parameters $(a_s,b_s)$ and
the estimated probability $P_{\lambda}$
for the sequence $0100110$ given the
past $110$, when the context tree is not known,
using the context-tree weighting algorithm.
\ujgyak
Based on the computations from the previous exercise, update
the estimated probability for the sequence $01001101$ given the
past $110$.
\end{gyak}
Show that for the cumulative probability of a binary sequence
$(x_1 \dots x_n)$ it is
\[
Q^n(x_1 x_2 \dots x_{n-1} x_n) = \sum_{j=1, \dots ,n; x_j=1}
P^{j}(x_1 x_2 \dots x_{j-1} 0) \koz .
\]
\section{Ziv-Lempel-coding}
In 1976--1978 Jacob Ziv and Abraham Lempel introduced two
universal coding algorithms, which in contrast to statistical coding
techniques, considered so far, do not make explicit use of the
underlying probability distribution.
The basic idea here is to replace a previously seen
string with a pointer into a history buffer (LZ77) or with the index of
a dictionary (LZ78). LZ algorithms are widely used---``zip'' and its variations
use the LZ77 algorithm. So, in contrast to the presentation
by several authors, Ziv-Lempel-coding\inddef{Ziv-Lempel-coding}
is not a single algorithm. Originally, Lempel and Ziv introduced a method to measure the
complexity of a string---like in Kolmogorov complexity.
This led to two different algorithms, LZ77 and LZ78.
Many modifications and variations have been developed since.
However, we shall present the original algorithms and refer to
the literature for further information.
\subsection{LZ77}
The idea of LZ77 is to pass a \ki{sliding window}\index{sliding window} over the text to be
compressed. One looks for the longest substring in this window
representing the next letters of the text. The window consists of two parts:
a history window of length $l_h$, say,
in which the last $l_h$ bits of the text considered so far are stored,
and a lookahead window of length $l_f$ containing the next $l_f$ bits
of the text. In the simplest case $l_h$ and $l_f$ are fixed. Usually,
$l_h$ is much bigger than $l_f$. Then one encodes the triple
(offset, length, letter). Here the \ki{offset}\index{offset} is the number of letters
one has to go back in the text to find the matching substring, the length
is just the length of this matching substring, and the letter to be stored
is the letter following the matching substring.
Let us illustrate this procedure with an example. Assume the text to be
compressed is $... abaabbaabbaaabbbaaaabbabbbabbb...$, the window is
of size 15 with $l_h=10$ letters history and $l_f=5$ letters lookahead
buffer.
Assume, the sliding window now arrived at
$$...aba||abbaabbaaa|bbbaa|| \koz ,$$
i. e., the history window contains the 10 letters $abbaabbaaa$ and
the lookahead window contains the five letters $bbbaa$. The longest substring
matching the first letters of the lookahead window is $bb$ of length $2$,
which is found nine letters back from the right end of the history window.
So we encode $(9,2,b)$, since $b$ is the next letter (the string $bb$ is
also found five letters back, in the original LZ77 algorithm one would
select the largest offset). The window then is moved
$3$ letters forward
$$...abaabb||aabbaaabbb|aaaab|| \koz .$$
The next codeword is $(6,3,a)$, since the longest matching substring is
$aaa$ of length $3$ found $6$ letters backwards and $a$ is the
letter following this substring in the lookahead window. We proceed with
$$...abaabbaabb||aaabbbaaaa|bbabb|| \koz,$$
and encode $(6,3,b)$. Further
$$...abaabbaabbaaab||bbaaaabbab|babbb|| \koz .$$
Here we encode $(3,4,b).$ Observe that the match can extend into the lookahead window.
There are many subtleties to be taken into account. If a
symbol did not appear yet in the text, offset and length are set to $0$.
If there are two matching strings of the same length, one has to
choose between the first and the second offset. Both variations
have advantages. Initially one might start with an empty history
window and the first letters of the text to be compressed in the lookahead
window - there are also further variations.
A common modification of the original scheme is to output only the pair
(offset, length) and not the following letter of the text. Using this
coding procedure one has to take into consideration the case in which the
next letter does not occur in the history window. In this case, usually the
letter itself is stored, such that the decoder has to distinguish between
pairs of numbers and single letters. Further variations do not necessarily
encode the longest matching substring.
\subsection{LZ78}
LZ78 does not use a sliding window but a dictionary which is
represented here as a table with an index and an entry. LZ78 parses the
text to be compressed into a collection of strings, where each string
is the longest matching string $\alpha$ seen so far plus the symbol $s$
following $\alpha$ in the text to be compressed.
The new string $\alpha s$ is added into the dictionary. The new entry is
coded as $(i,s)$, where $i$ is the index of the existing table entry
$\alpha$ and $s$ is the appended symbol.
As an example, consider the string ``$abaabbaabbaaabbbaaaabba$''.
It is divided by LZ78
into strings as shown below. String $0$ is here the empty string.
$$
\begin{array}{lccccccccccc}
\mbox{Input} & a & b & aa & bb & aab & ba & aabb & baa & aabba \\
\mbox{String Index} & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\
\mbox{Output} & (0,a) & (0,b) & (1,a) & (2,b) & (3,b) & (2,a) & (5,b) & (6,a) & (7,a) \koz . \\
\end{array}
$$
Since we are not using a sliding window, there is no limit for how far back
strings can reach. However, in practice the dictionary cannot continue
to grow infinitely. There are several ways to manage this problem.
For instance, after having reached the maximum
number of entries in the dictionary, no further entries can be added
to the table and coding becomes static. Another variation would be to
replace older entries.
The decoder knows how many bits must be
reserved for the index of the string in the dictionary,
and hence decompression is straightforward.
\subsubsection{Correctness} Ziv-Lempel coding asymptotically achieves
the best possible compression rate which again is the entropy rate
of the source. The source model, however, is much more general
than the discrete memoryless source. The stochastic process generating
the next letter, is assumed to be stationary
(the probability of a sequence does not depend on the instant of time,
i. e. $P(X_1=x_1, \dots ,X_n=x_n)=P(X_{t+1}=x_1, \dots ,X_{t+n}=x_n)$ for all
$t$ and all sequences $(x_1 \dots x_n)$). For stationary processes
the limit $\lim_{n \rightarrow \infty} \frac1n H(X_1, \dots X_n)$
exists and is defined to be the entropy rate.
If $s(n)$ denotes the number of strings in the parsing process of LZ78
for a text generated by a stationary source,
then the number of bits required to encode all these strings is
$s(n) \cdot (\lg s(n)+1)$. It can be shown that
$\frac{s(n) \cdot (\lg s(n)+1)}{n}$ converges to the entropy rate of
the source. However, this would require that all strings can be
stored in the dictionary.
\subsubsection{Analysis.} If we fix the size of the sliding window or the dictionary,
the running time of encoding a sequence of $n$ letters will be
linear in $n$. However, as usually in data compression, there is
a tradeoff between compression rate and speed. A better compression
is only possible with larger memory. Increasing the size of
the dictionary or the window will, however, result in a
slower performance, since the most time consuming task is the search
for the matching substring or the position in the dictionary.
Decoding in both LZ77 and LZ78 is straightforward.
Observe that with LZ77 decoding is usually much faster than encoding,
since the decoder already obtains the information at which position
in the history he can read out the next letters of the text to be recovered,
whereas the encoder has to find the longest matching substring in the
history window. So algorithms based on LZ77 are useful for files
which are compressed once and decompressed more frequently.
Further, the encoded text is not necessarily shorter
than the original text. Especially in the beginning of the encoding
the coded version may expand a lot. This expansion has to be taken
into consideration.
For implementation it is not optimal to represent the text
as an array. A suitable data structure will be a circular queue
for the lookahead window and a binary search tree for the history window
in LZ77, while for LZ78 a dictionary tree should be used.
\begin{gyak}
\ujgyak
Apply the algorithms LZ77\gyakindex{LZ77} and LZ78\gyakindex{LZ78} to the string ``abracadabra''.
\ujgyak
Which type of files will be well compressed with LZ77 and LZ78,
respectively? For which type of files are LZ77 and LZ78 not
so advantageous?
\ujgyak
Discuss the advantages of encoding the first or the last offset,
when several matching substrings are found in LZ77.
\end{gyak}
\section{The Burrows-Wheeler-transform}
The \ki{Burrows-Wheeler-transform}\inddef{Burrows-Wheeler-transform}
will best be demonstrated by an example.
Assume that our original text is $\vec{X}=$ ``WHEELER''. This text will be
mapped to a second text $\vec{L}$
and an index $I$ according to the following rules.
\smallskip
1) We form a matrix $M$ consisting of all cyclic shifts of the original text
$\vec{X}$. In our example
$$
M= \left( \begin{array}{ccccccc}
W & H & E & E & L & E & R \\
H & E & E & L & E & R & W \\
E & E & L & E & R & W & H \\
E & L & E & R & W & H & E \\
L & E & R & W & H & E & E \\
E & R & W & H & E & E & L \\
R & W & H & E & E & L & E \\
\end{array} \right).
$$
2) From $M$ we obtain a new matrix $M'$ by simply ordering the rows in
$M$ lexicographically. Here this yields the matrix
$$
M'= \left( \begin{array}{ccccccc}
E & E & L & E & R & W & H \\
E & L & E & R & W & H & E \\
E & R & W & H & E & E & L \\
H & E & E & L & E & R & W \\
L & E & R & W & H & E & E \\
R & W & H & E & E & L & E \\
W & H & E & E & L & E & R \\
\end{array} \right).
$$
3) The transformed string $\vec{L}$ then is just the last column of the matrix
$M'$ and the index $I$ is the number of the row of $M'$,
in which the original text is contained. In our example
$\vec{L}=$ ``HELWEER'' and $I=6$ -- we start counting the the rows
with row no. $0$.
\smallskip
This gives rise to the following pseudocode. We write
here $X$ instead of $\vec{X}$ and $L$ instead of $\vec{L}$,
since the purpose of the vector notation is only to distinguish
the vectors from the letters in the text.
\smallskip
\begin{algN}{BWT-Encoder($X$)}\inddef{BWT-Encoder}
1 \' \key{for} \= $j \leftarrow 0$ \key{to} $n-1$ \\
2 \' \> \key{do} $M[0,j]\leftarrow X[j]$ \\
3 \' \key{for} \= $i \leftarrow 0$ {\bf to} $n-1$ \\
4 \' \> \key{do} \= \key{for} \= $j \leftarrow 0$ \key{to} $n-1$ \\
5 \' \> \> \> \key{do} $M[i,j] \leftarrow M[i-1,j+1 \mbox{ mod } n]$ \\
6 \' \key{for} \= $i \leftarrow 0$ \key{to} $n-1$ \\
7 \' \> \key{do} row $i$ of $M' \leftarrow$ row $i$ of $M$ in lexicographic order \\
8 \' \key{for} \= $i \leftarrow 0$ \key{to} $n-1$ \\
9 \' \> \key{do} $L[i] \leftarrow M'[i,n-1]$ \\
10 \' $i=0$ \\
11 \' \key{while} \= (row $i$ of $M' \not=$ row $i$ of $M$) \\
12 \' \> \key{do} $i \leftarrow i+1$ \\
13 \' $I \leftarrow i$ \\
14 \' \key{return} $L$ and $I$ \\
\end{algN}
It can be shown that this transformation is invertible, i. e., it is possible
to reconstruct the original text $\vec{X}$ from its transform
$\vec{L}$ and the index $I$. This is because these two parameters just yield
enough information to find out the underlying permutation of the letters.
Let us illustrate this reconstruction using the above example again.
From the transformed string $\vec{L}$ we obtain a second string
$\vec{E}$ by simply
ordering the letters in $\vec{L}$ in ascending order.
Actually, $\vec{E}$ is the
first column of the matrix $M'$ above. So, in our example
$$
\vec{L} = ``H \quad E \quad L \quad W \quad E \quad E \quad R''
$$
$$
\vec{E} = ``E \quad E \quad E \quad H \quad L \quad R \quad W'' \koz .
$$
Now obviously the first letter $\vec{X}(0)$ of our original text $\vec{X}$
is the letter in position $I$ of the sorted string $\vec{E}$, so here
$\vec{X}(0) = \vec{E}(6) = W$. Then we look at the position of
the letter just considered in the string $\vec{L}$ -- here there is only
one W, which is letter no. 3 in $\vec{L}$. This position gives us
the location of the next letter of the original text, namely
$\vec{X}(1)= \vec{E}(3) = H$. $H$ is found in position no. 0 in
$\vec{L}$, hence $\vec{X}(2)= \vec{E}(0) = E$. Now there are three
E--s in the string $\vec{L}$ and we take the first one not used so far,
here the one in position no. 1, and hence $\vec{X}(3)= \vec{E}(1) = E$.
We iterate this procedure and find $\vec{X}(4)= \vec{E}(4) = L$,
$\vec{X}(5)= \vec{E}(2) = E$, $\vec{X}(6)= \vec{E}(5) = R$.
This suggests the following pseudocode.
\begin{alg}{BWT-Decoder($L, I$)}\inddef{BWT-Decoder}
1 \' $E[0\twodots n-1] \leftarrow$ sort $L[0 \twodots n-1]$ \\
2 \' $pi[-1] \leftarrow I$ \\
3 \' \key{for} \= $i \leftarrow 0$ \key{to} $n-1$ \\
4 \' \> \key{do} \= $j=0$ \\
5 \' \> \> \key{while} \= ($L[j])\not= E[pi[i-1]]$ OR $j$ is a component of $pi$) \\
6 \' \> \> \> \key{do} \= $j\leftarrow j+1$ \\
7 \' \> \> \> \> $pi[i] \leftarrow j$ \\
8 \' \> \> \> \> $X[i] \leftarrow L[j]$ \\
9 \' \key{return} $X$
\end{alg}
This algorithm implies a more formal description. Since the decoder
only knows $\vec{L}$, he has to sort this string to find out
$\vec{E}$. To each letter $\vec{L}(j)$ from
the transformed string $\vec{L}$ record the position $\pi(j)$
in $\vec{E}$ from which it was jumped to by the process described above.
So the vector $pi$ in our pseudocode yields
a permutation $\pi$ such that for each
$j=0, \dots ,n-1$ row $j$ it is $\vec{L}(j) = \vec{E}(\pi(j))$
in matrix $M$. In our example $\pi = (3,0,1,4,2,5,6)$.
This permutation can be used to reconstruct the original text
$\vec{X}$ of length $n$ via $\vec{X}(n-1-j) =
\vec{L}(\pi^j(I))$, where $\pi^0(x)=x$ and $\pi^j(x)=\pi(\pi^{j-1}(x))$
for $j=1, \dots ,n-1$.
Observe that so far the original data have only been transformed
and are not compressed, since string $\vec{L}$ has exactly the
same length as the original string $\vec{L}$. So what is the advantage
of the Burrows-Wheeler transformation? The idea is that the transformed
string can be much more efficiently encoded than the original string.
The dependencies among the letters have the effect that in the transformed
string $\vec{L}$ there appear long blocks consisting of the same letter.
In order to exploit such frequent blocks of the same letter,
Burrows and Wheeler suggested the following \ki{move-to-front-code},
\inddef{move-to-front code} which we shall illustrate again with our example above.
We write down a list containing the letters used in our text in
alphabetic order indexed by their position in this list.
$$
\begin{array}{ccccc}
E & H & L & R & W \\
0 & 1 & 2 & 3 & 4 \\
\end{array}$$
Then we parse through the transformed string $\vec{L}$ letter by letter,
note the index of the next letter and move this letter to the front
of the list. So in the first step we note 1---the index of the H,
move H to the front and obtain the list
$$
\begin{array}{ccccc}
H & E & L & R & W \\
0 & 1 & 2 & 3 & 4 \\
\end{array}
$$
Then we note 1 and move E to the front,
$$
\begin{array}{ccccc}
E & H & L & R & W \\
0 & 1 & 2 & 3 & 4 \\
\end{array}$$
note 2 and move L to the front,
$$
\begin{array}{ccccc}
L & E & H & R & W \\
0 & 1 & 2 & 3 & 4 \\
\end{array}$$
note 4 and move W to the front,
$$
\begin{array}{ccccc}
W & L & E & H & R \\
0 & 1 & 2 & 3 & 4 \\
\end{array}$$
note 2 and move E to the front,
$$
\begin{array}{ccccc}
E & W & L & H & R \\
0 & 1 & 2 & 3 & 4 \\
\end{array}$$
note 0 and leave E at the front,
$$
\begin{array}{ccccc}
E & W & L & H & R \\
0 & 1 & 2 & 3 & 4 \\
\end{array}$$
note 4 and move R to the front,
$$
\begin{array}{ccccc}
R & E & W & L & H \\
0 & 1 & 2 & 3 & 4 \\
\end{array}$$
So we obtain the sequence $(1,1,2,4,2,0,4)$ as our move-to-front-code.
The pseudocode may look as follows, where $Q$ is a list
of the letters occuring in the string $\vec{L}$.
\begin{alg}{Move-To-Front($L$)}\inddef{\textsc{Move-To-Front}}
1 \' $Q[0 \twodots n-1] \leftarrow$ list of $m$ letters occuring in $L$ ordered alphabetically \\
2 \' \key{for} \= $i \leftarrow 0$ \key{to} $n-1$ \\
3 \' \> \key{do} \= $j=0$ \\
4 \' \> \> \key{while} \= ($j \not= L[i]$) \\
5 \' \> \> \> $j \leftarrow j+1$ \\
6 \' \> \> $c[i] \leftarrow j$ \\
7 \' \key{for} \= $l \leftarrow 0$ \key{to} $j$ \\
8 \' \> \key{do} $Q[l] \leftarrow Q[l-1 \mbox{ mod } j+1]$ \\
9 \' \key{return} $c$
\end{alg}
The move-to-front-code $c$ will finally be compressed, for instance by Huffman-coding.
\subsubsection{Correctness.} The compression is due to the move-to-front-code obtained
from the transformed string $\vec{L}$.
It can easily be seen that this move-to-front coding procedure
is invertible, so one can recover the string $\vec{L}$ from the
code obtained as above.
Now it can be observed that in the move-to-front-code
small numbers occur more frequently. Unfortunately, this will become
obvious only with much longer texts than in our example---in
long strings it was observed that even about 70 per cent of the numbers are $0$.
This irregularity in distribution can be exploited by compressing the sequence obtained
after move-to-front-coding, for instance by Huffman-codes or run-length codes.
The algorithm performed very
well in practice regarding the compression rate as well as the speed.
The asymptotic optimality of compression has been proven for a wide
class of sources.
\subsubsection{Analysis.} The most complex part of the Burrows-Wheeler transform is
the sorting of the block yielding the transformed string $\vec{L}$.
Due to fast sorting procedures, especially suited for the type of
data to be compressed, compression algorithms based on the Burrows-Wheeler
transform are usually very fast.
On the other hand, compression is done blockwise. The text to be compressed
has to be divided into blocks of appropriate size such that
the matrices $M$ and $M'$ still fit into the memory.
So the decoder has to wait until the whole next block is transmitted
and cannot work sequentially bit by bit as in arithmetic coding
or Ziv-Lempel coding.
\begin{gyak}
\ujgyak
Apply the Burrows-Wheeler-transform\gyakindex{Burrows-Wheeler-transform} and the move-to-front code
to the text ``abracadabra''.
\ujgyak
Verify that the transformed string $\vec{L}$ and
the index $i$ of the position in the sorted text $\vec{E}$
(containing the first letter of the original text to be compressed)
indeed yield enough information to reconstruct the original text.
\ujgyak
Show how in our example the decoder would obtain the string
$\vec{L}=$''HELWEER'' from the move-to-front code $(1,1,2,4,2,0,4)$
and the letters E,H,L,W,R occuring in the text.
Describe the general procedure for decoding move-to-front codes.
\ujgyak
We followed here the encoding procedure presented by Burrows
and Wheeler. Can the encoder obtain the transformed string $\vec{L}$
even without constructing the two matrices $M$ and $M'$?
\end{gyak}
\section{Image compression}
The idea of image compression algorithms is similar
to the one behind the Burrows-Wheeler-transform. The text to be compressed
is transformed to a format which is suitable for application
of the techniques presented in the previous sections,
such as Huffman coding or arithmetic
coding. There are several procedures based on the type of image
(for instance, black/white, greyscale or colour image)
or compression (lossless
or lossy). We shall present the basic steps---representation of data,
discrete cosine transform, quantisation, coding---of lossy image compression
procedures like the standard \ki{JPEG.} \inddef{JPEG}
\subsection{Representation of data}
A greyscale image is represented as a two-dimensional array
$X$, where each entry $X(i,j)$ represents the intensity (or brightness)
at position $(i,j)$ of the image. Each $X(i,j)$ is either a signed or an
unsigned
$k$-bit integers, i. e., $X(i,j) \in \{0, \dots , 2^k-1\}$ or
$X(i,j) \in \{-2^{k-1}, \dots , 2^{k-1}-1 \}$.
A position in a colour image is usually represented by three greyscale
values $R(i,j)$, $G(i,j)$, and $B(i,j)$ per position corresponding to
the intensity of the primary colours red, green and blue.
In order to compress the image, the three arrays (or channels) $R$, $G$, $B$
are first converted to the luminance/chrominance space by the
\ki{$YC_bC_r$-transform} \inddef{YCbCr-transform} (performed entry--wise)
$$
\left( \begin{array}{c} Y \\ C_b \\ C_r \end{array} \right) =
\left( \begin{array}{ccc}
0.299 & 0.587 & 0.114 \\
-0.169 & -0.331 & 0.5 \\
0.5 & -0.419 & -0.0813
\end{array} \right) \cdot
\left( \begin{array}{c} R \\ G \\ B \end{array} \right)
$$
$Y=0.299 R + 0.587 G + 0.114 B$ is the luminance\inddef{luminance}
or intensity channel,
where the coefficients weighting the colours have been found empirically
and represent the best possible approximation of the intensity as
perceived by the human eye. The chrominance \inddef{chrominance} channels
$C_b = 0.564 (B-Y)$ and $C_r = 0.713(R-Y)$ contain the colour
information on red and blue
as the differences from $Y$. The information on green is obtained
as big part in the luminance $Y$.
A first compression for colour images commonly is already obtained
after application of the $YC_bC_r$-transform by removing
\ki{irrelevant information.} Since the human eye is less
sensitive to rapid colour changes than to changes in intensity,
the resolution of the two chrominance channels $C_b$ and $C_r$
is reduced by a factor of $2$ in both vertical and horizontal direction,
which results after sub-sampling in arrays of $\frac14$ of the original size.
The arrays then are subdivided into $8 \times 8$ blocks, on which
successively the actual (lossy) data compression procedure is applied.
Let us consider the following example based
on a real image, on which the steps of compression will be
illustrated. Assume that the $8 \times 8$ block of
$8$-bit unsigned integers below is obtained as a part of
an image.
$$
f=
\left( \begin{array}{cccccccc}
139 & 144 & 149 & 153 & 155 & 155 & 155 & 155 \\
144 & 151 & 153 & 156 & 159 & 156 & 156 & 155 \\
150 & 155 & 160 & 163 & 158 & 156 & 156 & 156 \\
159 & 161 & 162 & 160 & 160 & 159 & 159 & 159 \\
159 & 160 & 161 & 161 & 160 & 155 & 155 & 155 \\
161 & 161 & 161 & 161 & 160 & 157 & 157 & 157 \\
162 & 162 & 161 & 163 & 162 & 157 & 157 & 157 \\
161 & 162 & 161 & 161 & 163 & 158 & 158 & 158
\end{array} \right)
$$
\subsection{The discrete cosine transform}
Each $8 \times 8$ block $(f(i,j))_{i,j=0,\dots ,7}$, say,
is transformed into a new block $(F(u,v))_{u,v=0,\dots ,7}$.
There are several possible transforms, usually the \ki{discrete cosine
transform} \inddef{discrete cosine transform} is applied,
which here obeys the formula
$$F(u,v) = \frac14 c_uc_v \left( \sum_{i=0}^7 \sum_{j=0}^7
f(i,j) \cdot \cos \frac{(2i+1)u\pi}{16} \cos\frac{(2j+1)v\pi}{16} \right) \koz .
$$
The cosine transform
is applied after shifting the unsigned integers to
signed integers by subtraction of $2^{k-1}$.
\begin{alg}{DCT($f$)}\inddef{\textsc{DCT} algorithm}
1 \' \key{for} \= $u \leftarrow 0$ \key{to} 7 \\
2 \' \> \key{do} \key{for} \= $v \leftarrow 0$ \key{to} 7 \\
3 \' \> \> \key{do} $F(u,v) \leftarrow$ DCT - coefficient of matrix $f$ \\
4 \' \key{return} $F$
\end{alg}
The coefficients need not be calculated according to the formula above.
They can also be obtained via a related Fourier transform (see Exercises)
such that a Fast Fourier Transform may be applied. JPEG also supports
wavelet transforms, which may replace the discrete cosine transform here.
The discrete cosine transform can be inverted via
$$f(i,j) = \frac14 \left( \sum_{u=0}^7 \sum_{v=0}^7 c_uc_v
F(u,v) \cdot \cos \frac{(2i+1)u\pi}{16}\cos \frac{(2j+1)v\pi}{16} \right) \koz ,
$$
where $c_u = \left\{ \begin{array}{lr}
\frac{1}{\sqrt{2}} & \mbox{ for } u=0 \\
1 & \mbox{ for } u \not= 0 \end{array} \right.$ and
$c_v = \left\{ \begin{array}{lr}
\frac{1}{\sqrt{2}} & \mbox{ for } v=0 \\
1 & \mbox{ for } v \not= 0
\end{array} \right.$
are normalisation constants.
In our example, the transformed block $F$ is
$$
F=
\left( \begin{array}{cccccccc}
235.6 & -1.0 & -12.1 & -5.2 & 2.1 & -1.7 & -2.7 & 1.3 \\
-22.6 & -17.5 & -6.2 & -3.2 & -2.9 & -0.1 & 0.4 & -1.2 \\
-10.9 & -9.3 & -1.6 & 1.5 & 0.2 & -0.9 & -0.6 & -0.1 \\
-7.1 & -1.9 & 0.2 & 1.5 & 0.9 & -0.1 & 0.0 & 0.3 \\
-0.6 & -0.8 & 1.5 & 1.6 & -0.1 & -0.7 & 0.6 & 1.3 \\
1.8 & -0.2 & 1.6 & -0.3 & -0.8 & 1.5 & 1.0 & -1.0 \\
-1.3 & -0.4 & -0.3 & -1.5 & -0.5 & 1.7 & 1.1 & -0.8 \\
-2.6 & 1.6 & -3.8 & -1.8 & 1.9 & 1.2 & -0.6 & -0.4
\end{array} \right)
$$
where the entries are rounded.
The discrete cosine transform is closely related to the discrete
Fourier transform and similarly maps signals to frequencies. Removing
higher frequencies results in a less sharp image, an effect that
is tolerated, such that higher frequencies are stored with less accuracy.
Of special importance is the entry $F(0,0)$, which can be interpreted
as a measure for the intensity of the whole block.
\subsection{Quantisation}
The discrete cosine transform maps integers to real numbers, which
in each case have to be rounded to be representable.
Of course, this rounding already results in a loss of information.
However, the transformed block $F$ will now be much easier to manipulate.
A \ki{quantisation}\inddef{quantisation} takes place, which maps the entries of $F$
to integers by division by the corresponding entry
in a luminance quantisation matrix $Q$.
In our example we use
$$
Q=
\left( \begin{array}{cccccccc}
16 & 11 & 10 & 16 & 24 & 40 & 51 & 61 \\
12 & 12 & 14 & 19 & 26 & 58 & 60 & 55 \\
14 & 13 & 16 & 24 & 40 & 57 & 69 & 56 \\
14 & 17 & 22 & 29 & 51 & 87 & 80 & 62 \\
18 & 22 & 37 & 56 & 68 & 109 & 103 & 77 \\
24 & 35 & 55 & 64 & 81 & 104 & 113 & 92 \\
49 & 64 & 78 & 87 & 103 & 121 & 120 & 101 \\
72 & 92 & 95 & 98 & 112 & 100 & 103 & 99
\end{array} \right).
$$
The quantisation matrix has to be carefully chosen in order
to leave the image at highest possible quality.
Quantisation is the lossy part
of the compression procedure. The idea is to remove information which
should not be ``visually significant''. Of course, at this point
there is a tradeoff between the compression rate and the quality of the
decoded image. So, in JPEG the quantisation table is not included
into the standard but must be specified (and hence be encoded).
\begin{alg}{Quantisation($F$)}\inddef{\textsc{Quantisation}}
1 \' \key{for} \= $i \leftarrow 0$ \key{to} 7 \\
2 \' \> \key{do} \key{for} \= $j \leftarrow 0$ \key{to} 7 \\
3 \' \> \> \key{do} $T(i,j) \leftarrow \{\frac{F(i,j)}{Q(i,j)} \}$ \\
4 \' \key{return} $T$
\end{alg}
This quantisation transforms block $F$
to a new block $T$ with $T(i,j)= \{\frac{F(i,j)}{Q(i,j)} \}$,
where $\{x\}$ is the closest integer to $x$.
This block will finally be encoded.
Observe that in the transformed block $F$ besides the entry
$F(0,0)$ all other entries are relatively small numbers,
which has the effect that $T$ mainly consists of $0$s .
$$
T=
\left( \begin{array}{cccccccc}
15 & 0 & -1 & 0 & 0 & 0 & 0 & 0 \\
-2 & -1 & 0 & 0 & 0 & 0 & 0 & 0 \\
-1 & -1 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0
\end{array} \right) .
$$
Coefficient $T(0,0)$, in this case $15$, deserves special consideration.
It is called DC term (direct current), while the other entries are denoted
AC coefficients (alternate current).
\subsection{Coding}
Matrix $T$ will finally be encoded by a Huffman code.
We shall only sketch the procedure.
First the DC term will be encoded by the difference to the DC term
of the previously encoded block. For instance, if the previous
DC term was 12, then $T(0,0)$ will be encoded as $-3$.
After that the AC coefficients are
encoded according to the zig-zag order $T(0,1)$, $T(1,0)$, $T(2,0)$,
$T(1,1)$, $T(0,2)$, $T(0,3)$, $T(1,2)$, etc.. In our example, this yields
the sequence $0, -2, -1, -1, -1, 0, 0, -1$ followed by 55 zeros.
This zig--zag order exploits the fact that there are long runs of
successive zeros. These runs will be even more efficiently represented
by application of \ki{run-length coding,}\inddef{run-length coding}
i. e., we encode the number of zeros before the next nonzero element in the sequence
followed by this element.
Integers are written in such a way that small numbers have
shorter representations. This is achieved by splitting their
representation into size (number of bits to be reserved)
and amplitude (the actual value). So, $0$ has size $0$, $1$ and $-1$
have size $1$. $-3$, $-2$, $2$, and $3$ have size $2$, etc.
In our example this yields the sequence $(2)(3)$ for the DC term
followed by $(1,2)(-2)$, $(0,1)(-1)$, $(0,1)(-1)$, $(0,1)(-1)$, $(2,1)(-1)$,
and a final $(0,0)$ as an end-of-block symbol indicating that only zeros follow from now on.
$(1,2)(-2)$, for instance, means that there is 1 zero followed
by an element of size $2$ and amplitude $-2$.
These pairs are then assigned codewords from a Huffman code.
There are different Huffman codes for the pairs (run, size) and
for the amplitudes. These Huffman codes have to be specified and hence be included into
the code of the image.
In the following pseudocode for the encoding of a single
$8 \times 8$-block $T$ we shall denote
the different Huffman codes by encode-1, encode-2, encode-3.
\begin{algN}{Run-Length-Code$(T)$}\inddef{Run-Length-Code}
1 \' $c \leftarrow$ \textit{encode}-1(\textit{size}($DC-T[0,0]$)) \\
2 \' $c \leftarrow c||$ \textit{encode}-3(\textit{amplitude}($DC-T[00]$)) \\
3 \' $DC \leftarrow T[0,0]$ \\
4 \' \key{for} \= $l \leftarrow 1$ \key{to} 14 \\
5 \' \> \key{do} \key{for} \= $i \leftarrow 0$ \key{to} l \\
6 \' \> \> \key{do} \key{if} \= $l=1 \mbox{ mod } 2$ \\
7 \' \> \> \> \key{then} \= $u \leftarrow i$ \\
8 \' \> \> \> \key{else} \> $u \leftarrow l-i$ \\
9 \' \> \> \key{if} \= $T[u,l-u]=0$ \\
10 \' \> \> \> \key{then} \= \textit{run} $\leftarrow
\textit{run} + 1$ \\
11 \' \> \> \> \key{else} \> $c\leftarrow c||$
\textit{encode} -2(\textit{run}, \textit{size}$(T[u,l-u])$) \\
12 \' \> \> \> \> $c \leftarrow c||$
\textit{encode}-3(\textit{amplitude}($T[u,l-u]$) \\
13 \' \> \> \> \> $run \leftarrow 0$ \\
14 \' \key{if} \= \textit{run} > 0 \\
15 \' \> \key{then} \textit{encode}-2(0,0) \\
16 \' \key{return} $c$
\end{algN}
At the decoding end matrix $T$ will be reconstructed.
Finally, by multiplication of each entry $T(i,j)$ by
the corresponding entry $Q(i,j)$ from the quantisation matrix
$Q$ we obtain an approximation $\overline{F}$ to the
block $F,$ here
$$
\overline{F}=
\left( \begin{array}{cccccccc}
240 & 0 & -10 & 0 & 0 & 0 & 0 & 0 \\
-24 & -12 & 0 & 0 & 0 & 0 & 0 & 0 \\
-14 & -13 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0
\end{array} \right).
$$
To $\overline{F}$ the inverse cosine transform is applied.
This allows to decode the original $8 \times 8$--block $f$
of the original image -- in our example as
$$
\overline{f} =
\left( \begin{array}{cccccccc}
144 & 146 & 149 & 152 & 154 & 156 & 156 & 156 \\
148 & 150 & 152 & 154 & 156 & 156 & 156 & 156 \\
155 & 156 & 157 & 158 & 158 & 157 & 156 & 155 \\
160 & 161 & 161 & 162 & 161 & 159 & 157 & 155 \\
163 & 163 & 164 & 163 & 162 & 160 & 158 & 156 \\
163 & 164 & 164 & 164 & 162 & 160 & 158 & 157 \\
160 & 161 & 162 & 162 & 162 & 161 & 159 & 158 \\
158 & 159 & 161 & 161 & 162 & 161 & 159 & 158
\end{array} \right).
$$
\vspace{-8mm}
\begin{gyak}
\ujgyak
Find size and amplitude for the representation of
the integers 5, -19, and 32.
\vspace{-4mm}
\ujgyak
Write the entries of the following matrix in zig -- zag order.
\[
\left( \begin{array}{cccccccc}
5 & 0 & -2 & 0 & 0 & 0 & 0 & 0 \\
3 & 1 & 0 & 1 & 0 & 0 & 0 & 0 \\
0 & -1 & 0 & 0 & 0 & 0 & 0 & 0 \\
2 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\
-1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
0 & 0 & 0 & 0 & 0 & 0 & 0 & 0
\end{array} \right) .
\]
How would this matrix be encoded if the difference of
the DC term to the previous one was $-2$?
\ujgyak
In our example after quantising the sequence
$(2)(3)$, $(1,2)(-2)$, $(0,1)(-1)$, $(0,1)(-1)$, $(0,1)(-1)$, $(2,1)(-1)$,
$(0,0)$ has to be encoded. Assume the Huffman codebooks
would yield $011$ to encode the difference $2$ from the preceding
block's DC, $0$, $01$, and $11$ for the amplitudes $-1$, $-2$, and
$3$, respectively, and $1010$, $00$, $11011$, and $11100$ for the pairs
$(0,0)$, $(0,1)$, $(1,2)$, and $(2,1)$, respectively.
What would be the bitstream to be encoded for the $8 \times 8$ block
in our example? How many bits would hence be necessary to
compress this block?
\ujgyak
What would be matrices $T$, $\overline{F}$ and $\overline{f}$,
if we had used
$$Q=
\left( \begin{array}{cccccccc}
8 & 6 & 5 & 8 & 12 & 20 & 26 & 31 \\
6 & 6 & 7 & 10 & 13 & 29 & 30 & 28 \\
7 & 7 & 8 & 12 & 20 & 29 & 35 & 28 \\
7 & 9 & 11 & 15 & 26 & 44 & 40 & 31 \\
9 & 11 & 19 & 28 & 34 & 55 & 52 & 39 \\
12 & 18 & 28 & 32 & 41 & 52 & 57 & 46 \\
25 & 32 & 39 & 44 & 57 & 61 & 60 & 51 \\
36 & 46 & 48 & 49 & 56 & 50 & 57 & 50
\end{array} \right).
$$
for quantising\gyakindex{quantisation} after the cosine transform in the block of our example?
\ujgyak
What would be the zig-zag-code\gyakindex{zig-zag-code} in this case (assuming again
that the DC term would have difference $-3$ from the previous DC term)?
\ujgyak
For any sequence $(f(n))_{n=0, \dots ,m-1}$ define a new sequence
$(\hat{f}(n))_{n=0, \dots , 2m-1}$ by
\[ \hat{f}(n) = \left\{ \begin{array}{rl}
f(n) & \mbox{ for } n=0, \dots ,m-1, \\
f(2m-1-n) & \mbox{ for } n=m, \dots ,2m-1 \koz .
\end{array} \right. \]
This sequence can be expanded to a Fourier-series via
\[ \hat{f}(n)= \frac{1}{\sqrt{2m}} \sum_{n=0}^{2m-1} \hat{g}(u)
e^{i \frac{2\pi}{2m} nu} \quad \mbox{with }
\hat{g}(u) = \frac{1}{\sqrt{2m}} \sum_{n=0}^{2m-1} \hat{f}(u)
e^{-i \frac{2\pi}{2m} nu}, \quad i = \sqrt{-1} \koz. \]
Show how the coefficients of the discrete cosine transform
\[F(u) = c_u \sum_{n=0}^{m-1} f(n) \cos(\frac{(2n+1)\pi u}{2m}, \quad
c_u = \left\{ \begin{array}{rl}
\frac{1}{\sqrt{m}} & \mbox{ for } u=0 \\
\frac{2}{\sqrt{m}} & \mbox{ for } u \not=0
\end{array} \right. \]
arise from this Fourier series.
\end{gyak}
\begin{fld}
\ujfld{Adaptive Huffman-codes}{
Dynamic and adaptive Huffman-coding is based on the following
property. A binary code tree has the sibling property if each node
has a sibling and if the nodes can be listed in order of nonincreasing
probabilities with each node being adjacent to its sibling.
Show that a binary prefix code is a Huffman-code exactly if
the corresponding code tree has the sibling property.}
\ujfld{Generalisations of Kraft's inequality}{
In the proof of Kraft's inequality\felindex{Kraft's inequality} it is essential to order
the lengths $L(1)\leq\dots\leq L(a)$. Show that
the construction of a prefix code for given lengths $2,1,2$ is not possible
if we are not allowed to order the lengths.
This scenario of unordered lengths
occurs with the Shannon-Fano-Elias-code
and in the theory of alphabetic codes, which are related to
special search problems. Show that in this case a prefix code
with lengths $L(1)\leq\dots\leq L(a)$ exists if and only if
\[ \sum\limits_{x\in{\cal X}} 2^{-L(x)}\leq \frac12 \koz .\]
If we additionally require the prefix codes to be also suffix-free
i. e., no codeword is the end of another one,
it is an open problem to show that Kraft's inequality holds with
the $1$ on the right--hand side replaced by 3/4, i. e.,
\[ \sum\limits_{x\in{\cal X}} 2^{-L(x)}\leq \frac34 \koz .\]}
\ujfld{Redundancy of Krichevsky-Trofimov-estimator}{
Show that using the Krichevsky-Trofimov-estimator,\felindex{Krichevsky-Trofimov-estimator} when
parameter $\theta$ of a discrete memoryless source is unknown,
the individual redundancy of sequence $x^n$
is at most $\frac12 \lg n +3$
for all sequences $x^n$ and all $\theta \in \{0,1\}$.}
\ujfld{Alternatives to move-to-front-codes}
{Find further procedures which like move-to-front-coding prepare
the text for compression after application of the Burrows-Wheeler-transform.}
\end{fld}
\begin{fejmegj}
The frequency table of the letters in English texts is taken
from \cite{Wel88}. The Huffman coding algorithm was introduced by
Huffman\nevindex{Huffman, David A. (1925--1999)} in
\cite{Huffman52}. A pseudocode can be found in
\cite{CormenLe2009},\nevindex{Cormen, Thomas H.}\nevindex{Leiserson, Charles E.}\nevindex{Rivest,
Ronald Lewis} where the
Huffman coding algorithm is presented as a special Greedy algorithm.
There are also adaptive or dynamic variants of Huffman-coding,
which adapt the Huffman-code if it is no
longer optimal for the actual frequency table, for the case that the
probability distribution of the source is not known in advance.
The ``3/4-conjecture'' on Kraft's inequality for fix-free-codes is due to Ahlswede, Balkenhol, and Khachatrian
\cite{ABK96}.\nevindex{Ahlswede, Rudolf}\nevindex{Balkenhol, Bernhard}\nevindex{Khachatrian, Levon (1954--2002)}
Arithmetic coding has been introduced by Rissanen\nevindex{Rissanen, J. J.} \cite{Ris76} and
Pasco\nevindex{Pasco, R.} \cite{Pas76}. For a discussion of implementation questions
see \cite{Lan84,Lan84,WNC87}.
In the section on modelling
we are following the presentation of Willems,\nevindex{Willems, Frans M. J.}
Shtarkov\nevindex{Shtarkov, Yuri M.} and Tjalkens\nevindex{Tjalkens, Tjalling J.}
in \cite{WST97}. The exact calculations can be found in their original
paper \cite{WST95} which received the Best Paper Award of the IEEE
Information Theory Society in 1996. The Krichevsky-Trofimov-estimator
had been introduced in \cite{KT81}.
We presented the two original algorithms LZ77 and LZ78 \cite{ZL77, ZL78}
due to Lempel\nevindex{Lempel, Abraham} and Ziv.\nevindex{Ziv, Jacob}
Many variants, modifications and extensions have been developed
since that -- concerning the handling of the dictionary, the pointers,
the behaviour after the dictionary is complete, etc. For a description,
see, for instance, \cite{BWC89} or \cite{BWC90}. Most of the prominent
tools for data compression are variations of Ziv-Lempel-coding.
For example ``zip'' and ``gzip'' are based on LZ77 and a variant of LZ78
is used by the program ``compress''.
The Burrows-Wheeler transform was introduced in the
technical report \cite{BuWh94}. It became popular in the sequel,
especially because of the Unix compression tool ``bzip''
based on the Burrows-Wheeler-transform, which outperformed most
dictionary---based tools on several benchmark files. Also it
avoids arithmetic coding, for which patent rights have to be taken
into consideration.
Further investigations on the Burrows-Wheeler-transform
have been carried out, for instance in \cite{BaKu00,EVKV02,KuBa00}.
We only sketched the basics behind lossy image compression,
especially the preparation of the data for application of techniques
as Huffman coding. For a detailed discussion we refer to \cite{TM02},
where also the new JPEG2000 standard is described. Our example
is taken from \cite{Wal91}.
JPEG---short for Joint Photographic Experts Group---is very flexible. For instance, it also
supports lossless data compression.
All the topics presented in the section on image compression
are not unique. There are
models involving more basic colours and further transforms
besides the $YC_bC_r$-transform (for which even different scaling
factors for the chrominance channels were used, the formula
presented here is from \cite{TM02}). The cosine transform
may be replaced by another operation like
a wavelet transform.
Further, there is freedom to choose the quantisation matrix, responsible
for the quality of the compressed image, and the Huffman code. On the other
hand, this has the effect that these parameters have to be explicitly
specified and hence are part of the coded image.
The ideas behind procedures for video and sound compression are rather similar
to those for image compression. In principal, they follow the same
steps. The amount of data in these cases, however, is much bigger.
Again information is lost by removing irrelevant information
not realizable by the human eye or ear (for instance
by psychoacoustic models) and by quantising, where the quality
should not be reduced significantly. More refined quantising methods
are applied in these cases.
Most information on data compression algorithms
can be found in literature on Information Theory, for instance
\cite{CT91, HK02}, since the analysis of the achievable compression rates requires
knowledge of source coding theory. Recently, there have appeared
several books on data compression, for instance \cite{BWC90, HHJ03, NG96, Sal04, Say00},
to which we refer to further reading. The benchmark
files of the Calgary Corpus and the Canterbury Corpus are available
under \cite{Cal} or \cite{Can}.
The book of I. Csiszár and J. Körner \cite{Csiszar81}\nevindex{Csiszár, Imre}\nevindex{Körner, János}
analyses different aspects of information theory including the problems of data compression too.
\end{fejmegj}