\part{APPLICATIONS}
\chapter[ Bioinformatics]{Bioinformatics}\nevindex{Miklós, István}\index{bioinformatics|(}
In this chapter at first we present algorithms on sequences, trees and stochastic grammars,
then we continue with algorithms of comparison of structures and constructing of
evolutionary trees, and finish the chapter with some rarely discussed topics of
bioinformatics.
\section{Algorithms on sequences}
In this section, we are going to introduce dynamic programming algorithms working on sequences.
Sequences are finite series of characters over a finite alphabet.
The basic idea of dynamic programming is that calculations for long sequences can be given
via calculations on substrings of the longer sequences.
The algorithms introduced here are the most important ones in bioinformatics,
they are the basis of several software packages.
\subsection{Distances of two sequences using linear gap penalty}
DNA \index{DNA} contains the information of living cells.
Before the duplication of cells, the DNA molecules are doubled, and both daughter
cells contain one copy of DNA.
The replication of DNA is not perfect, the stored information can be changed by
random mutations. Random mutations creates variants in the population, and these variants
evolve to new species.
Given two sequences, we can ask the question how much the two species are related, and
how many mutations are needed to describe the evolutionary history of the two sequences.
We suppose that mutations are independent from each other, and hence, the
probability of a series of mutations is the product of probabilities of the
mutations. Each mutation is associated with a weight, mutations with high
probability get a smaller weight, mutations with low probability get
a greater weight. A reasonable choice might be the logarithm of one over the
probability of the mutation. In this case the weight of a series of mutations
is the sum of the weights of the individual mutations. We also assume that
mutation and its reverse have the same probability, therefore we study how
a sequence can be transfered into another instead of evolving two sequences
from a common ancestor. Assuming minimum evolution \ki{minimum evolution,}\inddef{minimum evolution}
we are seeking for the minimum weight series of mutations that transforms
one sequence into another. An important question is how we can quickly find
such a minimum weight series. The naive algorithm finds all the possible series
of mutations and chooses the minimum weight. Since the possible number of series
of mutations grows exponentially -- as we are going to show it in this chapter --,
the naive algorithm is obviously too slow.
We are going to introduce the Sellers' algorithm \cite{bio_sellers74}. Let $\Sigma$
be a finite set of symbols, and let $\Sigma^*$ denote the set of finite long sequences over $\Sigma$.
The $n$ long prefix of $A \in \Sigma^*$ will be denoted by $A_n$, and $a_n$ denotes
the $n$th character of $A$. The following transformations can be applied for a sequence:
\begin{itemize}
\item Insertion of symbol $a$ before position $i$, denoted by $a\,\leftarrow^i\,-$.
\item Deletion of symbol $a$ at position $i$, denoted by \mbox{$-\,\leftarrow^i\,a$}.
\item Substitution of symbol $a$ to symbol $b$ at position $i$, denoted by
$b\,\leftarrow^i\,a$.
\end{itemize}
The concatenation of mutations is denoted by the $\circ$ symbol.
$\tau$ denotes the set of finite long concatenations of the above mutations, and
$T(A)=B$ denotes that $T \in \tau$ transforms a sequence $A$ into sequence $B$.
Let $w:\tau \rightarrow R^+\cup\{0\}$ a weight function such that
for any $T_1$, $T_2$ and $S$ transformations satisfying
\begin{equation}
T_1 \circ T_2 = S \koz ,
\end{equation}
the
\begin{equation}
w(T_1)+w(T_2) = w(S) \koz ,
\end{equation}
equation also holds.
Furthermore, let $w(a\,\leftarrow^i\,b)$ be independent from $i$.
The transformation distance between two sequences, $A$ and $B$,
is the minimum weight of transformations transforming $A$ into $B$:
%
\begin{equation}
\delta(A,B) = \min\{w(T)|(T(A) = B\} \koz .
\end{equation}
If we assume that $w$ satisfies
\begin{eqnarray}
w(a \leftarrow b) & = & w(b \leftarrow a) \koz ,\\
w(a \leftarrow a) & = & 0 \koz ,\\
w(b \leftarrow a) + w(c \leftarrow b)& \geq & w(c \leftarrow a)
\end{eqnarray}
for any $a$, $b$, $c \in \Sigma \cup \{-\}$, then the $\delta(,)$ transformation
distance is indeed a metric on $\Sigma^*$.
Since $w(,)$ is a metric, it is enough to concern with transformations that
change each position of a sequence at most once. Series of transformations are
depicted with \ki{sequence alignments.}\inddef{sequence alignment}
By convention, the sequence at the top is the ancestor and the sequence at the bottom is
its descendant. For example, the alignment below shows that there were substitutions
at positions three and five, there was an insertion in the first position and a deletion
in the eighth position.
\begin{center}
{\tt
- A U C G U A C A G
U A G C A U A - A G
}
\end{center}
A pair at a position is called aligned pair. The weight of the series of
transformations described by the alignment is the sum of the weights of aligned
pairs. Each series of mutations can be described by an alignment, and this description
is unique up the permutation of mutations in the series. Since the summation
is commutative, the weight of the series of mutations does not depend on the
order of mutations.
We are going to show that the number of possible alignments also grows exponentially
with the length of the sequences. The alignments that do not contain this pattern
\begin{center}
{\tt
\# -
- \#
}
\end{center}
where {\tt \#} an arbitrary character of $\Sigma$ gives a subset of possible alignments.
The size of this subset is $\choose{|A| + |B|}{|A|}$, since there is a bijection between
this set of alignments and the set of coloured sequences that contains the characters of $A$
and $B$ in increasing order, and the characters of $A$ is coloured with one colour, and the
characters of $B$ is coloured with the other colour. For example, if $|A| = |B| = n$, then
${|A| + |B| \choose |A|} = \Theta(2^{2n}/n^{0.5})$.
An alignment whose weight is minimal called an \ki{optimal alignment.}\inddef{optimal alignment}
Let the set of optimal alignments of $A_i$ and $B_j$ be denoted by $\alpha^*(A_i,B_j)$, and let
$w(\alpha^*(A_i,B_j))$ denote the weights of any alignment in $\alpha^*(A_i,B_j)$.
The key of the fast algorithm for finding an optimal alignment is that if we know
$w(\alpha^*(A_{i-1},B_j))$, $w(\alpha^*(A_i,B_{j-1}))$, and
$w(\alpha^*(A_{i-1},B_{j-1}))$, then we can calculate $w(\alpha^*(A_i,j_m))$
in constant time. Indeed, if we delete the last aligned pair of an optimal alignment of
$A_i$ and $B_j$, we get the optimal alignment of $A_{i-1}$ and $B_j$,
or $A_i$ and $B_{j-1}$ or $A_{i-1}$ and $B_{j-1}$, depending on the last
aligned column depicts a deletion, an insertion, substitution or match, respectively.
Hence,
%
\begin{eqnarray}
w(\alpha^*(A_i,B_j)) =&\!\!\!\!\!\! \min\{ w(\alpha^*(A_{i-1},B_j))+w(- \leftarrow a_i); &\nonumber \\
&\,\,\,\,\,\,\,\; w(\alpha^*(A_i,B_{j-1}))+w(b_i \leftarrow -); &\label{eq:bio_main_basic}\\
&\,\,\,\,\;\;\;\;\;\; w(\alpha^*(A_{i-1},B_{j-1}))+w(b_i \leftarrow a_i)\} .&
\nonumber
\end{eqnarray}
%
The weights of optimal alignments are calculated in the
so-called \ki{dynamic programming table,}\inddef{dynamic programming table} $D$. The $d_{i,j}$ element of $D$ contains $w(\alpha^*(A_i,B_j))$.
Comparing of an $n$ and an $m$ long sequence requires the fill-in of an
$(n+1)x(m+1)$ table, indexing of rows and columns run from $0$ till $n$ and $m$, respectively.
The initial conditions for column $0$ and row $0$ are
\begin{eqnarray}
d_{0,0} & = & 0, \label{eq:bio_00}\\
d_{i,0} & = & \sum_{k=1}^i w(- \leftarrow a_k) \koz , \label{eq:bio_oszlop}\\
d_{0,j} & = & \sum_{l=1}^j w(b_l \leftarrow -) \koz . \label{eq:bio_sor}
\end{eqnarray}
\newpage
The table can be filled in using Equation~(\ref{eq:bio_main_basic}).
The time requirement for the fill-in is $\Theta(nm)$. After filling in
the dynamic programming table, the set of all optimal alignments can be find in the
following way, called \ki{trace-back.}\inddef{trace-back} We go from the right bottom
corner to the left top corner choosing the cell(s) giving the optimal value of
the current cell (there might be more than one such cells). Stepping up from
position $d_{i,j}$ means a deletion, stepping to the left means an insertion, and
the diagonal steps mean either a substitution or a match depending on
whether or not $a_i = b_j$. Each step is represented with an oriented edge, in this way,
we get an oriented graph, whose vertices are a subset of the cells of the dynamic
programming table. The number of optimal alignments might grow exponentially
with the length of the sequences, however, the set of optimal alignments can be represented
in polynomial time and space. Indeed, each path from $d_{n,m}$ to $d_{0,0}$
on the oriented graph obtained in the trace-back gives an optimal alignment.
\subsection{Dynamic programming with arbitrary gap function}
Since deletions and insertions get the same weight, the common name of them
is indel or \ki{gap,}\inddef{gap} and their weights are called \ki{gap penalty.}
\inddef{gap penalty} Usually gap penalties do not depend on the deleted or
inserted characters. The gap penalties used in the previous section grow linearly
with the length of the gap. This means that a long indel is considered as the result
of independent insertions or deletions of characters. However, the biological
observation is that long indels can be formed in one evolutionary step, and these long
indels are penalised too much with the linear gap penalty function. This observation
motivated the introduction of more complex gap penalty functions \cite{bio_wsb76}.
The algorithm introduced by Waterman {\it et al.} penalises a $k$ long gap with $g_k$.
For example the weight of this alignment:
\begin{center}
{\tt
- - A U C G A C G U A C A G
U A G U C - - - A U A G A G
}
\end{center}
is $g_2+w(G \leftarrow A)+g_3+w(A \leftarrow G)+w(G \leftarrow C)$.
We are still seeking for the minimal weight series of transformations transforming one sequence
into another or equivalently for an optimal alignment. Since there might be a long indel at the
end of the optimal alignment, above knowing $w(\alpha^*(A_{i-1},B_{j-1)})$, we must know all
$w(\alpha^*(A_k,B_j))$, $0 \leq k < i$ and $w(\alpha^*(A_i,B_l))$, $0 \leq l < j$ to calculate
$w(\alpha^*(A_i,B_j))$. The dynamic programming recursion is given by the following equations:
%
\begin{eqnarray}
w(\alpha^*(A_i,B_j)) = & \min\{w(\alpha^*(A_{i-1},B_{j-1}))+ w(b_j \leftarrow a_i) \koz ; & \nonumber \\
& \,\,\,\,\,\,\,\min_{0 \leq k < i}\{w(\alpha^*(A_k,B_j))+g_{i-k}\} \koz ; &\\
& \,\,\,\,\,\,\,\min_{0 \leq l < j}\{w(\alpha^*(A_i,B_l))+g_{j-l}\}\} \koz . & \nonumber
\end{eqnarray}
%
The initial conditions are:
%
\begin{eqnarray}
d_{0,0} & = & 0 \koz , \\
d_{i,0} & = & g_i \koz , \\
d_{0,j} & = & g_j \koz .
\end{eqnarray}
%
The time requirement for calculating $d_{i,j}$ is $\Theta(i+j)$, hence the running time of the
fill-in part to calculate the weight of an optimal alignment is $\Theta(nm(n+m))$. Similarly
to the previous algorithm, the set of optimal alignments represented by paths from $d_{n,m}$
to $d_{0,0}$ can be found in the trace-back part.
If $|A| =|B| = n$, then the running time of this algorithm is $\Theta(n^3)$. With restrictions
on the gap penalty function, the running time can be decreased. We are going to show two such
algorithms in the next two sections.
\subsection{Gotoh algorithm for affine gap penalty}
A gap penalty function is \ki{affine}\inddef{affine function} if
\begin{equation}
g_k = u_k+v, \quad u \geq 0,\,\,\, v \geq 0 \koz .
\end{equation}
There exists a $\Theta(nm)$ running time algorithm for affine gap penalty \cite{bio_gotoh82}.
Recall that in the Waterman algorithm,
\begin{equation}
d_{i,j} = \min\{d_{i-1,j-1} + w(b_j \leftarrow a_i);\,\, p_{i,j};\,\, q_{i,j}\} \koz ,
\end{equation}
where
\begin{eqnarray}
p_{i,j} & = & \min_{0 \leq k < i}\{d_{i-k,j} + g_k\} \koz , \\
q_{i,j} & = & \min_{0 \leq l < j}\{d_{i,j-l} + g_l\} \koz .
\end{eqnarray}
The key of the Gotoh algorithm is the following reindexing
\begin{eqnarray}
p_{i,j} & = & \min\{d_{i-1,j} + g_1,\,\, \min_{1 \leq k < i}\{d_{i-k,j} +g_k\}\} \nonumber \\
& = & \min\{d_{i-1,j} + g_1,\,\, \min_{0 \leq k < i - 1}\{d_{i-1-k,j} + g_{k+1}\}\} \nonumber \\
& = & \min\{d_{i-1,j} + g_1,\,\, \min_{0 \leq k < i - 1}\{d_{i-1-k,j} + g_k\} + u\} \nonumber \\
& = & \min\{d_{i-1,j} + g_1,\,\, p_{i-1,j} + u\} \koz .
\end{eqnarray}
And similarly
\begin{equation}
q_{i,j} = \min\{d_{i,j-1} + g_1,\,\, q_{i,j-1} + u\} \koz .
\end{equation}
In this way, $p_{i,j}$ és $q_{i,j}$ can be calculated in constant time, hence $d_{i,j}$.
Thus, the running time of the algorithm remains $\Theta(nm)$, and the algorithm will be only
a constant factor slower than the dynamic programming algorithm for linear gap penalties.
\subsection{Concave gap penalty}
There is no biological justification for the affine gap penalty function \cite{bio_bcg93,bio_gcb92},
its wide-spread use (for example, CLUSTAL-W \cite{bio_thg94}) is due to its low running time.
There is a more realistic gap penalty function for which an algorithm exists whose running time
is slightly more than the running time for affine gap penalty, but it is still significantly better
than the cubic running time algorithm of Waterman {\it et al.} \cite{bio_gg89,bio_mm88}.
A gap penalty function is concave if for each $i$, $g_{i+1} - g_i \leq g_i - g_{i-1}$.
Namely, the increasement of gap extensions are penalised less and less. It might happen that
the function starts decreasing after a given point, to avoid this, it is usually assumed that the function
increases monotonously. Based on empirical data \cite{bio_bcg93}, if two sequences evolved
for $d$ PAM unit \cite{bio_dso78}, the weight of a $q$ long indel is
%
\begin{equation}
35.03 - 6.88 \log_{10} d + 17.02 \log_{10} q \koz ,
\end{equation}
which is also a concave function. (One PAM unit is the time span on which 1\% of the sequence changed.)
There exist an $O(nm(\log n +\log m))$ running time algorithm for concave gap penalty functions.
this is a so-called \ki{forward looking}\inddef{forward looking algorithm} algorithm.
The \textsc{Forward-Looking} algorithm calculates the $i$th row of the dynamic programming table
in the following way for an arbitrary gap penalty function:
\begin{algN}{Forward-Looking}\inddef{\textsc{Forward-Looking}}
1 \' \key{for} \= $1 \leq j \leq m$ \\
2 \' \> $q_1[i,j] \leftarrow d[i,0]+g[j]$\\
3 \' \> $b[i,j] \leftarrow 0$\\
4 \' \key{for} \= $1 \leq j \leq m$\\
5 \' \> $q[i,j] \leftarrow q_1[i,j]$\\
6 \' \> $d[i,j] \leftarrow \min[q[i,j],p[i,j],d[i-1,j-1]+w(b_j \leftarrow a_i)]$\\
7 \' \` $\rhd$ At this step, we suppose that $p[i,j]$ and $d[i-1,j-1]$ are already calculated.\\
8 \' \> \key{for} \= $j < j_1 \leq m$ \` $\rhd$Inner cycle.\\
9 \' \> \>\key{if} $q1[i,j_1] \> \> $q_1[j_1] \leftarrow d[i,j]+g[j_1-j]$\\
11 \' \> \> \> $b[i,j_1] \leftarrow j$
\end{algN}
\noindent where $g[\,]$ is the gap penalty function and $b$ is a pointer whose role will be described later.
In row $6$, we assume that we already calculated $p[i,j]$ and $d[i-1,j-1]$.
It is easy to show that the forward looking algorithm makes the same comparisons as the
traditional, backward looking algorithm, but in a different order.
While the backward looking algorithm calculates $q_{i,j}$ at the $j$th position of the row looking back
to the already calculated entries of the dynamic programming table, the \textsc{Forward-Looking}
algorithm has already calculated $q_{i,j}$ by arriving to the $j$th position of the row.
On the other hand, it sends forward candidate values for $q[i,j_1]$, $j_1 > j$, and by arriving to
cell $j_1$, all the needed comparisons of candidate values have been made.
Therefore, the \textsc{Forward-Looking} algorithm is not faster than the traditional backward
looking algorithm, however, the conception helps accelerate the algorithm.
The key idea is the following.
\begin{lemma} Let $j$ be the actual cell in the row. If
\begin{equation}
d_{i,j} + g_{j_1-j} \geq q_1[i,j_1] \koz ,
\end{equation}
then for all $j_2>j_1$
\begin{equation}
d_{i,j} + g_{j_2 - j} \geq q_1[i,j_2] \koz .
\end{equation}
\end{lemma}
\begin{biz} From the condition it follows that there is a $k \key{do} \= $q[i,j] \leftarrow q[i,b[i,pn[i]]] + g[j-b[i,pn[i]]]$\\
4 \' \> \> $d[i,j] \leftarrow \min[q[i,j],p[i,j],d[i-1,j-1]+w(b_j \leftarrow a_i)]$\\
5 \' \` $\rhd$ At this step, we suppose that $p[i,j]$ and $d[i-1,j-1]$ are already calculated.\\
6 \' \> \> \key{if} \= $p[i,pn[i]] = j$ \key{then}\\
7 \' \> \> \> \key{then} $pn[i]--$\\
8 \' \> \> \key{if} \= $j+1 d[i,j] + g[m-j]$ \\
9 \' \> \> \> \key{then} $pn[i] \leftarrow 0$; $b[i,0] \leftarrow j$\\
10 \' \> \> \> \key{else if} \= $j+1 \> \> \> \key{then} \= $Y \leftarrow \max_{0\leq X \leq pn[i]} \{X | d[i,b[i,X]] +g[p[i,X]$ \\
\' \> \> \> \> \> $ - b[i,X]] \leq p[i,j] + g[p[i,X] -j] \}$\\
12 \' \> \> \> \key{if} \= $d[i,b[i,Y]] +g[p[i,Y] - b[i,Y]] = p[i,j] + g[p[i,X] -j]$ \\
13 \' \> \> \> \> \key{then} \=$pn[i] \leftarrow Y$; $b[i,Y] \leftarrow j$\\
14 \' \> \> \> \> \key{else} \> $E = p[i,Y]$ \\
15 \' \> \> \> \> \> \key{if} \= $Y \> \> \> \> \> \key{then} \= $B \leftarrow p[i,Y+1] - 1$ \\
17 \' \> \> \> \> \> \> \key{else} \> $B \leftarrow j+1$
\algnewpage
18 \' \> \> \> $pn[i]++$ \\
19 \' \> \> \> $b[i,pn[i]] \leftarrow j$\\
20 \' \> \> \> $p[i,pn[i]]$ \= $\leftarrow \max_{B \leq X \leq E} \{X|d[i,j] + g[X-j] \leq$ \\
\' \> \> \> \> $ d[i,b[i,Y]]+g[x-b[i,Y]]\}$\\
\end{algN}
The algorithm works in the following way: for each row, we maintain a variable storing the
number of blocks, a list of positions of block ends, and a list of pointers for each block.
For each cell $j$, the algorithm finds the last position for which the cell gives an optimal
value using binary search. There is first a binary search for the blocks then for the positions inside
the choosen block. It is enough to rewrite three values after the binary searches: the number of
blocks, the end of the last block and its pointer. Therefore the most time consuming part is the
binary search, which takes $O(\lg m)$ time for each cell.
We do the same for columns. If the dynamic programming table is filled in row by row, then for
each position $j$ in row $i$, the algorithms uses the block system of column $j$. Therefore the
running time of the algorithm is $O(nm (\lg n + \lg m))$.
\subsection{Similarity of two sequences, the Smith-Waterman algorithm}
We can measure not only the distance but also the similarity of two sequences.
For measuring the similarity of two characters, $S(a,b)$, the most frequently used
function is the \ki{log-odds}\inddef{log-odds}:
\begin{equation}
S(a,b) = \log\left(\frac{\prob{a,b}}{\prob{a}\prob{b}}\right) \koz ,
\end{equation}
where $\prob{a,b}$ is the joint probability of the two characters (namely, the probability of
observing them together in an alignment column), $\prob{a}$ and $\prob{b}$ are the
marginal probabilities. The similarity is positive if $\prob{a,b} > \prob{a}\prob{b}$,
otherwise negative. Similarities are obtained from empirical data, for aminoacids, the most
commonly used similarities are given by the PAM and BLOSUM matrices.
If we penalise gaps with negative numbers then the above described, global alignment
algorithms work with similarities by changing minimalisation to maximalisation.
It is possible to define a special problem that works for similarities and does not
work for distances. It is the local similarity problem or the local sequence alignment
problem \cite{bio_sw81}. Given two sequences, a similarity and a gap penalty function,
the problem is to give two substrings of the sequences whose similarity is maximal.
A \ki{substring}\inddef{substring} of a sequence is a consecutive part of the sequence. The biological motivation
of the problem is that some parts of the biological sequences evolve slowly while other
parts evolve fast. The local alignment finds the most conservated part of the two sequences.
Local alignment is widely used for homology searching in databases. The reason why
local alignments works well for homology searching is that the local alignment score
can separate homologue and non-homologue sequences better since the statistics is not
decreased due to the variable regions of the sequences.
The Smith-Waterman algorithm work in the following way. The initial conditions are:
\begin{equation}
d_{0,0} = d_{i,0} = d_{0,j} = 0 \koz .
\end{equation}
Considering linear gap penalty, the dynamic programming table is filled in using the
following recursions:
\begin{equation}
d_{i,j} = \max\{0;\, d_{i-1,j-1} + S(a_i,b_j), d_{i-1,j} + g;\, d_{i,j-1} + g\} \koz .\label{eq:bio_main_SW}
\end{equation}
Here $g$, the gap penalty is a negative number. The best local similarity score of the two
sequences is the maximal number in the table. The trace-back starts in the cell having the
maximal number, and ends when first reaches a $0$.
It is easy to prove that the alignment obtained in the trace-back will be locally optimal:
if the alignment could be extended at the end with a sub-alignment whose similarity is positive
then there would be a greater number in the dynamic programming table. If the alignment
could be extended at the beginning with a subalignment having positive similarity then
the value at the end of the traceback would not be $0$.
\subsection{Multiple sequence alignment}
The multiple sequence alignment problem was introduced by David Sankoff
\cite{bio_sankoff75}, and by today, the multiple sequence alignment has been the
central problem in bioinformatics. Dan Gusfield calls it the Holy Grail of bioinformatics.
\cite{bio_gusfield97}. Multiple alignments are widespread both in searching databases and
inferring evolutionary relationships. Using multiple alignments, it is possible to find
the conservated parts of a sequence family, the positions that describe the functional
properties of the sequence family. AS Arthur Lesk said: \cite{bio_hlt96}:
\emph{What two sequences whisper, a multiple sequence alignment shout out loud.}
The columns of a multiple alignment of $k$ sequences is called aligned $k$-tuples.
The dynamic programming for the optimal multiple alignment is the generalisation of
the dynamic programming for optimal pairwise alignment. To align $k$ sequences, we have to
fill in a $k$ dimensional dynamic programming table. To calculate an entry in this table
using linear gap penalty, we have to look back to a $k$ dimensional hypercube.
Therefore, the memory requirement in case of $k$ sequence, $n$ long each is $\Theta(n^k)$,
and the running time of the algorithm is $\Theta(2^k n^k)$ if we use linear gap penalty,
and $\Theta(n^{2k - 1})$ with arbitrary gap penalty.
There are two fundamental problems with the multiple sequence alignment. The first is an
algorithmic problem: it is proven that the multiple sequence alignment problem is NP-complete
\cite{bio_wj94}. The other problem is methodical: it is not clear how to score a multiple
alignment. An objective scoring function could be given only if the evolutionary relationships
were known, in this case an aligned $k$-tuple could be scored according to an evolutionary tree
\cite{bio_ph98}.
A heuristic solution for both problems
is the \ki{iterative sequence alignment}\inddef{iterative sequence alignment}
\cite{bio_fd87},\cite{bio_corpet88},\cite{bio_thg94}. This method first construct
a \ki{guide-tree}\inddef{guide-tree} using pairwise distances (such tree-building methods
are described in section~\ref{section:bio_torzsfa}). The guide-tree is used then to construct
a multiple alignment. Each leaf is labelled with a sequence, and first the sequences in
''cherry-motives" are aligned into each other, then sequence alignments are aligned
to sequences and sequence alignments according to the guide-tree. The iterative sequence alignment
method uses the ''once a gap -- always gap" rule. This means that gaps already placed into
an alignment cannot be modified when aligning the alignment to other alignment or sequence.
The only possibility is to insert all-gap columns into an alignment. The aligned sequences are
usually described with a \ki{profile.}\inddef{profile} The profile is a $(|\Sigma|+1)\times l$
table, where $l$ is the length of the alignment. A column of a profile contains the statistics of
the corresponding aligned $k$-tuple, the frequencies of characters and the gap symbol.
The obtained multiple alignment can be used for constructing another guide-tree, that can be used
for another iterative sequence alignment, and this procedure can be iterated till convergence.
The reason for the iterative alignment heuristic is that the optimal pairwise alignment of
closely related sequences will be the same in the optimal multiple alignment. The drawback of
the heuristic is that even if the previous assumption is true, there might be several optimal
alignments for two sequences, and their number might grow exponentially with the length of the
sequences. For example, let us consider the two optimal alignments of the sequences {\tt AUCGGUACAG}
and {\tt AUCAUACAG}.
\begin{center}
{\tt
A U C G G U A C A G\,\,\,\,\,\,\,\,\,\,\,A U C G G U A C A G
A U C - A U A C A G\,\,\,\,\,\,\,\,\,\,\,A U C A - U A C A G \koz .
}
\end{center}
\noindent We cannot choose between the two alignments, however, in a multiple alignment, only one
of them might be optimal. For example, if we align the sequence {\tt AUCGAU} to the two
optimal alignments, we get the following locally optimal alignments:
\begin{center}
{\tt
A U C G G U A C A G\,\,\,\,\,\,\,\,\,\,\,A U C G G U A C A G
A U C - A U A C A G\,\,\,\,\,\,\,\,\,\,\,A U C A - U A C A G
A U C G A U - - - -\,\,\,\,\,\,\,\,\,\,\,A U C - G - A U - -
}
\end{center}
\noindent The left alignment is globally optimal, however, the right alignment is
only locally optimal.
Hence, the iterative alignment method yields only a locally optimal alignment.
Another problem of this method is that it does not give an upper bound for the
goodness of the approximation. In spite of its drawback, the iterative alignment methods are
the most widely used ones for multiple sequence alignments in practice, since it is fast and
usually gives biologically reasonable alignments. Recently some approximation methods for multiple
sequence alignment have been published with known upper bounds for their goodness
\cite{bio_gusfield93,bio_rk98}. However, the bounds biologically are not reasonable,
and in practice, these methods usually give worse results than the heuristic methods.
We must mention a novel greedy method that is not based on dynamic programming.
The DiAlign \cite{bio_morgenstern99,bio_mdw96,bio_mfdw98} first searches
for gap-free homologue substrings by pairwise sequence comparison. The gap-free alignments
of the homologous substrings are called diagonals of the dynamic programming name, hence the
name of the method: Diagonal Alignment. The diagonals are scored according to their similarity
value and diagonals that are not compatible with high-score diagonals get a penalty. Two diagonals
are not compatible if they cannot be in the same alignment. After scoring the diagonals, they are
aligned together a multiple alignment in a greedy way. First the best diagonal is selected, then
the best diagonal that is comparable with the first one, then the third best alignment that is
comparable with the first two ones, etc. The multiple alignment is the union of the
selected diagonals that might not cover all the characters in the sequence. Those characters
that were not in any of the selected diagonals are marked as ''non alignable". The drawback
of the method is that sometimes it introduces too many gaps due to not penalising the gaps at all.
However, DiAlign has been one of the best heuristic alignment approach and is widely used
in the bioinformatics community.
\subsection{Memory-reduction with the Hirschberg algorithm}
If we want to calculate only the distance or similarity between two sequences and we
are not interested in an optimal alignment, then in case of linear or affine gap penalties,
it is very easy to construct an algorithm that uses only linear memory. Indeed, note that
the dynamic programming recursion needs only the previous row (in case of filling in the
dynamic table by rows), and the algorithm does not need to store earlier rows. On the other hand,
once the dynamic programming table has reached the last row and forgot the earlier rows, it is not
possible to trace-back the optimal alignment. If the dynamic programming table is scrolled
again and again in linear memory to trace-back the optimal alignment row by row then the
running time grows up to $O(n^3)$, where $n$ is the length of the sequences.
However, it is possible to design an algorithm that obtains an optimal alignment in $O(n^2)$
running time and uses only linear memory. This
is the \ki{Hirschberg algorithm}\inddef{Hirschberg algorithm}
\cite{bio_hirschberg75},\nevindex{Hirschberg, D. S.} which we are going to introduce for
distance-based alignment with linear gap penalty.
We introduce the suffixes of a sequence, a suffix is a substring ending at the end of the sequence.
Let $A^k$ denote the suffix of $A$ starting with character $a_{k+1}$.
The Hirschberg algorithm first does a dynamic programming algorithm for sequences
$A_{\left[|A|/2\right]}$ and $B$ using liner memory as described above.
Similarly, it does a dynamic programming algorithm for the reverse of the sequences
$A^{\left[|A|/2\right]}$ and $B$.
Based on the two dynamic programming procedures, we know what is the score of the optimal
alignment of $A_{\left[|A|/2\right]}$ and an arbitrary prefix of $B$, and similarly what is the
score of the optimal alignment of $A^{\left[|A|/2\right]}$ and an arbitrary suffix of $B$.
>From this we can tell what is the score of the optimal alignment of $A$ and $B$:
%
\begin{equation}
\min_j\left\{w(\alpha^*(A_{[|A|/2]},B_j)) + w(\alpha^*(A^{[|A|/2]},B^j))\right\} \koz ,
\label{eq:bio_Hirschberg}
\end{equation}
and from this calculation it must be clear that in the optimal alignment of $A$ and $B$,
$A_{\left[|A|/2\right]}$ is aligned with the prefix $B_j$ for which
\begin{equation}
w(\alpha^*(A_{[|A|/2]},B_j)) + w(\alpha^*(A^{[|A|/2]},B^j))
\end{equation}
is minimal.
Since we know the previous rows of the dynamic tables, we can tell if $a_{[|A|/2]}$ and $a_{[|A|/2]+1}$
is aligned with any characters of $B$ or these characters are deleted in the optimal alignment.
Similarly, we can tell if any character of $B$ is inserted between $a_{[|A|/2]}$ and $a_{[|A|/2]+1}$.
In this way, we get at least two columns of the optimal alignment. Then we do the same for
$A_{[|A|/2]-1}$ and the remaining part of the prefix of $B$, and for $A^{[|A|/2]+1}$ and
the remaining part of the suffix of $B$. In this way we get alignment columns at the quarter
and the three fourths of sequence $A$. In the next iteration, we do the same for the for
pairs of sequences, etc., and we do the iteration till we get all the alignment columns of
the optimal alignment.
Obviously, the memory requirement still only grows linearly with the length of the sequences.
We show that the running time is still $\Theta(n m)$, where $n$ and $m$ are the lengths of the
sequences. This comes from the fact that the running time of the first iteration is
$|A| \times |B|$, the running time of the second iteration is
$|A|/2) \times j^* + (|A|/2) \times (|B| - j^*$, where $j^*$ is the position for which we get a minimum
distance in Eqn.~(\ref{eq:bio_Hirschberg}). Hence the total running time is:
\begin{equation}
n m\times \left(1 + \frac{1}{2} + \frac{1}{4} + \cdots \right) = \Theta(n m) \koz .
\end{equation}
\subsection{Memory-reduction with corner-cutting}
The dynamic programming algorithm reaches the optimal alignment of two sequences with
aligning longer and longer prefixes of the two sequences. The algorithm can be accelerated
with excluding the bad alignments of prefixes that cannot yield an optimal alignment.
Such alignments are given with the ordered paths going from the right top and the left bottom
corners to $d_{0,0}$, hence the name of the technique.
Most of the corner-cutting algorithms use a test value. This test value is an upper bound of the
evolutionary distance between the two sequences. Corner-cutting algorithms using a test value
can obtain the optimal alignment of two sequences only if the test value is indeed smaller then
the distance between the two sequences, otherwise the algorithm stops before reaching the right
bottom corner or gives a non-optimal alignment. Therefore these algorithms are useful for
searching for sequences similar to a given one and we are not interested in sequences that
are farther from the query sequence than the test value.
We are going to introduce two algorithms. the Spouge algorithm \cite{bio_spouge89},\cite{bio_spouge91}
is a generalisation of the Fickett \cite{bio_fickett84} and the Ukkonnen algorithm
\cite{bio_ukkonen84},\cite{bio_ukkonen85}. The other algorithm was given by Gusfield, and this algorithm
is an example for a corner-cutting algorithm that reaches the right bottom corner even if
the distance between the two sequence is greater than the test value, but in this case the
calculated score is bigger than the test value, indicating that the obtained alignment is not
necessary optimal.
The Spouge algorithm calculates only those $d_{i,j}$ for which
\begin{equation}
d_{i,j} + |(n-i) - (m-j)| \times g \leq t \koz ,
\end{equation}
where $t$ is the test value, $g$ is the gap penalty, $n$ and $m$ are the length of the sequences.
The key observation of the algorithm is that any path going from $d_{i,j}$ to $d_{n,m}$ will increase
the score of the alignment at least by $|(n-i) - (m-j)| \times g$. Therefore is $t$ is smaller
than the distance between the sequences, the Spouge algorithm obtains the optimal alignments, otherwise
will stop before reaching the right bottom corner.
This algorithm is a generalisation of the Fickett algorithm and the Ukkonen algorithm. Those algorithms
also use a test value, but the inequality in the Fickett algorithm is:
\begin{equation}
d_{i,j} \leq t \koz ,
\end{equation}
while the inequality in the Ukkonnen algorithm is:
\begin{equation}
|i - j| \times g + |(n-i) - (m-j)| \times g \leq t \koz .
\end{equation}
Since in both cases, the left hand side of the inequalities are not greater than the left end
side of the Spouge inequality, the Fickett and the Ukkonnen algorithms will calculate at least
as much part of the dynamic programming table than the Spouge algorithm. Empirical results proved that
the Spouge algorithm is significantly better \cite{bio_spouge91}. The algorithm can be extended to
affine and concave gap penalties, too.
The $k$-difference global alignment problem \cite{bio_gusfield97} asks the following question:
Is there an alignment of the sequences whose weight is smaller than $k$? The algorithm
answering the question has $O(kn)$ running time, where $n$ is the length of the longer sequence.
The algorithm is based on the observation that any path from $d_{n,m}$ to $d_{0,0}$
having at most score $k$ cannot contain a cell $d_{i,j}$ for which $|i-j|>k/g$. Therefore
the algorithm calculates only those $d_{i,j}$ cells for which $(i-j)k/g$. If there exists an alignment
with a score smaller or equal than $k$, then $d_{n,m}k$, and $d_{n,m}>k$ is not necessary the score of the optimal
alignment since there might be an alignment that leaves the band defined by the $|i-j| 3$ number of points, let us assume that there are two trees representing the same metric.
We find a \ki{cherry motif}\inddef{cherry motif} on the first tree, with cherries $i$ and $j$.
A cherry motif is a motif with two leafes whose connecting path has exactly one internal node.
Every tree contains at least two cherry motives, a path on the tree that has the maximal
number of internal nodes has cherry motives at both ends.
If there is only one internal node
on the path connecting $i$ and $j$ on the other tree, then the length of the corresponding edges
in the two cherry motives must be the same, since for any additional point $k$, we must get the
same subtree. We define a new metric by deleting points $i$ and $j$, and adding a new point $u$.
The distance between $u$ and any point $k$ is $d_{i,k} - d_{i,u}$, where $d_{i,u}$ is the length
of the edge connecting $i$ with the internal point in the cherry motif. If we delete nodes $i$
and $j$, we get a tree that represent this metric and they are the same, according to the
induction.
If the path between $i$ and $j$ contains more than one internal node on the other tree, then we
find a contradiction. There is a $u_1$ point on the second tree for which $d_{i,u} \neq d_{i,u_1}$.
Consider a $k$ such that the path connecting $i$ and $k$ contains node $u$. From the first tree
\begin{equation}
d_{i,k} - d_{j,k} = d_{i,u} - d_{j,u} = 2 d_{i,u} - d_{i,j} \koz,
\end{equation}
while on the second tree
\begin{equation}
d_{i,k} - d_{j,k} = d_{i,u_1} - d_{j,u_1} = 2 d_{i,u_1} - d_{i,j} \koz ,
\end{equation}
which contradicts that $d_{i,u} \neq d_{i,u_1}$.
\end{biz}
\subsection{Clustering algorithms}
\begin{defi}
A metric is \ki{ultrametric}\inddef{ultrametric}\index{metric!ultrametric} if for any three points, $i$, $j$ and $k$
\begin{equation}
d_{i,j} \le \max\{ d_{i,k},\,d_{j,k}\}
\end{equation}
\end{defi}
It is easy to prove that the three distances between any three points are all equal or
two of them equal and the third is smaller in any ultrametric.
\begin{tetel}
If the metric on a finite set of points is ultrametric, then there is exactly one
tree that represents it. Furthermore, this tree can be rooted such that the distance between
a point and the root is the same for all points.
\label{tetel:bio_ultra}
\end{tetel}
\begin{biz}
Based on the Lemma~\ref{lemma:bio_egyetlenfa}, it is enough to construct one ultrametric tree
for any ultrametric. We represent ultrametric trees as \ki{dendrograms.}\inddef{dendrograms}
in this representation, the horizontal edges has length zero. For an example dendrogram, see
Figure \ref{figure:bio_dendrogram}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\setlength{\unitlength}{2000sp}%
\begingroup\makeatletter\ifx\SetFigFont\undefined%
\gdef\SetFigFont#1#2#3#4#5{%
\reset@font\fontsize{#1}{#2pt}%
\fontfamily{#3}\fontseries{#4}\fontshape{#5}%
\selectfont}%
\fi\endgroup%
\begin{center}
\begin{picture}(4224,3024)(889,-3373)
\thinlines
{\color[rgb]{0,0,0}\put(901,-3361){\line( 0, 1){600}}
\put(901,-2761){\line( 1, 0){600}}
\put(1501,-2761){\line( 0,-1){600}}
}%
{\color[rgb]{0,0,0}\put(1201,-2761){\line( 0, 1){900}}
\put(1201,-1861){\line( 1, 0){1500}}
\put(2701,-1861){\line( 0,-1){300}}
\put(2701,-2161){\line(-1, 0){600}}
\put(2101,-2161){\line( 0,-1){1200}}
}%
{\color[rgb]{0,0,0}\put(2701,-2161){\line( 1, 0){600}}
\put(3301,-2161){\line( 0,-1){300}}
\put(3301,-2461){\line(-1, 0){600}}
\put(2701,-2461){\line( 0,-1){600}}
\put(2701,-3061){\line(-1, 0){300}}
\put(2401,-3061){\line( 0,-1){300}}
}%
{\color[rgb]{0,0,0}\put(2701,-3061){\line( 1, 0){300}}
\put(3001,-3061){\line( 0,-1){300}}
}%
{\color[rgb]{0,0,0}\put(3301,-2461){\line( 1, 0){300}}
\put(3601,-2461){\line( 0,-1){900}}
}%
{\color[rgb]{0,0,0}\put(1801,-1861){\line( 0, 1){1500}}
\put(1801,-361){\line( 1, 0){2700}}
\put(4501,-361){\line( 0,-1){2100}}
\put(4501,-2461){\line(-1, 0){600}}
\put(3901,-2461){\line( 0,-1){900}}
}%
{\color[rgb]{0,0,0}\put(4501,-2461){\line( 1, 0){300}}
\put(4801,-2461){\line( 0,-1){300}}
\put(4801,-2761){\line(-1, 0){300}}
\put(4501,-2761){\line( 0,-1){600}}
}%
{\color[rgb]{0,0,0}\put(4801,-2761){\line( 1, 0){300}}
}%
{\color[rgb]{0,0,0}\put(5101,-2761){\line( 0,-1){600}}
}%
\end{picture}
\end{center}
\caption{A dendrogram.}
\label{figure:bio_dendrogram}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The proof is based on the induction on the number of leaves. Obviously, we can construct a dendrogram for
two leaves. After constructing the dendrogram for $n$ leaves, we add leaf $n+1$ to the dendrogram
in the following way. We find a leaf $i$ in the dendrogram, for which $d_{i,n+1}$ is minimal.
Then we walk up on the dendrogram till we reach the $d_{i,n+1}/2$ distance (we might go upper than the root).
The node $i$ is connected to the dendrogram at this point, see Figure \ref{figure:bio_dendrogram1}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\setlength{\unitlength}{1600sp}%
%
\begingroup\makeatletter\ifx\SetFigFont\undefined%
\gdef\SetFigFont#1#2#3#4#5{%
\reset@font\fontsize{#1}{#2pt}%
\fontfamily{#3}\fontseries{#4}\fontshape{#5}%
\selectfont}%
\fi\endgroup%
\begin{center}
\begin{picture}(11712,3666)(289,-3715)
\thinlines
{\color[rgb]{0,0,0}\put(301,-3361){\line( 0, 1){600}}
\put(301,-2761){\line( 1, 0){600}}
\put(901,-2761){\line( 0,-1){600}}
}%
{\color[rgb]{0,0,0}\put(2101,-2161){\line( 1, 0){600}}
\put(2701,-2161){\line( 0,-1){300}}
\put(2701,-2461){\line(-1, 0){600}}
\put(2101,-2461){\line( 0,-1){600}}
\put(2101,-3061){\line(-1, 0){300}}
\put(1801,-3061){\line( 0,-1){300}}
}%
{\color[rgb]{0,0,0}\put(2101,-3061){\line( 1, 0){300}}
\put(2401,-3061){\line( 0,-1){300}}
}%
{\color[rgb]{0,0,0}\put(2701,-2461){\line( 1, 0){300}}
\put(3001,-2461){\line( 0,-1){900}}
}%
{\color[rgb]{0,0,0}\put(601,-2761){\line( 0, 1){900}}
\put(601,-1861){\line( 1, 0){1500}}
\put(2101,-1861){\line( 0,-1){300}}
\put(2101,-2161){\line(-1, 0){600}}
\put(1501,-2161){\line( 0,-1){1200}}
}%
{\color[rgb]{0,0,0}\put(1201,-1861){\line( 0, 1){1800}}
\put(1201,-61){\line( 1, 0){2700}}
\put(3901,-61){\line( 0,-1){1800}}
\put(3901,-1861){\line(-1, 0){600}}
\put(3301,-1861){\line( 0,-1){1500}}
}%
{\color[rgb]{0,0,0}\put(3901,-1861){\line( 1, 0){600}}
\put(4501,-1861){\line( 0,-1){300}}
\put(4501,-2161){\line(-1, 0){600}}
\put(3901,-2161){\line( 0,-1){900}}
\put(3901,-3061){\line(-1, 0){300}}
\put(3601,-3061){\line( 0,-1){300}}
}%
{\color[rgb]{0,0,0}\put(3901,-3061){\line( 1, 0){300}}
\put(4201,-3061){\line( 0,-1){300}}
}%
{\color[rgb]{0,0,0}\put(4501,-2161){\line( 1, 0){300}}
\put(4801,-2161){\line( 0,-1){600}}
\put(4801,-2761){\line(-1, 0){300}}
\put(4501,-2761){\line( 0,-1){600}}
}%
{\color[rgb]{0,0,0}\put(4801,-2761){\line( 1, 0){300}}
\put(5101,-2761){\line( 0,-1){600}}
}%
{\color[rgb]{0,0,0}\put(5101,-1561){\vector( 1, 0){1200}}
}%
{\color[rgb]{0,0,0}\put(6001,-3361){\line( 0, 1){600}}
\put(6001,-2761){\line( 1, 0){600}}
\put(6601,-2761){\line( 0,-1){600}}
}%
{\color[rgb]{0,0,0}\put(6301,-2761){\line( 0, 1){900}}
\put(6301,-1861){\line( 1, 0){1500}}
\put(7801,-1861){\line( 0,-1){300}}
\put(7801,-2161){\line(-1, 0){600}}
\put(7201,-2161){\line( 0,-1){1200}}
}%
{\color[rgb]{0,0,0}\put(6901,-1861){\line( 0, 1){1800}}
\put(6901,-61){\line( 1, 0){3600}}
\put(10501,-61){\line( 0,-1){1200}}
\put(10501,-1261){\line(-1, 0){900}}
\put(9601,-1261){\line( 0,-1){600}}
\put(9601,-1861){\line(-1, 0){600}}
\put(9001,-1861){\line( 0,-1){1500}}
}%
{\color[rgb]{0,0,0}\put(7801,-2161){\line( 1, 0){600}}
\put(8401,-2161){\line( 0,-1){300}}
\put(8401,-2461){\line(-1, 0){600}}
\put(7801,-2461){\line( 0,-1){600}}
\put(7801,-3061){\line(-1, 0){300}}
\put(7501,-3061){\line( 0,-1){300}}
}%
{\color[rgb]{0,0,0}\put(7801,-3061){\line( 1, 0){300}}
\put(8101,-3061){\line( 0,-1){300}}
}%
{\color[rgb]{0,0,0}\put(8401,-2461){\line( 1, 0){300}}
\put(8701,-2461){\line( 0,-1){900}}
}%
{\color[rgb]{0,0,0}\put(9601,-1861){\line( 1, 0){600}}
\put(10201,-1861){\line( 0,-1){300}}
\put(10201,-2161){\line(-1, 0){600}}
\put(9601,-2161){\line( 0,-1){900}}
\put(9601,-3061){\line(-1, 0){300}}
\put(9301,-3061){\line( 0,-1){300}}
}%
{\color[rgb]{0,0,0}\put(9601,-3061){\line( 1, 0){300}}
\put(9901,-3061){\line( 0,-1){300}}
}%
{\color[rgb]{0,0,0}\put(10201,-2161){\line( 1, 0){300}}
\put(10501,-2161){\line( 0,-1){600}}
\put(10501,-2761){\line(-1, 0){300}}
\put(10201,-2761){\line( 0,-1){600}}
}%
{\color[rgb]{0,0,0}\put(10501,-2761){\line( 1, 0){300}}
\put(10801,-2761){\line( 0,-1){600}}
}%
{\color[rgb]{0,0,0}\put(10501,-1261){\line( 1, 0){900}}
\put(11401,-1261){\line( 0,-1){2100}}
}%
{\color[rgb]{0,0,0}\put(11701,-1261){\vector( 0, 1){ 0}}
\put(11701,-1261){\vector( 0,-1){2100}}
}%
\put(4201,-3661){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$i$}%
}}}
\put(7501,-3661){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$j$}%
}}}
\put(9901,-3661){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$i$}%
}}}
\put(10801,-3661){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$i\,'$}%
}}}
\put(11401,-3661){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$n+1$}%
}}}
\put(12001,-2461){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$d_{i,n+1}/2$}%
}}}
\end{picture}
\end{center}
\caption{Connecting leaf $n+1$ to the dendrogram.}
\label{figure:bio_dendrogram1}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
This dendrogram represents properly the distances between leaf $n+1$ and any other leaf. Indeed,
if leaf $i$ that is below the new internal node that bonnets leaf $n+1$, then $d_{i,i\,'} \le d_{i,n+1}$
and from the ultrametric property and the minimality of $d_{i,n+1}$ it follows that
$d_{i,n+1} = d_{i\,',n+1}$. On the other hand, if leaf $j$ is not below the new internal point joining leaf
$n+1$, then $d_{i,j} > d_{i,n+1}$, and from the ultrametric property it comes that
$d_{j,n+1} = d_{i,j}$.
\end{biz}
%
\noindent It is easy to see that the construction in the proof needs $O(n^2)$ running time, where
$n$ is the number of objects. We shell give another algorithm that finds the pair of objects $i$
and $j$ for which $d_{i,j}$ is minimal. From the ultrametric property, for any
$k \neq i,j$, $d_{i,k} = d_{j,k} (\ge d_{i,j})$, hence we can replace the pair of objects $i$ and $j$
to a new object, and the distance between this new object and any other object $k$ is well defined,
it is $d_{i,k} = d_{j,k}$. The objects $i$ and $j$ are connected at height $d_{i,j}/2$, and we treat
this sub-dendrogram as a single object. We continue the iteration till we have a single object.
This algorithm is slower than the previous algorithm, however, this is the basis of the clustering
algorithms. The clustering algorithms create a dendrogram even if the input distances do not follow
a ultrametric. On the other hand, if the input distances follow a ultrametric, then most of
the clustering algorithms gives back this ultrametric.
As we mentioned, the clustering algorithms find the object pair $i$ and $j$ for which
$d_{i,j}$ is minimal. The differences come from how the algorithms define the distance between
the new object replacing the pair of objects $i$ and $j$ and any other object. If the new object
is denoted by $u$, then the introduced clustering methods define $d_{u,k}$ in the following way:
%
\begin{itemize}
\item \textsc{Single link:} $d_{u,k} = \min\{d_{i,k},d_{j,k} \}$.
%
\item \textsc{Complete link:} $d_{u,k} = \max\{d_{i,k},d_{j,k}\}$.
%
\item \textsc{UPGMA:} the new distance is the arithmetic mean of the distances between the elements in $u$ and $k$ :
$d_{u,k} = \frac{d_{i,k} \times |i| + d_{j,k} \times |j|}{|i| + |j|}$,
where $|i|$ and $|j|$ are the number of elements in $i$ and $j$.
%
\item \textsc{Single average:} $d_{u,k} = \frac{d_{i,k} + d_{j,k}}{2}$.
%
\item \textsc{Centroid:} This method is used when the objects can be embedded into a Euclidean
space. Then the distance between two objects can be defined as the distance between the centroids
of the elements of the objects. It is not necessary to use the coordinates of the Euclidean
space since the distance $d_{u,k}$ in question is the distance between point $k$ and the point
intersecting the $ij$ edge in $|j|:|i|$ proportion in the triangle obtained by
points $i$, $j$ és $k$ (see Figure \ref{figure:bio_centroid}). This length can be calculated using
only $d_{i,j}$, $d_{i,k}$ and $d_{j,k}$. Hence the algorithm can be used even if the objects
cannot be embedded into a Euclidean space.
%
\item \textsc{Median:} The centroid of $u$ is the centroid of the centroids of $i$ and $j$.
This method is related to the centroid method as the single average is related to the UPGMA method.
It is again not necessary to know the coordinates of the elements, hence this method can be applied
to distances that cannot be embedded into a Euclidean space.
\end{itemize}
%
It is easy to show that the first four method gives the dendrogram of the input distances whenever
the input distances follow a ultrametric since $d_{i,k} = d_{j,k}$ in this case.
However, the \textsc{Centroid} and \textsc{Median} methods do not give the corresponding
dendrogram for ultrametric input distances, since $d_{u,k}$ will be smaller than
$d_{i,k}$ (which equals to $d_{j,k}$).
The central problem of the clustering algorithms is that they give a dendrogram
that might not be biologically correct. Indeed, the evolutionary tree of biological sequences
can be a dendrogram only if the \emph{molecular clock} hypothesis holds. The molecular clock hypothesis
says that the sequences evolve with the same tempo at each branches of the tree, namely they collect
the same number of mutations at a given time span. However, this is usually not true. Therefore
biologists want an algorithm that give a ultrametric tree only if the input distances follow
a ultrametric. The most popular such method is the \ki{Neighbour-Joining}
\inddef{Neighbour-Joining} algorithm.
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\setlength{\unitlength}{2000sp}%
%
\begingroup\makeatletter\ifx\SetFigFont\undefined%
\gdef\SetFigFont#1#2#3#4#5{%
\reset@font\fontsize{#1}{#2pt}%
\fontfamily{#3}\fontseries{#4}\fontshape{#5}%
\selectfont}%
\fi\endgroup%
\begin{center}
\begin{picture}(6024,2889)(289,-2515)
\thinlines
{\color[rgb]{0,0,0}\put(2701,239){\line( 1,-1){2400}}
}%
{\color[rgb]{0,0,0}\put(301,-2161){\line( 1, 1){2400}}
\put(2701,239){\line( 3,-2){3600}}
\put(6301,-2161){\line(-1, 0){6000}}
}%
\put(2701,290){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$k$}%
}}}
\put(301,-2461){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$i$}%
}}}
\put(2401,-2461){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$|j|$}%
}}}
\put(5101,-2461){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$u$}%
}}}
\put(5701,-2461){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$|i|$}%
}}}
\put(6301,-2461){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$j$}%
}}}
\end{picture}
\end{center}
\caption{ Calculating $d_{u,k}$ according to the \textsc{Centroid} method.}
\label{figure:bio_centroid}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Neighbour joining}
\begin{defi}
A metric is called \ki{additive} or \ki{four-point metric,}\inddef{additive metric}\inddef{four-point metric}
if for any four points $i$, $j$, $k$ and $l$
\begin{equation}
d_{i,j} + d_{k,l} \le \max\{ d_{i,k} + d_{j,l},\, d_{i,l} + d_{j,k}\}
\end{equation}
\end{defi}
\begin{tetel}
If a metric is additive on a finite set of objects, then there is exactly one tree that
represents it.
\label{tetel:bio_additiv}
\end{tetel}
\begin{biz}
Due to Lemma~\ref{lemma:bio_egyetlenfa}, there is at most one such tree, therefore it is
enough to construct it. First we give the construction then we prove its goodness.
For three objects we can construct a tree according to (\ref{eq:bio_harompontkezdet})--(\ref{eq:bio_harompontveg}).
Assume that we constructed the tree for $n \ge 3$ objects, and we want to add
leaf $n+1$ to the tree. First we find the topology and then we give the length
of the new edge. For obtaining the new topology we start with any leaf $i$, and let
denote $u$ the neighbour of leaf $i$. There are at least two other edges going out from
$u$, we find two leaves on the paths starting with these two outgoing edges, let
$k$ and $l$ denote these leaves, see Figure \ref{figure:bio_additivfa1}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\setlength{\unitlength}{2000sp}%
%
\begingroup\makeatletter\ifx\SetFigFont\undefined%
\gdef\SetFigFont#1#2#3#4#5{%
\reset@font\fontsize{#1}{#2pt}%
\fontfamily{#3}\fontseries{#4}\fontshape{#5}%
\selectfont}%
\fi\endgroup%
\begin{center}
\begin{picture}(3612,4119)(439,-3586)
\thinlines
{\color[rgb]{0,0,0}\put(2101,-1411){\line(-1, 1){1725}}
}%
{\color[rgb]{0,0,0}\put(2101,-3361){\line( 0, 1){1950}}
}%
{\color[rgb]{0,0,0}\put(2101,-1411){\line( 4, 3){1800}}
}%
{\color[rgb]{0,0,0}\put(2101,-2311){\line( 1, 0){900}}
}%
\put(2101,-3586){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$i$}%
}}}
\put(601,389){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$k$}%
}}}
\put(4051,-136){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$l$}%
}}}
\put(3076,-2386){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$n+1$}%
}}}
\put(2251,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$u$}%
}}}
\end{picture}
\end{center}
\caption{Connecting leaf $n+1$ for constructing an additive tree.}
\label{figure:bio_additivfa1}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
The leaf is connected to the edges between $i$ and $u$ if
%
\begin{equation}
d_{i,n+1} + d_{k,l} < d_{i,k} + d_{n+1,l}
\label{equation:bio_csakeggyre}
\end{equation}
%
Using similar inequalities, we can decide if leaf $n+1$ is before $u$ looking from $k$ or
looking from $l$. If the degree of $u$ is greater than $3$, then we find leaves $l'$ on the other paths
and we do the same investigations for $i$, $n+1$, $k$ and $l'$ points. >From the additive property,
it follows that inequality can hold at most for one cases. If it holds for $i$, then we connect
leaf $n+1$ to the edge connecting $u$ and $i$. If the inequality holds for another case, then we
derive the maximal subtree of the tree that contains $u$ as a leaf and also contains the leaf
for which the inequality holds. We define $d_{u,n+1}$ as $d_{i,n+1} - d_{i,u}$, then renaming
$u$ to $i$ we continue the searching for the connection place of leaf $n+1$. If we get
equality for all outgoing edges of $u$, then we connect leaf $n+1$ to $u$.
After finding the topology, we obtain the length of the new edge. Leaf $n+1$ is connected to
the edge between $i$ and $u$, let $u_1$ denote the new internal point, see
Figure \ref{figure:bio_additivfa2}/b.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\setlength{\unitlength}{1800sp}%
%
\begingroup\makeatletter\ifx\SetFigFont\undefined%
\gdef\SetFigFont#1#2#3#4#5{%
\reset@font\fontsize{#1}{#2pt}%
\fontfamily{#3}\fontseries{#4}\fontshape{#5}%
\selectfont}%
\fi\endgroup%
\begin{center}
\begin{picture}(8850,7611)(301,-7165)
\thinlines
{\color[rgb]{0,0,0}\put(1201,-361){\line(-1,-1){600}}
}%
{\color[rgb]{0,0,0}\put(601,239){\line( 1,-1){1200}}
}%
{\color[rgb]{0,0,0}\put(1801,-961){\line( 1, 1){1200}}
}%
{\color[rgb]{0,0,0}\put(1801,-2161){\line( 0, 1){1200}}
}%
{\color[rgb]{0,0,0}\put(2401,-361){\line( 1, 0){1200}}
}%
{\color[rgb]{0,0,0}\put(6601,239){\line( 1,-1){1200}}
}%
{\color[rgb]{0,0,0}\put(7801,-961){\line( 0,-1){1200}}
}%
{\color[rgb]{0,0,0}\put(7201,-361){\line(-1,-1){600}}
}%
{\color[rgb]{0,0,0}\put(7801,-961){\line( 1, 1){1200}}
}%
{\color[rgb]{0,0,0}\put(3601,-3961){\line( 1,-1){1200}}
}%
{\color[rgb]{0,0,0}\put(4801,-5161){\line( 0,-1){1200}}
}%
{\color[rgb]{0,0,0}\put(4801,-5161){\line( 1, 1){1200}}
}%
{\color[rgb]{0,0,0}\put(4801,-5161){\line( 1, 0){600}}
\put(5401,-5161){\line( 1, 0){600}}
}%
{\color[rgb]{0,0,0}\put(7801,-1561){\line( 1, 0){1200}}
}%
\put(376,239){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$k$}%
}}}
\put(251,-1036){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$k_1$}%
}}}
\put(2026,-961){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$u$}%
}}}
\put(1801,-2386){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$i$}%
}}}
\put(3751,-436){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$n+1$}%
}}}
\put(3151,314){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$l$}%
}}}
\put(6226,239){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$k$}%
}}}
\put(6151,-961){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$k_1$}%
}}}
\put(9151,239){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$l$}%
}}}
\put(9076,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$n+1$}%
}}}
\put(7651,-2386){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$i$}%
}}}
\put(8026,-961){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$u$}%
}}}
\put(7276,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$u_1$}%
}}}
\put(3301,-3886){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$k$}%
}}}
\put(6151,-3961){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$l$}%
}}}
\put(6151,-5161){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$n+1$}%
}}}
\put(4426,-5236){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$u$}%
}}}
\put(4726,-6586){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$i$}%
}}}
\put(1726,-2986){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}a)}%
}}}
\put(7726,-2911){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}b)}%
}}}
\put(4651,-7111){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}c)}%
}}}
\end{picture}
\end{center}
\caption{Some tree topologies for proving Theorem~\ref{tetel:bio_additiv}.}
\label{figure:bio_additivfa2}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
We define $d_{u,n+1}$ as $d_{l,n+1} - d_{l,u}$. then the distances
$d_{u,u_1}$, $d_{i,u_1}$, and $d_{u_1,n+1}$ can be calculated using
(\ref{eq:bio_harompontkezdet})--(\ref{eq:bio_harompontveg}).
If the leaf $n+1$ is connected to $u$, then $d_{u,n+1} = d_{i,n+1} - d_{i,u}$.
Now we prove the correctness of the construction. First we show that $d_{u,n+1}$
is well-defined, namely, for all node $j$ that is not in the new subtree containing leaves $n+1$
and $u$, $d_{j,n+1}- d_{j,u} = d_{i,n+1} - d_{i,u}$. If the new subtree contains $l$
then for $j = k$ that was used to find the place of leaf $n+1$ will obviously hold
(see Figure \ref{figure:bio_additivfa2}/a). Due to the additive metric property and the place of
leaf $n+1$
\begin{equation}
d_{k,n+1} + d_{i,l} = d_{i,n+1} + d_{k,l} \koz .
\end{equation}
Using inequalities $d_{i,l} = d_{i,u} + d_{u,l}$ és a $d_{k,l} = d_{k,u} + d_{u,l}$, it follows that
\begin{equation}
d_{k,n+1} - d_{k,u} = d_{i,n+1} - d_{i,u}.
\end{equation}
Similarly for all leaves $k_1$ that are not separated from $k$ by $u$, it holds that
\begin{equation}
d_{k_1,n+1} + d_{i,l} = d_{i,n+1} + d_{k_1,l}
\end{equation}
It is due to the additive metric and the inequality
\begin{equation}
d_{k,k_1} + d_{l,n+1} < d_{k,n+1} + d_{k_1,l}
\end{equation}
this later inequality comes from these inequalities
\begin{eqnarray}
d_{k,k_1} + d_{i,l} & < & d_{k_1,l} + d_{k,i} \\
d_{l,n+1} + d_{k,i} & < & d_{i,l} + d_{k,n+1}
\end{eqnarray}
If the degree of $u$ is greater than $3$, then similar inequalities hold.
Due to the way of calculating the new edge lengths, $d_{i,n+1}$ is represented
properly on the new tree, hence $d_{j,n+1}$ is represented properly for all
$j$ that is separated from leaf $n+1$ by $i$. Note that $i$ might be an earlier $u$.
If leaf $n+1$ is connected to the edge between $i$ and $u$ (Figure \ref{figure:bio_additivfa2}/b),
then due to the definition $d_{u,n+1}$, $d_{l,n+1}$ is represented properly. From the equation
\begin{equation}
d_{k,n+1} + d_{i,l} = d_{k,i} + d_{l,n+1}
\end{equation}
it follows that
\begin{equation}
d_{k,n+1} = d_{k,u} + d_{u,n+1} \koz ,
\end{equation}
hence $d_{k,n+1}$ is represented properly. It can be similarly shown that for
all points $j$ that are separated from $n+1$ by $u$, the $d_{j,n+1}$ is represented
properly on the tree.
If leaf $n+1$ is connected to node $u$ (Figure \ref{figure:bio_additivfa2}/c), then
from the equations
\begin{equation}
d_{i,n+1} + d_{k,l} = d_{k,i} + d_{l,n+1} = d_{k,n+1} + d_{j,i}
\end{equation}
it comes that both $d_{k,n+1}$ and $d_{l,n+1}$ are represents properly on the new tree,
and with similar reasoning, it is easy to show that actually for all nodes $j$ that is
separated from $n+1$ by $u$, $d_{j,n+1}$ is represented properly on the tree.
Hence we construct a tree containing leaf $n+1$ from the tree containing the first $n$ leaves,
thus proving Theorem~\ref{tetel:bio_additiv}.
\end{biz}
It is easy to show that the above algorithm that constructs the tree representing an additive metric
takes $O(n^2)$ running time. However, it works only if the input distances follow an additive
metric, other wise inequality (\ref{equation:bio_csakeggyre}) might hold several times, hence we
cannot decide where to join leaf $n+1$ to. We shell introduce an algorithm that has
$\Theta(n^3)$ running time and gives back the additive tree whenever the input distances follow
an additive metric, moreover it generates an additive tree that approximates the input distances
if those are not follow an additive metric.
The \ki{\textsc{Neighbour-Joining}}\inddef{\textsc{Neighbour-Joining}} algorithm works in the following way:
Given a set with $n$ points and a distance function $d$ on the points. First we calculate the
for each point $i$ the sum of the distances from the other points:
\begin{equation}
v_i = \sum_{j=1}^n d_{i,j} \koz .
\end{equation}
Then we find the pair of points for which
\begin{equation}
s_{i,j} = (n-2)d_{i,j} - v_i - v_j
\end{equation}
is minimal. The length of the edges from points $i$ and $j$ to the new
point $u$ are
\begin{equation}
e_{i,u} = \frac{d_{i,j}}{2} - \frac{v_i - v_j}{2n - 4}
\end{equation}
and
\begin{equation}
e_{j,u} = \frac{d_{i,j}}{2} - e_{i,u}
\end{equation}
Then we recalculate distances. We drop points $i$ and $j$, and add point $u$. The distance
between $u$ and any other point $k$ is defined as
\begin{equation}
d_{k,u} = \frac{d_{k,i} + d_{k,j} - d_{i,j}}{2} \koz .
\end{equation}
\begin{tetel}
If $d$ follows an additive metric, then the \textsc{Neighbour-Joining} algorithm
generates a tree that gives back $d$.
\label{tetel:bio_nj}
\end{tetel}
\begin{biz}
From Theorem~\ref{tetel:bio_additiv} there is exactly one tree that represents the distances.
It is enough to prove that the \textsc{Neighbour-Joining} algorithm always pick a cherry motif on
this tree, since a straightforward calculation shows that in this case the calculated edge lengths
are proper.
First we prove if $i$ and $j$ follows a cherry motif then for all $k$,
$s_{i,j} < s_{i,k}$ és $s_{i,j} < s_{k,j}$. Indeed, rearranging $s$, we have to prove that
%
\begin{equation}
\sum_{l \neq i,j} (d_{i,j} - d_{i,l} - d_{j,l}) - 2d_{i,j} -
\sum_{m \neq j,k} (d_{j,k} - d_{j,m} - d_{k,m}) + 2d_{j,k} < 0
\label{equation:bio_kisebbmintnulla}
\end{equation}
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\setlength{\unitlength}{2000sp}%
%
\begingroup\makeatletter\ifx\SetFigFont\undefined%
\gdef\SetFigFont#1#2#3#4#5{%
\reset@font\fontsize{#1}{#2pt}%
\fontfamily{#3}\fontseries{#4}\fontshape{#5}%
\selectfont}%
\fi\endgroup%
\begin{center}
\begin{picture}(2700,1689)(376,-1240)
\thinlines
{\color[rgb]{0,0,0}\put(601,239){\line( 1,-1){600}}
}%
{\color[rgb]{0,0,0}\put(1201,-361){\line(-1,-1){600}}
}%
{\color[rgb]{0,0,0}\put(1201,-361){\line( 1, 0){1200}}
}%
{\color[rgb]{0,0,0}\put(2401,-361){\line( 1, 1){600}}
}%
{\color[rgb]{0,0,0}\put(2401,-361){\line( 1,-1){600}}
}%
\put(451,314){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$i$}%
}}}
\put(376,-1186){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$j$}%
}}}
\put(3001,-1111){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$k$}%
}}}
\put(3076,314){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$l$}%
}}}
\put(1201,-211){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$u$}%
}}}
\put(2251,-211){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$w$}%
}}}
\end{picture}
\end{center}
\caption{The configuration of nodes $i$, $j$, $k$ and $l$ if $i$ and $j$ follows a cherry motif.}
\label{figure:bio_NJbiz1}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
If $l = m \neq i,j,k$, then we get that
\begin{equation}
(d_{i,j} - d_{i,l} - d_{j,l}) - d_{j,k} + d_{j,l} + d_{k,l} = 2d_{w,l} - 2d_{u,l} < 0 \koz ,
\end{equation}
(see also Figure \ref{figure:bio_NJbiz1}). $ 2d_{j,k} - 2d_{i,j}$ and the cases $l = k$ and
$m = i$ inside the sums cancel each other, hence we prove that the
(\ref{equation:bio_kisebbmintnulla}) inequality holds.
Now we prove the Theorem~\ref{tetel:bio_nj} in an indirect way. Suppose that $i$ and $j$
does not follow a cherry motif, however, $s_{i,j}$ is minimal. From the previous lemma,
neither $i$ nor $j$ are in a cherry motif with other leaves. We find a cherry motif
with leaves $k$ and $l$ and internal node $w$. Let $v$ denote the last common node
of paths going from $w$ to $i$ and to $j$. Since $s_{i,j}$ is minimal,
%
\begin{equation}
s_{k,l} - s_{i,j} > 0 \koz .
\end{equation}
Rearranging this we get that
\begin{equation}
\sum_{m_1 \neq k,l} (d_{k,l} - d_{m_1,k} - d_{m_1,l}) - 2d_{k,l} -
\sum_{m_2 \neq i,j} (d_{i,j} - d_{m_2,i} - d_{m_2,k}) + 2d_{i,j} > 0 \koz .
\end{equation}
$2d_{i,j} - 2d_{k,l}$ and cases $m_1 = k$, $m_1 = l$, $m_2 = i$ and $m_2 = j$ inside the sum
cancel each other. For the other $m = m_1 = m_2 \neq i,j,k,l$, the left hand side is
\begin{equation}
d_{k,l} - d_{m,k} - d_{m,l} - d_{i,j} + d_{m,i} + d_{m,k} \koz .
\label{equation:bio_njkifejezes}
\end{equation}
If $m$ joins to the tree via the path connecting $i$ and $j$, then the
expression~\ref{equation:bio_njkifejezes} will be always negative, see also
Figure \ref{figure:bio_NJbiz2}. Let these cases be called I. class cases.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}
\setlength{\unitlength}{2000sp}%
%
\begingroup\makeatletter\ifx\SetFigFont\undefined%
\gdef\SetFigFont#1#2#3#4#5{%
\reset@font\fontsize{#1}{#2pt}%
\fontfamily{#3}\fontseries{#4}\fontshape{#5}%
\selectfont}%
\fi\endgroup%
\begin{center}
\begin{picture}(2700,1794)(376,-1240)
\thinlines
{\color[rgb]{0,0,0}\put(601,239){\line( 1,-1){600}}
}%
{\color[rgb]{0,0,0}\put(1201,-361){\line(-1,-1){600}}
}%
{\color[rgb]{0,0,0}\put(1201,-361){\line( 1, 0){1200}}
}%
{\color[rgb]{0,0,0}\put(2401,-361){\line( 1, 1){600}}
}%
{\color[rgb]{0,0,0}\put(2401,-361){\line( 1,-1){600}}
}%
{\color[rgb]{0,0,0}\put(751, 89){\line( 1, 1){300}}
}%
{\color[rgb]{0,0,0}\put(901,-661){\line( 1,-1){300}}
}%
{\color[rgb]{0,0,0}\put(1801,-361){\line( 0,-1){450}}
}%
\put(451,314){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$i$}%
}}}
\put(376,-1186){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$j$}%
}}}
\put(3001,-1111){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$k$}%
}}}
\put(3076,314){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$l$}%
}}}
\put(1201,-211){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$v$}%
}}}
\put(2251,-211){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$w$}%
}}}
\put(1051,464){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$m$}%
}}}
\put(1276,-1036){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$m$}%
}}}
\put(1801,-986){\makebox(0,0)[lb]{\smash{\SetFigFont{10}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}$m$}%
}}}
\end{picture}
\end{center}
\caption{The possible places for node $m$ on the tree.}
\label{figure:bio_NJbiz2}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
If $m$ joins to the tree via the path between $v$ and $w$, then the
expression~\ref{equation:bio_njkifejezes} might be positive. Let these cases
called II. class cases. To avoid contradiction, the sum of absolute values from
I. class cases must be less than the sum from the II. class cases.
There is another $v'$ node on the path connecting $i$ and $j$, and we can find a cherry motif
after node $v'$ with leaves $k'$ and $l'$ and internal node $w'$. Here again the II. class cases
have to be more than the I. class cases, but this contradict to the situation with the first
cherry motif. Hence $i$ and $j$ form a cherry motif and we prove Theorem~\ref{tetel:bio_nj}.
\end{biz}
\begin{gyak}
\ujgyak Show that in a ultrametric, three distances coming from three
points are all equal or two of them equal and the third is smaller. Prove the
similar claim for the three sum of distances coming from four points in an additive
metric.
\ujgyak Show that a ultrametric is always an additive metric.
\ujgyak Give an example for a metric that is not additive.
\ujgyak Is it true that all additive metric is a Euclidean metric?
\ujgyak Give the formula that calculates $d_{u,k}$ from $d_{i,j}$, $d_{i,k}$ and $d_{j,k}$
for the centroid method.
\ujgyak Give algorithms that decide in $O(n^2)$ whether or not a metric is
\begin{itemize}
\item additive
\item ultrametric
\end{itemize}
($n$ is the number of points.)
\end{gyak}
\section{Miscellaneous topics}
In this section, we cover topics that are usually not mentioned in
bioinformatics books. We only mention the main results in a nutshell
and do not prove theorems.
\subsection{Genome rearrangement}
The genome of an organism consists of several genes. For each gene, only
one strand of the double stranded DNA contains meaningful information, the
other strand is the reverse complement. Since the DNA is chemically oriented,
we can talk about the direction of a gene. If each gene has one copy in the
genome then we can describe the order and directions of genes as a signed
permutation, where the signs give the directions of genes.
Given two genomes with the same gene content, represented as a signed
permutation then the problem is to give the minimal series of mutations
that transform one genome into another. We consider three types of
mutations:
\begin{itemize}
%
\item {\bf Reversal} A reversal acts on a consecutive part of the
signed permutation. It reverse the order of genes on the given part as
well as the signs of the genes.
%
\item {\bf Transposition} A transposition swaps two consecutive block
of genes.
%
\item {\bf Reverted transposition} It swaps two consecutive blocks and one
of the blocks is reverted. As for reversals, the signs in the reverted block also
change.
%
\end{itemize}
If we assume that only mutations happened, then we can give an $O(n^2)$ running time
algorithm that obtains a shortest series of mutations transforming one genome into another,
where $n$ is the number of genes.
If we consider other types of mutations, then the complexity of problems is unknown.
For transpositions, the best approximation is an $1.375$ approximation
\cite{bio_eh2005}, if we consider all possible types of mutations, then the
best approximation is a $2$-approximation \cite{bio_gps99}. For a wide range of and
biologically meaningful weights, the weighted sorting problem for all types of mutations
has a $1.5$-approximation \cite{bio_bo2006}.
If we do not know the signs, then the problem is proved to be NP-complete
\cite{bio_caprara99a}. Similarly, the optimal reversal median problem even for three genomes
and signed permutations is NP-complete \cite{bio_caprara99}. The optimal reversal
median is a genome that minimises the sum of distances from a set of genomes.
Below we describe the Hannenhalli-Pevzner theorem for the reversal distance of two genomes.
Instead of transforming permutation $\pi_1$ into $\pi_2$, we transform $\pi_2^{-1} \pi_1$
into the identical permutation. Based on elementary group theory, it is easy to show that
the two problems are equivalent. We assume that we already calculated $\pi_2^{-1} \pi_1$,
and we will denote it simply by $\pi$.
We transform an $n$ long signed permutation into a $2n$ long unsigned permutation by
replacing $+i$ to $2i-1, 2i$ and $-i$ to $2i, 2i-1$. Additionally, we frame the
unsigned permutation into $0$ and $2n+1$. The vertexes of the so-called
graph of desire and reality are the numbers of the unsigned permutation together with
$0$ and $2n+1$. Starting with $0$, we connect every other number in the graph, these
are the reality edges. Starting also with $0$, we connect $2i$ with $2i+1$ with an arc,
these are the desire edges. An example graph can be seen on Figure \ref{fig:bio_breakpoint}.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\setlength{\unitlength}{2000sp}%
%
\begin{figure}
\begingroup\makeatletter\ifx\SetFigFont\undefined%
\gdef\SetFigFont#1#2#3#4#5{%
\reset@font\fontsize{#1}{#2pt}%
\fontfamily{#3}\fontseries{#4}\fontshape{#5}%
\selectfont}%
\fi\endgroup%
\begin{center}
\begin{picture}(6643,2709)(592,-2161)
\thinlines
{\color[rgb]{0,0,0}\put(1201,-1261){\oval(1200,1200)[tr]}
\put(1201,-1261){\oval(1200,1200)[tl]}
}%
{\color[rgb]{0,0,0}\put(1801,-1261){\oval(1200,1200)[tr]}
\put(1801,-1261){\oval(1200,1200)[tl]}
}%
{\color[rgb]{0,0,0}\put(4501,-1261){\oval(3000,2460)[tr]}
\put(4501,-1261){\oval(3000,2460)[tl]}
}%
{\color[rgb]{0,0,0}\put(6601,-1211){\oval( 2,100)[br]}
\put(5701,-1211){\oval(1802,1802)[tr]}
\put(5701,-1211){\oval(1802,1802)[tl]}
\put(4801,-1211){\oval( 2,100)[bl]}
}%
{\color[rgb]{0,0,0}\put(4501,-1261){\oval(1800,1800)[tr]}
\put(4501,-1261){\oval(1800,1800)[tl]}
}%
{\color[rgb]{0,0,0}\put(7201,-986){\oval( 50,550)[br]}
\put(5701,-986){\oval(3050,3050)[tr]}
\put(5701,-986){\oval(3050,3050)[tl]}
\put(4201,-986){\oval( 50,550)[bl]}
}%
\thicklines
{\color[rgb]{0,0,0}\put(1951,-1561){\line( 1, 0){450}}
}%
{\color[rgb]{0,0,0}\put(751,-1561){\line( 1, 0){450}}
}%
{\color[rgb]{0,0,0}\put(3151,-1561){\line( 1, 0){150}}
\put(3301,-1561){\line( 1, 0){150}}
\put(3451,-1561){\line( 1, 0){150}}
}%
{\color[rgb]{0,0,0}\put(4351,-1561){\line( 1, 0){450}}
}%
{\color[rgb]{0,0,0}\put(5551,-1561){\line( 1, 0){150}}
\put(5701,-1561){\line( 1, 0){150}}
\put(5851,-1561){\line( 1, 0){150}}
}%
{\color[rgb]{0,0,0}\put(6751,-1561){\line( 1, 0){150}}
\put(6901,-1561){\line( 1, 0){150}}
\put(7051,-1561){\line( 1, 0){150}}
}%
\put(601,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}0}%
}}}
\put(1201,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}2}%
}}}
\put(1801,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}1}%
}}}
\put(2401,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}3}%
}}}
\put(3001,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}4}%
}}}
\put(3601,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}9}%
}}}
\put(4201,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}10}%
}}}
\put(4801,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}7}%
}}}
\put(5401,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}8}%
}}}
\put(6001,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}5}%
}}}
\put(6601,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}6}%
}}}
\put(7201,-1561){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}11}%
}}}
\put(1501,-2161){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}-1}%
}}}
\put(2701,-2161){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}+2}%
}}}
\put(3901,-2161){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}+5}%
}}}
\put(5101,-2161){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}+3}%
}}}
\put(6301,-2161){\makebox(0,0)[lb]{\smash{\SetFigFont{12}{14.4}{\rmdefault}{\mddefault}{\updefault}{\color[rgb]{0,0,0}+4}%
}}}
\end{picture}
\end{center}
\caption{Representation of the $-1,\, +2,\, +5,\, +3,\, +4$ signed permutation with
an unsigned permutation, and its graph of desire and reality.}
\label{fig:bio_breakpoint}
\end{figure}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Since each vertex in the graph of desire and reality has a degree of two, the graph can be
unequivocally decomposed into cycles. We call a cycle a directed cycle if on a walk on the
cycle, we go at least once from left to right on a reality cycle and we go at least once
from right to left on a reality cycle. Other cycles are unoriented cycles.
The span of a desire edge is the interval between its left and right vertexes.
Two cycles overlap if there are two reality edges in the two cycles whose spans intersect.
The vertexes of the overlap graph of a signed permutation are the cycles in its
graph of desire and reality, two nodes are connected if the two cycles overlap.
The overlap graph can be decomposed into components. A component is directed if it contains
a directed cycle, otherwise it is unoriented. The span of a component is the interval
between its leftmost and rightmost nodes in the graph of desire and reality.
An unoriented component is a hurdle if its span does not contain any unoriented component
or it contains all unoriented component. Other components are called protected non-hurdles.
A super-hurdle is hurdle for which it is true that if we delete this hurdle then one of the
protected non-hurdles becomes a hurdle. A fortress is a permutation in which all hurdles
are super-hurdles and their number is odd.
The Hannenhalli-Pevzner theorem is the following:
%
\begin{tetel}
Given a signed permutation $\pi$. The minimum number of mutations sorting this permutation
into the identical permutation is
\begin{equation}
n + 1 - c_{\pi} + h_{\pi} + f_{\pi}
\label{bio_hp_elmelet}
\end{equation}
where $n$ is the length of the permutation, $c_{\pi}$ is the number of cycles,
$h_{\pi}$ is the number of hurdles, and $f_{\pi} = 1$ if the permutation is a
fortress, otherwise $0$
\end{tetel}
%
The proof of the theorem can be found in the book
due to Pevzner\nevindex{Pevzner, Pavel A.}.
The reversal distance was calculated in $\Theta(n)$ time by Bader et al.\nevindex{Bader, M.}.
It is very easy to obtain $c_{\pi}$ in $\Theta(n)$ time. The hard part is to calculate
$h_{\pi}$ and $f_{\pi}$. The source of the problem is that the overlap graph might contain
$\Omega(n^2)$ edges. Therefore the fast algorithm does not obtain the entire overlap graph, only
a spanning tree on each component of it.
\subsection{Shotgun sequencing}
A genome of an organism usually contain significantly more than one million
nucleic acids. Using a special biochemical technology, the order of nucleic acids
can be obtained, however, the uncertainty grows with the length of the DNA, and becomes
absolutely unreliable after about 500 nucleic acids.
A possible solution for overcoming this problem is the following: several copies are
made from the original DNA sequence and they are fragmented into small parts that can
be sequenced in the above described way. Then the original sequence must be reconstructed
from its overlapping fragments. This technique is called \ki{shotgun sequencing.}
\inddef{shotgun sequencing}
The mathematical definition of the problem is that we want to find the shortest common
super-sequence of a set of sequences. Sequence $B$ is a super-sequence of $A$ if $A$ is
subsequence of $B$. Recall that a subsequence is not necessarily a consecutive part of the
sequence. Maier proved that the shortest common super-sequence problem is NP-complete
is the size of the alphabet is at least $5$ and conjectured that it is the case if the size is
at least $3$. Later on it has been proved that the problem is NP-complete for all non-trivial
alphabet \cite{bio_ru81}.
Similar problem is the shortest common super-string problem, that is also an NP-complete
problem \cite{bio_gms80}. This later has biological relevance, since we are looking for
overlapping substrings. Several approximation algorithms have been published for
the shortest common super-string problem. A greedy algorithm finds for each pair of strings
the maximal possible overlap, then it tries to find a shortest common super-string
by merging the overlapping strings in a greedy way \cite{bio_tu88}. The running
time of the algorithm is $O(Nm)$, where $N$ is the number of sequences and $m$
is the total length of the sequences. This greedy method is proved to be
a $4$-approximation \cite{bio_bjl+91}. A modified version being a $3$-approximation
also exist, and the conjecture is that the modified version is a $2$-approximation
\cite{bio_bjl+91}.
The sequencing of DNA is not perfect, insertions, deletions and substitutions
might happen during sequencing. Therefore Jiang and Li suggested the shortest
$k$-approximative common super-string problem \cite{bio_jl91}. Kececioglu and
Myers worked out a software package including several heuristic algorithm
for the problem \cite{bio_km93}. Later on Myers worked for Celera, which
played a very important role in sequencing the human genome. A review
paper on the topic can be found in \cite{bio_wm97}.
\begin{gyak}
%
\ujgyak Show that a fortress\gyakindex{fortress} contains at least three super-hurdle.
%
\ujgyak At least how long is a fortress?
%
\end{gyak}
\begin{fld}
\ujfld{Concave Smith--Waterman}
Give the Smith--Waterman-algorithm\felindex{Smith--Waterman-algorithm} for concave gap penalty.
\ujfld{Concave Spouge}
Give Spouge-algorithm\felindex{Spouge-algorithm} for concave gap penalty.
\ujfld{Serving at a petrol station}
There are two rows at a petrol station. Each car needs either petrol or diesel oil.
At most two cars can be served at the same time, but only if they need different type of fuel,
and the two cars are the first ones in the two rows or the first two in the same row.
The serving time is the same not depending on whether two cars are being served or only one.
Give a pair-HMM for which the Viterbi-algorithmqprindex{Viterbi-algorithm} provides a shortest serving scenario.
\ujfld{Moments of an HMM}
Given an HMM and a sequence. Obtain the mean, variance, $k$th moment of the probabilities
of paths emitting the given sequence.
\ujfld{Moments of a SCFG}
Given a SCFG and a sequence. Obtain the mean, variance, $k$th moment of the probabilities of
derivations of the given sequence.
%
\ujfld{Co-emission probability of two HMMs}
Can this probability be calculated in $O((n_1 n_2)^2)$ time where $n_1$ and $n_2$ are the
number of states in the HMMs?
%
\ujfld{Sorting reversals}
A sorting reversal is a reversal that decreases the reversal distance of a signed permutation.
How can a sorting reversal change the number of cycles and hurdles?
\end{fld}
\begin{fejmegj}
The first dynamic programming algorithm for aligning biological sequences was given by
Needleman\nevindex{Needleman, S. B.} and Wunch\nevindex{Wunch, C. D.} in 1970 \cite{bio_nw70}.
Though the concave gap penalty function is
biologically more relevant, the affine gap penalty has been the standard soring scheme for
aligning biological sequences. For example, one of the most popular multiple alignment
program, CLUSTAL-W uses affine gap penalty and iterative sequence alignment \cite{bio_thg94}.
%
The edit distance of two strings can calculated faster than $\Theta(l^2)$
time, that is the famous ''Four Russians' speedup"qindex{Four Russians speedup} \cite{bio_adkf70}.
The running time of the algorithm is $O(n^2/\log(n))$, however, it has such a big
constant in the running time that it is not worth using it for sequence lengths appear
in biological applications.
%
The longest common subsequence problem can be solved using a dynamic programming algorithm
similar to the dynamic programming algorithm for aligning sequences. Unlike that algorithm,
the algorithm of Hunt and Szymanski creates a graph whose points are the characters of the
sequences $A$ and $B$, and $a_i$ is connected to $b_j$ iff $a_i = b_j$. Using this graph, the
longest common subsequence can be find in $\Theta((r+n)\log(n))$ time, where $r$ is the number of
edges in the graph and $n$ is the number of nodes \cite{bio_hs77}. Although the running time
of this algorithm is $O(n^2\lg(n))$, since the number of edges might be $O(n^2)$, in many cases
the number of edges is only $O(n)$, and in this cases the running time is only $O(n\lg(n))$.
%
A very sophisticated version of the corner-cutting method is the diagonal extension technique,
which fills in the dynamic programming table by diagonals and does not need a test value.
An example for such an algorithm is the algorithm of
Wu\nevindex{Wu, S.}\nevindex{Myers, Eugene W.}\nevindex{Manber, U.}\nevindex{Miller, Webb} at al. \cite{bio_wmmm90}.
the {\tt diff} command in the Unix operating system is also based on diagonal extension
\cite{bio_mm85}, having a running time $O(n + m + d_e^2)$, where $n$ and $m$ are the lengths of
the sequences and $d_e$ is the edit distance between the two sequences.
%
The Knuth-Morris-Pratt string-searching algorithm searches a small pattern $P$ in a
long string $M$. Its running time is $\Theta(p + m)$, where $p$ and $m$ are the length of
the sequences \cite{bio_kmp77}. Landau and Vishkin modified this algorithm such that
the modified version can find a pattern in $M$ that differs at most in $k$ position
\cite{bio_lv86}. The running time of the algorithm is $\Theta(k(p\log(p) + m))$, the
memory requirement is $\Theta(k(p+m))$.
%
Although dynamic programming algorithms are the most frequently used techniques for
aligning sequences, it is also possible to attack the problem with integer linear programming.
Kececiogluqnevindex{Kececioglu, John D.} and his colleges gave the first linear programming algorithm
for aligning sequences \cite{bio_klm+2000}. Their method has been extended to arbitrary gap penalty functions
\cite{bio_aclr2002}. Lanciaqnevindex{Lancia, G.} wrote a review paper on the topic \cite{bio_lancia2004}
and Pachter and Sturmfels showed the relationship between the dynamic programming and
integer linear programming approach in their book \textit{Algebraic Statistics for Computational
Biology} \cite{Pachter05}.\nevindex{Pachter, Lior}\nevindex{Sturmfels, Bernd}
%
The structural alignment considers only the 3D structure of sequences. The
optimal structural alignment problem is to find an alignment where we penalise gaps, however,
the aligned characters scored not by their similarity but by how close their are in the
superposed 3D structures. Several algorithms have been developed for the problem, one of them
is the combinatorial extension (CE) algorithm \cite{bio_sb98}.
%
For a given topology it is possible to find the Maximum Likelihood labeling
\cite{bio_ppsg2000}. This algorithm has been integrated into PAML, which is one
of the most popular software package for phylogenetic analysis
(http://abacus.gene.ucl.ac.uk/software/paml.html).
%
The Maximum Likelihood tree problem is to find for a substitution model
and a set of sequences the tree topology and edge lengths for which the likelihood is
maximal. Surprisingly, it has only recently been proved that the problem is NP-complete
\cite{bio_ct2005,bio_roch2006}.
The similar problem, the Ancestral Maximum Likelihood problem has been showed to be NP-complete
also only recently \cite{bio_abh+2003}. The AML problem is to find the tree topology, edge lengths
and labellings for which the likelihood of a set of sequences is maximal in a given substitution
model.
%
The two most popular sequence alignment algorithms based on HMMs are the SAM
\cite{bio_hk96} and the HMMER (http://hmmer.wustl.edu/) packages. An example for
HMM for genome annotation is the work of Pedersen and Hein
\cite{bio_ph_2003}.\nevindex{Pedersen, Jakob Skou}\nevindex{Hein, Jotun}
Comparative genome annotation can be done with pair-HMMs like the
DoubleScan \cite{bio_md2002},
(http://www.sanger.ac.uk/Software/analysis/doublescan/) and the Projector \cite{bio_md2004},
(http://www.sanger.ac.uk/Software/analysis/projector/) programs.
%
Goldman, Thorne and Jones were the first who published an HMM in which the
emission probabilities are calculated from evolutionary informations
\cite{bio_gtj96}. It was used
for protein secondary structure prediction. The HMM emits alignment columns, the
emission probabilities can be calculated with the Felsenstein algorithm.
%
The Knudsen-Hein grammar is used in the PFold program, which is for predicting
RNA secondary structures \cite{bio_kh2003}. This SCFG generates RNA multiple alignments,
where the terminal symbols are alignment columns. The derivation probabilities can be
calculated with the Felsenstein algorithm, the corresponding substitution model is a
single nucleotide or a dinucleotide model, according to the derivation rules.
%
The running time of the \textsc{Forward} algorithm grows squarely with the number of states
in the HMM. However, this is not always the fastest algorithm. For a biologically important
HMM, it is possible to reduce the $\Theta(5^n L^n)$ running time of the \textsc{Forward}
algorithm to $\Theta(2^n L^n)$ with a more sophisticated algorithm \cite{bio_lmdjh2003,bio_lmsh2003}.
However, it is unknown whether or not similar acceleration exist for
the \textsc{Viterbi} algorithm.
%
The Zuker-Tinoco model \cite{bio_tul71} defines free energies for RNA secondary
structure elements, and the free energy of an RNA structure is the sum of
free energies of the elements. The Zuker-Sankoff algorithm calculates in $\Theta(l^4)$
time the minimum free energy structure, using $\Theta(l^2)$ memory, where $l$ is the
length of the RNA sequence. It is also possible to calculate the partition function
of the Boltzmann distribution with the same running time and memory requirement
\cite{bio_mccaskill90}. For a special case of free energies, both the optimal structure
and the partition function can be calculated in $\Theta(l^3)$ time, using still only
$\Theta(l^2)$ memory \cite{bio_lzp99}.
%
Two base-pairings, $i \cdot j$ and $i' \cdot j'$ forms a pseudo-knot if
$i < i' < j < j'$. Predicting the optimal RNA secondary structure in which
arbitrary pseudo-knots are allowed is NP-complete \cite{bio_lp2000}. For special
types of pseudo-knots, polynomial running time algorithms exist
\cite{bio_akutsu2000,bio_lp2000,bio_re99,bio_uhky99}.
%
RNA secondary structures can be compared with aligning ordered forests
\cite{bio_htgk2003}.
%
Atteson gave a mathematical definition for the goodness of tree-constructing
methods, and showed that the \textsc{Neighbor-Joining}\index{\textsc{Neighbor-Joining}} algorithm is the best
one for some definitions \cite{bio_atteson99}. Elias and Lagergren recently published
an improved algorithm for \textsc{Neighbor-Joining} that has only $O(n^2)$ running time
\cite{bio_el2005}.
%
There are three possible tree topologies for four species that are called quartets.
If we know all the quartets of the tree, it is possible to reconstruct it. It is proved
that it is enough to know only the short quartets of a tree that are the quartets of
closely related species \cite{bio_essw97}.
%
A genome might contain more than one DNA sequences, the DNA sequences are called
chromosomes. A genome rearrangement might happen between chromosomes, too, such
mutations are called translocations. Hannenhalli\nevindex{Hannenhali, Sridhar} gave a $\Theta(n^3)$ running time
algorithm for calculating the translocation and reversal distance \cite{bio_hannenhalli96}.
Pisanti\nevindex{Pisanti, N.} and Sagot\nevindex{Sagot, Marie-France}
generalised the problem and gave results for the translocation diameter \cite{bio_ps2002}.
%
The generalisation of sorting permutations is the problem of finding the
minimum length generating word for an element of a group. The problem is
known to be NP-complete \cite{bio_jerrum85}.
Above the reversal distance and translocation distance problem, only for
the block interchange distance exists a polynomial running time algorithm
\cite{bio_christie96}. We mention that Bill Gates, the owner of Microsoft
worked also on sorting permutations, actually, with prefix reversals
\cite{bio_gp79}.
Description of many algorithms of bioinformatics can be found in the book of
Pevzner\nevindex{Pevzner, Pavel A.} and Jones\nevindex{Jones, Neil C.} \cite{Pevzner2004}.
%
We wrote only about the most important topics of bioinformatics, and we
did not cover several topics like recombination, pedigree analysis, character-based
tree reconstructing methods, partial digesting, protein threading methods, DNA chip
analysis, knowledge representation, biochemical pathways, scale-free networks, etc.
We close the chapter with the words of Donald Knuth:
''It is hard for me to say confidently that, after fifty more years of explosive growth of
computer science, there will still be a lot of fascinating unsolved problems at peoples' fingertips,
that it won't be pretty much working on refinements of well-explored things. Maybe all of the simple
stuff and the really great stuff has been discovered. It may not be true, but I can't predict an
unending growth. I can't be as confident about computer science as I can about biology. Biology
easily has 500 years of exciting problems to work on, it's at that level."
\index{bioinformatics|)}
\end{fejmegj}