Table of Contents

I. APPLICATIONS
1. Bioinformatics
21.1 Algorithms on sequences
21.1.1 Distances of two sequences using linear gap penalty
21.1.2 Dynamic programming with arbitrary gap function
21.1.3 Gotoh algorithm for affine gap penalty
21.1.4 Concave gap penalty
21.1.5 Similarity of two sequences, the Smith-Waterman algorithm
21.1.6 Multiple sequence alignment
21.1.7 Memory-reduction with the Hirschberg algorithm
21.1.8 Memory-reduction with corner-cutting
21.2 Algorithms on trees
21.2.1 The small parsimony problem
21.2.2 The Felsenstein algorithm
21.3 Algorithms on stochastic grammars
21.3.1 Hidden Markov Models
21.3.2 Stochastic context-free grammars
21.4 Comparing structures
21.4.1 Aligning labelled, rooted trees
21.4.2 Co-emission probability of two HMMs
21.5 Distance based algorithms for constructing evolutionary trees
21.5.1 Clustering algorithms
21.5.2 Neighbour joining
21.6 Miscellaneous topics
21.6.1 Genome rearrangement
21.6.2 Shotgun sequencing

List of Figures

21.1. The tree on which we introduce the Felsenstein algorithm. Evolutionary times are denoted with The tree on which we introduce the Felsenstein algorithm. Evolutionary times are denoted with v s on the edges of the tree.s on the edges of the tree.
21.2. A dendrogram.
21.3. Connecting leaf Connecting leaf n+1 to the dendrogram. to the dendrogram.
21.4. Calculating Calculating d_{u,k} according to the Centroid method. according to the Centroid method.
21.5. Connecting leaf Connecting leaf n+1 for constructing an additive tree. for constructing an additive tree.
21.6. Some tree topologies for proving Theorem 21.7.
21.7. The configuration of nodes The configuration of nodes i , j , k and l if i and j follows a cherry motif., The configuration of nodes i , j , k and l if i and j follows a cherry motif., The configuration of nodes i , j , k and l if i and j follows a cherry motif. and The configuration of nodes i , j , k and l if i and j follows a cherry motif. if The configuration of nodes i , j , k and l if i and j follows a cherry motif. and The configuration of nodes i , j , k and l if i and j follows a cherry motif. follows a cherry motif.
21.8. The possible places for node The possible places for node m on the tree. on the tree.
21.9. Representation of the Representation of the -1,\,+2,\,+5,\,+3,\,+4 signed permutation with an unsigned permutation, and its graph of desire and reality. signed permutation with an unsigned permutation, and its graph of desire and reality.

Part I. APPLICATIONS

Chapter 1. Bioinformatics

In this chapter at first we present algorithms on sequences, trees and stochastic grammars, then we continue with algorithms of comparison of structures and constructing of evolutionary trees, and finish the chapter with some rarely discussed topics of bioinformatics.

21.1 Algorithms on sequences

In this section, we are going to introduce dynamic programming algorithms working on sequences. Sequences are finite series of characters over a finite alphabet. The basic idea of dynamic programming is that calculations for long sequences can be given via calculations on substrings of the longer sequences.

The algorithms introduced here are the most important ones in bioinformatics, they are the basis of several software packages.

21.1.1 Distances of two sequences using linear gap penalty

DNA contains the information of living cells. Before the duplication of cells, the DNA molecules are doubled, and both daughter cells contain one copy of DNA. The replication of DNA is not perfect, the stored information can be changed by random mutations. Random mutations creates variants in the population, and these variants evolve to new species.

Given two sequences, we can ask the question how much the two species are related, and how many mutations are needed to describe the evolutionary history of the two sequences.

We suppose that mutations are independent from each other, and hence, the probability of a series of mutations is the product of probabilities of the mutations. Each mutation is associated with a weight, mutations with high probability get a smaller weight, mutations with low probability get a greater weight. A reasonable choice might be the logarithm of one over the probability of the mutation. In this case the weight of a series of mutations is the sum of the weights of the individual mutations. We also assume that mutation and its reverse have the same probability, therefore we study how a sequence can be transfered into another instead of evolving two sequences from a common ancestor. Assuming minimum evolution minimum evolution, we are seeking for the minimum weight series of mutations that transforms one sequence into another. An important question is how we can quickly find such a minimum weight series. The naive algorithm finds all the possible series of mutations and chooses the minimum weight. Since the possible number of series of mutations grows exponentially – as we are going to show it in this chapter –, the naive algorithm is obviously too slow.

We are going to introduce the Sellers' algorithm [257]. Let be a finite set of symbols, and let denote the set of finite long sequences over . The long prefix of will be denoted by , and denotes the th character of . The following transformations can be applied for a sequence:

  • Insertion of symbol before position , denoted by .

  • Deletion of symbol at position , denoted by .

  • Substitution of symbol to symbol at position , denoted by .

The concatenation of mutations is denoted by the symbol. denotes the set of finite long concatenations of the above mutations, and denotes that transforms a sequence into sequence . Let a weight function such that for any , and transformations satisfying

Equation 21.1. 


the

Equation 21.2. 


equation also holds. Furthermore, let be independent from . The transformation distance between two sequences, and , is the minimum weight of transformations transforming into :

Equation 21.3. 


If we assume that satisfies

Equation 21.4. 


Equation 21.5. 


Equation 21.6. 


for any , , , then the transformation distance is indeed a metric on .

Since is a metric, it is enough to concern with transformations that change each position of a sequence at most once. Series of transformations are depicted with sequence alignments. By convention, the sequence at the top is the ancestor and the sequence at the bottom is its descendant. For example, the alignment below shows that there were substitutions at positions three and five, there was an insertion in the first position and a deletion in the eighth position.

  - A U C G U A C A G  
  U A G C A U A - A G  

A pair at a position is called aligned pair. The weight of the series of transformations described by the alignment is the sum of the weights of aligned pairs. Each series of mutations can be described by an alignment, and this description is unique up the permutation of mutations in the series. Since the summation is commutative, the weight of the series of mutations does not depend on the order of mutations.

We are going to show that the number of possible alignments also grows exponentially with the length of the sequences. The alignments that do not contain this pattern

  # -  
  - #  

where # an arbitrary character of gives a subset of possible alignments. The size of this subset is , since there is a bijection between this set of alignments and the set of coloured sequences that contains the characters of and in increasing order, and the characters of is coloured with one colour, and the characters of is coloured with the other colour. For example, if , then .

An alignment whose weight is minimal called an optimal alignment. Let the set of optimal alignments of and be denoted by , and let denote the weights of any alignment in .

The key of the fast algorithm for finding an optimal alignment is that if we know , , and , then we can calculate in constant time. Indeed, if we delete the last aligned pair of an optimal alignment of and , we get the optimal alignment of and , or and or and , depending on the last aligned column depicts a deletion, an insertion, substitution or match, respectively. Hence,

Equation 21.7. 


The weights of optimal alignments are calculated in the so-called dynamic programming table, . The element of contains . Comparing of an and an long sequence requires the fill-in of an table, indexing of rows and columns run from till and , respectively. The initial conditions for column and row are

Equation 21.8. 


Equation 21.9. 


Equation 21.10. 


The table can be filled in using Equation (Equation 21.7). The time requirement for the fill-in is . After filling in the dynamic programming table, the set of all optimal alignments can be find in the following way, called trace-back. We go from the right bottom corner to the left top corner choosing the cell(s) giving the optimal value of the current cell (there might be more than one such cells). Stepping up from position means a deletion, stepping to the left means an insertion, and the diagonal steps mean either a substitution or a match depending on whether or not . Each step is represented with an oriented edge, in this way, we get an oriented graph, whose vertices are a subset of the cells of the dynamic programming table. The number of optimal alignments might grow exponentially with the length of the sequences, however, the set of optimal alignments can be represented in polynomial time and space. Indeed, each path from to on the oriented graph obtained in the trace-back gives an optimal alignment.

21.1.2 Dynamic programming with arbitrary gap function

Since deletions and insertions get the same weight, the common name of them is indel or gap, and their weights are called gap penalty. Usually gap penalties do not depend on the deleted or inserted characters. The gap penalties used in the previous section grow linearly with the length of the gap. This means that a long indel is considered as the result of independent insertions or deletions of characters. However, the biological observation is that long indels can be formed in one evolutionary step, and these long indels are penalised too much with the linear gap penalty function. This observation motivated the introduction of more complex gap penalty functions [289]. The algorithm introduced by Waterman et al. penalises a long gap with . For example the weight of this alignment:

  - - A U C G A C G U A C A G  
  U A G U C - - - A U A G A G  

is .

We are still seeking for the minimal weight series of transformations transforming one sequence into another or equivalently for an optimal alignment. Since there might be a long indel at the end of the optimal alignment, above knowing , we must know all , and , to calculate . The dynamic programming recursion is given by the following equations:

Equation 21.11. 


The initial conditions are:

Equation 21.12. 


Equation 21.13. 


Equation 21.14. 


The time requirement for calculating is , hence the running time of the fill-in part to calculate the weight of an optimal alignment is . Similarly to the previous algorithm, the set of optimal alignments represented by paths from to can be found in the trace-back part.

If , then the running time of this algorithm is . With restrictions on the gap penalty function, the running time can be decreased. We are going to show two such algorithms in the next two sections.

21.1.3 Gotoh algorithm for affine gap penalty

A gap penalty function is affine if

Equation 21.15. 


There exists a running time algorithm for affine gap penalty [113]. Recall that in the Waterman algorithm,

Equation 21.16. 


where

Equation 21.17. 


Equation 21.18. 


The key of the Gotoh algorithm is the following reindexing

Equation 21.19. 


And similarly

Equation 21.20. 


In this way, és can be calculated in constant time, hence . Thus, the running time of the algorithm remains , and the algorithm will be only a constant factor slower than the dynamic programming algorithm for linear gap penalties.

21.1.4 Concave gap penalty

There is no biological justification for the affine gap penalty function [29], [112], its wide-spread use (for example, CLUSTAL-W [274]) is due to its low running time. There is a more realistic gap penalty function for which an algorithm exists whose running time is slightly more than the running time for affine gap penalty, but it is still significantly better than the cubic running time algorithm of Waterman et al. [101], [212].

A gap penalty function is concave if for each , . Namely, the increasement of gap extensions are penalised less and less. It might happen that the function starts decreasing after a given point, to avoid this, it is usually assumed that the function increases monotonously. Based on empirical data [29], if two sequences evolved for PAM unit [66], the weight of a long indel is

Equation 21.21. 


which is also a concave function. (One PAM unit is the time span on which 1% of the sequence changed.) There exist an running time algorithm for concave gap penalty functions. this is a so-called forward looking algorithm. The Forward-Looking algorithm calculates the th row of the dynamic programming table in the following way for an arbitrary gap penalty function:

Forward-Looking

   1  FOR  
   2      
   3      
   4  FOR  
   5      
   6      
   7         At this step, we suppose that  and  are already calculated. 
   8    FOR         Inner cycle. 
   9       IF  THEN 
  10            
  11            

where is the gap penalty function and is a pointer whose role will be described later. In row , we assume that we already calculated and . It is easy to show that the forward looking algorithm makes the same comparisons as the traditional, backward looking algorithm, but in a different order. While the backward looking algorithm calculates at the th position of the row looking back to the already calculated entries of the dynamic programming table, the Forward-Looking algorithm has already calculated by arriving to the th position of the row. On the other hand, it sends forward candidate values for , , and by arriving to cell , all the needed comparisons of candidate values have been made. Therefore, the Forward-Looking algorithm is not faster than the traditional backward looking algorithm, however, the conception helps accelerate the algorithm.

The key idea is the following.

Lemma 21.1 Let be the actual cell in the row. If

Equation 21.22. 


then for all

Equation 21.23. 


Proof. From the condition it follows that there is a for which

Equation 21.24. 


Let us add to the equation:

Equation 21.25. 


For each concave gap penalty function,

Equation 21.26. 


rearranging this and using Equation (Equation 21.25)

Equation 21.27. 


The idea of the algorithm is to find the position with a binary search from where the actual cell cannot send forward optimal values. This is still not enough for the desired acceleration since number of candidate values should be rewritten in the worst case. However, the corollary of the previous lemma leads to the desired acceleration:

Corollary 21.2 Before the th cell sends forward candidate values in the inner cycle of the forward looking algorithm, the cells after cell form blocks, each block having the same pointer, and the pointer values are decreasing by blocks from left to right.

The pseudocode of the algorithm is the following:

Forward-Looking-Binary-Searching()

   1  ; ;  
   2  FOR  TO  
   3    DO  
   4         
   5         At this step, we suppose that  and  are already calculated. 
   6       IF  THEN 
   7          THEN  
   8       IF  AND  
   9          THEN ;  
  10          ELSE IF  
  11             THEN  
                        
  12          IF  
  13             THEN ;  
  14             ELSE  
  15                IF  
  16                   THEN  
  17                   ELSE  
  18            
  19            
  20            
                     

The algorithm works in the following way: for each row, we maintain a variable storing the number of blocks, a list of positions of block ends, and a list of pointers for each block. For each cell , the algorithm finds the last position for which the cell gives an optimal value using binary search. There is first a binary search for the blocks then for the positions inside the choosen block. It is enough to rewrite three values after the binary searches: the number of blocks, the end of the last block and its pointer. Therefore the most time consuming part is the binary search, which takes time for each cell.

We do the same for columns. If the dynamic programming table is filled in row by row, then for each position in row , the algorithms uses the block system of column . Therefore the running time of the algorithm is .

21.1.5 Similarity of two sequences, the Smith-Waterman algorithm

We can measure not only the distance but also the similarity of two sequences. For measuring the similarity of two characters, , the most frequently used function is the log-odds:

Equation 21.28. 


where is the joint probability of the two characters (namely, the probability of observing them together in an alignment column), and are the marginal probabilities. The similarity is positive if , otherwise negative. Similarities are obtained from empirical data, for aminoacids, the most commonly used similarities are given by the PAM and BLOSUM matrices.

If we penalise gaps with negative numbers then the above described, global alignment algorithms work with similarities by changing minimalisation to maximalisation.

It is possible to define a special problem that works for similarities and does not work for distances. It is the local similarity problem or the local sequence alignment problem [261]. Given two sequences, a similarity and a gap penalty function, the problem is to give two substrings of the sequences whose similarity is maximal. A substring of a sequence is a consecutive part of the sequence. The biological motivation of the problem is that some parts of the biological sequences evolve slowly while other parts evolve fast. The local alignment finds the most conservated part of the two sequences. Local alignment is widely used for homology searching in databases. The reason why local alignments works well for homology searching is that the local alignment score can separate homologue and non-homologue sequences better since the statistics is not decreased due to the variable regions of the sequences.

The Smith-Waterman algorithm work in the following way. The initial conditions are:

Equation 21.29. 


Considering linear gap penalty, the dynamic programming table is filled in using the following recursions:

Equation 21.30. 


Here , the gap penalty is a negative number. The best local similarity score of the two sequences is the maximal number in the table. The trace-back starts in the cell having the maximal number, and ends when first reaches a .

It is easy to prove that the alignment obtained in the trace-back will be locally optimal: if the alignment could be extended at the end with a sub-alignment whose similarity is positive then there would be a greater number in the dynamic programming table. If the alignment could be extended at the beginning with a subalignment having positive similarity then the value at the end of the traceback would not be .

21.1.6 Multiple sequence alignment

The multiple sequence alignment problem was introduced by David Sankoff [254], and by today, the multiple sequence alignment has been the central problem in bioinformatics. Dan Gusfield calls it the Holy Grail of bioinformatics [123]. Multiple alignments are widespread both in searching databases and inferring evolutionary relationships. Using multiple alignments, it is possible to find the conservated parts of a sequence family, the positions that describe the functional properties of the sequence family. AS Arthur Lesk said: [141]: What two sequences whisper, a multiple sequence alignment shout out loud.

The columns of a multiple alignment of sequences is called aligned -tuples. The dynamic programming for the optimal multiple alignment is the generalisation of the dynamic programming for optimal pairwise alignment. To align sequences, we have to fill in a dimensional dynamic programming table. To calculate an entry in this table using linear gap penalty, we have to look back to a dimensional hypercube. Therefore, the memory requirement in case of sequence, long each is , and the running time of the algorithm is if we use linear gap penalty, and with arbitrary gap penalty.

There are two fundamental problems with the multiple sequence alignment. The first is an algorithmic problem: it is proven that the multiple sequence alignment problem is NP-complete [288]. The other problem is methodical: it is not clear how to score a multiple alignment. An objective scoring function could be given only if the evolutionary relationships were known, in this case an aligned -tuple could be scored according to an evolutionary tree [228].

A heuristic solution for both problems is the iterative sequence alignment [87], [62], [274]. This method first construct a guide-tree using pairwise distances (such tree-building methods are described in section the section called “21.5 Distance based algorithms for constructing evolutionary trees”). The guide-tree is used then to construct a multiple alignment. Each leaf is labelled with a sequence, and first the sequences in “cherry-motives” are aligned into each other, then sequence alignments are aligned to sequences and sequence alignments according to the guide-tree. The iterative sequence alignment method uses the “once a gap – always gap” rule. This means that gaps already placed into an alignment cannot be modified when aligning the alignment to other alignment or sequence. The only possibility is to insert all-gap columns into an alignment. The aligned sequences are usually described with a profile. The profile is a table, where is the length of the alignment. A column of a profile contains the statistics of the corresponding aligned -tuple, the frequencies of characters and the gap symbol.

The obtained multiple alignment can be used for constructing another guide-tree, that can be used for another iterative sequence alignment, and this procedure can be iterated till convergence. The reason for the iterative alignment heuristic is that the optimal pairwise alignment of closely related sequences will be the same in the optimal multiple alignment. The drawback of the heuristic is that even if the previous assumption is true, there might be several optimal alignments for two sequences, and their number might grow exponentially with the length of the sequences. For example, let us consider the two optimal alignments of the sequences AUCGGUACAG and AUCAUACAG.

We cannot choose between the two alignments, however, in a multiple alignment, only one of them might be optimal. For example, if we align the sequence AUCGAU to the two optimal alignments, we get the following locally optimal alignments:

The left alignment is globally optimal, however, the right alignment is only locally optimal.

Hence, the iterative alignment method yields only a locally optimal alignment. Another problem of this method is that it does not give an upper bound for the goodness of the approximation. In spite of its drawback, the iterative alignment methods are the most widely used ones for multiple sequence alignments in practice, since it is fast and usually gives biologically reasonable alignments. Recently some approximation methods for multiple sequence alignment have been published with known upper bounds for their goodness [124], [246]. However, the bounds biologically are not reasonable, and in practice, these methods usually give worse results than the heuristic methods.

We must mention a novel greedy method that is not based on dynamic programming. The DiAlign [215], [216], [217] first searches for gap-free homologue substrings by pairwise sequence comparison. The gap-free alignments of the homologous substrings are called diagonals of the dynamic programming name, hence the name of the method: Diagonal Alignment. The diagonals are scored according to their similarity value and diagonals that are not compatible with high-score diagonals get a penalty. Two diagonals are not compatible if they cannot be in the same alignment. After scoring the diagonals, they are aligned together a multiple alignment in a greedy way. First the best diagonal is selected, then the best diagonal that is comparable with the first one, then the third best alignment that is comparable with the first two ones, etc. The multiple alignment is the union of the selected diagonals that might not cover all the characters in the sequence. Those characters that were not in any of the selected diagonals are marked as “non alignable”. The drawback of the method is that sometimes it introduces too many gaps due to not penalising the gaps at all. However, DiAlign has been one of the best heuristic alignment approach and is widely used in the bioinformatics community.

21.1.7 Memory-reduction with the Hirschberg algorithm

If we want to calculate only the distance or similarity between two sequences and we are not interested in an optimal alignment, then in case of linear or affine gap penalties, it is very easy to construct an algorithm that uses only linear memory. Indeed, note that the dynamic programming recursion needs only the previous row (in case of filling in the dynamic table by rows), and the algorithm does not need to store earlier rows. On the other hand, once the dynamic programming table has reached the last row and forgot the earlier rows, it is not possible to trace-back the optimal alignment. If the dynamic programming table is scrolled again and again in linear memory to trace-back the optimal alignment row by row then the running time grows up to , where is the length of the sequences.

However, it is possible to design an algorithm that obtains an optimal alignment in running time and uses only linear memory. This is the Hirschberg algorithm [137], which we are going to introduce for distance-based alignment with linear gap penalty.

We introduce the suffixes of a sequence, a suffix is a substring ending at the end of the sequence. Let denote the suffix of starting with character .

The Hirschberg algorithm first does a dynamic programming algorithm for sequences and using liner memory as described above. Similarly, it does a dynamic programming algorithm for the reverse of the sequences and .

Based on the two dynamic programming procedures, we know what is the score of the optimal alignment of and an arbitrary prefix of , and similarly what is the score of the optimal alignment of and an arbitrary suffix of . From this we can tell what is the score of the optimal alignment of and :

Equation 21.31. 


and from this calculation it must be clear that in the optimal alignment of and , is aligned with the prefix for which

Equation 21.32. 


is minimal.

Since we know the previous rows of the dynamic tables, we can tell if and is aligned with any characters of or these characters are deleted in the optimal alignment. Similarly, we can tell if any character of is inserted between and .

In this way, we get at least two columns of the optimal alignment. Then we do the same for and the remaining part of the prefix of , and for and the remaining part of the suffix of . In this way we get alignment columns at the quarter and the three fourths of sequence . In the next iteration, we do the same for the for pairs of sequences, etc., and we do the iteration till we get all the alignment columns of the optimal alignment.

Obviously, the memory requirement still only grows linearly with the length of the sequences. We show that the running time is still , where and are the lengths of the sequences. This comes from the fact that the running time of the first iteration is , the running time of the second iteration is , where is the position for which we get a minimum distance in Eqn. (Equation 21.31). Hence the total running time is:

Equation 21.33. 


21.1.8 Memory-reduction with corner-cutting

The dynamic programming algorithm reaches the optimal alignment of two sequences with aligning longer and longer prefixes of the two sequences. The algorithm can be accelerated with excluding the bad alignments of prefixes that cannot yield an optimal alignment. Such alignments are given with the ordered paths going from the right top and the left bottom corners to , hence the name of the technique.

Most of the corner-cutting algorithms use a test value. This test value is an upper bound of the evolutionary distance between the two sequences. Corner-cutting algorithms using a test value can obtain the optimal alignment of two sequences only if the test value is indeed smaller then the distance between the two sequences, otherwise the algorithm stops before reaching the right bottom corner or gives a non-optimal alignment. Therefore these algorithms are useful for searching for sequences similar to a given one and we are not interested in sequences that are farther from the query sequence than the test value.

We are going to introduce two algorithms. the Spouge algorithm [263], [262] is a generalisation of the Fickett [88] and the Ukkonnen algorithm [285], [284]. The other algorithm was given by Gusfield, and this algorithm is an example for a corner-cutting algorithm that reaches the right bottom corner even if the distance between the two sequence is greater than the test value, but in this case the calculated score is bigger than the test value, indicating that the obtained alignment is not necessary optimal.

The Spouge algorithm calculates only those for which

Equation 21.34. 


where is the test value, is the gap penalty, and are the length of the sequences. The key observation of the algorithm is that any path going from to will increase the score of the alignment at least by . Therefore is is smaller than the distance between the sequences, the Spouge algorithm obtains the optimal alignments, otherwise will stop before reaching the right bottom corner.

This algorithm is a generalisation of the Fickett algorithm and the Ukkonen algorithm. Those algorithms also use a test value, but the inequality in the Fickett algorithm is:

Equation 21.35. 


while the inequality in the Ukkonnen algorithm is:

Equation 21.36. 


Since in both cases, the left hand side of the inequalities are not greater than the left end side of the Spouge inequality, the Fickett and the Ukkonnen algorithms will calculate at least as much part of the dynamic programming table than the Spouge algorithm. Empirical results proved that the Spouge algorithm is significantly better [262]. The algorithm can be extended to affine and concave gap penalties, too.

The -difference global alignment problem [123] asks the following question: Is there an alignment of the sequences whose weight is smaller than ? The algorithm answering the question has running time, where is the length of the longer sequence. The algorithm is based on the observation that any path from to having at most score cannot contain a cell for which . Therefore the algorithm calculates only those cells for which and disregards the neighbours of the border elements for which . If there exists an alignment with a score smaller or equal than , then and is indeed the distance of the two sequences. Otherwise , and is not necessary the score of the optimal alignment since there might be an alignment that leaves the band defined by the inequality and still has a smaller score then the best optimal alignment in the defined band.

The corner-cutting technique has been extended to multiple sequence alignments scored by the sum-of-pairs scoring scheme [46]. The sum-of-pairs score is:

Equation 21.37. 


where is the th aligned -tuple is the distance function on , is the number of sequences, is the character of the multiple alignment in the th row and th column. The -suffix of sequence is . Let denote the distance of the optimal alignment of the -suffix and the -suffix of the th and the th sequences. The Carillo and Lipman algorithm calculates only the positions for which

Equation 21.38. 


where is the test value. The goodness of the algorithm follows from the fact that the sum-of-pairs score of the optimal alignment of the not yet aligned suffixes cannot be smaller than the sum of the scores of the optimal pairwise alignments. This corner cutting method can align at most six moderately long sequences [191].

Exercises

21.1-1 Show that the number of possible alignments of an and an long sequences is

21.1-2 Give a series of pairs of sequences and a scoring scheme such that the number of optimal alignments grows exponentially with the length of the sequences.

21.1-3 Give the Hirschberg algorithm for multiple alignments.

21.1-4 Give the Hirschberg algorithm for affine gap penalties.

21.1-5 Give the Smith-Waterman algorithm for affine gap penalties.

21.1-6 Give the Spouge algorithm for affine gap penalties.

21.1-7 Construct an example showing that the optimal multiple alignment of three sequences might contain a pairwise alignment that is only suboptimal.

21.2 Algorithms on trees

Algorithms introduced in this section work on rooted trees. The dynamic programming is based on the reduction to rooted subtrees. As we will see, above obtaining optimal cases, we can calculate algebraic expressions in the same running time.

21.2.1 The small parsimony problem

The (weighted) parsimony principle is to describe the changes of biological sequences with the minimum number (minimum weight) of mutations. We will concern only with substitutions, namely, the input sequences has the same length and the problem is to give the evolutionary relationships of sequences using only substitutions and the parsimony principle. We can define the large and the small parsimony problem. For the large parsimony problem, we do not know the topology of the evolutionary tree showing the evolutionary relationships of the sequences, hence the problem is to find both the best topology and an evolutionary history on the tree. The solution is not only locally but globally optimal. It has been proved that the large parsimony problem is NP-complete [99].

The small parsimony problem is to find the most parsimonious evolutionary history on a given tree topology. The solution for the small parsimony problem is only locally optimal, and there is no guarantee for global optimum.

Each position of the sequences is scored independently, therefore it is enough to find a solution for the case where there is only one character at each leaf of the tree. In this case, the evolutionary history can be described with labelling the internal nodes with characters. If two characters at neighbouring vertices are the same, then no mutation happened at the corresponding edge, otherwise one mutation happened. The naive algorithm investigates all possible labelings and selects the most parsimonious solution. Obviously, it is too slow, since the number of possible labelings grows exponentially with the internal nodes of the tree.

The dynamic programming is based on the reduction to smaller subtrees [254]. Here the definition of subtrees is the following: there is a natural partial ordering on the nodes in the rooted subtree such that the root is the greatest node and the leaves are minimal. A subtree is defined by a node, and the subtree contains this node and all nodes that are smaller than the given node. The given node is the root of the subtree. We suppose that for any child of the node and any character we know the minimum number of mutations that are needed on the tree with root given that there is at node . Let denote this number. Then

Equation 21.39. 


where is the set of children of , is the alphabet, and is if and otherwise.

The minimum number of mutations on the entire tree is , where is the root of the tree. A most parsimonious labelling can be obtained with trace-backing the tree from the root to the leaves, writing to each nodes the character that minimises Eqn. Equation 21.39. To do this, we have to store for all and .

The running time of the algorithm is for one character, where is the number of nodes of the tree, and for entire sequences, where is the length of the sequences.

21.2.2 The Felsenstein algorithm

The input of the Felsenstein algorithm [86] is a multiple alignment of DNA (or RNA or protein) sequences, an evolutionary tree topology and edge lengths, and a model that gives for each pair of characters, and and time , what is the probability that evolves to duting time . Let denote this probability. The equilibrium probability distribution of the characters is denoted by . The question is what is the likelihood of the tree, namely, what is the probability of observing the sequences at the leaves given the evolutionary parameters consisting of the edge lengths and parameters of the substitution model.

We assume that each position evolves independently, hence the probability of an evolutionary process is the product of the evolutionary probabilities for each position. Therefore it is enough to show how to calculate the likelihood for a sequence position. We show this for an example tree that can be seen on Figure Figure 21.1, “ The tree on which we introduce the Felsenstein algorithm. Evolutionary times are denoted with The tree on which we introduce the Felsenstein algorithm. Evolutionary times are denoted with v s on the edges of the tree.s on the edges of the tree.”. will denote the character at node and is the length of edge . Since we do not know the characters at the internal nodes, we must sum the probabilities for all possible configurations:

Equation 21.40. 


If we consider the four character alphabet of DNA, the summation has members, an in case of species, it would have , namely the computational time grows exponentially with the number of sequences. However, if we move the expressions not depending on the summation index out of the summation, then we get the following product:

Equation 21.41. 


which can be calculated in significantly less time. Note that the parenthesis in (Equation 21.41) gives the topology of the tree. Each summation can be calculated independently then we multiply the results. Hence the running time of calculating the likelihood for one position decreases to and the running time of calculating the likelihood for the multiple alignment is where is the length of the alignment.

Figure 21.1.  The tree on which we introduce the Felsenstein algorithm. Evolutionary times are denoted with The tree on which we introduce the Felsenstein algorithm. Evolutionary times are denoted with v s on the edges of the tree.s on the edges of the tree.

The tree on which we introduce the Felsenstein algorithm. Evolutionary times are denoted with v s on the edges of the tree.

Exercises

21.2-1 Give an algorithm for the weighted small parsimony problem where we want to get minimum weight evolutionary labeling given a tree topology and a set of sequences associated to the leaves of the tree.

21.2-2 The gene content changes in species, a gene that can be found in a genome of a species might be abundant in another genome. In the simplest model an existing gene might be deleted from the genome and an abundant gene might appear. Give the small parsimony algorithm for this gene content evolution model.

21.2-3 Give an algorithm that obtains the Maximum Likelihood labelling on a tree.

21.2-4 Rewrite the small parsimony problem in the form of (Equation 21.40) replacing sums with minimalisation, and show that the Sankoff algorithm is based on the same rearrangement as the Felsenstein algorithm.

21.2-5 The Fitch algorithm [90] works in the following way: Each node is associated with a set of characters, . The leaves are associated with a set containing the character associated to the leaves, and each internal character has the set:

After reaching the root, we select an arbitrary character from , where is the root of the tree, and we choose the same character that we chose at the parent node if the set of the child node has this character, otherwise an arbitrary character from the set of the child node. Show that we get a most parsimonious labelling. What is the running time of this algorithm?

21.2-6 Show that the Sankoff algorithm gives all possible most parsimonious labelling, while there are most parsimonious labellings that cannot be obtained with the Fitch algorithm.

21.3 Algorithms on stochastic grammars

Below we give algorithms on stochastic transformational grammars. Stochastic transformational grammars play a central role in modern bioinformatics. Two types of transformational grammars are widespread, the Hidden Markov Models (HMMs) are used for protein structure prediction and gene finding, while Stochastic Context Free Grammars (SCFGs) are used for RNA secondary structure prediction.

21.3.1 Hidden Markov Models

We give the formal definition of Hidden Markov Models (HMM): Let denote a finite set of states. There are two distinguished states among the states, the start and the end states. The states are divided into two parts, emitting and non-emitting states. We assume that only the start and the end states are non-emitting, we will show that this assumption is not too strict.

The M transformation matrix contains the transition probabilities, , that the Markov process will jump to state from state . Emitting states emit characters form a finite alphabet, . The probability that the state emits a character will be denoted by . The Markov process starts in the start state and ends in the end state, and walks according to the transition probabilities in M. Each emitting state emits a character, the emitted characters form a sequence. The process is hidden since the observer observes only the sequence and does not observe the path that the Markov process walked on. There are three important questions for HMMs that can be answered using dynamic programming algorithms.

The first question is the following: given an HMM and a sequence, what is the most likely path that emits the given sequence? The Viterbi algorithm gives the answer for this question. Recall that is the -long prefix of sequence , and is the character in the th position. The dynamic programming answering the first question is based on that we can calculate the probability, the probability of the most probable path emitting prefix and being in state if we already calculated for all possible , since

Equation 21.42. 


The reason behind the above equation is that the probability of any path is the product of transition and emission probabilities. Among the products having the same last two terms (in our case ) the maximal is the one for which the product of the other terms is maximal.

The initialisation of the dynamic programming is

Equation 21.43. 


Since the end state does not emit a character, the termination of the dynamic programming algorithm is

Equation 21.44. 


where is the probability of the most likely path emitting the given sequence. One of the most likely paths can be obtained with a trace-back.

The second question is the following: given an HMM and a sequence, what is the probability that the HMM emits the sequence? This probability is the sum of the probabilities of paths that emit the given sequence. Since the number of paths emitting a given sequence might grow exponentially with the length of the sequence, the naive algorithm that finds all the possible emitting paths and sum their probabilities would be too slow.

The dynamic programming algorithm that calculates quickly the probability in question is called the Forward algorithm. It is very similar to the Viterbi algorithm, just there is a sum instead of maximalisation in it:

Equation 21.45. 


Since the END state does not emit, the termination is

Equation 21.46. 


where is the probability that the HMM emits sequence .

The most likely path obtained by the Viterbi algorithm has more and less reliable parts. Therefore we are interested in the probability

This is the third question that we answer with dynamic programming algorithm. The above mentioned probability is the sum of the probabilities of paths that emit in state divided by the probability that the HMM emits sequence . Since the number of such paths might grow exponentially, the naive algorithm that finds all the possible paths and sum their probability is too slow.

To answer the question, first we calculate for each suffix and state what is the probability that the HMM emits suffix given that state emits . This can be calculated with the Backward algorithm, which is similar to the Forward algorithm just starts the recursion with the end of the sequence:

Equation 21.47. 


Let denote the probability

Then

Equation 21.48. 


and from this

Equation 21.49. 


which is the needed probability.

21.3.2 Stochastic context-free grammars

It can be shown that every context-free grammar can be rewritten into Chomsky normal form. Each rule of a grammar in Chomsky normal form has the form or , where the s are non-terminal symbols, and is a terminal symbol. In a stochastic grammar, each derivation rule has a probability, a non-negative number such that the probabilities of derivation rules for each non-terminal sum up to .

Given a SCFG and a sequence, we can ask the questions analogous to the three questions we asked for HMMs: what is the probability of the most likely derivation, what is the probability of the derivation of the sequence and what is the probability that a sub-string has been derivated starting with non-terminal, given that the SCFG derivated the sequence. The first question can be answered with the CYK (Cocke-Younger-Kasami) algorithm which is the Viterbi-equivalent algorithm for SCFGs. The second question can be answered with the Inside algorithm, this is the Forward-equivalent for SCFGs. The third question can be answered with the combination of the Inside and Outside algorithms, as expected, the Outside algorithm is analogous to the Backward algorithm. Though the introduced algorithms are equivalent with the algorithms used for HMMs, their running time is significantly greater.

Let denote the probability of the rule , and let denote the probability of the rule . The Inside algorithm calculates for all and , this is the probability that non-terminal derives the substring from till . The initial conditions are:

Equation 21.50. 


for all and . The main recursion is:

Equation 21.51. 


where is the number of non-terminals. The dynamic programming table is an upper triangle matrix for each non-terminal, the filling-in of the table starts with the main diagonal, and is continued with the other diagonals. The derivation probability is , where is the length of the sequence, and is the starting non-terminal. The running time of the algorithm is , the memory requirement is .

The Outside algorithm calculates for all and , this is the part of the derivation probability of deriving sequence which is “outside” of the derivation of substring from till , starting the derivation from . A more formal definition for is that this is the sum of derivation probabilities in whom the substring from till is derived from , divided by . Here we define as . The initial conditions are:

Equation 21.52. 


Equation 21.53. 


The main recursion is:

Equation 21.54. 


The reasoning for Eqn. Equation 21.54 is the following. The non-terminal was derivated from a non-terminal together with a non-terminal, and their derivation order could be both and . The outside probability of non-terminal is product of the outside probability of , the derivation probability and the inside probability of . As we can see, inside probabilities are needed to calculate outside probabilities, this is a significant difference from the Backward algorithm that can be used without a Forward algorithm.

The CYK algorithm is very similar to the Inside algorithm, just there are maximalisations instead of summations:

Equation 21.55. 


The probability of the most likely derivation is . The most likely derivation can be obtained with a trace-back.

Finally, the probability that the substring from till has been derived by given that the SCFG derived the sequence is:

Equation 21.56. 


Exercises

21.3-1 In a regular grammar, each derivation rule is either in a form or in a form . Show that each HMM can be rewritten as a stochastic regular grammar. On the other hand, there are stochastic regular grammars that cannot be described as HMMs.

21.3-2 Give a dynamic programming algorithm that calculate for a stochastic regular grammar and a sequence

  • the most likely derivation,

  • the probability of derivation,

  • the probability that character is derived by non-terminal .

21.3-3 An HMM can contain silent states that do not emit any character. Show that any HMM containing silent states other than the start and end states can be rewritten to an HMM that does not contain silent states above the start and end states and emits sequences with the same probabilities.

21.3-4 Pair Hidden Markov models are Markov models in which states can emit characters not only to one but two sequences. Some states emit only into one of the sequences, some states emit into both sequences. The observer sees only the sequences and does not see which state emits which characters and which characters are co-emitted. Give the Viterbi, Forward and Backward algorithms for pair-HMMs.

21.3-5 The Viterbi algorithm does not use that probabilities are probabilities, namely, they are non-negative and sum up to one. Moreover, the Viterbi algorithm works if we replace multiplications to additions (say that we calculate the logarithm of the probabilities). Give a modified HMM, namely, in which “probabilities” not necessary sum up to one, and they might be negative, too, and the Viterbi algorithm with additions are equivalent with the Gotoh algorithm.

21.3-6 Secondary structures of RNA sequences are set of basepairings, in which for all basepairing positions and , implies that either or . The possible basepairings are , , , , and . Give a dynamic programming algorithm that finds the secondary structure containing the maximum number of basepairings for an RNA sequence. This problem was first solved by Nussionov et al. [221].

21.3-7 The derivation rules of the Knudsen-Hein grammar are [172], [171]

where has to be substituted with the possible characters of RNA sequences, and the s in the expression have to be replaced by possible basepairings. Show that the probability of the derivation of a sequence as well as the most likely derivation can be obtained without rewriting the grammar into Chomsky normal form.

21.4 Comparing structures

In this section, we give dynamic programming algorithms for comparing structures. As we can see, aligning labelled rooted trees is a generalisation of sequence alignment. The recursions in the dynamic programming algorithm for comparing HMMs yields a linear equation system due to circular dependencies. However, we still can call it dynamic programming algorithm.

21.4.1 Aligning labelled, rooted trees

Let be a finite alphabet, and , . Labelling of tree is a function that assigns a character of to each node . If we delete a node from the tree, then the children of the node will become children of the parental node. If we delete the root of the tree, then the tree becomes a forest. Let be a rooted tree labelled with characters from , and let represent the labelling. is an alignment of trees and labelled with characters from if restricting the labeling of to the first (respectively, second) coordinates and deleting nodes labelled with ' ' yields tree (respectively, ). Let be a similarity function. An optimal alignment of trees and is the tree labelled with for which

Equation 21.57. 


is maximal. This tree is denoted by . Note that a sequence can be represented with a unary tree, which has a single leaf. Therefore aligning trees is a generalisation of aligning sequences (with linear gap penalty).

Below we will concern only with trees in which each node has a degree at most . The recursion in the dynamic programming algorithm goes on rooted subtrees. A rooted subtree of a tree contains a node of the tree and all nodes that are smaller than . The tree obtained by root is denoted by .

A tree to an empty tree can be aligned only in one way. Two leafes labelled by and can be aligned in three different way. The alignment might contain only one node labelled with or might contain two nodes, one of them is labelled with , the other with . One of the points is the root, the other the leaf.

Similarly, when we align a single leaf to a tree, then in the alignment either the single character of the node is labelled together with a character of the tree or labelled together with ' ' in an independent node. This node can be placed in several ways on tree , however the score of any of them is the same.

After this initialisation, the dynamic programming algorithm aligns greater rooted subtrees using the alignments of smaller rooted subtrees. We assume that we already know the score of the optimal alignments , , , , , , and when aligning subtrees and , where and are the children of and and are the children of . Should one of the nodes have only one child, the dynamic programming reduces the problem of aligning and to less subproblems. We assume that the algorithm also knows the score of the optimal alignments of to the empty tree and the score of the optimal alignment of to the empty tree. Let the labelling of be and the labelling of be . We have to consider constant many subproblems for obtaining the score of the optimal alignment of and . If one of the tree is aligned to one of the children's subtree of the other tree, then the other child and the root of the other tree is labelled together with ' '. If character of is co-labelled with the character of , then the children nodes are aligned together, as well. The last situation is the case when the roots are not aligned in but one of the roots is the root of and the other root is its only child. The children might or might not be aligned together, this is five possible cases altogether.

Since the number of rooted subtrees equals to the number of nodes of the tree, the optimal alignment can be obtained in time, where and are the number of nodes in and .

21.4.2 Co-emission probability of two HMMs

Let and be Hidden Markov Models. The co-emission probability of the two models is

Equation 21.58. 


where the summation is over all possible sequences and is the probability that model emitted sequence . The probability that path emitted sequence is denoted by , a path from the START state till the state is denoted by . Since state can be reached on several paths, this definition is not well-defined, however, this will not cause a problem later on. Since the coemission probability is the sum of the product of emission of paths,

Equation 21.59. 


Let denote the path that can be obtained with removing the last state from , and let be the state before in path . (We define similarly and .) Hence

Equation 21.60. 


where is the probability of jumping to END from , and

Equation 21.61. 


can be also obtained with this equation:

Equation 21.62. 


where is the probability that emitted . Equation Equation 21.62 defines a linear equation system for all pairs of emitting states and . The initial conditions are:

Equation 21.63. 


Equation 21.64. 


Equation 21.65. 


Unlike the case of traditional dynamic programming, we do not fill in a dynamic programming table, but solve a linear equation system defined by Equation Equation 21.62. Hence, the coemission probability can be calculated in time, where and are the number of emitting states of the models.

Exercises

21.4-1 Give a dynamic programming algorithm for the local similarities of two trees. This is the score of the most similar subtrees of the two trees. Here subtrees are any consecutive parts of the tree.

21.4-2 Ordered trees are rooted trees in which the children of a node are ordered. The ordered alignment of two ordered trees preserve the orderings in the aligned trees. Give an algorithm that obtains the optimal ordered alignment of two ordered trees and has running time being polynomial with both the maximum number of children and number of nodes.

21.4-3 Consider the infinite Euclidean space whose coordinates are the possible sequences. Each Hidden Markov model is a vector in this space the coordinates of the vector are the emission probabilities of the corresponding sequences. Obtain the angle between two HMMs in this space.

21.4-4 Give an algorithm that calculates the generating function of the length of the emitted sequences of an HMM, that is

where is the probability that the Markov model emitted a sequence with length .

21.4-5 Give an algorithm that calculates the generating function of the length of the emitted sequences of a pair-HMM, that is

where is the probability that the first emitted sequence has length , and the second emitted sequence has length .

21.5 Distance based algorithms for constructing evolutionary trees

In this section, we shell introduce algorithms whose input is a set of objects and distances between objects. The distances might be obtained from pairwise alignments of sequences, however, the introduced algorithms work for any kind of distances. The leaves of the tree are the given objects, and the topology and the lengths of the edges are obtained from the distances. Every weighted tree defines a metric on the leaves of the tree, we define the distance between two leaves as the sum of the weights of edges on the path connecting them. The goodness of algorithms can be measured as the deviation between the input distances and the distances obtained on the tree.

We define two special metrics, the ultrametric and additive metric. The clustering algorithms generate a tree that is always ultrametric. We shell prove that clustering algorithms gives back the ultrametric if the input distances follow a ultrametric, namely, the tree obtained by a clustering algorithm defines exactly the input distances.

Similarly, the Neighbour Joining algorithm creates a tree that represents an additive metric, and whenever the input distances follow an additive metric, the generated tree gives back the input distances.

For both proves, we need the following lemma:

Lemma 21.3 For any metric, there is at most one tree that represents it and has positive weights.

Proof. The proof is based on induction, the induction starts with three points. For three points, there is exactly one possible topology, a star-tree. Let the lengths of the edges connecting points , and with the internal node of the star three be , and , respectively. The lengths of the edges defined by the

Equation 21.66. 


Equation 21.67. 


Equation 21.68. 


equation system, which has a unique solution since the determinant

Equation 21.69. 


is not .

For number of points, let us assume that there are two trees representing the same metric. We find a cherry motif on the first tree, with cherries and . A cherry motif is a motif with two leafes whose connecting path has exactly one internal node. Every tree contains at least two cherry motives, a path on the tree that has the maximal number of internal nodes has cherry motives at both ends.

If there is only one internal node on the path connecting and on the other tree, then the length of the corresponding edges in the two cherry motives must be the same, since for any additional point , we must get the same subtree. We define a new metric by deleting points and , and adding a new point . The distance between and any point is , where is the length of the edge connecting with the internal point in the cherry motif. If we delete nodes and , we get a tree that represent this metric and they are the same, according to the induction.

If the path between and contains more than one internal node on the other tree, then we find a contradiction. There is a point on the second tree for which . Consider a such that the path connecting and contains node . From the first tree

Equation 21.70. 


while on the second tree

Equation 21.71. 


which contradicts that .

21.5.1 Clustering algorithms

Definition 21.4 A metric is ultrametric if for any three points, , and

Equation 21.72. 


It is easy to prove that the three distances between any three points are all equal or two of them equal and the third is smaller in any ultrametric.

Theorem 21.5 If the metric on a finite set of points is ultrametric, then there is exactly one tree that represents it. Furthermore, this tree can be rooted such that the distance between a point and the root is the same for all points.

Proof. Based on the Lemma 21.3, it is enough to construct one ultrametric tree for any ultrametric. We represent ultrametric trees as dendrograms. in this representation, the horizontal edges has length zero. For an example dendrogram, see Figure Figure 21.2, “ A dendrogram.”.

Figure 21.2.  A dendrogram.

A dendrogram.


The proof is based on the induction on the number of leaves. Obviously, we can construct a dendrogram for two leaves. After constructing the dendrogram for leaves, we add leaf to the dendrogram in the following way. We find a leaf in the dendrogram, for which is minimal. Then we walk up on the dendrogram till we reach the distance (we might go upper than the root). The node is connected to the dendrogram at this point, see Figure Figure 21.3, “ Connecting leaf Connecting leaf n+1 to the dendrogram. to the dendrogram.”.

Figure 21.3.  Connecting leaf Connecting leaf n+1 to the dendrogram. to the dendrogram.

Connecting leaf n+1 to the dendrogram.


This dendrogram represents properly the distances between leaf and any other leaf. Indeed, if leaf that is below the new internal node that bonnets leaf , then and from the ultrametric property and the minimality of it follows that . On the other hand, if leaf is not below the new internal point joining leaf , then , and from the ultrametric property it comes that .

It is easy to see that the construction in the proof needs running time, where is the number of objects. We shell give another algorithm that finds the pair of objects and for which is minimal. From the ultrametric property, for any , , hence we can replace the pair of objects and to a new object, and the distance between this new object and any other object is well defined, it is . The objects and are connected at height , and we treat this sub-dendrogram as a single object. We continue the iteration till we have a single object. This algorithm is slower than the previous algorithm, however, this is the basis of the clustering algorithms. The clustering algorithms create a dendrogram even if the input distances do not follow a ultrametric. On the other hand, if the input distances follow a ultrametric, then most of the clustering algorithms gives back this ultrametric.

As we mentioned, the clustering algorithms find the object pair and for which is minimal. The differences come from how the algorithms define the distance between the new object replacing the pair of objects and and any other object. If the new object is denoted by , then the introduced clustering methods define in the following way:

  • Single link: .

  • Complete link: .

  • UPGMA: the new distance is the arithmetic mean of the distances between the elements in and : , where and are the number of elements in and .

  • Single average: .

  • Centroid: This method is used when the objects can be embedded into a Euclidean space. Then the distance between two objects can be defined as the distance between the centroids of the elements of the objects. It is not necessary to use the coordinates of the Euclidean space since the distance in question is the distance between point and the point intersecting the edge in proportion in the triangle obtained by points , és (see Figure Figure 21.4, “ Calculating Calculating d_{u,k} according to the Centroid method. according to the Centroid method.”). This length can be calculated using only , and . Hence the algorithm can be used even if the objects cannot be embedded into a Euclidean space.

  • Median: The centroid of is the centroid of the centroids of and . This method is related to the centroid method as the single average is related to the UPGMA method. It is again not necessary to know the coordinates of the elements, hence this method can be applied to distances that cannot be embedded into a Euclidean space.

It is easy to show that the first four method gives the dendrogram of the input distances whenever the input distances follow a ultrametric since in this case. However, the Centroid and Median methods do not give the corresponding dendrogram for ultrametric input distances, since will be smaller than (which equals to ).

The central problem of the clustering algorithms is that they give a dendrogram that might not be biologically correct. Indeed, the evolutionary tree of biological sequences can be a dendrogram only if the molecular clock hypothesis holds. The molecular clock hypothesis says that the sequences evolve with the same tempo at each branches of the tree, namely they collect the same number of mutations at a given time span. However, this is usually not true. Therefore biologists want an algorithm that give a ultrametric tree only if the input distances follow a ultrametric. The most popular such method is the Neighbour-Joining algorithm.

Figure 21.4.  Calculating Calculating d_{u,k} according to the Centroid method. according to the Centroid method.

Calculating d_{u,k} according to the Centroid method.

21.5.2 Neighbour joining

Definition 21.6 A metric is called additive or four-point metric, if for any four points , , and

Equation 21.73. 


Theorem 21.7 If a metric is additive on a finite set of objects, then there is exactly one tree that represents it.

Proof. Due to Lemma 21.3, there is at most one such tree, therefore it is enough to construct it. First we give the construction then we prove its goodness.

For three objects we can construct a tree according to (Equation 21.66)–(Equation 21.68). Assume that we constructed the tree for objects, and we want to add leaf to the tree. First we find the topology and then we give the length of the new edge. For obtaining the new topology we start with any leaf , and let denote the neighbour of leaf . There are at least two other edges going out from , we find two leaves on the paths starting with these two outgoing edges, let and denote these leaves, see Figure Figure 21.5, “ Connecting leaf Connecting leaf n+1 for constructing an additive tree. for constructing an additive tree.”.

Figure 21.5.  Connecting leaf Connecting leaf n+1 for constructing an additive tree. for constructing an additive tree.

Connecting leaf n+1 for constructing an additive tree.


The leaf is connected to the edges between and if

Equation 21.74. 


Using similar inequalities, we can decide if leaf is before looking from or looking from . If the degree of is greater than , then we find leaves on the other paths and we do the same investigations for , , and points. From the additive property, it follows that inequality can hold at most for one cases. If it holds for , then we connect leaf to the edge connecting and . If the inequality holds for another case, then we derive the maximal subtree of the tree that contains as a leaf and also contains the leaf for which the inequality holds. We define as , then renaming to we continue the searching for the connection place of leaf . If we get equality for all outgoing edges of , then we connect leaf to .

After finding the topology, we obtain the length of the new edge. Leaf is connected to the edge between and , let denote the new internal point, see Figure Figure 21.6, “ Some tree topologies for proving Theorem 21.7.”/b.

Figure 21.6.  Some tree topologies for proving Theorem 21.7.

Some tree topologies for proving Theorem 21.7.


We define as . then the distances , , and can be calculated using (Equation 21.66)–(Equation 21.68). If the leaf is connected to , then .

Now we prove the correctness of the construction. First we show that is well-defined, namely, for all node that is not in the new subtree containing leaves and , . If the new subtree contains then for that was used to find the place of leaf will obviously hold (see Figure Figure 21.6, “ Some tree topologies for proving Theorem 21.7.”/a). Due to the additive metric property and the place of leaf

Equation 21.75. 


Using inequalities és a , it follows that

Equation 21.76. 


Similarly for all leaves that are not separated from by , it holds that

Equation 21.77. 


It is due to the additive metric and the inequality

Equation 21.78. 


this later inequality comes from these inequalities

Equation 21.79. 


Equation 21.80. 


If the degree of is greater than , then similar inequalities hold.

Due to the way of calculating the new edge lengths, is represented properly on the new tree, hence is represented properly for all that is separated from leaf by . Note that might be an earlier .

If leaf is connected to the edge between and (Figure Figure 21.6, “ Some tree topologies for proving Theorem 21.7.”/b), then due to the definition , is represented properly. From the equation

Equation 21.81. 


it follows that

Equation 21.82. 


hence is represented properly. It can be similarly shown that for all points that are separated from by , the is represented properly on the tree.

If leaf is connected to node (Figure Figure 21.6, “ Some tree topologies for proving Theorem 21.7.”), then from the equations

Equation 21.83. 


it comes that both and are represents properly on the new tree, and with similar reasoning, it is easy to show that actually for all nodes that is separated from by , is represented properly on the tree.

Hence we construct a tree containing leaf from the tree containing the first leaves, thus proving Theorem 21.7.

It is easy to show that the above algorithm that constructs the tree representing an additive metric takes running time. However, it works only if the input distances follow an additive metric, other wise inequality (Equation 21.74) might hold several times, hence we cannot decide where to join leaf to. We shell introduce an algorithm that has running time and gives back the additive tree whenever the input distances follow an additive metric, moreover it generates an additive tree that approximates the input distances if those are not follow an additive metric.

The Neighbour-Joining algorithm works in the following way: Given a set with points and a distance function on the points. First we calculate the for each point the sum of the distances from the other points:

Equation 21.84. 


Then we find the pair of points for which

Equation 21.85. 


is minimal. The length of the edges from points and to the new point are

Equation 21.86. 


and

Equation 21.87. 


Then we recalculate distances. We drop points and , and add point . The distance between and any other point is defined as

Equation 21.88. 


Theorem 21.8 If follows an additive metric, then the Neighbour-Joining algorithm generates a tree that gives back .

Proof. From Theorem 21.7 there is exactly one tree that represents the distances. It is enough to prove that the Neighbour-Joining algorithm always pick a cherry motif on this tree, since a straightforward calculation shows that in this case the calculated edge lengths are proper.

First we prove if and follows a cherry motif then for all , és . Indeed, rearranging , we have to prove that

Equation 21.89. 


Figure 21.7.  The configuration of nodes The configuration of nodes i , j , k and l if i and j follows a cherry motif., The configuration of nodes i , j , k and l if i and j follows a cherry motif., The configuration of nodes i , j , k and l if i and j follows a cherry motif. and The configuration of nodes i , j , k and l if i and j follows a cherry motif. if The configuration of nodes i , j , k and l if i and j follows a cherry motif. and The configuration of nodes i , j , k and l if i and j follows a cherry motif. follows a cherry motif.

The configuration of nodes i , j , k and l if i and j follows a cherry motif.


If , then we get that

Equation 21.90. 


(see also Figure Figure 21.7, “ The configuration of nodes The configuration of nodes i , j , k and l if i and j follows a cherry motif., The configuration of nodes i , j , k and l if i and j follows a cherry motif., The configuration of nodes i , j , k and l if i and j follows a cherry motif. and The configuration of nodes i , j , k and l if i and j follows a cherry motif. if The configuration of nodes i , j , k and l if i and j follows a cherry motif. and The configuration of nodes i , j , k and l if i and j follows a cherry motif. follows a cherry motif.”). and the cases and inside the sums cancel each other, hence we prove that the (Equation 21.89) inequality holds.

Now we prove the Theorem 21.8 in an indirect way. Suppose that and does not follow a cherry motif, however, is minimal. From the previous lemma, neither nor are in a cherry motif with other leaves. We find a cherry motif with leaves and and internal node . Let denote the last common node of paths going from to and to . Since is minimal,

Equation 21.91. 


Rearranging this we get that

Equation 21.92. 


and cases , , and inside the sum cancel each other. For the other , the left hand side is

Equation 21.93. 


If joins to the tree via the path connecting and , then the expression Equation 21.93 will be always negative, see also Figure Figure 21.8, “ The possible places for node The possible places for node m on the tree. on the tree.”. Let these cases be called I. class cases.

Figure 21.8.  The possible places for node The possible places for node m on the tree. on the tree.

The possible places for node m on the tree.


If joins to the tree via the path between and , then the expression Equation 21.93 might be positive. Let these cases called II. class cases. To avoid contradiction, the sum of absolute values from I. class cases must be less than the sum from the II. class cases.

There is another node on the path connecting and , and we can find a cherry motif after node with leaves and and internal node . Here again the II. class cases have to be more than the I. class cases, but this contradict to the situation with the first cherry motif. Hence and form a cherry motif and we prove Theorem 21.8.

Exercises

21.5-1 Show that in a ultrametric, three distances coming from three points are all equal or two of them equal and the third is smaller. Prove the similar claim for the three sum of distances coming from four points in an additive metric.

21.5-2 Show that a ultrametric is always an additive metric.

21.5-3 Give an example for a metric that is not additive.

21.5-4 Is it true that all additive metric is a Euclidean metric?

21.5-5 Give the formula that calculates from , and for the centroid method.

21.5-6 Give algorithms that decide in whether or not a metric is

  • additive

  • ultrametric

( is the number of points.)

21.6 Miscellaneous topics

In this section, we cover topics that are usually not mentioned in bioinformatics books. We only mention the main results in a nutshell and do not prove theorems.

21.6.1 Genome rearrangement

The genome of an organism consists of several genes. For each gene, only one strand of the double stranded DNA contains meaningful information, the other strand is the reverse complement. Since the DNA is chemically oriented, we can talk about the direction of a gene. If each gene has one copy in the genome then we can describe the order and directions of genes as a signed permutation, where the signs give the directions of genes.

Given two genomes with the same gene content, represented as a signed permutation then the problem is to give the minimal series of mutations that transform one genome into another. We consider three types of mutations:

  • Reversal A reversal acts on a consecutive part of the signed permutation. It reverse the order of genes on the given part as well as the signs of the genes.

  • Transposition A transposition swaps two consecutive block of genes.

  • Reverted transposition It swaps two consecutive blocks and one of the blocks is reverted. As for reversals, the signs in the reverted block also change.

If we assume that only mutations happened, then we can give an running time algorithm that obtains a shortest series of mutations transforming one genome into another, where is the number of genes.

If we consider other types of mutations, then the complexity of problems is unknown. For transpositions, the best approximation is an approximation [79], if we consider all possible types of mutations, then the best approximation is a -approximation [121]. For a wide range of and biologically meaningful weights, the weighted sorting problem for all types of mutations has a -approximation [20].

If we do not know the signs, then the problem is proved to be NP-complete [45]. Similarly, the optimal reversal median problem even for three genomes and signed permutations is NP-complete [44]. The optimal reversal median is a genome that minimises the sum of distances from a set of genomes.

Below we describe the Hannenhalli-Pevzner theorem for the reversal distance of two genomes. Instead of transforming permutation into , we transform into the identical permutation. Based on elementary group theory, it is easy to show that the two problems are equivalent. We assume that we already calculated , and we will denote it simply by .

We transform an long signed permutation into a long unsigned permutation by replacing to and to . Additionally, we frame the unsigned permutation into and . The vertexes of the so-called graph of desire and reality are the numbers of the unsigned permutation together with and . Starting with , we connect every other number in the graph, these are the reality edges. Starting also with , we connect with with an arc, these are the desire edges. An example graph can be seen on Figure Figure 21.9, “ Representation of the Representation of the -1,\,+2,\,+5,\,+3,\,+4 signed permutation with an unsigned permutation, and its graph of desire and reality. signed permutation with an unsigned permutation, and its graph of desire and reality.”.

Figure 21.9.  Representation of the Representation of the -1,\,+2,\,+5,\,+3,\,+4 signed permutation with an unsigned permutation, and its graph of desire and reality. signed permutation with an unsigned permutation, and its graph of desire and reality.

Representation of the -1,\,+2,\,+5,\,+3,\,+4 signed permutation with an unsigned permutation, and its graph of desire and reality.


Since each vertex in the graph of desire and reality has a degree of two, the graph can be unequivocally decomposed into cycles. We call a cycle a directed cycle if on a walk on the cycle, we go at least once from left to right on a reality cycle and we go at least once from right to left on a reality cycle. Other cycles are unoriented cycles.

The span of a desire edge is the interval between its left and right vertexes. Two cycles overlap if there are two reality edges in the two cycles whose spans intersect. The vertexes of the overlap graph of a signed permutation are the cycles in its graph of desire and reality, two nodes are connected if the two cycles overlap. The overlap graph can be decomposed into components. A component is directed if it contains a directed cycle, otherwise it is unoriented. The span of a component is the interval between its leftmost and rightmost nodes in the graph of desire and reality. An unoriented component is a hurdle if its span does not contain any unoriented component or it contains all unoriented component. Other components are called protected non-hurdles.

A super-hurdle is hurdle for which it is true that if we delete this hurdle then one of the protected non-hurdles becomes a hurdle. A fortress is a permutation in which all hurdles are super-hurdles and their number is odd.

The Hannenhalli-Pevzner theorem is the following:

Theorem 21.9 Given a signed permutation . The minimum number of mutations sorting this permutation into the identical permutation is

Equation 21.94. 


where is the length of the permutation, is the number of cycles, is the number of hurdles, and if the permutation is a fortress, otherwise .

The proof of the theorem can be found in the book due to Pevzner.

The reversal distance was calculated in time by Bader et al.. It is very easy to obtain in time. The hard part is to calculate and . The source of the problem is that the overlap graph might contain edges. Therefore the fast algorithm does not obtain the entire overlap graph, only a spanning tree on each component of it.

21.6.2 Shotgun sequencing

A genome of an organism usually contain significantly more than one million nucleic acids. Using a special biochemical technology, the order of nucleic acids can be obtained, however, the uncertainty grows with the length of the DNA, and becomes absolutely unreliable after about 500 nucleic acids.

A possible solution for overcoming this problem is the following: several copies are made from the original DNA sequence and they are fragmented into small parts that can be sequenced in the above described way. Then the original sequence must be reconstructed from its overlapping fragments. This technique is called shotgun sequencing.

The mathematical definition of the problem is that we want to find the shortest common super-sequence of a set of sequences. Sequence is a super-sequence of if is subsequence of . Recall that a subsequence is not necessarily a consecutive part of the sequence. Maier proved that the shortest common super-sequence problem is NP-complete is the size of the alphabet is at least and conjectured that it is the case if the size is at least . Later on it has been proved that the problem is NP-complete for all non-trivial alphabet [247].

Similar problem is the shortest common super-string problem, that is also an NP-complete problem [102]. This later has biological relevance, since we are looking for overlapping substrings. Several approximation algorithms have been published for the shortest common super-string problem. A greedy algorithm finds for each pair of strings the maximal possible overlap, then it tries to find a shortest common super-string by merging the overlapping strings in a greedy way [270]. The running time of the algorithm is , where is the number of sequences and is the total length of the sequences. This greedy method is proved to be a -approximation [36]. A modified version being a -approximation also exist, and the conjecture is that the modified version is a -approximation [36].

The sequencing of DNA is not perfect, insertions, deletions and substitutions might happen during sequencing. Therefore Jiang and Li suggested the shortest -approximative common super-string problem [154]. Kececioglu and Myers worked out a software package including several heuristic algorithm for the problem [166]. Later on Myers worked for Celera, which played a very important role in sequencing the human genome. A review paper on the topic can be found in [291].

Exercises

21.6-1 Show that a fortress contains at least three super-hurdle.

21.6-2 At least how long is a fortress?

  Problems  

21-1 Concave Smith–Waterman

Give the Smith–Waterman-algorithm for concave gap penalty.

21-2 Concave Spouge

Give Spouge-algorithm for concave gap penalty.

21-3 Serving at a petrol station

There are two rows at a petrol station. Each car needs either petrol or diesel oil. At most two cars can be served at the same time, but only if they need different type of fuel, and the two cars are the first ones in the two rows or the first two in the same row. The serving time is the same not depending on whether two cars are being served or only one. Give a pair-HMM for which the Viterbi-algorithm provides a shortest serving scenario.

21-4 Moments of an HMM

Given an HMM and a sequence. Obtain the mean, variance, th moment of the probabilities of paths emitting the given sequence.

21-5 Moments of a SCFG

Given a SCFG and a sequence. Obtain the mean, variance, th moment of the probabilities of derivations of the given sequence.

21-6 Co-emission probability of two HMMs

Can this probability be calculated in time where and are the number of states in the HMMs?

21-7 Sorting reversals

A sorting reversal is a reversal that decreases the reversal distance of a signed permutation. How can a sorting reversal change the number of cycles and hurdles?

  Chapter Notes  

The first dynamic programming algorithm for aligning biological sequences was given by Needleman and Wunch in 1970 [218]. Though the concave gap penalty function is biologically more relevant, the affine gap penalty has been the standard soring scheme for aligning biological sequences. For example, one of the most popular multiple alignment program, CLUSTAL-W uses affine gap penalty and iterative sequence alignment [274]. The edit distance of two strings can calculated faster than time, that is the famous “Four Russians' speedup” [13]. The running time of the algorithm is , however, it has such a big constant in the running time that it is not worth using it for sequence lengths appear in biological applications. The longest common subsequence problem can be solved using a dynamic programming algorithm similar to the dynamic programming algorithm for aligning sequences. Unlike that algorithm, the algorithm of Hunt and Szymanski creates a graph whose points are the characters of the sequences and , and is connected to iff . Using this graph, the longest common subsequence can be find in time, where is the number of edges in the graph and is the number of nodes [143]. Although the running time of this algorithm is , since the number of edges might be , in many cases the number of edges is only , and in this cases the running time is only . A very sophisticated version of the corner-cutting method is the diagonal extension technique, which fills in the dynamic programming table by diagonals and does not need a test value. An example for such an algorithm is the algorithm of Wu at al. [297]. The diff command in the Unix operating system is also based on diagonal extension [211], having a running time , where and are the lengths of the sequences and is the edit distance between the two sequences. The Knuth-Morris-Pratt string-searching algorithm searches a small pattern in a long string . Its running time is , where and are the length of the sequences [173]. Landau and Vishkin modified this algorithm such that the modified version can find a pattern in that differs at most in position [184]. The running time of the algorithm is , the memory requirement is . Although dynamic programming algorithms are the most frequently used techniques for aligning sequences, it is also possible to attack the problem with integer linear programming. Kececioglu and his colleges gave the first linear programming algorithm for aligning sequences [165]. Their method has been extended to arbitrary gap penalty functions [9]. Lancia wrote a review paper on the topic [183] and Pachter and Sturmfels showed the relationship between the dynamic programming and integer linear programming approach in their book Algebraic Statistics for Computational Biology [227]. The structural alignment considers only the 3D structure of sequences. The optimal structural alignment problem is to find an alignment where we penalise gaps, however, the aligned characters scored not by their similarity but by how close their are in the superposed 3D structures. Several algorithms have been developed for the problem, one of them is the combinatorial extension (CE) algorithm [258]. For a given topology it is possible to find the Maximum Likelihood labeling [242]. This algorithm has been integrated into PAML, which is one of the most popular software package for phylogenetic analysis (http://abacus.gene.ucl.ac.uk/software/paml.html). The Maximum Likelihood tree problem is to find for a substitution model and a set of sequences the tree topology and edge lengths for which the likelihood is maximal. Surprisingly, it has only recently been proved that the problem is NP-complete [51], [250]. The similar problem, the Ancestral Maximum Likelihood problem has been showed to be NP-complete also only recently [5]. The AML problem is to find the tree topology, edge lengths and labellings for which the likelihood of a set of sequences is maximal in a given substitution model. The two most popular sequence alignment algorithms based on HMMs are the SAM [142] and the HMMER (http://hmmer.wustl.edu/) packages. An example for HMM for genome annotation is the work of Pedersen and Hein [233]. Comparative genome annotation can be done with pair-HMMs like the DoubleScan [208], (http://www.sanger.ac.uk/Software/analysis/doublescan/) and the Projector [209], (http://www.sanger.ac.uk/Software/analysis/projector/) programs. Goldman, Thorne and Jones were the first who published an HMM in which the emission probabilities are calculated from evolutionary informations [110]. It was used for protein secondary structure prediction. The HMM emits alignment columns, the emission probabilities can be calculated with the Felsenstein algorithm. The Knudsen-Hein grammar is used in the PFold program, which is for predicting RNA secondary structures [171]. This SCFG generates RNA multiple alignments, where the terminal symbols are alignment columns. The derivation probabilities can be calculated with the Felsenstein algorithm, the corresponding substitution model is a single nucleotide or a dinucleotide model, according to the derivation rules. The running time of the Forward algorithm grows squarely with the number of states in the HMM. However, this is not always the fastest algorithm. For a biologically important HMM, it is possible to reduce the running time of the Forward algorithm to with a more sophisticated algorithm [194], [195]. However, it is unknown whether or not similar acceleration exist for the Viterbi algorithm. The Zuker-Tinoco model [275] defines free energies for RNA secondary structure elements, and the free energy of an RNA structure is the sum of free energies of the elements. The Zuker-Sankoff algorithm calculates in time the minimum free energy structure, using memory, where is the length of the RNA sequence. It is also possible to calculate the partition function of the Boltzmann distribution with the same running time and memory requirement [206]. For a special case of free energies, both the optimal structure and the partition function can be calculated in time, using still only memory [199]. Two base-pairings, and forms a pseudo-knot if . Predicting the optimal RNA secondary structure in which arbitrary pseudo-knots are allowed is NP-complete [198]. For special types of pseudo-knots, polynomial running time algorithms exist [8], [198], [249], [281]. RNA secondary structures can be compared with aligning ordered forests [140]. Atteson gave a mathematical definition for the goodness of tree-constructing methods, and showed that the Neighbor-Joining algorithm is the best one for some definitions [16]. Elias and Lagergren recently published an improved algorithm for Neighbor-Joining that has only running time [80]. There are three possible tree topologies for four species that are called quartets. If we know all the quartets of the tree, it is possible to reconstruct it. It is proved that it is enough to know only the short quartets of a tree that are the quartets of closely related species [81]. A genome might contain more than one DNA sequences, the DNA sequences are called chromosomes. A genome rearrangement might happen between chromosomes, too, such mutations are called translocations. Hannenhalli gave a running time algorithm for calculating the translocation and reversal distance [132]. Pisanti and Sagot generalised the problem and gave results for the translocation diameter [238]. The generalisation of sorting permutations is the problem of finding the minimum length generating word for an element of a group. The problem is known to be NP-complete [153]. Above the reversal distance and translocation distance problem, only for the block interchange distance exists a polynomial running time algorithm [52]. We mention that Bill Gates, the owner of Microsoft worked also on sorting permutations, actually, with prefix reversals [106].

Description of many algorithms of bioinformatics can be found in the book of Pevzner and Jones [236]. We wrote only about the most important topics of bioinformatics, and we did not cover several topics like recombination, pedigree analysis, character-based tree reconstructing methods, partial digesting, protein threading methods, DNA chip analysis, knowledge representation, biochemical pathways, scale-free networks, etc. We close the chapter with the words of Donald Knuth: “It is hard for me to say confidently that, after fifty more years of explosive growth of computer science, there will still be a lot of fascinating unsolved problems at peoples' fingertips, that it won't be pretty much working on refinements of well-explored things. Maybe all of the simple stuff and the really great stuff has been discovered. It may not be true, but I can't predict an unending growth. I can't be as confident about computer science as I can about biology. Biology easily has 500 years of exciting problems to work on, it's at that level.”

Bibliography

[1] Author. Title. Kiadó. Volume. (year). pp.

[2] S. Abiteboul [http://www-rocq.inria.fr/~abitebou/]. Querying semi-structured data, Lecture Notes in Computer Science [http://www.link.springer.de/link/service/series/0558/index.htm], In F. Afrati [http://www.softlab.ntua.gr/facilities/public/AD/foto/], P. Kolaitis [http://www.soe.ucsc.edu/people/faculty/kolaitis.html](Eds.) Proceedings of ICDT'97. Springer [http://www.springer.de/]-Verlag. 1186. 1997. 1–18.

[3] S. Abiteboul [http://www-rocq.inria.fr/~abitebou/], O. Duschka. Complexity of answering queries using materialized views, In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM [http://isbndb.com/d/publisher/acm/_press.html]-Press. 1998. 254–263.

[4] S. Abiteboul [http://www-rocq.inria.fr/~abitebou/], V. Vianu. Foundations of Databases. Addison [http://www.aw.com/]-Wesley. 1995.

[5] L. Addario-Berry [http://www.math.mcgill.ca/louigi/], B. Chor [http://www.math.tau.ac.il/~bchor/], M. Hallett [http://www.mcb.mcgill.ca/~hallett/], J.Lagergren [http://www.nada.kth.se/~jensl/], A. Panconesi [http://www.dsi.uniroma1.it/~ale/], T. Wareham [http://web.cs.mun.ca/~harold/] . Ancestral Maximum Likelihood of Phylogenetic Trees is Hard. Lecture Notes in Bioinformatics. 2812. 2003. 202–215.

[6] R. G. Addie, M. Zukerman, T. Neame. Broadband Traffic Modeling: Simple Solutions to Hard Problems. IEEE Communications [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.comsoc.org/pubs/commag/] Magazine. 36(8). 1998. 88–95.

[7] A. Aho, C. Beeri, J. D. Ullman. The theory of joins in relational databases. ACM Transactions on Database Systems. 4(3). 1979. 297–314.

[8] T. Akutsu [http://www.bic.kyoto-u.ac.jp/takutsu/members/takutsu/]. Dynamic programming algorithms for RNA secondary prediction with pseudoknots. Discrete [http://www.sciencedirect.com/science/journal/0166218X] Applied Mathematics. 104. 2000. 45–62.

[9] E. Althaus [http://www.informatik.uni-mainz.de/~althaus/], A. Caprara, H. Lenhof [http://www.labmeeting.com/papers/author/lenhof-hp], K. Reinert [http://www.imprs-cbsc.mpg.de/faculty/reinert.shtml]. Multiple sequence alignment with arbitrary gap costs: Computing an optimal solution using polyhedral combinatorics. Bioinformatics. 18. 2002. S4–S16.

[10] G. M. Amdahl. Validity of the single-processor approach to achieving large-scale computer capabilities, In AFIPS Conference Proceedings. 30. 1967. 483–485.

[11] D. Anick, D. Mitra, M. Sondhi. Stochastic theory of a data handling system with multiple sources. The Bell [http://www.lucent.com/minds/techjournal/] System Technical Journal. 61. 1982. 1871–1894.

[12] M. Arenas [http://www.cs.toronto.edu/~marenas/], L. Libkin [http://www.cs.toronto.edu/~libkin/]. A normal form for XML documents, In Proceedings of the 21st Symposium on Principles of Database Systems. 2002. 85–96.

[13] V. Arlazanov, E. A. Dinic, M. Kronrod, I. Faradzev. On economic construction of the transitive closure of a directed graph. Doklady Academii Nauk SSSR. 194. 1970. 487–488.

[14] W. Armstrong. Dependency structures of database relationships, In Proceedings of IFIP Congress. North Holland. 1974. 580–583.

[15] J. Aspnes [http://cs-www.cs.yale.edu/homes/aspnes/], C. Busch [http://www.cs.rpi.edu/~buschc/], S. Dolev [http://www.cs.bgu.ac.il/~dolev/], F. Panagiota [http://www.cs.uoi.gr/~faturu/], C. Georgiou [http://www.cs.ucy.ac.cy/~chryssis/], A. Shvartsman [http://www.engr.uconn.edu/~aas/], P. Spirakis [http://www.cti.gr/Paul_Spirakis/], R. Wattenhofer [http://dcg.ethz.ch/members/roger.html]. Eight open problems in distributed computing. Bulletin of European Association of Theoretical Computer Science of EATCS [http://www.eatcs.org/publications/bulletin.html]. 90. 2006. 109–126.

[16] K. Atteson. The performance of the neighbor-joining method of phylogeny reconstruction. Algorithmica [http://link.springer.de/link/service/journals/00453/]. 25(2/3). 1999. 251–278.

[17] H. Attiya, C. Dwork [http://research.microsoft.com/users/dwork/], N. A. [http://theory.lcs.mit.edu//~lynch], L. J. Stockmeyer. Bounds on the time to reach agreement in the presence of timing uncertainty. Journal of the ACM [http://www.acm.org/]. 41. 1994. 122–142.

[18] H. Attiya, J. Welch [http://faculty.cs.tamu.edu/welch/]. Distributed Computing, Fundamentals, Simulations and Advanced Topics. McGraw [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.mcgraw-hill.com/]-Hill. 1998.

[19] B. Awerbuch. Complexity of network synchronization. Journal of the ACM [http://www.acm.org/]. 32(4). 1985. 804–823.

[20] M. Bader, E. Ohlebusch. Sorting by weighted reversals, transpositions and inverted transpsitions. Lecture Notes in Bioinformatics. 3909. 2006. 563–577.

[21] B. Baker. A tight asymptotic bound for next-fit decreasing bin-packing. SIAM Journal on Algebraic and Discrete Methods. 2(2). 1981. 147–152.

[22] Á. Balogh, A. Iványi [http://www.compalg.inf.elte.hu/~tony]. Memory Management, In A. Iványi [http://www.compalg.inf.elte.hu/~tony](Ed.) Algorithms of Informatics. Mondat [http:://www.mondat.hu] Kiadó. 2. 2007. 799–850.

[23] J. Banks [http://www.isye.gatech.edu/people/faculty/Jerry_Banks/], J. Carson, B. Nelson. Discrete-Event Simulation. Discrete-Event Simulation. 1996.

[24] C. Beeri. On the membership problem for functional and multivalued dependencies in relational databases. ACM Transactions on Database Systems. 5. 1980. 241–259.

[25] C. Beeri, P. Bernstein. Computational problems related to the design of normal form relational schemas. ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=transaction&idx=J777&coll=ACM&dl=ACM&CFID=25015209&CFTOKEN=28278630] Transactions on Database Systems. 4(1). 1979. 30–59.

[26] C. Beeri, M. Dowd. On the structure of Armstrong relations for functional dependencies. Journal of ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=journal&idx=J401&coll=ACM&dl=ACM&CFID=25015209&CFTOKEN=28278630]. 31(1). 1984. 30–46.

[27] C. Beeri, R. Fagin, J. Howard. A complete axiomatization for functional and multivalued dependencies, In ACM SIGMOD Symposium on the Management of Data. 1977. 47–61.

[28] R. Bello, K. Dias, A. Downing, J. Freenan, T. Finnerty, W. D. Norcott, H. Sun, A. Witkowski, M. Ziauddin. Materialized views in Oracle, In Proceedings of Very Large Data Bases'98. 1998. 659–664.

[29] S. A. Benner [http://www.searlescholars.net/people/benner.html], M. A. Cohen, H. G. H. Gonnet. Empirical and structural models for insertions and deletions in the divergent evolution of proteins. Journal of [http://www.sciencedirect.com/science/journal/00222836] Biology. 229(4). 1993. 1065–1082.

[30] J. Beran. Statistics for Long-Memory Processes, Monographs on Statistics and Applied Probability. Chapman [http://www.chapmanhall.com/] & Hall. 1986.

[31] J. Beran, R. Sherman, M. Taqqu [http://math.bu.edu/people/murad/], W. Willinger. Long-range dependence in variable-bit-rate video traffic. IEEE Transactions on Communications [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.comsoc.org/pubs/jrnal/transcom.html]. 43. 1995. 1566–1579.

[32] P. Berman, J. Garay. Cloture votes: $n/4$-resilient distributed consensus in $t+1$ rounds. Mathematical Systems [http://springerlink.metapress.com/content/100369/] Theory. 26(1). 1993. 3–19.

[33] K. A. Berman [http://www.ececs.uc.edu/~berman/], J. L. Paul. Sequential and Parallel Algorithms. PWS Publishing Company. 1996.

[34] A. Békéssy, J. Demetrovics [http://www.sztaki.hu/sztaki/afe/infodep/demetrovics.hu.jhtml]. Contribution to the theory of data base relations. Discrete [http://www.sciencedirect.com/science/journal/0012365X] Mathematics. 27(1). 1979. 1–10.

[35] L. A. Bélády [http://www.mkimarketing.hu/esemenyeink/eletrajzok/belady.htm], R. Nelson, G. S. Shedler. An anomaly in space-time characteristics of certain programs running in paging machine. Communications of the ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=magazine&idx=J79&coll=portal&dl=ACM&CFID=10204809&CFTOKEN=31999750]. 12(1). 1969. 349–353.

[36] A. Blum, T. Yiang, M. Li, J. Tromp, M. Yannakis. Linear approximation of shortest superstrings, In Proceedings of the 23rd ACM Symposium on Theory of Computing. 1991. 328–336.

[37] R. P. Brent. The Parallel Evaluation of General Arithmetic Expressions. Journal of the ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=journal&idx=J401&coll=portal&dl=ACM&CFID=10136019&CFTOKEN=486195]. 21. 1974. 201–206.

[38] P. Buneman [http://www.cis.upenn.edu/~peter/]. Semistructured data, In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM [http://isbndb.com/d/publisher/acm_press.html]-Press. 1997. 117–121.

[39] P. Buneman [http://www.cis.upenn.edu/~peter/], M. Fernandez [http://www.research.att.com/~mff/], D. Suciu [http://www.cs.washington.edu/homes/suciu/]. UnQL: a query language and algebra for semistructured data based on structural recursion. The International Journal on Very Large Data Bases [http://springerlink.metapress.com/app/home/journal.asp?wasp=6p8qd4wxrp6wvmehwkak&referrer=parent&backto=subject,139,142;]. 9(1). 2000. 76–110.

[40] J. E. Burns. A formal model for message passing systems. Indiana [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.indiana.edu/] University. (91. 1980.

[41] J. E. Burns, N. A. Lynch [http://theory.lcs.mit.edu//~lynch]. Bounds on shared memory for mutual exclusion. Information [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.acm.org/tog/] and Computation. 107(2). 1993. 171–184.

[42] CACI [http://www.cs.rutgers.edu/~chvatal/]. COMNET III. CACI Products Co.. 1997.

[43] D. Calvanese [http://www.inf.unibz.it/~calvanese/], D. De Giacomo, M. Lenzerini, M. Varni. Answering regular path queries using views, In Proceedings of the Sixteenth International Conference on Data Engineering. 2000. 190–200.

[44] A. Caprara. Formulations and complexity of multiple sorting by reversals, In RECOMB-99. ACM [http://isbndb.com/d/publisher/acm_press.html]-Press. 1999. 84–93.

[45] A. Caprara. Sorting Permutations by Reversals and Eulerian Cycle Decompositions. SIAM [http://epubs.siam.org/sam-bin/dbq/toclist/SIDMA] Journal on Discrete Mathematics. 12(1). 1999. 91–110.

[46] H. Carillo, D. Lipman. The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics. 48. 1988. 1073–1082.

[47] H. Casanova [http://navet.ics.hawaii.edu/~casanova/], A. Legrand [http://mescal.imag.fr/membres/arnaud.legrand/], Y. Robert [http://graal.ens-lyon.fr/~yrobert/]. Parallel Algorithms. Chapman [http://www.chapmanhall.com/] & Hall. 2009.

[48] R. Chandra, L. Dagum, D. Kohr, D. Maydan, J. McDonald, R. Menon. Parallel Programming in OpenMP. Morgan [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.mkp.com/] Kaufmann Publishers. 2000.

[49] S. Chaudhury [http://www.cs.iastate.edu/~chaudhur/homepage.html], R. Krishnamurty, S. Potomianos, K. Shim. Optimizing queries with materialized views, In Proceedings of the Eleventh International Conference on Data Engineering. 1995. 190–200.

[50] Q. Chen [http://www.cs.wisc.edu/~qun/], A. Lim [http://www.ieor.berkeley.edu/~lim/], K. W. Ong. An adaptive structural summary for graph-structured data, In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data. 2003. 134–144.

[51] B. Chor, T. Tuller. Maximum likelihood of evolutionary trees: hardness and approximation. Bioinformatics. 21. 2005. i97–i106.

[52] D. Christie. Sorting permutations by block-interchanges. Information [http://www.elsevier.nl/inca/publications/store/5/0/5/6/1/2/] Processing Letters. 60(4). 1996. 165–169.

[53] E. F. Codd [http://www.sis.pitt.edu/~mbsclass/hall_of_fame/codd.htm]. A relational model of large shared data banks. Communications of the ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=magazine&idx=J79&coll=portal&dl=ACM&CFID=10204809&CFTOKEN=31999750]. 13(6). 1970. 377–387.

[54] E. F. Codd [http://www.sis.pitt.edu/~mbsclass/hall_of_fame/codd.htm]. Further normalization of the data base relational model, In R. Rustin Courant Computer Science Symposium 6: Data Base Systems. Prentice [http://www.prenhall.com/] Hall. 1972. 33–64.

[55] E. F. Codd [http://www.sis.pitt.edu/~mbsclass/hall_of_fame/codd.htm]. Normalized database structure: A brief tutorial, In ACM SIGFIDET Workshop on Data Description, Access and Control. 1971. 24–30.

[56] E. F. Codd [http://www.sis.pitt.edu/~mbsclass/hall_of_fame/codd.htm]. Recent investigations in relational data base systems, In Information Processing 74. North-Holland [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/]. 1974. 1017–1021.

[57] E. F. Codd [http://www.sis.pitt.edu/~mbsclass/hall_of_fame/codd.htm]. Relational completeness of database sublanguages, In R. Rustin(Ed.) Courant Computer Science Symposium 6: Data Base Systems. Prentice [http://www.prenhall.com/] Hall. 1972. 65–98.

[58] E. Coffman [http://www.el.columbia.edu/~egc/]. Computer and Job Shop Scheduling. John Wiley [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.wiley.com/] & Sons. 1976.

[62] F. Corpet. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Research. 16. 1988. 10881–10890.

[63] M. Crovella, A. Bestavros. Self-Similarity in World Wide Web Traffic: Evidence and Possible Causes. IEEE/ACM Transactions on Networking [http://portal.acm.org/portal.cfm?CFID=20545300&CFTOKEN=71807633]. 5(6). 1997. 835–846.

[64] D. E. Culler [http://www.cs.berkeley.edu/~culler/], R. M. Karp [http://www.icir.org/karp/], D. Patterson, A. Sahay, E. E. Santos, K. E. Schauser, R. Subramonian, T. von Eicken. LogP: A practical model of parallel computation. Communication of the ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=magazine&idx=J79&coll=portal&dl=ACM&CFID=10204809&CFTOKEN=31999750]. 39(11). 1996. 78–85.

[66] M. O. Dayhoff, R. M. Schwartz, B. Orcutt. A model of evolutionary change in proteins. Atlas of Protein Sequence and Structure. 5. 1978. 345–352.

[67] C. Delobel. Normalization and hierarchical dependencies in the relational data model. ACM Transactions on Database Systems. 3(3). 1978. 201–222.

[68] J. Demetrovics [http://www.sztaki.hu/sztaki/afe/infodep/demetrovics.hu.jhtml], Gy. O. H. Katona [http://www.renyi.hu/~ohkatona/], A. Sali [http://www.renyi.hu/~sali/]. Minimal Representations of Branching Dependencies. [http://www.sciencedirect.com/science/journal/0166218X] Applied Mathematics. 40. 1992. 139–153.

[69] J. Demetrovics [http://www.sztaki.hu/sztaki/afe/infodep/demetrovics.hu.jhtml], Gy. O. H. Katona [http://www.renyi.hu/~ohkatona/], A. Sali [http://www.renyi.hu/~sali/]. Minimal Representations of Branching Dependencies. Acta Scientiarum Mathematicorum (Szeged). 60. 1995. 213–223.

[70] J. Demetrovics [http://www.sztaki.hu/sztaki/afe/infodep/demetrovics.hu.jhtml], Gy. O. H. Katona [http://www.renyi.hu/~ohkatona/], A. Sali [http://www.renyi.hu/~sali/]. Design type problems motivated by database theory. Journal of Statistical Planning and Inference. 72. 1998. 149–164.

[71] P. Denning. Virtual memory. Computing Surveys [http://www.acm.org/pubs/surveys/]. 2(3). 1970. 153–189.

[72] A. Deutsch, M. Fernandez, D. Suciu [http://www.cs.washington.edu/homes/suciu/]. Storing semistructured data with STORED, In Proceedings of SIGMOD'99. 1999. 431–442.

[73] A. Deutsch, L. Popa, D. Tannen. Physical data independence, constraints and optimization with universal plans, In Proceedings of VLDB'99. 1999. 459–470.

[74] D. Dolev, R. Strong. Authenticated algorithms for Byzantine agreement. SIAM [http://epubs.siam.org/sam-bin/dbq/toclist/SICOMP] Journal on Computing. 12(4). 1983. 656–666.

[75] N. Duffield, N. Oconnell. Large deviation and overflow probabilities for the general single-server queue, with applications. Mathematical Proceedings of the Cambridge [journals.cambridge.org/ journal_MathematicalProceedingsoftheCambridgePhilosophicalSociety] Philosophical Society. 118. 1995. 363–374.

[76] D. Duffy, A. McIntosh, M. Rosenstein, W. Willinger. Statistical analysis of CCSN/SS7 traffic data from working CCS subnetworks. IEEE Journal on Selected Areas [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.comsoc.org/pubs/jrnal/jsac.html] Communications. 12. 1994. 544–551.

[77] O. Duschka [http://logic.stanford.edu/people/duschka/], M. Genesereth. Answering recursive queries using views, In Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM [http://isbndb.com/d/publisher/acm_press.html]-Press. 1997. 109–116.

[78] O. Duschka [http://logic.stanford.edu/people/duschka/], M. Genesereth. Query planning in infomaster, In Proceedings of ACM Symposium on Applied Computing. ACM [http://isbndb.com/d/publisher/acm_press.html]-Press. 1997. 109–111.

[79] I. Elias, T. Hartman. A 1.375 approximation algorithm for sorting by transpositions. Lecture Notes in Bioinformatics. 3692. 2005. 204–215.

[80] I. Elias, J. Lagergren. Fast Neighbor Joining. Lecture Notes in Computer Science. 3580. 2005. 1263–1274.

[81] P. L. Erdős [http://www.renyi.hu/~elp/], M. Steel [http://www.math.canterbury.ac.nz/~m.steel/], L. Székely [http://www.math.sc.edu/~szekely/], T. Warnow [http://userweb.cs.utexas.edu/~tandy/]. Local quartet splits of a binary tree infer all quartet splits via one dyadic inference rule. Computers and Artificial Intelligence. 16(2). 1997. 217–227.

[82] A. Erramilli, O. Narayan, W. Willinger. Experimental queueing analysis with long-range dependent packet-traffic. IEEE/ACM Transactions on Networking [http://portal.acm.org/portal.cfm?CFID=20545300&CFTOKEN=71807633]. 4(2). 1996. 209–223.

[83] R. Fagin [http://www.almaden.ibm.com/cs/people/fagin/]. Armstrong databases, In Proceedings of IBM Symposium on Mathematical Foundations of Computer Science. 1982. 24 pages.

[84] R. Fagin. Horn clauses and database dependencies. Journal of ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=journal&idx=J401&coll=ACM&dl=ACM&CFID=25015209&CFTOKEN=28278630]. 29(4). 1982. 952–985.

[85] R. Fagin. Multivalued dependencies and a new normal form for relational databases. ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=transaction&idx=J777&coll=ACM&dl=ACM&CFID=25015209&CFTOKEN=28278630] Transactions on Database Systems. 2. 1977. 262–278.

[86] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. Journal of Molecular Evolution [link.springer.de/link/service/journals/00239/]. 17. 1981. 368–376.

[87] D. Feng, R. F. Doolittle. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolution [http://www.springerlink.com/app/home/journal.asp?wasp=ege9fdlrul6xnnwvuxv2&referrer=parent&backto=linkingpublicationresults,id:100107,1]. 25. 1987. 351–360.

[88] J. W. Fickett. Fast optimal alignment. Nucleid Acids Research [http://nar.oupjournals.org/]. 12. 1984. 175–180.

[89] M. J. Fischer, N. A. Lynch [http://theory.lcs.mit.edu//~lynch], M. S. Paterson. Impossibility of distributed consensus with one faulty proces. Journal of the ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=journal&idx=J401&coll=portal&dl=ACM&CFID=10136019&CFTOKEN=486195]. 32(2). 1985. 374–382.

[90] W. M. Fitch. Toward defining the course of evolution: minimum change for a specified tree topology. Systematic Zoology [http://www.jstor.org/journals/00397989.html]. 20. 1971. 406–416.

[91] D. D. Florescu [http://www-caravel.inria.fr/Fmembre_dana.html]. Search spaces for object-oriented query optimization. 1996.

[92] D. Florescu [http://www-caravel.inria.fr/Fmembre_dana.html], A. Halevy [http://www.cs.washington.edu/homes/alon/], A. O. Mendelzon. Database techniques for the world-wide web: a survey. SIGMOD Record. 27(3). 1998. 59–74.

[93] D. Florescu [http://www-caravel.inria.fr/Fmembre_dana.html], L. Raschid, P. Valduriez [http://www.sciences.univ-nantes.fr/info/perso/permanents/valduriez/]. Answering queries using OQL view expressions, In Workshop on Materialized views, in cooperation with ACM SIGMOD. 1996. 627–638.

[94] M. J. Flynn [http://umunhum.stanford.edu/~flynn/]. Very high-speed computer systems. Proceedings of the IEEE [http://www.ieee.org/organizations/pubs/proceedings/]. 5(6). 1966. 1901–1909.

[95] P. Fornai, A. Iványi [http://compalg.elte.hu/tanszek/tony/oktato.php?oktato=tony]. Bélády's anomaly is unbounded In E. Kovács, Z. Winkler (Eds.) 5th International Conference on Applied Informatics. Molnár és társa. 2002. 65–72.

[96] S. Fortune, J. Wyllie. Parallelism in Random Access Machines, In Proceedings of the Tenth Annual ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=magazine&idx=J79&coll=portal&dl=ACM&CFID=10204809&CFTOKEN=31999750] Symposium on Theory of Computing. 1978. 114–118.

[97] I. Foster [http://www-fp.mcs.anl.gov/~foster/], C. Kesselman [http://www.isi.edu/~carl/]. The Grid: Blueprint for a New Computing Infrastructure. Morgan [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.mkp.com/] Kaufman Publisher. 2004 (2nd edition).

[98] I. Foster [http://www-fp.mcs.anl.gov/~foster/], C. Kesselman [http://www.isi.edu/~carl/], J. M. Nick, S. Tuecke. The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration. Available at www.globus.org/research/papers/ogsa.pdf [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.globus.org/research/papers/ogsa.pdf]. 2002.

[99] L. Foulds, R. L. Graham [http://math.ucsd.edu/~fan/ron]. The Steiner problem in phylogeny is NP-complete. Advances [http://www.sciencedirect.om.hu/science/journal/01968858] in Applied Mathematics. 3. 1982. 43–49.

[100] M. Friedman, D. S. Weld. Efficient execution of information gathering plans, In Proceedings International Joint Conference on Artificial Intelligence. 1997. 785–791.

[101] Z. Galil [http://www.cs.columbia.edu/~galil/], R. Giancarlo. Speeding up dynamic programming with applications to molecular biology. Theoretical [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/locate/tcs] Computer Science. 64. 1989. 107–118.

[102] J. Gallant, D. Maier, J. Storer. On finding minimal length superstrings. Journal of Computer [http://www.sciencedirect.com/science/journal/00220000/] and System Sciences. 20(1). 1980. 50–58.

[103] W. Gararch, E. Evan [http://www.cs.umd.edu/~egolub/professional.shtml], C. Kruskal [http://www.cs.umd.edu/~kruskal/]. Proofs that yield nothing but their validity or all languages in NP. Journal of the ACM. 38(3). 1991. 691–729.

[104] H. Garcia-Molina [http://www-db.stanford.edu/people/hector.html], J. Seiferas. Elections in a distributed computing systems. IEEE [http://www.computer.org/tc/] Transactions on Computers. C-31(1). 1982. 47–59.

[105] M. R. Garey [http://cm.bell-labs.com/cm/ms/former/mrg/], D. S. Johnson [http://www.research.att.com/~dsj/]. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman [http://www.whfreeman.com/]. 1979.

[106] W. H. Gates [http://www.microsoft.com/billgates/default.asp], C. H. Papadimitriou [http://www.cs.berkeley.edu/~christos/]. Bounds for sorting by prefix reversals. Discrete [http://www.sciencedirect.com/science/journal/0012365X] Mathematics. 27. 1979. 47–57.

[107] P. Gács [http://www.cs.bu.edu/fac/gacs/]. Compatible sequences and a slow Winkler percolation. Combinatorics Probability [http://journals.cambridge.org/action/displayJournal?jid=CPC] and Computing. 13(6). 2004. 815–856.

[108] F. Gécseg [http://www.inf.u-szeged.hu/~gecseg/], I. Peák. Algebraic Theory of Automata. Akadémiai [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.akkrt.hu/] Kiadó. 1972.

[109] P. B. Gibbons, Y. Matias [http://www.math.tau.ac.il/~matias/], V. Ramachandran [http://www.cs.utexas.edu/users/vlr/]. Can a shared-memory model serve as a bridging model for parallel computation. Theory of Computing [http://www.springerlink.com/app/home/journal.asp?wasp=cmwhunlhyn56kdhukgur&referrer=parent&backto=linkingpublicationresults,id:100369,1] Systems. 32(3). 1999. 327–359.

[110] N. Goldman, J. Thorne, D. Jones. Using evolutionary [http://www.sanger.ac.uk/Software/analysis/projector/] trees in protein secondary structure prediction and other comparative sequence analyses. Journal of Molecular Biology. 263(2). 1996. 196–208.

[111] J. Goldstein, P. A. Larson. Optimizing queries using materialized views: a practical, scalable solution, In Optimizing queries using materialized views: a practical, scalable solution. 2001. 331–342.

[112] H. G. H. Gonnet, M. A. Cohen, S. A. Benner [http://www.searlescholars.net/people/benner.html]. Exhaustive matching of the entire protein sequence database. Science. 256. 1992. 1443–1445.

[113] O. Gotoh [http://www.cbrc.jp/~gotoh/]. An improved algorithm for matching biological sequences. Journal of Molecular [http://www.sciencedirect.com/science/journal/00222836] Biology. 162. 1982. 705–708.

[114] G. Gottlob [http://www.dbai.tuwien.ac.at/staff/gottlob/], C. Koch [http://www.dbai.tuwien.ac.at/staff/koch/], R. Pichler [http://www.logic.at/staff/reini/]. The complexity of XPath query evaluation, In Proceedings of the 22nd Symposium on Principles of Database Systems. 2003. 179–190.

[115] G. Grahne [http://www.cs.concordia.ca/~grahne/], A. Mendelzon [http://www.cs.toronto.edu/~mendel/]. Tableau techniques for querying information sources through global schemas, Lecture Notes in Computer Science. [http://www.link.springer.de/link/service/series/0558/index.htm], In Proceedings of ICDT'99. Springer [http://www.springer.de/]-Verlag. 1540. 1999. 332–347.

[116] A. Grama [http://www.cs.purdue.edu/people/ayg], A. Gupta [http://www-users.cs.umn.edu/~agupta/], G. Karypis [http://www.cs.umn.edu/faculty/karypis.html], V. Kumar [http://www-users.cs.umn.edu/~kumar/]. Introduction to Parallel Computing. Addison [http://www.aw.com/]-Wesley. 2003 (2nd edition).

[117] J. Grant, J. Minker. Inferences for numerical dependencies. Theoretical [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/locate/tcs] Computer Science. 41. 1985. 271–287.

[118] J. Grant, J. Minker. Normalization and axiomatization for numerical dependencies. Information and Control. 65. 1985. 1–17.

[119] J. N. Gray. Notes on database operating system, In R. Bayer, R. M. Graham, G. Seegmuller (Eds.) Operating Systems: An Advanced Course, Lecture Notes in Computer Science [http://www.link.springer.de/link/service/series/0558/index.htm]. Springer [http://www.springer.de/]-Verlag. 60. 1978. 393–481.

[120] W. Gropp, M. Snir, B. Nitzberg, E. Lusk. MPI: The Complete Reference, Scientific and Engineering Computation Series. The MIT [mitpress.mit.edu/] Press. 1998.

[121] Q-P. Gu, S. Peng, H. Sudborough. A 2-Approximation Algorithm for Genome Rearrangements by Reversals and Transpositions. Theoretical [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/locate/tcs] Computer Science. 210(2). 1999. 327–339.

[122] R. Gusella. A measurement study of diskless workstation traffic on an Ethernet. IEEE Transactions on Communications [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.comsoc.org/pubs/jrnal/transcom.html]. 38. 1990. 1557–1568.

[123] D. M. Gusfield [http://wwwcsif.cs.ucdavis.edu/~gusfield/]. Algorithms on Strings, Trees and Sequences. Cambridge [http://uk.cambridge.org/] University Press. 1997.

[124] D. M. Gusfield [http://wwwcsif.cs.ucdavis.edu/~gusfield/]. Efficient methods for multiple sequence alignment with guaranteed error bounds. Bulletin of Mathematical Biology. 55. 1993. 141–154.

[125] J. L. Gustafson. Reevaluating Amdahl's law. Communications of ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=journal&idx=J401&coll=portal&dl=ACM&CFID=10136019&CFTOKEN=486195]. 28(1). 1988. 532–535.

[126] T. Gyires [http://www.itk.ilstu.edu/faculty/tbgyires/tbgyires.htm]. Simulation of the Harmful Consequences of Self-Similar Network Traffic. The Journal of Computer Information [http://www.fgcu.edu/rboggs/jcis/index.asp?page=1] Systems. 42(4). 2002. 94–111.

[127] T. Gyires [http://www.itk.ilstu.edu/faculty/tbgyires/tbgyires.htm]. Extension of multiprotocol label switching for long-range dependent traffic: QoS routing and performance in IP networks. Computer Standards [http://www.sciencedirect.om.hu/science/journal/09205489] and Interfaces. 27. 2005. 117–132.

[129] A. Halevy [http://www.cs.washington.edu/homes/alon/]. Logic based techniques in data integration, In J. Minker Logic-based Artificial Intelligence. Kluwer [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.wkap.nl/] Academic Publishers. 2000. 575–595.

[130] A. Halevy [http://www.cs.washington.edu/homes/alon/], A. Mendelzon, Y. Sagiv, D. Srivastava. Answering queries using views, In Proceedings of the Fourteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM [http://isbndb.com/d/publisher/acm_press.html]-Press. 1995. 95–104.

[131] A. Halevy, A. Rajaraman, J. J. Ordille, D. Srivastava. Querying heterogeneous information sources using source descriptions, In Proceedings of Very Large Data Bases. 1996. 251–262.

[132] S. Hannenhalli [http://cagr.pcbi.upenn.edu/]. Polynomial-time algorithm for computing translocation distance between genomes. Discrete [http://www.sciencedirect.com/science/journal/0166218X] Applied Mathematics. 71. 1996. 137–151.

[133] H. He [http://www.cs.duke.edu/~haohe/], J. Yang [http://www.cs.duke.edu/~junyang/]. Multiresolution indexing of XML for frequent queries, In Proceedings of the 20th International Conference on Data Engineering. 2004. 683–694.

[134] H. Hefles, D. Lucantoni. A Markov modulated characterization of packetized voice and data traffic and related statistical multiplexer performance. IEEE Journal on Selected Areas [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.comsoc.org/pubs/jrnal/jsac.html] in Communication. 4. 1986. 856–868.

[135] M. R. Henzinger [http://www.cs.cornell.edu/Info/Department/Annual96/Faculty/MHenzinger.html], T. A. Henzinger [http://www.henzinger.com/monika/], P. Kopke. Computing simulations on finite and infinite graphs, In Proceedings of the 36th Annual Symposium on Foundations of Computer Science. IEEE [http://www.ieee.org/organizations/pubs/press/] Computer Society Press. 1995. 453–462.

[136] F. Heylighen [http://pespmc1.vub.ac.be/HEYL.html]. Principia Cybernetica Project. http://pespmc1.vub.ac.be/HEYL.html [http://pespmc1.vub.ac.be/HEYL.html]. 2004.

[137] D. S. Hirschberg [http://www.ics.uci.edu/~dan/]. A linear space algorithm for computing maximal common subsequences. Communications of the ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=magazine&idx=J79&coll=portal&dl=ACM&CFID=10204809&CFTOKEN=31999750]. 18. 1975. 341–343.

[138] J. E. Hopcroft [http://www.cs.cornell.edu/jeh/], R. Motwani, J. D. Ullman [http://www-db.stanford.edu/~ullman/]. Introduction to Automata Theory, Languages, and Computation. Addison [http://www.aw.com/]-Wesley. (in German: Einführung in Automatentheorie, Formale Sprachen und Komplexittstheorie, Pearson [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.pearson-studium.de] Studium, 2002). 2001. (2nd edition).

[139] E. Horowitz [http://sunset.usc.edu/~horowitz/pleader/horohome.html], S. Sahni [http://www.cise.ufl.edu/~sahni/], S. Rajasekaran [http://www.cise.ufl.edu/~raj/]. Computer Algorithms. Computer Science Press. 1998.

[140] M. Höchsman, T. Töller, R. Giegerich, S. Kurtz. Local similarity in RNA secondary structure, In Proceedings of IEEE Bioinformatics Conference 2003. 2003. 159–168.

[141] T. J. P. Hubbard, A. M. Lesk, A. Tramontano. Gathering them into the fold. Nature Structural Biology. 4. 1996. pages 313.

[142] R. Hughey, A. Krogh. Hidden Markov models for sequence analysis: Extension and analysis of the basic method. CABIOS. 12(2). 1996. 95–107.

[143] J. Hunt, T. Szymanski. A fast algorithm for computing longest common subsequences. Communications of the ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=magazine&idx=J79&coll=portal&dl=ACM&CFID=10204809&CFTOKEN=31999750]. 20(5). 1977. 350–353.

[144] K. Hwang, Z. Xu. Scalable Parallel Computing. McGraw-Hill [http://books.mcgraw-hill.com/]. 1998.

[145] Interoute [http://www.interoute.com/glossary.html]. http://wwww.interoute.com/glossary.html [http://www.interoute.com/glossary.html]. 2004.

[146] A. Iványi [http://compalg.elte.hu/tanszek/tony/oktato.php?oktato=tony]. Density of safe matrices. Acta Universitatis Sapientiae [http://www.acta.sapientiae.ro/]. 1(2). 2009. 121–142.

[147] A. Iványi [http://compalg.elte.hu/tanszek/tony/oktato.php?oktato=tony]. On dumpling-eating giants, Colloquia of Mathematical Society János Bolyai [http://www.bolyai.hu/], In Finite and Infinite Sets (Eger, 1981). North-Holland [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/]. 37. 1984. 279–390.

[148] A. Iványi [http://compalg.elte.hu/tanszek/tony/oktato.php?oktato=tony]. Párhuzamos algoritmusok (Parallel Algorithms). ELTE Eötvös [http://www.elte.hu/szervezet/eotvos_kiado.html] Kiadó. 2003.

[149] A. Iványi [http://compalg.elte.hu/tanszek/tony/oktato.php?oktato=tony]. Performance bounds for simple bin packing algorithms. Annales Universitatis Scientiarum Budapestinensis de Rolando Eötvös Nominatae, Sectio Computarorica [http:compalg.inf.elte.hu/annales/computatorica]. 5. 1984. 77–82.

[150] A. Iványi [http://compalg.elte.hu/tanszek/tony/oktato.php?oktato=tony]. Tight worst-case bounds for bin packing algorithms, Colloquia of Mathematical Society János Bolyai [http://www.bolyai.hu/], In Theory of Algorithms (Pécs, 1984). North-Holland [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/]. 44. 1985. 233–240.

[151] A. Iványi [http://compalg.elte.hu/tanszek/tony/oktato.php?oktato=tony], R. Szmeljánszkij. Elements of Theoretical Programming (in Russian). Moscow [http://www.msu.ru/english/www-serv/www-msu.html] State University. 1985.

[152] R. Jain, S. Routhier. Packet trains: Measurements and a new model for computer network traffic. IEEE Journal on Selected Areas [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.comsoc.org/pubs/jrnal/jsac.html] in Communication. 4. 1986. 986–995.

[153] M. R. Jerrum. The complexity of finding minimum-length generator sequences. Theoretical [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/locate/tcs] Computer Science. 36. 1986. 265–289.

[154] T. Jiang, M. Li. Towards a DNA sequencing theory (learning strings). (91-08). 1991.

[155] D. S. Johnson [http://www.research.att.com/~dsj/]. Near-Optimal Bin Packing Algorithms. MIT [web.mit.edu/] Department of Mathematics [http://www-math.mit.edu/]. 1973.

[156] D. S. Johnson [http://www.research.att.com/~dsj/], A. Demers, J. D. Ullman, M. R. Garey, R. L. Graham. Worst-case performance-bounds for simple one-dimensional bin packing algorithms. SIAM [http://epubs.siam.org/sam-bin/dbq/toclist/SICOMP] Journal on Computing. 3. 1974. 299–325.

[157] K. Jones. Consultant's Guide to COMNET III. CACI [http://www.caciasl.com/default.html] Product Company. 1997.

[158] G. A. Jones, J. Jones. Information and Coding Theory. Springer [http://www.springer.de/]-Verlag. 2000.

[159] M. Kandemir [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.cse.psu.edu/~kandemir/], J. Ramanujam [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.ee.lsu.edu/jxr/jxr.html], A. Choudhary [http://www.ece.northwestern.edu/~choudhar/]. Compiler algorithms for optimizing locality and parallelism on shared and distributed-memory machines. Journal of Parallel [http://www.sciencedirect.com/science/journal/07437315] and Distributed Computing. 60. 2000. 924–965.

[160] R. M. Karp [http://www.icir.org/karp/], R. E. Miller, S. Winograd. The organization of computations for uniform recurrence equations. Journal of the ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=journal&idx=J401&coll=portal&dl=ACM&CFID=10136019&CFTOKEN=486195]. 14. 1967. 563–590.

[161] R. Kaushik [http://www.cs.wisc.edu/~raghav/], P. Bohannon [http://www.bell-labs.com/user/bohannon/], J. F. Naughton [http://www.cs.wisc.edu/~naughton/naughton.html], H. Korth [http://www.bell-labs.com/user/hfk/]. Covering indexes for branching path queries, In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data. 2002. 133–144.

[162] R. Kaushik [http://www.cs.wisc.edu/~raghav/], P. Bohannon [http://www.bell-labs.com/user/bohannon/], J. F. Naughton [http://www.cs.wisc.edu/~naughton/naughton.html], H. Korth [http://www.bell-labs.com/user/hfk/], P. Shenoy [http://www.cs.washington.edu/homes/pshenoy/]. Updates for structure indexes, In Proceedings of Very Large Data Bases. 2002. 239–250.

[163] R. Kaushik [http://www.cs.wisc.edu/~raghav/], R. Krishnamurthy [http://www.cs.wisc.edu/~sekar/], J. F. Naughton [http://www.cs.wisc.edu/~naughton/naughton.html], R. Ramakrishnan [http://www.cs.wisc.edu/~raghu/]. Exploiting local similarity for indexing paths in graph-structured data, In Proceedings of the 18th International Conference on Data Engineering. 2002. 129–140.

[164] R. Kaushik [http://www.cs.wisc.edu/~raghav/], R. Shenoy [http://www.cs.washington.edu/homes/pshenoy/], P. F. Bohannon [http://www.bell-labs.com/user/bohannon/], E. Gudes [http://www.cs.bgu.ac.il/~ehud/]. On the integration of structure indexes and inverted lists, In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. 2004. 779–790.

[165] J. Kececioglu [http://www.cs.arizona.edu/~kece/], H. Lenhof [http://bioinf-www.bioinf.uni-sb.de/people/lenhof/], K. Mehlhorn [http://www.mpi-sb.mpg.de/~mehlhorn/], P. Mutzel, K. Reinert [http://www.imprs-cbsc.mpg.de/faculty/reinert.shtm], M. Vingron. A polyhedral approach to sequence alignment problems. Discrete [http://www.sciencedirect.com/science/journal/0166218X] Applied Mathematics. 104(1-3). 2000. 143–186.

[166] D. Kececioglu [http://www.cs.arizona.edu/~kece/], E. Myers [http://research.janelia.org/myers/]. Combinatorial algorithms for DNA sequence assembly. 1993.

[167] K. Kennedy [http://www.cs.rice.edu/~ken/], R. Allen. Optimizing Compilers for Modern Architectures. Morgan [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.mkp.com/] Kaufman Publishers. 2001.

[168] M. Khosrow-Pour [http://www.idea-group.com/encyclopedia/authors.asp?id=26&pub_id=4455]. Encyclopedia of Information Science and Technology, Vol. 1, Vol. 2, Vol. 3, Vol. 4, Vol. 5. Idea [http://www.idea-group.com/] Group Inc.. 2005.

[169] S. Kleiman, D. Shah, B. Smaalders. Programming with Threads. Prentice [http://www.prenhall.com/] Hall. 1996.

[170] L. Kleinrock [http://en.wikipedia.org/wiki/Leonard/_Kleinrock]. Queueing Systems. John Wiley [http://www.wiley.com/] & Sons. 1975.

[171] B. Knudsen [http://www.clcbioconsulting.com/bjarne.php], J. Hein. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Researchs. 31(13). 2003. 3423–3428.

[172] B. Knudsen [http://www.clcbioconsulting.com/bjarne.php], J. Hein. RNA secondary structure prediction using stochastic context free grammars and evolutionary history. Bioinformatics. 15(6). 1999. 446–454.

[173] D. E. Knuth [http://www-cs-faculty.stanford.edu/~knuth/], J. Morris, V., R. Pratt. Fast pattern matching in strings. SIAM [http://epubs.siam.org/sam-bin/dbq/toclist/SICOMP] Journal on Computing. 6(2). 1977. 323–350.

[174] C. H. Koelbel, D. B. Loveman, R. S. Schreiber, G. Steele Jr., M. E. Zosel. The High Performance Fortran Handbook. The MIT [mitpress.mit.edu/] Press. 1994.

[175] A. D. Kshemkalyani [http://www.cs.uic.edu/~ajayk], M. Singhal [http://www.cs.uky.edu/~singhal]. Distributed Computing. Cambridge [http://uk.cambridge.org/] University Press. 2008.

[176] H. T. Kung, C. E. Leiserson [http://theory.lcs.mit.edu/~cel/]. Systolic arrays (for VLSI), In I. S. Duff, G. W. Stewart (Eds.) Sparse Matrix Proceedings. SIAM [http://www.siam.org/]. 1978. 256–282.

[177] C. T. Kwok, D. Weld. Planning to gather information, In Proceedings of AAAI 13th National Conference on Artificial Intelligence. 1996. 32–39.

[178] T. Lai, S. Sahni. Anomalies in parallel branch and bound algorithms. Communications of ACM [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.acm.org/dl/]. 27(6). 1984. 594–602.

[179] E. T. Lambrecht, S. Kambhampati, S. Gnanaprakasam. Optimizing recursive information gathering plans, In Proceedings of 16th International Joint Conference on Artificial Intelligence. 1999. 1204–1211.

[180] L. Lamport [http://research.microsoft.com/users/lamport/]. A new solution of Dijkstra's concurrent programming problem. Communications of the ACM. 18(8). 1974. 453–455.

[181] L. Lamport [http://research.microsoft.com/users/lamport/]. A fast mutual exclusion algorithm. ACM Transactions on Computers. 5(1). 1987. 1–11.

[182] L. Lamport [http://research.microsoft.com/users/lamport/], R. Shostak, M. Pease. The Byzantine generals problem. ACM [http://www.acm.org/] Transactions on Programming Languages and Systems. 4(3). 1982. 382–401.

[183] G. Lancia. Integer programming models for computational biology problems. Journal of Computer Science and Technology. 19(1). 2004. 60–77.

[184] G. Landau, U. Vishkin. Eficient string matching with mismatches. Theoretical [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/locate/tcs] Computer Science. 43. 1986. 239–249.

[185] A. Law [http://www.averill-law.com/averill.htm], W. Kelton [http://www.cba.uc.edu/faculty/keltonwd/]. Simulation Modeling and Analysis. 3rd edition, McGraw-Hill [http://www.mhhe.com/catalogs/] Higher Education. 1976.

[186] F. T. Leighton [http://theory.lcs.mit.edu/~ftl/]. Introduction to Parallel Algorithms and Architectures: Arrays-Trees-Hypercubes. Morgan [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.mkp.com/] Kaufman Publishers. 1992.

[187] F. T. Leighton [http://theory.lcs.mit.edu/~ftl/]. Introduction to Parallel Algorithms and Architectures: Algorithms and VSLI. Morgan [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.mkp.com/] Kaufman Publishers. 2001.

[188] W. Leland, M. Taqqu [http://math.bu.edu/people/murad/], D. Wilson. On the self-similar nature of Ethernet traffic. Computer Communication [http://www.acm.org/sigcomm/ccr/] Reviews. 23. 1993. 183–193.

[189] C. Leopold [http://www.se.e-technik.uni-kassel.de/pm/leopoldE.html]. Parallel and Distributed Computing, Wiley Series on Parallel and Distributed Computing. John Wiley [http://www.wiley.com/] & Sons. 2001.

[190] B. Lewis, D. J. Berg. Multithreaded Programming with Phtreads. Prentice [http://www.prenhall.com/] Hall. 1998.

[191] D. Lipman, S. J. Altshuland, J. Kecioglu [http://www.cs.arizona.edu/~kece/]. A tool for multiple sequence alignment. Proc. Natl. Academy Science. 86. 1989. 4412–4415.

[192] M. Listanti, V. Eramo, R. Sabella. Architectural and Technological Issues for Future Optical Internet Networks. IEEE Communications [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.comsoc.org/pubs/commag/] Magazine. 8(9). 2000. 82–92.

[193] C. Lucchesi. Candidate keys for relations. Journal of of Computer and System Sciences. 17(2). 1978. 270–279.

[194] G. Lunter, I. Miklós [http://www.stats.ox.ac.uk/~miklos/], A. Drummond, J. L. Jensen, J. Hein. Bayesian phylogenetic inference under a statistical indel model. Lecture Notes in Bioinformatics. 2812. 2003. 228–244.

[195] G. Lunter, I. Miklós [http://www.stats.ox.ac.uk/~miklos/], Y. Song, J. Hein. An efficient algorithm for statistical multiple alignment on arbitrary phylogenetic trees. Journal of Computational Biology [http://www.liebertpub.com/publication.aspx?pub_id=31]. 10(6). 2003. 869–889.

[196] N. A. Lynch [http://theory.lcs.mit.edu//~lynch]. Distributed Algorithms. Morgan [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.mkp.com/] Kaufman Publisher. 2001 (5th edition).

[197] N. A. Lynch [http://theory.lcs.mit.edu//~lynch], M. J. Fischer. On describing the behavior and implementation of distributed systems. Theoretical [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/locate/tcs] Computer Science. 13(1). 1981. 17–43.

[198] R. Lyngso, C. N. S. Pedersen. RNA Pseudoknot Prediction in Energy Based Models. Journal of Computational Biology [http://www.liebertpub.com/publication.aspx?pub_id=31]. 7(3/4). 2000. 409–428.

[199] R. Lyngso, M. Zuker, C. Pedersen. Fast evaluation of internal loops in RNA secondary structure prediction. Bioinformatics. 15(6). 1999. 440–445.

[200] D. Maier. Minimum covers in the relational database model. Journal of the ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=journal&idx=J401&coll=portal&dl=ACM&CFID=10136019&CFTOKEN=486195]. 27(4). 1980. 664–674.

[201] D. Maier, A. O. Mendelzon, Y. Sagiv. Testing implications of data dependencies. ACM Transactions on Database Systems. 4(4). 1979. 455–469.

[202] B. Mandelbrot. The Fractal Geometry of Nature. W. H. Freeman [http://www.aw.com/]. 1982.

[203] B. Mandelbrot, J. W. Van Ness [http://www.utdallas.edu/dept/math/faculty/vanness.html]. Fractional Brownian Motions, Fractional Noises and Applications. SIAM [epubs.siam.org/sam-bin/dbq/toclist/SIREV ] Review. 10. 1968. 422–437.

[204] R. L. Mattson, J. Gecsei [http://www.iro.umontreal.ca/labs/safari/gecsei.html], D. R. Slutz, I. Traiger. Evaluation Techniques for Storage Hierarchies. IBM [http://www.research.ibm.com/journal/sj/] Systems Journal. 9(2). 1970. 78–117.

[205] McAfee [http://www.nai.com/us/index.asp]. Sniffer Technologies. http://www.nai.com/us/index.asp [http://www.nai.com/us/index.asp]. 2004.

[206] J. S. McCaskill. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 29. 1990. 1105–1119.

[207] B. Melamed [http://www.melamed.rutgers.edu/]. An overview of TES processes and modeling methodology, Lecture Notes in Computer [http://www.link.springer.de/link/service/series/0558/index.htm] Science, In L. Donatiello and A. R. Nelson (Eds.) Models and Techniques for Performance Evaluation of Computer and Communications Systems. Springer [http://www.springer.de/]-Verlag. 1993. 359–393.

[208] I. M. Meyer [http://www.cs.ubc.ca/people/profile.jsp?id=irmtraud], R. Durbin. Comparative ab initio prediction of gene structures using pair HMMs. Bioinformatics. 18(10). 2002. 1309–1318.

[209] I. M. Meyer [http://www.cs.ubc.ca/people/profile.jsp?id=irmtraud], R. Durbin. Gene structure conservation aids similarity based gene prediction. Nucleic Acids Research. 32(2). 2004. 776–783.

[210] Sz. Mihnovskiy, N. Shor. Estimation of the page fault number in paged memory (in Russian). Kibernetika (Kiev). 1(5). 1965. 18–20.

[211] W. Miller, E. Myers [http://research.janelia.org/myers/]. A file comparison program. Software – Practice and Experience. 15(11). 1985. 1025–1040.

[212] W. Miller, E. W. Myers [http://research.janelia.org/myers/]. Sequence comparison with concave weighting functions. Bulletin of Mathematical Biology. 50. 1988. 97–120.

[213] T. Milo [http://www.math.tau.ac.il/~milo/], D. Suciu [http://www.cs.washington.edu/homes/suciu/]. Index structures for path expressions, Lecture Notes in Computer Science [http://www.link.springer.de/link/service/series/0558/index.htm], In 7th International Conference on Data Bases. Springer [http://www.springer.de/]-Verlag. 1540. 1999. 277–295.

[214] S. Molnár [http://kronos.tmit.bme.hu/~molnar/], A. Vidács [http://www.tmit.bme.hu/~vidacs/], A. Nilsson. Bottlenecks on the way towards characterization of network traffic: Estimation and interpretation of the Hurst parameter. http://hsnlab.ttt.bme.hu/~molnar [http://hsnlab.ttt.bme.hu/~molnar] (Conference papers). 1997.

[215] B. Morgenstern [http://www.gobics.de/burkhard/]. DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics [bioinformatics.oupjournals.org/]. 15. 1999. 211–218.

[216] B. Morgenstern [http://www.gobics.de/burkhard/], A. Dress, T. Werner. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc. Natl. Academy Science. 93. 1996. 12098–12103.

[217] B. Morgenstern [http://www.gobics.de/burkhard/], K. Frech, A. Dress, T. Werner. DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics [bioinformatics.oupjournals.org/]. 14. 1998. 290–294.

[218] S. N. Needleman, C. Wunch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology. 48. 1970. 443–453.

[219] M. F. Neuts. A versatile Markovian point process. Journal of Applied Probability [http://www.shef.ac.uk/uni/companies/apt/ap.html]. 18. 1979. 764–779.

[220] M. F. Neuts [http://www.mfn-consulting.com/html/biography.html]. Structured Stochastic Matrices of M/G/1 Type and Their Applications. Marcel Dekker [http://www.dekker.com/index.jsp]. 1989.

[221] R. Nussinov, G. Pieczenk, J. Griggs, D. Kleitman. Algorithms for loop matching. SIAM [http://epubs.siam.org/SIAP/siap_toc.html] Journal of Applied Mathematics. 35. 1978. 68–82.

[222] S. Oaks, H. Wong. Java Threads. O'Reilly. 1999.

[223] OpenMP Website. http://www.openmp.org [http://www.openmp.org/]. 2007.

[225] S. Owicki, D. Gries [http://www.cs.cornell.edu/Info/People/gries/gries.html]. An axiomatic proof technique for parallel programs I.. Acta Informatica [http://springerlink.metapress.com/app/home/journal.asp?wasp=2pw78ahqqj2qxj8dfatn&referrer=parent&backto=linkingpublicationresults,1:100460,1]. 6(4). 1976. 319–340.

[226] S. Owicki, L. Lamport [http://research.microsoft.com/users/lamport/]. Proving liveness properties of concurrent programs. ACM [http://www.acm.org/] Transactions on Programming Languages and Systems. 4(3). 1982. 455–495.

[227] L. Pachter [http://math.berkeley.edu/~lpachter/], B. Sturmfels [http://math.berkeley.edu/~bernd/] (Eds.). Algebraic Statistics for Computational Biology. Cambridge [http://uk.cambridge.org/] University Press. 2005.

[228] R. Page, E. Holmes. Molecular Evolution: a Phylogenetic Approach. Blackwell [http://www.blackwellpublishing.com/]. 1998.

[229] R. Paige [http://www.cs.nyu.edu/cs/faculty/paige/], R. Tarjan [http://www.cs.princeton.edu/~ret/]. Three partition refinement algorithms. SIAM [http://epubs.siam.org/sam-bin/dbq/toclist/SICOMP] Journal on Computing. 16(6). 1987. 973–989.

[230] C. Partridge. The end of simple traffic models. IEEE Network [http://www.comsoc.org/pubs/net/]. 7(9). 1993.

[231] V. Paxson, S. Floyd. Wide-area traffic: The failure of Poisson modeling. IEEE/ACM Transactions on Networking [http://portal.acm.org/portal.cfm?CFID=20545300&CFTOKEN=71807633]. 3. 1995. 226–244.

[232] M. Pease, R. Shostak, L. Lamport [http://research.microsoft.com/users/lamport/]. Reaching agreement in the presence of faults. Journal of the ACM [http://www.acm.org/]. 27(2). 1980. 228–234.

[233] J. S. Pedersen [http://users.soe.ucsc.edu/~jsp/], J. Hein. Gene finding with a hidden Markov model of genome structure and evolution. Bioinformatics. 19(2). 2003. 219–227.

[234] G. Peterson, J. Fischer. Economical solutions for the critical section problem in distributed Systems, In Proceedings of the 9th ACM Symposium on Theory of Computing. IEEE [http://www.ieee.org/organizations/pubs/press/] Computer Society Press. 1977. 91–97.

[235] S. Petrov. Finite axiomatization of languages for representation of system properties. Information Sciences. 47. 1989. 339–372.

[236] P. A. Pevzner [http://cseweb.ucsd.edu/~ppevzner/], N. Jones [http://cseweb.ucsd.edu/~ncjones/bio.html]. Bioinformatics Algorithms. The MIT [mitpress.mit.edu/] Press. 2004.

[237] G. F. Pfister. In Search of Clusters. Prentice [http://www.prenhall.com/] Hall. 1998 (2nd edition).

[238] N. Pisanti, M. Sagot. Further thoughts on the syntenic distance between genomes. Algorithmica [http://link.springer.de/link/service/journals/00453/]. 34(2). 2002. 157–180.

[239] N. Polyzotis [http://www.cs.ucsc.edu/~alkis/], M. N. Garofalakis [http://www.bell-labs.com/user/minos/]. Statistical synopses for graph-structured XML databases, In Proceedings of the 2002 ACM SIGMOD international Conference on Management of Data. 2002. 358–369.

[240] R. Pottinger. MinCon: A scalable algorithm for answering queries using views. The VLDB [http://springerlink.metapress.com/app/home/journal.asp?wasp=b5cryjywql0qv16pxgfy&referrer=parent&backto=linkingpublicationresults,1:100392,1] Journal. 10(2). 2001. 182–198.

[241] R. Pottinger [http://www.cs.ubc.ca/~rap/], A. Halevy [http://www.cs.washington.edu/homes/alon/]. A scalable algorithm for answering queries using views, In Proceedings of Very Large Data Bases'00. 2000. 484–495.

[242] T. Pupko, I. Peer, R. Shamir, D. Graur. A Fast Algorithm for Joint Reconstruction of Ancestral Amino Acid Sequences. Molecular Biology and Evolution. 17. 2000. 890–896.

[243] X. Qian. Query folding, In Proceedings of International Conference on Data Engineering. 1996. 48–55.

[244] P. Quinton [http://www.irisa.fr/cosi/Quinton/]. Automatic synthesis of systolic arrays from uniform recurrent equations. Proceedings of the 11th Annual International Symposium on Computer Architecture [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.acm.org/dl/]. 1984. 208–214.

[245] S. K. Rao [http://www.cs.berkeley.edu/~satishr]. Regular iterative algorithms and their implementations on processor arrays. Doktori értekezés, Stanford [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.stanford.edu/] University. 1985.

[246] R. Ravi, J. D. Kececioglu [http://www.cs.arizona.edu/~kece/]. Approximation algorithms for multiple sequence alignment under a fixed evolutionary tree. Discrete [http://www.sciencedirect.com/science/journal/0166218X] Applied Mathematics. 88(1-3). 1998. 355–366.

[247] K. R\textbackslash{. The shortest common supersequence problem over binary alphabet is NP-complete. Theoretical [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/locate/tcs] Computer Science. 16. 1981. 187–198.

[248] A. Rényi. Probability Theory. Akadémiai [http://www.akkrt.hu/] Kiadó/North Holland [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/] Publ. House. 1970.

[249] E. Rivas, S. Eddy. A dynamic programming algorithm for RNA structure prediction including pseudoknots. Journal of Molecular Biology. 285(5). 1999. 2053–2068.

[250] S. Roch. A Short Proof that Phylogenetic Tree Reconstruction by Maximum Likelihood Is Hard. EEE Transactions on Computational Biology and Bioinformatics. 3(1). 2006. 92–94.

[251] S. H. Roosta. Parallel Processing and Parallel Algorithms. Springer [http://www.springer-ny.com/]-Verlag. 1999.

[253] A. Sali, Sr., A. Sali [http://www.renyi.hu/~sali/]. Generalized Dependencies in Relational Databases. Acta Cybernetica [http://www.inf.u-szeged.hu/kutatas/actacybernetica/starthu.xml]. 13. 1998. 431–438.

[254] D. Sankoff. Minimal mutation trees of sequences. SIAM [http://epubs.siam.org/SIAP/siap_toc.html] Journal of Applied Mathematics. 28. 1975. 35–42.

[255] N. Santoro. Design and Analysis of Distributed Algorithms. John Wiley [http://www.wiley.com/] & Sons, New York. 2006.

[256] A. Segall. Distributed network protocols. IEEE Transactions on Information [http://www.ieee.org/portal/index.jsp?pageID=corp_level1&path=pubs/transactions&file=tit.xml&xsl=generic.xsl] Theory. IT-29(1). 1983. 23–35.

[257] P. H. Sellers. On the theory and computation of evolutionary distances. SIAM [http://epubs.siam.org/SIAP/siap_toc.html] Journal of Applied Mathematics. 26. 1974. 787–793.

[258] I. Shindyalov, P. Bourne. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering. 11(9). 1998. 739–747.

[260] D. Sima [http://www.bmf.hu/02szervezeti/sima_dezso.htm], T. Fountain, P. Kacsuk [http://www.lpds.sztaki.hu/index.php?menu=staff&&load=staff/member.php&&mid=0]. Advanced Computer Architectures: a Design Space Approach. Addison [http://www.aw.com/]-Wesley Publishing Company. 1998 (2nd edition).

[261] T. F. Smith, M. S. Waterman [http://www-hto.usc.edu/people/Waterman.html]. Identification of common molecular subsequences. Journal of Molecular [http://www.sciencedirect.com/science/journal/00222836] Biology. 147. 1981. 195–197.

[262] J. L. Spouge. Fast optimal alignment. CABIOS. 7. 1991. 1–7.

[263] J. L. Spouge. Speeding up dynamic programming algorithms for finding optimal lattice paths. SIAM [http://epubs.siam.org/SIAP/siap_toc.html] Journal of Applied Mathematics. 49. 1989. 1552–1566.

[264] L. J. Stockmeyer [http://www.geocities.com/stockmeyer@sbcglobal.net/], A. R. Meyer [http://theory.lcs.mit.edu/~meyer/]. Word problems requiring exponential time, In Proceedings of the 28th Annual ACM Symposium on Theory of Computing. ACM [http://isbndb.com/d/publisher/acm_press.html] Press. 1973. 1–9.

[265] A. S. Tanenbaum [http://www.cs.vu.nl/~ast/]. Computer Networks. Prentice [http://www.prenhall.com/] Hall. 2004.

[266] A. S. Tanenbaum [http://www.cs.vu.nl/~ast/]. Modern Operating Systems. Prentice [http://www.prenhall.com/] Hall. 2001.

[267] A. S. Tanenbaum [http://www.cs.vu.nl/~ast/], M. van Steen [http://www.van-steen.net:8081/export/]. Distributed Systems. Principles and Paradigms. Prentice [http://www.prenhall.com/] Hall. 2002.

[268] A. S. Tanenbaum [http://www.cs.vu.nl/~ast/], A. Woodhull. Operating Systems. Design and Implementation. Prentice [http://www.prenhall.com/] Hall. 1997.

[269] M. Taqqu [http://math.bu.edu/people/murad/], W. Teverovsky, W. Willinger. Estimators for long-range dependence: an empirical study. Fractals [http://www.worldscinet.com/fractals/fractals.shtml]. 3(4). 1995. 785–788.

[270] J. Tarhio, E. Ukkonen. A greedy approximation algorithm for constructing shortest common superstrings. Theoretical [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/locate/tcs] Computer Science. 57. 1988. 131–145.

[271] J. Teich [http://www-date.uni-paderborn.de/MEMBERS/teich.html], L. Thiele [http://www.tik.ee.ethz.ch/~thiele/]. Control generation in the design of processor arrays. International Journal of VLSI [http://www.kluweronline.com/issn/0922-5773/contents] and Signal Processing. 3(2). 1991. 77–92.

[272] G. Tel [http://www.cs.uu.nl/~gerard/]. Introduction to Distributed Algorithms. Cambridge [http://uk.cambridge.org/] University Press. 2000 (2nd edition).

[273] B. Thalheim. Dependencies in Relational Databases. B. G. Teubner [http://www.teubner.de/]. 1991.

[274] J. D. Thompson, D. G. Higgins, T. J. Gibson. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific penalties and weight matrix choice. Nucleic Acids Research. 22. 1994. 4673–4680.

[275] I. Tinoco, O. Uhlenbeck, M. Levine. Estimation of secondary structure in ribonucleic acids. Nature. 230. 1971. 362–367.

[276] Top 500 Supercomputer Sites. http://www.top500.org [http://www.top500.org/]. 2007.

[277] A. Tucker [http://www.bowdoin.edu/~allen/]. Handbook of Computer Science. Chapman [http://www.chapmanhall.com/] & Hall/CRC [http://www.crcpress.com/]. 2004.

[279] O. G. Tsatalos, M. C. Solomon, Y. Ioannidis. The GMAP: a versatile tool for physical data independence. The VLDB [http://springerlink.metapress.com/app/home/journal.asp?wasp=b5cryjywql0qv16pxgfy&referrer=parent&backto=linkingpublicationresults,1:100392,1] Journal. 5(2). 1996. 101–118.

[280] D. M. Tsou, P. C. Fischer. Decomposition of a relation scheme into Boyce–Codd normal form. SIGACT News. 14(3). 1982. 23–29.

[281] Y. Uemura, A. Hasegawa, Y. Kobayashi, T. Yokomori. Tree adjoining grammars for RNA structure prediction. Theoretical [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/locate/tcs] Computer Science. 210. 1999. 277–303.

[282] J. D. Ullman [http://www-db.stanford.edu/~ullman/]. Information integration using logical views, Lecture Notes in Computer Science [http://www.link.springer.de/link/service/series/0558/index.htm], In Proceedings of ICDT'97. Springer [http://www.springer.de/]-Verlag. 1186. 1997. 19–40.

[283] J. D. Ullman [http://www-db.stanford.edu/~ullman/]. Principles of Database and Knowledge Base Systems. Vol. 1. Computer Science Press. 1989 (2nd edition).

[284] E. Ukkonen [http://www.cs.helsinki.fi/u/ukkonen/]. Algorithms for approximate string matching. Information and Control. 64. 1985. 100–118.

[285] E. Ukkonen [http://www.cs.helsinki.fi/u/ukkonen/]. On approximate string matching. Lecture Notes in Computer Science. 158. 1984. 487–495.

[286] L. G. Valiant [http://people.deas.harvard.edu/~valiant/]. A bridging model for parallel computation. Communications of the ACM [http://portal.acm.org/browse_dl.cfm?linked=1&part=magazine&idx=J79&coll=portal&dl=ACM&CFID=10204809&CFTOKEN=31999750]. 33(8). 1990. 103–111.

[287] V. Vianu [http://www-cse.ucsd.edu/users/vianu/]. A Web Odyssey: from Codd to XML, In Proceedings of the 20th Symposium on Principles of Database Systems. 2001. 1–5.

[288] L. Wang, T. Jiang. On the complexity of multiple sequence alignment. Journal of Computational Biology. 1. 1994. 337–348.

[289] M. S. Waterman [http://www-hto.usc.edu/people/Waterman.html], T. F. Smithand, W. A. Beyer. Some biological sequence metrics. Advances in Mathematics. 20. 1976. 367–387.

[290] Web Dictionary [http://pcp.wub.ac.be/ASC.html] of Cybernetics and Systems. http://pcp.wub.ac.be/ASC.html [http://pcp.wub.ac.be/ASC.html]. 2007.

[291] J. W. Weber, E. Myers [http://research.janelia.org/myers/]. Human Whole Genome Shotgun Sequencing. Genome Research. 7. 1997. 401–409.

[292] W. Willinger, V. Paxson. Discussion of {. The Annals of Statistics [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.stat.washington.edu/annstat/]. 25(5). 1997. 1805–1869.

[293] W. Willinger, M. Taqqu [http://math.bu.edu/people/murad/], R. Sherman, D. Wilson. Self-similarity through high-variability: statistical analysis of Ethernet LAN traffic at the source level. IEEE/ACM Transactions on Networking [http://portal.acm.org/portal.cfm?CFID=20545300&CFTOKEN=71807633]. 5. 1997. 71–86.

[294] W. Willinger, D. Wilson, W. Leland, M. Taqqu [http://math.bu.edu/people/murad/]. On traffic measurements that defy traffic models (and vice versa): self-similar traffic modeling for high-speed networks. Connections. 8(11). 1994. 14–24.

[295] W. Winkler [http://www.math.dartmouth.edu/~pw]. Dependent percolation and colliding random walks. Random [http://www3.interscience.wiley.com/cgi-bin/jtoc/38107/] Structures & Algorithms. 16(1). 2000. 58–84.

[296] M. J. Wolfe. High Performance Compilers for Parallel Computing. Addison Wesley Longman. 1996.

[297] S. Wu, E. W. Myers [http://research.janelia.org/myers/], U. Manber, W. Miller. An Sequence Comparison Algorithm. Information [http://www.elsevier.nl/inca/publications/store/5/0/5/6/1/2/] Processing Letters. 35(6). 1990. 317–323.

[298] J. Wu, S.. On cost-optimal merge of two intransitive sorted sequences. International Journal of Foundations of Computer Science. 1(14). 2003. 99–106.

[299] J. Yang, K. Karlapalem, Q. Li. Algorithms for materialized view design in data warehousing environment, In Proceedings of Very Large Data Bases'97. 1997. 136–145.

[300] H. Z. Yang, P. A. Larson. Query transformation for PSJ-queries, In Proceedings of Very Large Data Bases'87. 1987. 245–254.

[301] K. Yi [http://www.cs.duke.edu/~yike/], H. He [http://www.cs.duke.edu/~haohe/], I. Stanoi [http://www.research.ibm.com/people/i/irs/], J. Yang [http://www.cs.duke.edu/~junyang/]. Incremental maintenance of XML structural indexes, In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data. 2004. 491–502.

[302] C. Yu, D. Johnson. On the complexity of finding the set of candidate keys for a given set of functional dependencies, In Information Processing 74. North-Holland [file:///C:/Documents%20and%20Settings/tony/Local%20Settings/Temp/_tc/ch21/www.elsevier.nl/]. 1974. 580–583.

[303] M. Zaharioudakis, R. Cochrane, G. Lapis, H. Pirahesh, M. Urata. Answering complex SQL queries using automatic summary tables, In Proceedings of SIGMOD'00. 2000. 105–116.

[304] C. Zaniolo. Analysis and Design of Relational Schemata for Database Systems. UCLA-Eng-7669. 1976.

[305] C. Zaniolo. A new normal form for the design of relational database schemata. ACM Transactions on Database Systems. 7. 1982. 489–499.