Table of Contents
List of Figures
Classical-Euclidean
algorithm in and . In case (a), the input is . The first two lines of the pseudocode compute the absolute values of the input numbers. The loop between lines and is executed four times, values , and in these iterations are shown in the table. The Classical-Euclidean(
,
)
algorithm outputs as result. In case (b), the input parameters are . The first two lines compute the normal form of the polynomials, and the while loop is executed three times. The output of the algorithm is the polynomial .Primitive-Euclidean
algorithm with input . The first two lines of the program compute the primitive parts of the polynomials. The loop between lines and is executed three times, the table shows the values of , and in the iterations. In line , variable equals . The Primitive-Euclidean(
)
algorithm returns as result.Random-SAT
.
AnTonCom, Budapest, 2011 This electronic book was prepared in the framework of project Eastern Hungarian Informatics Books Repository no. TÁMOP-4.1.2-08/1/A-2009-0046 This electronic book appeared with the support of European Union and with the co-financing of European Social Fund
Editor: Antal Iványi Authors of Volume 1: László Lovász (Preface), Antal Iványi (Introduction), Zoltán Kása (Chapter 1), Zoltán Csörnyei (Chapter 2), Ulrich Tamm (Chapter 3), Péter Gács (Chapter 4), Gábor Ivanyos and Lajos Rónyai (Chapter 5), Antal Járai and Attila Kovács (Chapter 6), Jörg Rothe (Chapters 7 and 8), Csanád Imreh (Chapter 9), Ferenc Szidarovszky (Chapter 10), Zoltán Kása (Chapter 11), Aurél Galántai and András Jeney (Chapter 12) Validators of Volume 1: Zoltán Fülöp (Chapter 1), Pál Dömösi (Chapter 2), Sándor Fridli (Chapter 3), Anna Gál (Chapter 4), Attila Pethő (Chapter 5), Lajos Rónyai (Chapter 6), János Gonda (Chapter 7), Gábor Ivanyos (Chapter 8), Béla Vizvári (Chapter 9), János Mayer (Chapter 10), András Recski (Chapter 11), Tamás Szántai (Chapter 12), Anna Iványi (Bibliography) Authors of Volume 2: Burkhard Englert, Dariusz Kowalski, Gregorz Malewicz, and Alexander Shvartsman (Chapter 13), Tibor Gyires (Chapter 14), Claudia Fohry and Antal Iványi (Chapter 15), Eberhard Zehendner (Chapter 16), Ádám Balogh and Antal Iványi (Chapter 17), János Demetrovics and Attila Sali (Chapters 18 and 19), Attila Kiss (Chapter 20), István Miklós (Chapter 21), László Szirmay-Kalos (Chapter 22), Ingo Althöfer and Stefan Schwarz (Chapter 23) Validators of Volume 2: István Majzik (Chapter 13), János Sztrik (Chapter 14), Dezső Sima (Chapters 15 and 16), László Varga (Chapter 17), Attila Kiss (Chapters 18 and 19), András Benczúr (Chapter 20), István Katsányi (Chapter 21), János Vida (Chapter 22), Tamás Szántai (Chapter 23), Anna Iványi (Bibliography) ©2011 AnTonCom Infokommunikációs Kft. Homepage: http://www.antoncom.hu/ |
It is a special pleasure for me to recommend to the Readers the book Algorithms of Computer Science, edited with great care by Antal Iványi. Computer algorithms form a very important and fast developing branch of computer science. Design and analysis of large computer networks, large scale scientific computations and simulations, economic planning, data protection and cryptography and many other applications require effective, carefully planned and precisely analyzed algorithms.
Many years ago we wrote a small book with Péter Gács [168] under the title Algorithms. The three volumes of the book Algorithms of Computer Science show how this topic developed into a complex area that branches off into many exciting directions. It gives a special pleasure to me that so many excellent representatives of Hungarian computer science have cooperated to create this book. It is obvious to me that this book will be one of the most important reference books for students, researchers and computer users for a long time.
Budapest, July 2010
László Lovász
The first volume of the book Informatikai algoritmusok appeared in 2004 in Hungarian [127], and the second volume of the book appeared in 2005 [128]. Two volumes contained 31 chapters: 23 chapters of the first and second volumes of the present electronic book, and further chapters on clustering, frequent elements in data bases, geoinformatics, inner-point methods, number theory, Petri-nets, queuing theory, and scheduling.
The Hungarian version of the first volume contains those chapters which were finished until May of 2004, and the second volume contains the chapters finished until April of 2005.
The printed English version contains the chapters submitted until April of 2007. Volume 1 [129] contains the chapters belonging to the fundamentals of informatics, while the second volume [130] contains the chapters having closer connection with some applications.
The first and second volumes of the given book represent an extended and corrected electronic version of the printed book written is English. The third volume of the present book contains new chapters.
The chapters of the first volume are divided into three parts. The chapters of Part 1 are connected with automata: Automata and Formal Languages (written by Zoltán Kása, Sapientia Hungarian University of Transylvania), Compilers (Zoltán Csörnyei, Eötvös Loránd University), Compression and Decompression (Ulrich Tamm, Chemnitz University of Technology Commitment), Reliable Computations (Péter Gács, Boston University).
The chapters of Part 2 have algebraic character: here are the chapters Algebra (written by Gábor Ivanyos and Lajos Rónyai, Budapest University of Technology and Economics), Computer Algebra (Antal Járai and Attila Kovács, Eötvös Loránd University), further Cryptology and Complexity Theory (Jörg Rothe, Heinrich Heine University).
The chapters of the third part have numeric character: Competitive Analysis (Csanád Imreh, University of Szeged), Game Theory (Ferenc Szidarovszky, The University of Arizona) and Scientific Computations (Aurél Galántai, Óbuda University and András Jeney, University of Miskolc).
The second volume is also divided into three parts. The chapters of Part 4 are connected with computer networks: Distributed Algorithms (Burkhard Englert, California State University; Dariusz Kowalski, University of Liverpool; Grzegorz Malewicz, University of Alabama; Alexander Allister Shvartsman, University of Connecticut), Parallel Algorithms (Claudia Fohry, University of Kassel and Antal Iványi, Eötvös Loránd University), Network Simulation (Tibor Gyires, Illinois State University) and Systolic Systems (Eberhard Zehendner, Friedrich Schiller University).
The chapters of Part 5 are Relational Databases and Query in Relational Databases (János Demetrovics, Eötvös Loránd University and Attila Sali, Alfréd Rényi Institute of Mathematics), Semistructured Data Bases (Attila Kiss, Eötvös Loránd University) and Memory Management (Ádám Balog, Antal Iványi, Eötvös Loránd University).
The chapters of the third part of the second volume have close connections with biology: Bioinformatics (István Miklós, Rényi Institute of Mathematics), Human-Computer Interactions (Ingo Althöfer, Stefan Schwarz, Friedrich Schiller University), and Computer Graphics (László Szirmay-Kalos, Budapest University of Technology and Economics).
The chapters are validated by Gábor Ivanyos, István Majzik, Lajos Rónyai, András Recski, and Tamás Szántai (Budapest University of Technology and Economics), András Benczúr, Sándor Fridli, János Gonda, István Katsányi, Attila Kiss, László Varga, János Vida, and Béla Vizvári (Eötvös Loránd University), Dezső Sima (Óbuda University) Pál Dömösi, János Sztrik, and Attila Pethő (University of Debrecen), Zoltán Fülöp (University of Szeged), Anna Gál (University of Texas), János Mayer (University of Zürich).
The first and second volumes contain verbal description, pseudocode and analysis of over 200 algorithms, and over 350 figures and 120 examples illustrating how the algorithms work. Each section ends with exercises and each chapter ends with problems. In the two volumes you can find over 330 exercises and 70 problems.
We have supplied an extensive bibliography, in the section Chapter Notes of each chapter. In the bibliography the names of the authors, journals and publishers are usually active links to the corresponding web sites (the living elements are underlined in the printed version and on the screen too).
The LaTeX style file was written by Viktor Belényesi, Zoltán Csörnyei, László Domoszlai and Antal Iványi. The figures was drawn or corrected by Kornél Locher. Anna Iványi transformed the bibliography into hypertext. The DOCBOOK version was made by Marton 2001 Kft.
Using the data of the colofon page you can contact with any of the creators of the book. We welcome ideas for new exercises and problems, and also critical remarks or bug reports.
The publication of the printed book was supported by Department of Mathematics of Hungarian Academy of Science, and the electronic version received support from European Union and from the European Social Fund.
Budapest, May 26, 2011
Antal Iványi (tony@compalg.inf.elte.hu)
Table of Contents
Table of Contents
Automata and formal languages play an important role in projecting and realizing compilers. In the first section grammars and formal languages are defined. The different grammars and languages are discussed based on Chomsky hierarchy. In the second section we deal in detail with the finite automata and the languages accepted by them, while in the third section the pushdown automata and the corresponding accepted languages are discussed. Finally, references from a rich bibliography are given.
A finite and nonempty set of symbols is called an alphabet. The elements of an alphabet are letters, but sometimes are named also symbols.
With the letters of an alphabet words are composed. If then is a word over the alphabet (the letters are not necessary distinct). The number of letters of a word, with their multiplicities, constitutes the length of the word. If then the length of is If then the word is an empty word, which will be denoted by (sometimes in other books). The set of words over the alphabet will be denoted by :
For the set of nonempty words over the notation will be used. The set of words of length over will be denoted by , and Then
The words and are equal (i.e. ), if and
We define in the binary operation called concatenation. The concatenation (or product) of the words and is the word . It is clear that This operation is associative but not commutative. Its neutral element is because for all . with the concatenation is a monoid.
We introduce the power operation. If then and for The reversal (or mirror image) of the word is . The reversal of sometimes is denoted by or . It is clear that and
Word is a prefix of the word if there exists a word such that . If then is a proper prefix of . Similarly is a suffix of if there exists a word such that . The proper suffix can also be defined. Word is a subword of the word if there are words and such that If then is a proper subword.
A subset of is called a language over the alphabet . Sometimes this is called a formal language because the words are here considered without any meanings. Note that is the empty language while is a language which contains the empty word.
If are languages over we define the following operations
union
intersection
difference
complement
multiplication
power
if
iteration or star operation
mirror
We will use also the notation
The union, product and iteration are called regular operations.
Languages can be specified in several ways. For example a language can be specified using
1) the enumeration of its words,
2) a property, such that all words of the language have this property but other word have not,
3) a grammar.
For example the following are languages
Even if we cannot enumerate the elements of an infinite set infinite languages can be specified by enumeration if after enumerating the first some elements we can continue the enumeration using a rule. The following is such a language
The following sets are languages
where denotes the number of letters in word and the number of letters .
Define the generative grammar or shortly the grammar.
Definition 1.1 A grammar is an ordered quadruple , where
is the alphabet of variables (or nonterminal symbols),
is the alphabet of terminal symbols, where ,
is a finite set, that is is the finite set of productions of the form , where and contains at least a nonterminal symbol,
is the start symbol.
Remarks. Instead of the notation sometimes is used.
In the production or word is called the left-hand side of the production while the right-hand side. If for a grammar there are more than one production with the same left-hand side, then these production
We define on the set the relation called direct derivation
In fact we replace in an appearance of the subword by and we get . Another notations for the same relation can be or .
If we want to emphasize the used grammar , then the notation can be replaced by . Relation is the reflexive and transitive closure of , while denotes its transitive closure. Relation is called a derivation.
From the definition of a reflexive and transitive relation we can deduce the following: , if there exist the words and . This can be written shortly If then . The same way we can define the relation except that always, so at least one direct derivation will de used.
Definition 1.2 The language generated by grammar is the set
So contains all words over the alphabet which can be derived from the start symbol using the productions from .
It is easy to see than because
where up to the last but one replacement the first production () was used, while at the last replacement the production . This derivation can be written Therefore can be derived from for all and no other words can be derived from .
Definition 1.3 Two grammars and are equivalent, and this is denoted by if .
Example 1.2 The following two grammars are equivalent because both of them generate the language .
, where
,
where
First let us prove by mathematical induction that for If then
The inductive hypothesis is We use production , then times production , and then production , afterwards again times production . Therefore
If now we use production we get for , but by the production , so for any . We have to prove in addition that using the productions of the grammar we cannot derive only words of the form . It is easy to see that a successful derivation (which ends in a word containing only terminals) can be obtained only in the presented way.
Similarly for
Here orderly were used the productions ( times), , ( times), , ( times), , ( times). But So , . It is also easy to see than other words cannot be derived using grammar .
The grammars and are not equivalent because .
Theorem 1.4 Not all languages can be generated by grammars.
Proof. We encode grammars for the proof as words on the alphabet . For a given grammar let and The encoding is the following:
the code of is the code of is
In the code of the grammar the letters are separated by 000, the code of the arrow is 0000, and the productions are separated by 00000.
It is enough, of course, to encode the productions only. For example, consider the grammar
.
The code of is 10101, the code of is 1001001, the code of is 10011001. The code of the grammar is
From this encoding results that the grammars with terminal alphabet can be enumerated as and the set of these grammars is a denumerable infinite set.
Footnote Let us suppose that in the alphabet there is a linear order , let us say . The words which are codes of grammars can be enumerated by ordering them first after their lengths, and inside the equal length words, alphabetically, using the order of their letters. But we can use equally the lexicographic order, which means that ( is before ) if is a proper prefix of or there exists the decompositions and , where , , are subwords, and letters with .
Consider now the set of all languages over denoted by , that is . The set is denumerable because its words can be ordered. Let this order , where . We associate to each language an infinite binary sequence the following way:
It is easy to see that the set of all such binary sequences is not denumerable, because each sequence can be considered as a positive number less than 1 using its binary representation (The decimal point is considered to be before the first digit). Conversely, to each positive number less than 1 in binary representation a binary sequence can be associated. So, the cardinality of the set of infinite binary sequences is equal to cardinality of interval , which is of continuum power. Therefore the set is of continuum cardinality. Now to each grammar with terminal alphabet associate the corresponding generated language over . Since the cardinality of the set of grammars is denumerable, there will exist a language from , without associated grammar, a language which cannot be generated by a grammar.
Putting some restrictions on the form of productions, four type of grammars can be distinguished.
Definition 1.5 Define for a grammar the following four types.
A grammar is of type 0 (phrase-structure grammar) if there are no restrictions on productions.
A grammar is of type 1 (context-sensitive grammar) if all of its productions are of the form , where , , . A production of the form can also be accepted if the start symbol does not occur in the right-hand side of any production.
A grammar is of type 2 (context-free grammar) if all of its productions are of the form , where , . A production of the form can also be accepted if the start symbol does not occur in the right-hand side of any production.
A grammar is of type 3 (regular grammar) if its productions are of the form or , where and . A production of the form can also be accepted if the start symbol does not occur in the right-hand side of any production.
If a grammar is of type then language is also of type .
This classification was introduced by Noam Chomsky.
A language is of type () if there exists a grammar of type which generates the language , so .
Denote by () the class of the languages of type . Can be proved that
By the definition of different type of languages, the inclusions () are evident, but the strict inclusions () must be proved.
Example 1.3 We give an example for each type of context-sensitive, context-free and regular grammars.
Context-sensitive grammar. , where
Elements of are:
Language contains words of the form with and .
Context-free grammar. , where
Elements of are:
Language contains algebraic expressions which can be correctly built using letter , operators and and brackets.
Regular grammar. , where
Elements of are:
Language contains words over the alphabet with at least two letters at the beginning.
It is easy to prove that any finite language is regular. The productions will be done to generate all words of the language. For example, if is in the language, then we introduce the productions: , , , where is the start symbol of the language and are distinct nonterminals. We define such productions for all words of the language using different nonterminals for different words, excepting the start symbol . If the empty word is also an element of the language, then the production is also considered.
The empty set is also a regular language, because the regular grammar generates it.
A production of the form is called a unit production, where . Unit productions can be eliminated from a grammar in such a way that the new grammar will be of the same type and equivalent to the first one.
Let be a grammar with unit productions. Define an equivalent grammar without unit productions. The following algorithm will construct the equivalent grammar.
Eliminate-Unit-Productions(
)
1 if the unit productions and are in put also
the unit production in while can be extended
2 if the unit production and the production () are in
put also the production in
3 let be the set of productions of except unit productions
4 RETURN
Clearly, and are equivalent. If is of type then is also of type
Example 1.4 Use the above algorithm in the case of the grammar , where contains
Using the first step of the algorithm, we get the following new unit productions:
(because of and ),
(because of and ),
(because of and ),
(because of and ),
(because of and ),
(because of and ).
In the second step of the algorithm will be considered only productions with or in the right-hand side, since productions , and can be used (the other productions are all unit productions). We get the following new productions:
(because of and ),
(because of and ),
(because of and ),
(because of and ),
(because of and ).
The new grammar will have the productions:
A grammar is to be said a grammar in normal form if its productions have no terminal symbols in the left-hand side.
We need the following notions. For alphabets and a homomorphism is a function for which . It is easy to see that for arbitrary value is uniquely determined by the restriction of on , because
If a homomorphism is a bijection then is an isomorphism.
Theorem 1.6 To any grammar an equivalent grammar in normal form can be associated.
Proof. Grammars of type 2 and 3 have in left-hand side of any productions only a nonterminal, so they are in normal form. The proof has to be done for grammars of type 0 and 1 only.
Let be the original grammar and we define the grammar in normal form as .
Let be those terminal symbols which occur in the left-hand side of productions. We introduce the new nonterminals . The following notation will be used: , , and .
Define the isomorphism , where
Define the set of production as
In this case if and only if From this the theorem immediately results because
Example 1.5 Let , where contains
In the left-hand side of productions the terminals occur, therefore consider the new nonterminals , and include in also the new productions , and .
Terminals will be replaced by nonterminals respectively, and we get the set as
Let us see what words can be generated by this grammars. It is easy to see that because
, so
We prove, using the mathematical induction, that for . For this is the case, as we have seen before. Continuing the derivation we get , and this is what we had to prove.
But So . These words can be generated also in .
In this subsection extended grammars of type 1, 2 and 3 will be presented.
Extended grammar of type 1. All productions are of the form , where , excepted possibly the production .
Extended grammar of type 2. All productions are of the form , where
Extended grammar of type 3. All productions are of the form or , where .
Theorem 1.7 To any extended grammar an equivalent grammar of the same type can be associated.
Proof. Denote by the extended grammar and by the corresponding equivalent grammar of the same type.
Type 1. Define the productions of grammar by rewriting the productions , where of the extended grammar in the form allowed in the case of grammar by the following way.
Let () be a production of , which is not in the required form. Add to the set of productions of the following productions, where are new nonterminals:
Furthermore, add to the set of productions of without any modification the productions of which are of permitted form, i.e. . Inclusion can be proved because each used production of in a derivation can be simulated by productions obtained from it. Furthermore, since the productions of can be used only in the prescribed order, we could not obtain other words, so also is true.
Type 2. Let . Productions of form have to be eliminated, only can remain, if doesn't occur in the right-hand side of productions. For this define the following sets:
Since for we have , and is a finite set, there must exists such a for which . Let us denote this set as . It is easy to see that a nonterminal is in if and only if . (In addition if and only if .)
We define the productions of starting from the productions of in the following way. For each production with of add to the set of productions of this one and all productions which can be obtained from it by eliminating from one or more nonterminals which are in , but only in the case when the right-hand side does not become .
It in not difficult to see that this grammar generates the same language as does, except the empty word . So, if then the proof is finished. But if , then there are two cases. If the start symbol does not occur in any right-hand side of productions, then by introducing the production , grammar will generate also the empty word. If occurs in a production in the right-hand side, then we introduce a new start symbol and the new productions and . Now the empty word can also be generated by grammar .
Type 3. First we use for the procedure defined for grammars of type 2 to eliminate productions of the form . From the obtained grammar we eliminate the unit productions using the algorithm ELIMINATE-UNIT-PRODUCTIONS
.
In the obtained grammar for each production , where , add to the productions of also the followings
where are new nonterminals. It is easy to prove that grammar built in this way is equivalent to .
Example 1.6 Let be an extended grammar of type 1, where , and contains the following productions:
The only production which is not context-sensitive is . Using the method given in the proof, we introduce the productions:
Now the grammar is context-sensitive, where the elements of are
It can be proved that .
Example 1.7 Let be an extended grammar of type 2, where contains:
Then , , . The productions of the new grammar are:
The original grammar generates also the empty word and because occurs in the right-hand side of a production, a new start symbol and two new productions will be defined: . The context-free grammar equivalent to the original grammar is with the productions:
Both of these grammars generate language .
Example 1.8 Let be the extended grammar of type 3 under examination, where :
First, we eliminate production . Since , the productions will be
The latter production (which a unit production) can also be eliminated, by replacing it with . Productions and have to be transformed. Since, both productions have the same right-hand side, it is enough to introduce only one new nonterminal and to use the productions and instead of . Production will be replaced by . The new grammar is , where :
Can be proved that .
We will prove the following theorem, by which the Chomsky-classes of languages are closed under the regular operations, that is, the union and product of two languages of type is also of type , the iteration of a language of type is also of type ().
Theorem 1.8 The class () of languages is closed under the regular operations.
Proof. For the proof we will use extended grammars. Consider the extended grammars and of type each. We can suppose that .
Union. Let .
We will show that If then from the assumption that and are of type follows by definition that also is of type . If and one of the grammars generates the empty word, then we eliminate from the corresponding production (possibly the both) () and replace it by production .
Product. Let .
We will show that By definition, if then will be of the same type. If and there is production in but there is no production in then production will be replaced by . We will proceed the same way in the symmetrical case. If there is in production and in production then they will be replaced by .
In the case of regular grammars (), because is not a regular production, we need to use another grammar , where the difference between and lies in that instead of productions in the form in will exist production of the form .
Iteration. Let .
In the case of grammars of type 2 let . Then also is of type 2.
In the case of grammars of type 3, as in the case of product, we will change the productions, that is , where the difference between and lies in that for each will be replaced by , and the others will be not changed. Then also will be of type 3.
The productions given in the case of type 2 are not valid for , because when applying production we can get the derivations of type , , , where can be a left-hand side of a production. In this case, replacing by its right-hand side in derivation , we can generate a word which is not in the iterated language. To avoid such situations, first let us assume that the language is in normal form, i.e. the left-hand side of productions does not contain terminals (see Section 1.1), second we introduce a new nonterminal , so the set of nonterminals now is , and the productions are the following:
Now we can avoid situations in which the left-hand side of a production can extend over the limits of words in a derivation because of the iteration. The above derivations can be used only by beginning with and getting derivation . Here we can not replace unless the last symbol in is a terminal symbol, and only after using a production of the form .
It is easy to show that for each type.
Exercises
1.1-1 Give a grammar which generates language and determine its type.
1.1-2 Let be an extended context-free grammar, where ,
.
Give an equivalent context-free grammar.
1.1-3 Show that and are regular languages over arbitrary alphabet .
1.1-4 Give a grammar to generate language , where represents the number of 0's in word and the number of 1's.
1.1-5 Give a grammar to generate all natural numbers.
1.1-6 Give a grammar to generate the following languages, respectively:
,
,
,
.
1.1-7 Let be an extended grammar, where , and contains the productions:
Determine the type of this grammar. Give an equivalent, not extended grammar with the same type. What language it generates?
Finite automata are computing models with input tape and a finite set of states (Fig. 1.1). Among the states some are called initial and some final. At the beginning the automaton read the first letter of the input word written on the input tape. Beginning with an initial state, the automaton read the letters of the input word one after another while change its states, and when after reading the last input letter the current state is a final one, we say that the automaton accepts the given word. The set of words accepted by such an automaton is called the language accepted (recognized) by the automaton.
Definition 1.9 A nondeterministic finite automaton (NFA) is a system , where
is a finite, nonempty set of states,
is the input alphabet,
is the set of transitions (or of edges), where ,
is the set of initial states,
is the set of final states.
An NFA is in fact a directed, labelled graph, whose vertices are the states and there is a (directed) edge labelled with from vertex to vertex if . Among vertices some are initial and some final states. Initial states are marked by a small arrow entering the corresponding vertex, while the final states are marked with double circles. If two vertices are joined by two edges with the same direction then these can be replaced by only one edge labelled with two letters. This graph can be called a transition graph.
,
The automaton can be seen in Fig. 1.2.
In the case of an edge vertex is the start-vertex, the end-vertex and the label. Now define the notion of the walk as in the case of graphs. A sequence
of edges of a NFA is a walk with the label . If then and . Such a walk is called an empty walk. For a walk the notation
will be used, or if then we write shortly . Here is the start-vertex and the end-vertex of the walk. The states in a walk are not necessary distinct. A walk is productive if its start-vertex is an initial state and its end-vertex is a final state. We say that an NFA accepts or recognizes a word if this word is the label of a productive walk. The empty word is accepted by an NFA if there is an empty productive walk, i.e. there is an initial state which is also a final state.
The set of words accepted by an NFA will be called the language accepted by this NFA. The language accepted or recognized by NFA is
The NFA and are equivalent if .
Sometimes it is useful the following transition function:
This function associate to a state and input letter the set of states in which the automaton can go if its current state is and the head is on input letter .
Denote by the cardinal (the number of elements) of .
Footnote. The same notation is used for the cardinal of a set and length of a word, but this is no matter of confusion because for word we use lowercase letters and for set capital letters. The only exception is , but this could not be confused with a word.
An NFA is a deterministic finite automaton (DFA) if
In Fig. 1.2 a DFA can be seen.
Condition can be replaced by
If for a DFA for each state and for each letter then it is called a complete DFA.
Every DFA can be transformed in a complete DFA by introducing a new state, which can be called a snare state. Let be a DFA. An equivalent and complete DFA will be , where is the new state and . It is easy to see that . Using the transition function we can easily define the transition table. The rows of this table are indexed by the elements of , its columns by the elements of . At the intersection of row and column we put . In the case of Fig. 1.2, the transition table is:
The NFA in Fig. 1.3 are not deterministic: the first (automaton A) has two initial states, the second (automaton B) has two transitions with from state (to states and ). The transition table of these two automata are in Fig. 1.4. is set of words over which do not begin with two zeroes (of course is in language), is the set of words which contain as a subword.
Let be a finite automaton. A state is accessible if it is on a walk which starts by an initial state. The following algorithm determines the inaccessible states building a sequence , , of sets, where is the set of initial states, and for any is the set of accessible states, which are at distance at most from an initial state.
Inaccessible-States(A)
1 2 3REPEAT
4 5FOR
all 6DO
FOR
all 7DO
8UNTIL
9 10RETURN
The inaccessible states of the automaton can be eliminated without changing the accepted language.
If and then the running time of the algorithm (the number of steps) in the worst case is , because the number of steps in the two embedded loops is at most and in the loop rEPEAT
at most .
Set has the property that if and only if . The above algorithm can be extended by inserting the condition to decide if language is or not empty.
Let be a finite automaton. A state is productive if it is on a walk which ends in a terminal state. For finding the productive states the following algorithm uses the function :
This function for a state and a letter gives the set of all states from which using this letter the automaton can go into the state .
Nonproductive-States(A)
1 2 3REPEAT
4 5FOR
all 6DO
FOR
all 7DO
8UNTIL
9 10RETURN
The nonproductive states of the automaton can be eliminated without changing the accepted language.
If is the number of states, the number of letters in the alphabet, then the running time of the algorithm is also as in the case of the algorithm INACCESSIBLE-STATES
.
The set given by the algorithm has the property that if and only if . So, by a little modification it can be used to decide if language is or not empty.
As follows we will show that any NFA can be transformed in an equivalent DFA.
Theorem 1.10 For any NFA one may construct an equivalent DFA.
Proof. Let be an NFA. Define a DFA , where
,
edges of are those triplets for which are not empty, and ,
,
.
We prove that .
a) First prove that . Let . Then there exists a walk
Using the transition function of NFA we construct the sets , . Then and since we get , so . Thus, there exists a walk
There are sets for which , and for we have , and
is a productive walk. Therefore . That is .
b) Now we show that . Let . Then there is a walk
Using the definition of we have , i.e. there exists , that is by the definitions of and there is such that . Similarly, there are the states such that , where , thus, there is a walk
so .
In constructing DFA we can use the corresponding transition function :
The empty set was excluded from the states, so we used here instead of .
Example 1.10 Apply Theorem 1.10 to transform NFA in Fig. 1.3. Introduce the following notation for the states of the DFA:
where is the initial state. Using the transition function we get the transition table:
This automaton contains many inaccessible states. By algorithm INACCESSIBLE-STATES
we determine the accessible states of DFA:
.
Initial state is also a final state. States and are final states. States are inaccessible and can be removed from the DFA. The transition table of the resulted DFA is
The corresponding transition graph is in Fig. 1.5.
The algorithm given in Theorem 1.10 can be simplified. It is not necessary to consider all subset of the set of states of NFA. The states of DFA can be obtained successively. Begin with the state and determine the states for all . For the newly obtained states we determine the states accessible from them. This can be continued until no new states arise.
In our previous example is the initial state. From this we get
The transition table is
which is the same (excepted the notation) as before.
The next algorithm will construct for an NFA the transition table of the equivalent DFA , but without to determine the final states (which can easily be included). Value of ISIN
() in the algorithm is true if state is already in and is false otherwise. Let be an ordered list of the letters of .
NFA-DFA(A)
1 2 3 counts the rows. 4 counts the states. 5REPEAT
6FOR
counts the columns. 7DO
8IF
9THEN
IF
ISIN
() 10THEN
11ELSE
12 13 14 15ELSE
16 17UNTIL
18RETURN
transition table of
Since loop rEPEAT
is executed as many times as the number of states of new automaton, in worst case the running time can be exponential, because, if the number of states in NFA is , then DFA can have even states. (The number of subsets of a set of elements is , including the empty set.)
Theorem 1.10 will have it that to any NFA one may construct an equivalent DFA. Conversely, any DFA is also an NFA by definition. So, the nondeterministic finite automata accepts the same class of languages as the deterministic finite automata.
In this subsection we will use complete deterministic finite automata only. In this case has a single element. In formulae, sometimes, instead of set we will use its single element. We introduce for a set the function which give us the single element of set , so . Using walks which begin with the initial state and have the same label in two DFA's we can determine the equivalence of these DFA's. If only one of these walks ends in a final state, then they could not be equivalent.
Consider two DFA's over the same alphabet and . We are interested to determine if they are or not equivalent. We construct a table with elements of form , where and . Beginning with the second column of the table, we associate a column to each letter of the alphabet . If the first element of the th row is then at the cross of th row and the column associated to letter will be the pair .
In the first column of the first row we put and complete the first row using the above method. If in the first row in any column there occur a pair of states from which one is a final state and the other not then the algorithm ends, the two automata are not equivalent. If there is no such a pair of states, every new pair is written in the first column. The algorithm continues with the next unfilled row. If no new pair of states occurs in the table and for each pair both of states are final or both are not, then the algorithm ends and the two DFA are equivalent.
If , and then taking into account that in worst case loop rEPEAT
is executed times, loop fOR
times, the running time of the algorithm in worst case will be , or if then .
Our algorithm was described to determine the equivalence of two complete DFA's. If we have to determine the equivalence of two NFA's, first we transform them into complete DFA's and after this we can apply the above algorithm.
DFA-Equivalence(
)
1 write in the first column of the first row the pair 2 3REPEAT
4 5 let be the pair in the first column of the th row 6FOR
all 7DO
write in the column associated to in the th row the pair 8IF
one state in is final and the other not 9THEN
RETURN
NO 10ELSE
write pair in the next empty row of the first column, if not occurred already in the first column 11UNTIL
the first element of th row becomes empty 12RETURN
YES
Example 1.11 Determine if the two DFA's in Fig. 1.6 are equivalent or not. The algorithm gives the table
The two DFA's are equivalent because all possible pairs of states are considered and in every pair both states are final or both are not final.
Example 1.12 The table of the two DFA's in Fig. 1.7 is:
These two DFA's are not equivalent, because in the last column of the second row in the pair the first state is final and the second not.
We have seen that NFA's accept the same class of languages as DFA's. The following theorem states that this class is that of regular languages.
Theorem 1.11 If is a language accepted by a DFA, then one may construct a regular grammar which generates language .
Proof. Let be the DFA accepting language , that is . Define the regular grammar with the productions:
If for and , then put production in .
If and , then put also production in .
Prove that .
Let and . Thus, since accepts word , there is a walk
Then there are in the productions
(in the right-hand side of the last production does not occur, because ), so there is the derivation
Therefore, .
Conversely, let and . Then there exists a derivation
in which productions
were used, which by definition means that in DFA there is a walk
and since is a final state, .
If the DFA accepts also the empty word , then in the above grammar we introduce a new start symbol instead of , consider the new production and for each production introduce also .
Example 1.13 Let be a DFA, where . The corresponding transition table is
The transition graph of is in Fig. 1.8. By Theorem 1.11 we define regular grammar with the productions in
One may prove that .
The method described in the proof of Theorem 1.11 easily can be given as an algorithm. The productions of regular grammar obtained from the DFA can be determined by the following algorithm.
Regular-Grammar-from-DFA(A)
1 2FOR
all 3DO
FOR
all 4DO
FOR
all 5DO
IF
6THEN
7IF
8THEN
9IF
10THEN
11RETURN
It is easy to see that the running time of the algorithm is , if the number of states is and the number of letter in alphabet is . In lines 2–4 we can consider only one loop, if we use the elements of . Then the worst case running time is , where is the number of transitions of DFA. This is also , since all transitions are possible. This algorithm is:
Regular-Grammar-from-Dfa'(A)
1 2FOR
all 3DO
4IF
5THEN
6IF
7THEN
8RETURN
Theorem 1.12 If is a regular language, then one may construct an NFA that accepts language .
Proof. Let be the grammar which generates language . Define NFA :
, where (i.e. is a new symbol),
For every production , define transition in .
For every production , define transition in .
Prove that .
Let , . Then there is in a derivation of word :
This derivation is based on productions
Then, by the definition of the transitions of NFA there exists a walk
Thus, . If , there is production , but in this case the initial state is also a final one, so . Therefore, .
Let now . Then there exists a walk
If is the empty word, then instead of we have in the above formula , which also is a final state. In other cases only can be as last symbol. Thus, in there exist the productions
and there is the derivation
thus, and therefore .
Example 1.14 Let be a regular grammar. The NFA associated is , where . The corresponding transition table is
The transition graph is in Fig. 1.9. This NFA can be simplified, states and can be contracted in one final state.
Using the above theorem we define an algorithm which associate an NFA to a regular grammar .
NFA-from-Regular-Grammar(A)
1 2 3FOR
all 4DO
FOR
all 5DO
IF
6THEN
7FOR
all 8DO
IF
9THEN
10IF
11THEN
12ELSE
13RETURN
A
As in the case of algorithm REGULAR-GRAMMAR-FROM-DFA
, the running time is , where is number of nonterminals and the number of terminals. Loops in lines 3, 4 and 7 can be replaced by only one, which uses productions. The running time in this case is better and is equal to , if is the number of productions. This algorithm is:
NFA-from-Regular-Grammar'(A)
1 2 3FOR
all 4DO
IF
5THEN
6IF
7THEN
8IF
9THEN
10ELSE
11RETURN
A
From Theorems 1.10, 1.11 and 1.12 results that the class of regular languages coincides with the class of languages accepted by NFA's and also with class of languages accepted by DFA's. The result of these three theorems is illustrated in Fig. 1.10 and can be summarised also in the following theorem.
Figure 1.10. Relations between regular grammars and finite automata. To any regular grammar one may construct an NFA which accepts the language generated by that grammar. Any NFA can be transformed in an equivalent DFA. To any DFA one may construct a regular grammar which generates the language accepted by that DFA.
Theorem 1.13 The following three class of languages are the same:
the class of regular languages,
the class of languages accepted by DFA's,
the class of languages accepted by NFA's.
It is known (see Theorem 1.8) that the set of regular languages is closed under the regular operations, that is if are regular languages, then languages , and are also regular. For regular languages are true also the following statements.
The complement of a regular language is also regular. This is easy to prove using automata. Let be a regular language and let be a DFA which accepts language . It is easy to see that the DFA accepts language . So, is also regular.
The intersection of two regular languages is also regular. Since , the intersection is also regular.
The difference of two regular languages is also regular. Since , the difference is also regular.
A finite automaton with -moves (FA with -moves) extends NFA in such way that it may have transitions on the empty input , i.e. it may change a state without reading any input symbol. In the case of a FA with -moves for the set of transitions it is true that .
The transition function of a FA with -moves is:
The FA with -moves in Fig. 1.11 accepts words of form , where and .
Theorem 1.14 To any FA with -moves one may construct an equivalent NFA (without -moves).
Let be an FA with -moves and we construct an equivalent NFA . The following algorithm determines sets and . For a state denote by the set of states (including even ) in which one may go from using -moves only. This may be extended also to sets
Clearly, for all and both and may be computed. Suppose in the sequel that these are given.
The following algorithm determine the transitions using the transition function , which is defined in line 5.
If and , then lines 2–6 show that the running time in worst case is .
Eliminate-Epsilon-Moves(A)
1 2FOR
all 3DO
FOR
all 4DO
5 6 7RETURN
Example 1.15 Consider the FA with -moves in Fig. 1.11. The corresponding transition table is:
Apply algorithm ELIMINATE-EPSILON-MOVES
.
, ,
, and its intersection with is not empty, thus .
,
.
,
,
,
,
.
The transition table of NFA is:
and the transition graph is in Fig. 1.12.
Define regular operations on NFA: union, product and iteration. The result will be an FA with -moves.
Operation will be given also by diagrams. An NFA is given as in Fig. 1.13(a). Initial states are represented by a circle with an arrow, final states by a double circle.
Figure 1.13. (a) Representation of an NFA. Initial states are represented by a circle with an arrow, final states by a double circle. (b) Union of two NFA's.
Let and be NFA. The result of any operation is a FA with -moves . Suppose that always. If not, we can rename the elements of any set of states.
Union. , where
,
,
,
,
.
For the result of the union see Fig. 1.13(b). The result is the same if instead of a single initial state we choose as set of initial states the union . In this case the result automaton will be without -moves. By the definition it is easy to see that .
Product. , where
,
,
,
,
.
For the result automaton see Fig. 1.14(a). Here also .
Iteration. , where
,
,
,
,
.
The iteration of an FA can be seen in Fig. 1.14(b). For this operation it is also true that .
The definition of these tree operations proves again that regular languages are closed under the regular operations.
A DFA is called minimum state automaton if for any equivalent complete DFA it is true that . We give an algorithm which builds for any complete DFA an equivalent minimum state automaton.
States and of an DFA are equivalent if for arbitrary word we reach from both either final or nonfinal states, that is
if for any word
If two states are not equivalent, then they are distinguishable. In the following algorithm the distinguishable states will be marked by a star, and equivalent states will be merged. The algorithm will associate list of pair of states with some pair of states expecting a later marking by a star, that is if we mark a pair of states by a star, then all pairs on the associated list will be also marked by a star. The algorithm is given for DFA without inaccessible states. The used DFA is complete, so contains exact one element, function defined in Subsection 1.2.2, which gives the unique element of the set, will be also used here.
Automaton-Minimization(
)
1 mark with a star all pairs of states for which and or and 2 associate an empty list with each unmarked pair 3FOR
all unmarked pair of states and for all symbol examine pairs of statesIF
any of these pairs is marked,THEN
mark also pair with all the elements on the list before associated with pairELSE
IF
all the above pairs are unmarkedTHEN
put pair on each list associated with pairs , unless 4 merge all unmarked (equivalent) pairs
After finishing the algorithm, if a cell of the table does not contain a star, then the states corresponding to its row and column index, are equivalent and may be merged. Merging states is continued until it is possible. We can say that the equivalence relation decomposes the set of states in equivalence classes, and the states in such a class may be all merged.
Remark. The above algorithm can be used also in the case of an DFA which is not complete, that is there are states for which does not exist transition. Then a pair may occur, and if is a final state, consider this pair marked.
Example 1.16 Let be the DFA in Fig. 1.15. We will use a table for marking pairs with a star. Marking pair means putting a star in the cell corresponding to row and column (or row and column ).
First we mark pairs , , , and (because is the single final state). Then consider all unmarked pairs and examine them as the algorithm requires. Let us begin with pair . Associate with it pairs , that is , . Because pair is already marked, mark also pair .
In the case of pair the new pairs are and . With pair associate pair on a list, that is
Now continuing with one obtain pairs and , with which nothing are associated by algorithm.
Continue with pair . The associated pairs are and . None of them are marked, so associate with them on a list pair , that is
Now continuing with we get the pairs and , and because this latter is marked we mark pair and also pair associated to it on a list. Continuing we will get the table in Fig. 1.15, that is we get that and . After merging them we get an equivalent minimum state automaton (see Fig. 1.16).
The following theorem, called pumping lemma for historical reasons, may be efficiently used to prove that a language is not regular. It is a sufficient condition for a regular language.
Theorem 1.15 (pumping lemma) For any regular language there exists a natural number (depending only on ), such that any word of with length at least may be written as such that
(1) ,
(2) ,
(3) for all .
Proof. If is a regular language, then there is such an DFA which accepts (by Theorems 1.12 and 1.10). Let be this DFA, so . Let be the number of its states, that is . Let and . Then, because the automaton accepts word , there are states and walk
Because the number of states is and , by the pigeonhole principle states can not all be distinct (see Fig. 1.17), there are at least two of them which are equal.
Footnote. Pigeonhole principle: If we have to put more than objects into boxes, then at least one box will contain at least two objects.
Let , where and is the least such index. Then . Decompose word as:
.
This decomposition immediately yields to and . We will prove that for any .
Because , there exists an walk
and because of , this may be written also as
From this walk can be omitted or can be inserted many times. So, there are the following walks:
Therefore for all , and this proves the theorem.
Example 1.17 We use the pumping lemma to show that is not regular. Assume that is regular, and let be the corresponding natural number given by the pumping lemma. Because the length of the word is , this word can be written as in the lemma. We prove that this leads to a contradiction. Let be the decomposition as in the lemma. Then , so and can contain no other letters than , and because we must have , word contains at least one . Then for will contain a different number of 's and 's, therefore for any . This is a contradiction with the third assertion of the lemma, this is why that assumption that is regular, is false. Therefore .
Because the context-free grammar generates language , we have . From these two follow that .
Example 1.18 We show that is not regular. ( is the number of 0's in , while the number of 1's).
We proceed as in the previous example using here word , where is the natural number associated by lemma to language .
Example 1.19 We prove, using the pumping lemma, that is not a regular language. Let be, where here is also the natural number associated to by the pumping lemma. From we have that contains no other letters than , but it contains at least one. By lemma we have , that is not possible. Therefore is not regular.
Pumping lemma has several interesting consequences.
Corollary 1.16 Regular language is not empty if and only if there exists a word , , where is the natural number associated to by the pumping lemma.
Proof. The assertion in a direction is obvious: if there exists a word shorter than in , then . Conversely, let and let be the shortest word in . We show that . If , then we apply the pumping lemma, and give the decomposition , and . This is a contradiction, because and is the shortest word in . Therefore .
Corollary 1.17 There exists an algorithm that can decide if a regular language is or not empty.
Proof. Assume that , where is a DFA. By Consequence 1.16 and Theorem 1.15 language is not empty if and only if it contains a word shorter than , where is the number of states of automaton . By this it is enough to decide that there is a word shorter than which is accepted by automaton . Because the number of words shorter than is finite, the problem can be decided.
When we had given an algorithm for inaccessible states of a DFA, we remarked that the procedure can be used also to decide if the language accepted by that automaton is or not empty. Because finite automata accept regular languages, we can consider to have already two procedures to decide if a regular languages is or not empty. Moreover, we have a third procedure, if we take into account that the algorithm for finding productive states also can be used to decide on a regular language when it is empty.
Corollary 1.18 A regular language is infinite if and only if there exists a word such that , where is the natural number associated to language , given by the pumping lemma.
Proof. If is infinite, then it contains words longer than , and let be the shortest word longer than in . Because is regular we can use the pumping lemma, so , where , thus is also true. By the lemma . But because and the shortest word in longer than is , we get . From we get also .
Conversely, if there exists a word such that , then using the pumping lemma, we obtain that , and for any , therefore is infinite.
Now, the question is: how can we apply the pumping lemma for a finite regular language, since by pumping words we get an infinite number of words? The number of states of a DFA accepting language is greater than the length of the longest word in . So, in there is no word with length at least , when is the natural number associated to by the pumping lemma. Therefore, no word in can be decomposed in the form , where , , , and this is why we can not obtain an infinite number of words in .
In this subsection we introduce for any alphabet the notion of regular expressions over and the corresponding representing languages. A regular expression is a formula, and the corresponding language is a language over . For example, if , then , , are regular expressions over which represent respectively languages , , . The exact definition is the following.
Definition 1.19 Define recursively a regular expression over and the language it represent.
is a regular expression representing the empty language.
is a regular expression representing language .
If , then is a regular expression representing language .
If , are regular expressions representing languages and respectively, then , , are regular expressions representing languages , and respectively.
Regular expression over can be obtained only by using the above rules a finite number of times.
Some brackets can be omitted in the regular expressions if taking into account the priority of operations (iteration, product, union) the corresponding languages are not affected. For example instead of we can consider .
Two regular expressions are equivalent if they represent the same language, that is if , where and are the languages represented by regular expressions and respectively. Figure 1.18 shows some equivalent expressions.
We show that to any finite language can be associated a regular expression which represent language . If , then . If , then , where for any expression is a regular expression representing language . This latter can be done by the following rule. If , then , else if , where depends on , then , where the brackets are omitted. We prove the theorem of Kleene which refers to the relationship between regular languages and regular expression.
Theorem 1.20 (Kleene's theorem) Language is regular if and only if there exists a regular expression over representing language .
Proof. First we prove that if is a regular expression, then language which represents is also regular. The proof will be done by induction on the construction of expression.
If , , , then , , respectively. Since is finite in all three cases, it is also regular.
If , then , where and are the languages which represent the regular expressions and respectively. By the induction hypothesis languages and are regular, so is also regular because regular languages are closed on union. Cases and can be proved by similar way.
Conversely, we prove that if is a regular language, then a regular expression can be associated to it, which represent exactly the language . If is regular, then there exists a DFA for which . Let the states of the automaton . Define languages for all and . is the set of words, for which automaton goes from state to state without using any state with index greater than . Using transition graph we can say: a word is in , if from state we arrive to state following the edges of the graph, and concatenating the corresponding labels on edges we get exactly that word, not using any state . Sets can be done also formally:
, if ,
,
for all .
We can prove by induction that sets can be described by regular expressions. Indeed, if , then for all and languages are finite, so they can be expressed by regular expressions representing exactly these languages. Moreover, if for all and language can be expressed by regular expression, then language can be expressed also by regular expression, which can be corresponding constructed from regular expressions representing languages , , and respectively, using the above formula for .
Finally, if is the set of final states of the DFA , then can be expressed by a regular expression obtained from expressions representing languages using operation .
Further on we give some procedures which associate DFA to regular expressions and conversely regular expression to DFA.
We present here three methods, each of which associate to a DFA the corresponding regular expression.
Method 1. Using the result of the theorem of Kleene, we will construct the sets , and write a regular expression which represents the language , where is the set of final states of the automaton.
Figure 1.20. DFA in Example 1.21 to which a regular expression is associated by Method 1. The computation are in Figure 1.21.
Example 1.20 Consider the DFA in Fig. 1.19.
Then the regular expression corresponding to is .
Example 1.21 Find a regular expression associated to DFA in Fig. 1.20. The computations are in Figure 1.21. The regular expression corresponding to is .
Method 2. Now we generalize the notion of finite automaton, considering words instead of letters as labels of edges. In such an automaton each walk determine a regular expression, which determine a regular language. The regular language accepted by a generalized finite automaton is the union of regular languages determined by the productive walks. It is easy to see that the generalized finite automata accept regular languages.
The advantage of generalized finite automata is that the number of its edges can be diminuted by equivalent transformations, which do not change the accepted language, and leads to a graph with only one edge which label is exactly the accepted language. The possible equivalent transformations can be seen in Fig. 1.22. If some of the vertices 1, 2, 4, 5 on the figure coincide, in the result they are merged, and a loop will arrive.
First, the automaton is transformed by corresponding -moves to have only one initial and one final state. Then, applying the equivalent transformations until the graph will have only one edge, we will obtain as the label of this edge the regular expression associated to the automaton.
Figure 1.22. Possible equivalent transformations for finding regular expression associated to an automaton.
Example 1.22 In the case of Fig. 1.19 the result is obtained by steps illustrated in Fig. 1.23. This result is , which represents the same language as obtained by Method 1 (See example 1.20).
Example 1.23 In the case of Fig. 1.20 is not necessary to introduce new initial and final state. The steps of transformations can be seen in Fig. 1.23. The resulted regular expression can be written also as , which is the same as obtained by the previous method.
Method 3. The third method for writing regular expressions associated to finite automata uses formal equations. A variable is associated to each state of the automaton (to different states different variables). Associate to each state an equation which left side contains , its right side contains sum of terms of form or , where is a variable associated to a state, and is its corresponding input symbol. If there is no incoming edge in the state corresponding to then the right side of the equation with left side contains , otherwise is the sum of all terms of the form for which there is a transition labelled with letter from state corresponding to to the state corresponding to . If the state corresponding to is also an initial and a final state, then on right side of the equation with the left side will be also a term equal to . For example in the case of Fig. 1.20 let these variable corresponding to the states . The corresponding equation are
.
If an equation is of the form , where are arbitrary words not containing , then it is easy to see by a simple substitution that is a solution of the equation.
Because these equations are linear, all of them can be written in the form or , where do not contain any variable. Substituting this in the other equations the number of remaining equations will be diminuted by one. In such a way the system of equation can be solved for each variable.
The solution will be given by variables corresponding to final states summing the corresponding regular expressions.
In our example from the first equation we get . From here , or , and solving this we get . Variable can be obtained immediately and we obtain .
Using this method in the case of Fig. 1.19, the following equations will be obtained
Therefore
.
Adding the two equations we will obtain
, from where (considering as and as ) we get the result
.
From here the value of after the substitution is
,
which is equivalent to the expression obtained using the other methods.
Associate to the regular expression a generalized finite automaton:
After this, use the transformations in Fig. 1.25 step by step, until an automaton with labels equal to letters from or will be obtained.
Example 1.24 Get started from regular expression . The steps of transformations are in Fig. 1.26(a)-(e). The last finite automaton (see Fig. 1.26(e)) can be done in a simpler form as can be seen in Fig. 1.26(f). After eliminating the -moves and transforming in a deterministic finite automaton the DFA in Fig. 1.27 will be obtained, which is equivalent to DFA in Fig. 1.19.
Figure 1.25. Possible transformations to obtain finite automaton associated to a regular expression.
Exercises
1.2-1 Give a DFA which accepts natural numbers divisible by 9.
1.2-2 Give a DFA which accepts the language containing all words formed by
a. an even number of 0's and an even number of 1's,
b. an even number of 0's and an odd number of 1's,
c. an odd number of 0's and an even number of 1's,
d. an odd number of 0's and an odd number of 1's.
1.2-3 Give a DFA to accept respectively the following languages:
.
1.2-4 Give an NFA which accepts words containing at least two 0's and any number of 1's. Give an equivalent DFA.
1.2-5 Minimize the DFA's in Fig. 1.28.
1.2-6 Show that the DFA in 1.29.(a) is a minimum state automaton.
1.2-7 Transform NFA in Fig. 1.29.(b) in a DFA, and after this minimize it.
1.2-8 Define finite automaton which accepts all words of the form (), and finite automaton which accepts all words of the form (). Define the union automaton , and then eliminate the -moves.
1.2-9 Associate to DFA in Fig. 1.30 a regular expression.
1.2-10 Associate to regular expression a DFA.
1.2-11 Prove, using the pumping lemma, that none of the following languages are regular:
.
1.2-12 Prove that if is a regular language, then is also regular.
1.2-13 Prove that if is a regular language, then the following languages are also regular.
.
1.2-14 Show that the following languages are all regular.
,
,
.
In this section we deal with the pushdown automata and the class of languages — the context-free languages — accepted by them.
As we have been seen in Section 1.1, a context-free grammar is one with the productions of the form , , . The production is also permitted if does not appear in right hand side of any productions. Language is the context-free language generated by grammar .
We have been seen that finite automata accept the class of regular languages. Now we get to know a new kind of automata, the so-called pushdown automata, which accept context-free languages. The pushdown automata differ from finite automata mainly in that to have the possibility to change states without reading any input symbol (i.e. to read the empty symbol) and possess a stack memory, which uses the so-called stack symbols (See Fig. 1.31).
The pushdown automaton get a word as input, start to function from an initial state having in the stack a special symbol, the initial stack symbol. While working, the pushdown automaton change its state based on current state, next input symbol (or empty word) and stack top symbol and replace the top symbol in the stack with a (possibly empty) word.
There are two type of acceptances. The pushdown automaton accepts a word by final state when after reading it the automaton enter a final state. The pushdown automaton accepts a word by empty stack when after reading it the automaton empties its stack. We show that these two acceptances are equivalent.
Definition 1.21 A nondeterministic pushdown automaton is a system
where
is the finite, non-empty set of states
is the input alphabet,
is the stack alphabet,
is the set of transitions or edges,
is the initial state,
is the start symbol of stack,
is the set of final states.
A transition means that if pushdown automaton is in state , reads from the input tape letter (instead of input letter we can also consider the empty word ), and the top symbol in the stack is , then the pushdown automaton enters state and replaces in the stack by word . Writing word in the stack is made by natural order (letters of word will be put in the stack letter by letter from left to right). Instead of writing transition we will use a more suggestive notation .
Here, as in the case of finite automata, we can define a transition function
which associate to current state, input letter and top letter in stack pairs of the form , where is the word written in stack and the new state.
Because the pushdown automaton is nondeterministic, we will have for the transition function
(if the pushdown automaton reads an input letter and moves to right), or
(without move on the input tape).
A pushdown automaton is deterministic, if for any and we have
, and
if , then , .
We can associate to any pushdown automaton a transition table, exactly as in the case of finite automata. The rows of this table are indexed by elements of , the columns by elements from and (to each and will correspond a column). At intersection of row corresponding to state and column corresponding to and we will have pairs if . The transition graph, in which the label of edge will be corresponding to transition , can be also defined.
Example 1.25 . Elements of are:
The transition function:
The transition table:
Because for the transition function every set which is not empty contains only one element (e.g. ), in the above table each cell contains only one element, and the set notation is not used. Generally, if a set has more than one element, then its elements are written one under other. The transition graph of this pushdown automaton is in Fig. 1.32.
The current state, the unread part of the input word and the content of stack constitutes a configuration of the pushdown automaton, i.e. for each , and the triplet can be a configuration. If and , then the pushdown automaton can change its configuration in two ways:
,
if
,
if .
The reflexive and transitive closure of the relation will be denoted by . Instead of using , sometimes is considered.
How does work such a pushdown automaton? Getting started with the initial configuration we will consider all possible next configurations, and after this the next configurations to these next configurations, and so on, until it is possible.
Definition 1.22 Pushdown automaton accepts (recognizes) word by final state if there exist a sequence of configurations of for which the following are true:
the first element of the sequence is ,
there is a going from each element of the sequence to the next element, excepting the case when the sequence has only one element,
the last element of the sequence is , where and .
Therefore pushdown automaton accepts word by final state, if and only if for some and . The set of words accepted by final state by pushdown automaton will be called the language accepted by by final state and will be denoted by .
Definition 1.23 Pushdown automaton accepts (recognizes) word by empty stack if there exist a sequence of configurations of for which the following are true:
the first element of the sequence is ,
there is a going from each element of the sequence to the next element,
the last element of the sequence is and is an arbitrary state.
Therefore pushdown automaton accepts a word by empty stack if for some . The set of words accepted by empty stack by pushdown automaton will be called the language accepted by empty stack by and will be denoted by .
Example 1.26 Pushdown automaton of Example 1.25 accepts the language by final state. Consider the derivation for words and .
Word is accepted by the considered pushdown automaton because
and because is a final state the pushdown automaton accepts this word. But the stack being empty, it accepts this word also by empty stack.
Because the initial state is also a final state, the empty word is accepted by final state, but not by empty stack.
To show that word is not accepted, we need to study all possibilities. It is easy to see that in our case there is only a single possibility:
, but there is no further going, so word is not accepted.
Example 1.27 The transition table of the pushdown automaton is:
The corresponding transition graph can be seen in Fig. 1.33. Pushdown automaton accepts the language . Because is nemdeterministic, all the configurations obtained from the initial configuration can be illustrated by a computation tree. For example the computation tree associated to the initial configuration can be seen in Fig. 1.34. From this computation tree we can observe that, because is a leaf of the tree, pushdown automaton accepts word 1001 by empty stack. The computation tree in Fig. 1.35 shows that pushdown automaton does not accept word , because the configurations in leaves can not be continued and none of them has the form .
Figure 1.35. Computation tree to show that the pushdown automaton in Example 1.27 does not accept word .
Theorem 1.24 A language is accepted by a nondeterministic pushdown automaton by empty stack if and only if it can be accepted by a nondeterministic pushdown automaton by final state.
be the pushdown automaton which accepts by empty stack language . Define pushdown automaton , where and
.
Working of : Pushdown automaton with an -move first goes in the initial state of , writing (the initial stack symbol of ) in the stack (beside ). After this it is working as . If for a given word empties its stack, then still has in the stack, which can be deleted by using an -move, while a final state will be reached. can reach a final state only if has emptied the stack.
b) Let be a pushdown automaton, which accepts language by final state. Define pushdown automaton , where , and
.
Working : Pushdown automaton with an -move writes in the stack beside the initial stack symbol of , then works as , i.e reaches a final state for each accepted word. After this empties the stack by an -move. can empty the stack only if goes in a final state.
The next two theorems prove that the class of languages accepted by nondeterministic pushdown automata is just the set of context-free languages.
Theorem 1.25 If is a context-free grammar, then there exists such a nondeterministic pushdown automaton which accepts by empty stack, i.e. .
We outline the proof only. Let be a context-free grammar. Define pushdown automaton , where , and the set of transitions is:
If there is in the set of productions of a production of type , then let put in the transition ,
For any letter let put in the transition . If there is a production in , the pushdown automaton put in the stack the mirror of with an -move. If the input letter coincides with that in the top of the stack, then the automaton deletes it from the stack. If in the top of the stack there is a nonterminal , then the mirror of right-hand side of a production which has in its left-hand side will be put in the stack. If after reading all letters of the input word, the stack will be empty, then the pushdown automaton recognized the input word.
The following algorithm builds for a context-free grammar the pushdown automaton , which accepts by empty stack the language generated by .
From-Cfg-to-Pushdown-Automaton(
)
1FOR
all productionDO
put in the transition 2FOR
all terminalDO
put in the transition 3RETURN
If has productions and terminals, then the number of step of the algorithm is .
Example 1.28 Let . Then , with the following transition table.
Let us see how pushdown automaton accepts word , which in grammar can be derived in the following way:
where productions and were used. Word is accepted by empty stack (see Fig. 1.36).
Theorem 1.26 For a nondeterministic pushdown automaton there exists always a context-free grammar such that accepts language by empty stack, i.e. .
Instead of a proof we will give a method to obtain grammar . Let be the nondeterministic pushdown automaton in question.
Then , where
and .
Productions in will be obtained as follows.
For all state put in production .
If , where , () and , put in for all possible states productions
.
If , where , and , put in production
.
The context-free grammar defined by this is an extended one, to which an equivalent context-free language can be associated. The proof of the theorem is based on the fact that to every sequence of configurations, by which the pushdown automaton accepts a word, we can associate a derivation in grammar . This derivation generates just the word in question, because of productions of the form , which were defined for all possible states . In Example 1.27 we show how can be associated a derivation to a sequence of configurations. The pushdown automaton defined in the example recognizes word 00 by the sequence of configurations
,
which sequence is based on the transitions
,
,
.
To these transitions, by the definition of grammar , the following productions can be associated
(1) for all states ,
(2) ,
(3) .
Furthermore, for each state productions were defined.
By the existence of production there exists the derivation , where can be chosen arbitrarily. Let choose in above production (1) state to be equal to . Then there exists also the derivation
,
where can be chosen arbitrarily. If , then the derivation
will result. Now let equal to , then
,
which proves that word 00 can be derived used the above grammar.
The next algorithm builds for a pushdown automaton a context-free grammar , which generates the language accepted by pushdown automaton by empty stack.
From-Pushdown-Automaton-to-Cf-Grammar(
)
1FOR
all 2DO
put in production 3FOR
all , (), 4DO
FOR
all states 5DO
put in productions 6FOR
All , 7DO
put in production
If the automaton has states and productions, then the above algorithm executes at most steps, so in worst case the number of steps is . Finally, without proof, we mention that the class of languages accepted by deterministic pushdown automata is a proper subset of the class of languages accepted by nondeterministic pushdown automata. This points to the fact that pushdown automata behave differently as finite automata.
Example 1.29 As an example, consider pushdown automaton from the Example 1.28: . Grammar is:
where for all instead of we shortly used . The transitions:
Based on these, the following productions are defined:
.
It is easy to see that can be eliminated, and the productions will be:
,
,
, ,
and these productions can be replaced:
,
.
Consider context-free grammar . A derivation tree of is a finite, ordered, labelled tree, which root is labelled by the the start symbol , every interior vertex is labelled by a nonterminal and every leaf by a terminal. If an interior vertex labelled by a nonterminal has descendents, then in there exists a production such that the descendents are labelled by letters , , . The result of a derivation tree is a word over , which can be obtained by reading the labels of the leaves from left to right. Derivation tree is also called syntax tree.
Consider the context-free grammar . It generates language . Derivation of word is:
In Fig. 1.37 this derivation can be seen, which result is .
To every derivation we can associate a syntax tree. Conversely, to any syntax tree more than one derivation can be associated. For example to syntax tree in Fig. 1.37 the derivation
also can be associated.
Definition 1.27 Derivation is a leftmost derivation, if for all there exist words , and productions , for which we have
Consider grammar:
.
In this grammar word has two different leftmost derivations:
,
.
Definition 1.28 A context-free grammar is ambiguous if in there exists a word with more than one leftmost derivation. Otherwise is unambiguous.
The above grammar is ambiguous, because word has two different leftmost derivations. A language can be generated by more than one grammar, and between them can exist ambiguous and unambiguous too. A context-free language is inherently ambiguous, if there is no unambiguous grammar which generates it.
Example 1.30 Examine the following two grammars.
Grammar is ambiguous because
and
.
Grammar is unambiguous.
Can be proved that .
Like for regular languages there exists a pumping lemma also for context-free languages.
Theorem 1.29 (pumping lemma) For any context-free language there exists a natural number (which depends only on ), such that every word of the language longer than can be written in the form and the following are true:
(1) ,
(2) ,
(3) ,
(4) is also in for all .
Proof. Let be a grammar without unit productions, which generates language . Let be the number of nonterminals, and let be the maximum of lengths of right-hand sides of productions, i.e. . Let and , such that . Then there exists a derivation tree with the result . Let be the height of (the maximum of path lengths from root to leaves). Because in all interior vertices have at most descendents, has at most leaves, i.e. . On the other hand, because of , we get that . From this follows that in derivation tree there is a path from root to a leave in which there are more than vertices. Consider such a path. Because in the number of nonterminals is and on this path vertices different from the leaf are labelled with nonterminals, by the pigeonhole principle, it must be a nonterminal on this path which occurs at least twice.
Let us denote by the nonterminal being the first on this path from root to the leaf which firstly repeat. Denote by the subtree, which root is this occurrence of . Similarly, denote by the subtree, which root is the second occurrence of on this path. Let be the result of the tree . Then the result of is in form , while of in . Derivation tree with this decomposition of can be seen in Fig. 1.38. We show that this decomposition of satisfies conditions (1)–(4) of lemma. Because in there are no -productions (except maybe the case ), we have . Furthermore, because each interior vertex of the derivation tree has at least two descendents (namely there are no unit productions), also the root of has, hence . Because is the first repeated nonterminal on this path, the height of is at most , and from this results.
After eliminating from all vertices of excepting the root, the result of obtained tree is , i.e. .
Similarly, after eliminating we get , and finally because of the definition of we get Then . Therefore and for all . Therefore, for all we have , i.e. for all .
Now we present two consequences of the lemma.
Proof. This consequence states that there exists a context-sensitive language which is not context-free. To prove this it is sufficient to find a context-sensitive language for which the lemma is not true. Let this language be .
To show that this language is context-sensitive it is enough to give a convenient grammar. In Example 1.2 both grammars are extended context-sensitive, and we know that to each extended grammar of type an equivalent grammar of the same type can be associated.
Let be the natural number associated to by lemma, and consider the word . Because of , if is context-free can be decomposed in such that conditions (1)–(4) are true. We show that this leads us to a contradiction.
Firstly, we will show that word and can contain only one type of letters. Indeed if either or contain more than one type of letters, then in word the order of the letters will be not the order , so , which contradicts condition (4) of lemma.
If both and contain at most one type of letters, then in word the number of different letters will be not the same, so . This also contradicts condition (4) in lemma. Therefore is not context-free.
Corollary 1.31 The class of context-free languages is not closed under the intersection.
Proof. We give two context-free languages which intersection is not context-free. Let , and
where :
,
,
,
and , where :
,
,
.
Languages and are context-free. But
is not context-free (see the proof of the Consequence 1.30).
In the case of arbitrary grammars the normal form was defined (see Section 1.1) as grammars with no terminals in the left-hand side of productions. The normal form in the case of the context-free languages will contains some restrictions on the right-hand sides of productions. Two normal forms (Chomsky and Greibach) will be discussed.
Definition 1.32 A context-free grammar is in Chomsky normal form, if all productions have form or , where , .
Example 1.31 Grammar is in Chomsky normal form and .
To each -free context-free language can be associated an equivalent grammar is Chomsky normal form. The next algorithm transforms an -free context-free grammar in grammar which is in Chomsky normal form.
Chomsky-Normal-Form(
)
1 2 eliminate unit productions, and let the new set of productions (see algorithmELIMINATE-UNIT-PRODUCTIONS
in Section 1.1) 3 in replace in each production with at least two letters in right-hand side all terminals by a new nonterminal , and add this nonterminal to and add production to 4 replace all productions , where and , by the following: , , , , where are new nonterminals, and add them to 5RETURN
Example 1.32 Let . It is easy to see that . Steps of transformation to Chomsky normal form are the following:
Step 1:
Step 2: After eliminating the unit production the productions are:
,
.
Step 3: We introduce three new nonterminals because of the three terminals in productions. Let these be . Then the production are:
,
,
,
.
Step 4: Only one new nonterminal (let this ) must be introduced because of a single production with three letters in the right-hand side. Therefore , and the productions in are:
,
,
,
,
.
All these productions are in required form.
Definition 1.33 A context-free grammar is in Greibach normal form if all production are in the form , where , , .
Example 1.33 Grammar is in Greibach normal form and .
To each -free context-free grammar an equivalent grammar in Greibach normal form can be given. We give and algorithm which transforms a context-free grammar in Chomsky normal form in a grammar in Greibach normal form.
First, we give an order of the nonterminals: , where is the start symbol. The algorithm will use the notations , .
Greibach-Normal-Form(
)
1 2 3FOR
TO
Case 4DO
FOR
TO
5DO
for all productions and (where has no as first letter) in productions , delete from productions 6IF
there is a production Case 7THEN
put in the new nonterminal , for all productions put in productions and , delete from production , empty> for all production (where is not the first letter of ) put in production 8FOR
DOWNTO
Case 9DO
FOR
TO
10DO
for all productions and put in production and delete from productions , 11FOR
TO
Case 12DO
FOR
TO
13DO
for all productions and put in production and delete from productions 14RETURN
The algorithm first transform productions of the form such that or , where this latter is in Greibach normal form. After this, introducing a new nonterminal, eliminate productions , and using substitutions all production of the form and will be transformed in Greibach normal form.
Example 1.34 Transform productions in Chomsky normal form
in Greibach normal form.
Steps of the algorithm:
3–5: Production must be transformed. For this production is appropriate. Put in the set of productions and eliminate .
The productions will be:
6–7: Elimination of production will be made using productions:
Then, after steps 6–7. the productions will be:
8–10: We make substitutions in productions with in left-hand side. The results is:
11–13: Similarly with productions with in left-hand side:
After the elimination in steps 8–13 of productions in which substitutions were made, the following productions, which are now in Greibach normal form, result:
can be generated by grammar
First, will eliminate the single unit production, and after this we will give an equivalent grammar in Chomsky normal form, which will be transformed in Greibach normal form.
Productions after the elimination of production :
.
We introduce productions , and replace terminals by the corresponding nonterminals:
,
,
.
After introducing two new nonterminals (, ):
,
,
,
,
.
This is now in Chomsky normal form. Replace the nonterminals to be letters as in the algorithm. Then, after applying the replacements
replaced by , replaced by , replaced by , replaced by , replaced by , replaced by , replaced by ,
our grammar will have the productions:
,
,
,
,
.
In steps 3–5 of the algorithm the new productions will occur:
then
, then
.
Therefore
,
,
,
.
Steps 6–7 will be skipped, because we have no left-recursive productions. In steps 8–10 after the appropriate substitutions we have:
,
,
,
,
,
.
Exercises
1.3-1 Give pushdown automata to accept the following languages:
,
,
.
1.3-2 Give a context-free grammar to generate language , and transform it in Chomsky and Greibach normal forms. Give a pushdown automaton which accepts .
1.3-3 What languages are generated by the following context-free grammars?
.
1.3-4 Give a context-free grammar to generate words with an equal number of letters and .
1.3-5 Prove, using the pumping lemma, that a language whose words contains an equal number of letters , and can not be context-free.
1.3-6 Let the grammar , where
,
,
,
Show that word if a then if a then c else c has two different leftmost derivations.
1.3-7 Prove that if is context-free, then is also context-free.
PROBLEMS |
1-1
Linear grammars
A grammar which has productions only in the form or , where , is called a linear grammar. If in a linear grammar all production are of the form or , then it is called a left-linear grammar. Prove that the language generated by a left-linear grammar is regular.
1-2
Operator grammars
An -free context-free grammar is called operator grammar if in the right-hand side of productions there are no two successive nonterminals. Show that, for all -free context-free grammar an equivalent operator grammar can be built.
1-3
Complement of context-free languages
Prove that the class of context-free languages is not closed on complement.
CHAPTER NOTES |
In the definition of finite automata instead of transition function we have used the transition graph, which in many cases help us to give simpler proofs.
There exist a lot of classical books on automata and formal languages. We mention from these the following: two books of Aho and Ullman [5], [6] in 1972 and 1973, book of Gécseg and Peák [87] in 1972, two books of Salomaa [221], [222] in 1969 and 1973, a book of Hopcroft and Ullman [118] in 1979, a book of Harrison [108] in 1978, a book of Manna [174], which in 1981 was published also in Hungarian. We notice also a book of Sipser [242] in 1997 and a monograph of Rozenberg and Salomaa [220]. In a book of Lothaire (common name of French authors) [166] on combinatorics of words we can read on other types of automata. Paper of Giammarresi and Montalbano [89] generalise the notion of finite automata. A new monograph is of Hopcroft, Motwani and Ullman [117]. In German we recommend the textbook of Asteroth and Baier [14]. The concise description of the transformation in Greibach normal form is based on this book.
A practical introduction to formal languages is written by Webber [270].
Other books in English: [32], [40], [68], [142], [149], [159], [164], [165], [177], [185], [238], [240], [251], [252].
At the end of the next chapter on compilers another books on the subject are mentioned.
Table of Contents
When a programmer writes down a solution of her problems, she writes a program on a special programming language. These programming languages are very different from the proper languages of computers, from the machine languages. Therefore we have to produce the executable forms of programs created by the programmer. We need a software or hardware tool, that translates the source language program – written on a high level programming language – to the target language program, a lower level programming language, mostly to a machine code program.
There are two fundamental methods to execute a program written on higher level language. The first is using an interpreter. In this case, the generated machine code is not saved but executed immediately. The interpreter is considered as a special computer, whose machine code is the high level language. Essentially, when we use an interpreter, then we create a two-level machine; its lower level is the real computer, on which the higher level, the interpreter, is built. The higher level is usually realized by a computer program, but, for some programming languages, there are special hardware interpreter machines.
The second method is using a compiler program. The difference of this method from the first is that here the result of translation is not executed, but it is saved in an intermediate file called target program.
The target program may be executed later, and the result of the program is received only then. In this case, in contrast with interpreters, the times of translation and execution are distinguishable.
In the respect of translation, the two translational methods are identical, since the interpreter and the compiler both generate target programs. For this reason we speak about compilers only. We will deal the these translator programs, called compilers (Figure 2.1).
Our task is to study the algorithms of compilers. This chapter will care for the translators of high level imperative programming languages; the translational methods of logical or functional languages will not be investigated.
First the structure of compilers will be given. Then we will deal with scanners, that is, lexical analysers. In the topic of parsers – syntactic analysers –, the two most successful methods will be studied: the LL and the LALR} parsing methods. The advanced methods of semantic analysis use O-ATG grammars, and the task of code generation is also written by this type of grammars. In this book these topics are not considered, nor we will study such important and interesting problems as symbol table handling, error repairing or code optimising. The reader can find very new, modern and efficient methods for these methods in the bibliography.
A compiler translates the source language program (in short, source program) into a target language program (in short, target program). Moreover, it creates a list by which the programmer can check her private program. This list contains the detected errors, too.
Using the notation program (input)(output) the compiler can be written by
compiler (source program)(target program, list). |
In the next, the structure of compilers are studied, and the tasks of program elements are described, using the previous notation.
The first program of a compiler transforms the source language program into character stream that is easy to handle. This program is the source handler.
source handler(source program)(character stream). |
The form of the source program depends from the operating system. The source handler reads the file of source program using a system, called operating system, and omits the characters signed the end of lines, since these characters have no importance in the next steps of compilation. This modified, “poured” character stream will be the input data of the next steps.
The list created by the compiler has to contain the original source language program written by the programmer, instead of this modified character stream. Hence we define a list handler program,
list handler (source program, errors)(list), |
which creates the list according to the file form of the operating system, and puts this list on a secondary memory.
It is practical to join the source handler and the list handler programs, since they have same input files. This program is the source handler.
source handler (source program, errors)(character stream, list). |
The target program is created by the compiler from the generated target code. It is located on a secondary memory, too, usually in a transferable binary form. Of course this form depends on the operating system. This task is done by the code handler program.
code handler (target code)(target program). |
Using the above programs, the structure of a compiler is the following (Figure 2.2):
source handler (source program, errors) (character string, list),
compiler (character stream)(target code, errors),
code handler (target code)(target program).
This decomposition is not a sequence: the three program elements are executed not sequentially. The decomposition consists of three independent working units. Their connections are indicated by their inputs and outputs.
In the next we do not deal with the handlers because of their dependentness on computers, operating system and peripherals – although the outer form, the connection with the user and the availability of the compiler are determined mainly by these programs. The task of the program compiler is the translation. It consists of two main subtasks: analysing the input character stream, and to synthetizing the target code. The first problem of the analysis is to determine the connected characters in the character stream. These are the symbolic items, e.g., the constants, names of variables, keywords, operators. This is done by the lexical analyser, in short, scanner.
From the character stream the scanner makes a series of symbols and during this task it detects lexical errors.
scanner (character stream)(series of symbols, lexical errors). |
This series of symbols is the input of the syntactic analyser, in short, parser. Its task is to check the syntactic structure of the program. This process is near to the checking the verb, the subject, predicates and attributes of a sentence by a language teacher in a language lesson. The errors detected during this analysis are the syntactic errors. The result of the syntactic analysis is the syntax tree of the program, or some similar equivalent structure.
parser (series of symbols)(syntactically analysed program, syntactic errors). |
The third program of the analysis is the semantic analyser. Its task is to check the static semantics. For example, when the semantic analyser checks declarations and the types of variables in the expression a + b
, it verifies whether the variables a
and b
are declared, do they are of the same type, do they have values? The errors detected by this program are the semantic errors.
semantic analyser (syntactically analysed program)(analysed program, semantic errors). |
The output of the semantic analyser is the input of the programs of synthesis. The first step of the synthesis is the code generation, that is made by the code generator:
code generator (analysed program)(target code). |
The target code usually depends on the computer and the operating system. It is usually an assembly language program or machine code. The next step of synthesis is the code optimisation:
code optimiser (target code)(target code). |
The code optimiser transforms the target code on such a way that the new code is better in many respect, for example running time or size.
As it follows from the considerations above, a compiler consists of the next components (the structure of the compiler program is in the Figure 2.3):
source handler (source program, errors)(character stream, list),
scanner (character stream)(series of symbols, lexical errors),
parser (series of symbols)(syntactically analysed program, syntactic errors),
semantic analyser (syntactically analysed program)(analysed program, semantic errors),
code generator (analysed program)(target code),
code optimiser (target code)(target code),
code handler (target code)(target program).
The algorithm of the part of the compiler, that performs analysis and synthesis, is the next:
Compiler
1 determine the symbolic items in the text of source program 2 check the syntactic correctness of the series of symbols 3 check the semantic correctness of the series of symbols 4 generate the target code 5 optimise the target code
The objects written in the first two points will be analysed in the next sections.
Exercises
2.1-1 Using the above notations, give the structure of interpreters.
2.1-2 Take a programming language, and write program details in which there are lexical, syntactic and semantic errors.
2.1-3 Give respects in which the code optimiser can create better target code than the original.
The source-handler transforms the source program into a character stream. The main task of lexical analyser (scanner) is recognising the symbolic units in this character stream. These symbolic units are named symbols.
Unfortunately, in different programming languages the same symbolic units consist of different character streams, and different symbolic units consist of the same character streams. For example, there is a programming language in which the 1.
and .10
characters mean real numbers. If we concatenate these symbols, then the result is the 1..10
character stream. The fact, that a sign of an algebraic function is missing between the two numbers, will be detected by the next analyser, doing syntactic analysis. However, there are programming languages in which this character stream is decomposited into three components: 1 and 10 are the lower and upper limits of an interval type variable.
The lexical analyser determines not only the characters of a symbol, but the attributes derived from the surrounded text. Such attributes are, e.g., the type and value of a symbol.
The scanner assigns codes to the symbols, same codes to the same sort of symbols. For example the code of all integer numbers is the same; another unique code is assigned to variables.
The lexical analyser transforms the character stream into the series of symbol codes and the attributes of a symbols are written in this series, immediately after the code of the symbol concerned.
The output information of the lexical analyser is not “readable”: it is usually a series of binary codes. We note that, in the viewpoint of the compiler, from this step of the compilation it is no matter from which characters were made the symbol, i.e. the code of the if symbol was made form English if or Hungarian ha or German wenn characters. Therefore, for a program language using English keywords, it is easy to construct another program language using keywords of another language. In the compiler of this new program language the lexical analysis would be modified only, the other parts of the compiler are unchanged.
The exact definition of symbolic units would be given by regular grammar, regular expressions or deterministic finite automaton. The theories of regular grammars, regular expressions and deterministic finite automata were studied in previous chapters.
Practically the lexical analyser may be a part of the syntactic analysis. The main reason to distinguish these analysers is that a lexical analyser made from regular grammar is much more simpler than a lexical analyser made from a context-free grammar. Context-free grammars are used to create syntactic analysers.
One of the most popular methods to create the lexical analyser is the following:
describe symbolic units in the language of regular expressions, and from this information construct the deterministic finite automaton which is equivalent to these regular expressions,
implement this deterministic finite automaton.
We note that, in writing of symbols regular expressions are used, because they are more comfortable and readable then regular grammars. There are standard programs as the lex
of UNIX systems, that generate a complete syntactical analyser from regular expressions. Moreover, there are generator programs that give the automaton of scanner, too.
A very trivial implementation of the deterministic finite automaton uses multidirectional cASE
instructions. The conditions of the branches are the characters of state transitions, and the instructions of a branch represent the new state the automaton reaches when it carries out the given state transition.
The main principle of the lexical analyser is building a symbol from the longest series of symbols. For example the string ABC
is a three-letters symbol, rather than three one-letter symbols. This means that the alternative instructions of the cASE
branch read characters as long as they are parts of a constructed symbol.
Functions can belong to the final states of the automaton. For example, the function converts constant symbols into an inner binary forms of constants, or the function writes identifiers to the symbol table.
The input stream of the lexical analyser contains tabulators and space characters, since the source-handler expunges the carriage return and line feed characters only. In most programming languages it is possible to write a lot of spaces or tabulators between symbols. In the point of view of compilers these symbols have no importance after their recognition, hence they have the name white spaces.
Expunging white spaces is the task of the lexical analyser. The description of the white space is the following regular expression:
where space and the tab tabulator are the characters which build the white space symbols and is the symbol for the or function. No actions have to make with this white space symbols, the scanner does not pass these symbols to the syntactic analyser.
Some examples for regular expression:
Example 2.1 Introduce the following notations: Let be an arbitrary digit, and let be an arbitrary letter,
the not-visible characters are denoted by their short names, and let be the name of the empty character stream. denotes a character distinct from . The regular expressions are:
real number: ,
positive integer and real number: ,
identifier: ,
comment: ,
comment terminated by : ,
string of characters: .
Deterministic finite automata constructed from regular expressions 2 and 3 are in Figures 2.4 and 2.5.
The task of lexical analyser is to determine the text of symbols, but not all the characters of a regular expression belong to the symbol. As is in the 6th example, the first and the last "
characters do not belong to the symbol. To unravel this problem, a buffer is created for the scanner. After recognising of a symbol, the characters of these symbols will be in the buffer. Now the deterministic finite automaton is supplemented by a transfer function, where means that the character is inserted into the buffer.
Example 2.2 The 4th and 6th regular expressions of the example 2.1 are supplemented by the function, automata for these expressions are in Figures 2.6 and 2.7. The automaton of the 4th regular expression has none function, since it recognises comments. The automaton of the 6th regular expression recognises This is a “string”
from the character string ”This is a “”string”””
.
Now we write the algorithm of the lexical analyser given by deterministic finite automaton. (The state of the set of one element will be denoted by the only element of the set).
Let be the deterministic finite automaton, which is the scanner. We augment the alphabet with a new notion: let others be all the characters not in . Accordingly, we modify the transition function :
The algorithm of parsing, using the augmented automaton , follows:
Lex-analyse(
)
1 , first character of 2 3WHILE
and 4DO
IF
5THEN
6 next character of 7ELSE
8IF
and 9THEN
10ELSE
11RETURN
The algorithm has two parameters: the first one is the input character string terminated by , the second one is the automaton of the scanner. In the line 1 the state of the scanner is set to , to the start state of the automaton, and the first character of the input string is determined. The variable indicates that the algorithm is analysing the input string, the text analysing is set in this variable in the line 2. In the line 5 a state-transition is executed. It can be seen that the above augmentation is needed to terminate in case of unexpected, invalid character. In line 8–10 the O.K. means that the analysed character string is correct, and the ERROR signs that a lexical error was detected. In the case of successful termination the variable contains the character, at erroneous termination it contains the invalid character.
We note that the algorithm LEX-ANALYSE
recognise one symbol only, and then it is terminated. The program written in a programming language consists of a lot of symbols, hence after recognising a symbol, the algorithm have to be continued by detecting the next symbol. The work of the analyser is restarted at the state of the automaton. We propose the full algorithm of the lexical analyser as an exercise (see Problem 2-1).
Example 2.3 The automaton of the identifier in the point 3 of example 2.1 is in Figure 2.5. The start state is 0, and the final state is 1. The transition function of the automaton follows:
The augmented transition function of the automaton:
The algorithm LEX-ANALYSE
gives the series of states and sign O.K. to the input string abc123#
, it gives sign ERROR to the input sting 9abc#
, and the series and sign ERROR to the input string abc
123
.
In this subsection we investigate the problems emerged during running of lexical analyser, and supply solutions for these problems.
All of programming languages allows identifiers having special names and predefined meanings. They are the keywords. Keywords are used only in their original notions. However there are identifiers which also have predefined meaning but they are alterable in the programs. These words are called standard words.
The number of keywords and standard words in programming languages are vary. For example, there is a program language, in which three keywords are used for the zero value: zero
, zeros
és zeroes
.
Now we investigate how does the lexical analyser recognise keywords and standard words, and how does it distinguish them from identifiers created by the programmers.
The usage of a standard word distinctly from its original meaning renders extra difficulty, not only to the compilation process but also to the readability of the program, such as in the next example:
if if then else = then;
or if we declare procedures which have names begin
and end
:
begin begin; begin end; end; begin end; end;
Recognition of keywords and standard words is a simple task if they are written using special type characters (for example bold characters), or they are between special prefix and postfix characters (for example between apostrophes).
We give two methods to analyse keywords.
All keywords is written as a regular expression, and the implementation of the automaton created to this expression is prepared. The disadvantage of this method is the size of the analyser program. It will be large even if the description of keywords, whose first letter are the same, are contracted.
Keywords are stored in a special keyword-table. The words can be determined in the character stream by a general identifier- recogniser. Then, by a simple search algorithm, we check whether this word is in the keyword- table. If this word is in the table then it is a keyword. Otherwise it is an identifier defined by the user. This method is very simple, but the efficiency of search depends on the structure of keyword-table and on the algorithm of search. A well-selected mapping function and an adequate keyword-table should be very effective.
If it is possible to write standard words in the programming language, then the lexical analyser recognises the standard words using one of the above methods. But the meaning of this standard word depends of its context. To decide, whether it has its original meaning or it was overdefined by the programmer, is the task of syntactic analyser.
Since the lexical analyser creates a symbol from the longest character stream, the lexical analyser has to look ahead one or more characters for the allocation of the right-end of a symbol. There is a classical example for this problem, the next two FORTRAN statements:
DO 10 I = 1.1000
DO 10 I = 1,1000
In the FORTRAN programming language space-characters are not important characters, they do not play an important part, hence the character between 1
and 1000
decides that the statement is a DO
cycle statement or it is an assignment statement for the DO10I
identifier.
To sign the right end of the symbol, we introduce the symbol /
into the description of regular expressions. Its name is lookahead operator. Using this symbol the description of the above DO
keyword is the next
This definition means that the lexical analyser says that the first two D
and O
letters are the DO
keyword, if looking ahead, after the O
letter, there are letters or digits, then there is an equal sign, and after this sign there are letters or digits again, and finally, there is a “,
” character. The lookahead operator implies that the lexical analyser has to look ahead after the DO
characters. We remark that using this lookahead method the lexical analyser recognises the DO
keyword even if there is an error in the character stream, such as in the DO2A=3B,
character stream, but in a correct assignment statement it does not detect the DO
keyword.
In the next example we concern for positive integers. The definition of integer numbers is a prefix of the definition of the real numbers, and the definition of real numbers is a prefix of the definition of real numbers containing explicit power-part.
The automaton for all of these three expressions is the automaton of the longest character stream, the real number containing explicit power-part.
The problem of the lookahead symbols is resolved using the following algorithm. Put the character into a buffer, and put an auxiliary information aside this character. This information is “it is invalid”. If the character string, using this red character, is not correct; otherwise we put the type of the symbol into here. If the automaton is in a final-state, then the automaton recognises a real number with explicit power-part. If the automaton is in an internal state, and there is no possibility to read a next character, then the longest character stream which has valid information is the recognised symbol.
Example 2.4 Consider the 12.3e+f#
character stream, where the character #
is the endsign of the analysed text. If in this character stream there was a positive integer number in the place of character f
, then this character stream should be a real number. The content of the puffer of lexical analyser:
The recognised symbol is the 12.3
real number. The lexical analysing is continued at the text e+f
.
The number of lookahead-characters may be determined from the definition of the program language. In the modern languages this number is at most two.
There are programming languages, for example C, in which small letters and capital letters are different. In this case the lexical analyser uses characters of all symbols without modification. Otherwise the lexical analyser converts all characters to their small letter form or all characters to capital letter form. It is proposed to execute this transformation in the source handler program.
At the case of simpler programming languages the lexical analyser writes the characters of the detected symbol into the symbol table, if this symbol is not there. After writing up, or if this symbol has been in the symbol table already, the lexical analyser returns the table address of this symbol, and writes this information into its output. These data will be important at semantic analysis and code generation.
In programming languages the directives serve to control the compiler. The lexical analyser identifies directives and recognises their operands, and usually there are further tasks with these directives.
If the directive is the if
of the conditional compilation, then the lexical analyser has to detect all of parameters of this condition, and it has to evaluate the value of the branch. If this value is false
, then it has to omit the next lines until the else
or endif
directive. It means that the lexical analyser performs syntactic and semantic checking, and creates code-style information. This task is more complicate if the programming language gives possibility to write nested conditions.
Other types of directives are the substitution of macros and including files into the source text. These tasks are far away from the original task of the lexical analyser.
The usual way to solve these problems is the following. The compiler executes a pre-processing program, and this program performs all of the tasks written by directives.
Exercises
2.2-1 Give a regular expression to the comments of a programming language. In this language the delimiters of comments are and , and inside of a comment may occurs and characters, but is forbidden.
2.2-2 Modify the result of the previous question if it is supposed that the programming language has possibility to write nested comments.
2.2-3 Give a regular expression for positive integer numbers, if the pre- and post-zero characters are prohibited. Give a deterministic finite automaton for this regular expression.
2.2-4 Write a program, which re-creates the original source program from the output of lexical analyser. Pay attention for nice an correct positions of the re-created character streams.
The perfect definition of a programming language includes the definition of its syntax and semantics.
The syntax of the programming languages cannot be written by context free grammars. It is possible by using context dependent grammars, two-level grammars or attribute grammars. For these grammars there are not efficient parsing methods, hence the description of a language consists of two parts. The main part of the syntax is given using context free grammars, and for the remaining part a context dependent or an attribute grammar is applied. For example, the description of the program structure or the description of the statement structure belongs to the first part, and the type checking, the scope of variables or the correspondence of formal and actual parameters belong to the second part.
The checking of properties written by context free grammars is called syntactic analysis or parsing. Properties that cannot be written by context free grammars are called form the static semantics. These properties are checked by the semantic analyser.
The conventional semantics has the name run-time semantics or dynamic semantics. The dynamic semantics can be given by verbal methods or some interpreter methods, where the operation of the program is given by the series of state-alterations of the interpreter and its environment.
We deal with context free grammars, and in this section we will use extended grammars for the syntactic analysis. We investigate on methods of checking of properties which are written by context free grammars. First we give basic notions of the syntactic analysis, then the parsing algorithms will be studied.
Definition 2.1 Let be a grammar. If and then is a sentential form. If and then is a sentence of the language defined by the grammar.
The sentence has an important role in parsing. The program written by a programmer is a series of terminal symbols, and this series is a sentence if it is correct, that is, it has not syntactic errors.
Definition 2.2 Let be a grammar and is a sentential form (). We say that is a phrase of , if there is a symbol , which and . We say that is a simple phrase of , if .
We note that every sentence is phrase. The leftmost simple phrase has an important role in parsing; it has its own name.
Definition 2.3 The leftmost simple phase of a sentence is the handle.
The leaves of the syntax tree of a sentence are terminal symbols, other points of the tree are nonterminal symbols, and the root symbol of the tree is the start symbol of the grammar.
In an ambiguous grammar there is at least one sentence, which has several syntax trees. It means that this sentence has more than one analysis, and therefore there are several target programs for this sentence. This ambiguity raises a lot of problems, therefore the compilers translate languages generated by unambiguous grammars only.
We suppose that the grammar has properties as follows:
the grammar is cycle free, that is, it has not series of derivations rules (),
the grammar is reduced, that is, there are not “unused symbols” in the grammar, all of nonterminals happen in a derivation, and from all nonterminals we can derive a part of a sentence. This last property means that for all it is true that , where and ().
As it has shown, the lexical analyser translates the program written by a programmer into series of terminal symbols, and this series is the input of syntactic analyser. The task of syntactic analyser is to decide if this series is a sentence of the grammar or it is not. To achieve this goal, the parser creates the syntax tree of the series of symbols. From the known start symbol and the leaves of the syntax tree the parser creates all vertices and edges of the tree, that is, it creates a derivation of the program. If this is possible, then we say that the program is an element of the language. It means that the program is syntactically correct.
Hence forward we will deal with left to right parsing methods. These methods read the symbols of the programs left to right. All of the real compilers use this method.
To create the inner part of the syntax tree there are several methods. One of these methods builds the syntax tree from its start symbol . This method is called top-down method. If the parser goes from the leaves to the symbol , then it uses the bottom-up parsing method.
We deal with top-down parsing methods in Subsection 2.3.1. We investigate bottom-up parsers in Subsection 2.3.2; now these methods are used in real compilers.
If we analyse from top to down then we start with the start symbol. This symbol is the root of syntax tree; we attempt to construct the syntax tree. Our goal is that the leaves of tree are the terminal symbols.
First we review the notions that are necessary in the top-down parsing. Then the table methods and the recursive descent method will be analysed.
Our methods build the syntax tree top-down and read symbols of the program left to right. For this end we try to create terminals on the left side of sentential forms.
Definition 2.4 If then the leftmost direct derivation of the sentential form () is , and
Definition 2.5 If all of direct derivations in () are leftmost, then this derivation is said to be leftmost derivation, and
In a leftmost derivation terminal symbols appear at the left side of the sentential forms. Therefore we use leftmost derivations in all of top-down parsing methods. Hence if we deal with top-down methods, we do not write the text “leftmost” at the arrows.
One might as well say that we create all possible syntax trees. Reading leaves from left to right, we take sentences of the language. Then we compare these sentences with the parseable text and if a sentence is same as the parseable text, then we can read the steps of parsing from the syntax tree which is belongs to this sentence. But this method is not practical; generally it is even impossible to apply.
A good idea is the following. We start at the start symbol of the grammar, and using leftmost derivations we try to create the text of the program. If we use a not suitable derivation at one of steps of parsing, then we find that, at the next step, we can not apply a proper derivation. At this case such terminal symbols are at the left side of the sentential form, that are not same as in our parseable text.
For the leftmost terminal symbols we state the theorem as follows.
The proof of this theorem is trivial. It is not possible to change the leftmost terminal symbols of sentential forms using derivation rules of a context free grammar.
This theorem is used during the building of syntax tree, to check that the leftmost terminals of the tree are same as the leftmost symbols of the parseable text. If they are different then we created wrong directions with this syntax tree. At this case we have to make a backtrack, and we have to apply an other derivation rule. If it is impossible (since for example there are no more derivation rules) then we have to apply a backtrack once again.
General top-down methods are realized by using backtrack algorithms, but these backtrack steps make the parser very slow. Therefore we will deal only with grammars such that have parsing methods without backtracks.
The main properties of grammars are the following. If, by creating the leftmost derivation (), we obtain the sentential form () at some step of this derivation, and our goal is to achieve , then the next step of the derivation for nonterminal is determinable unambiguously from the first symbols of .
To look ahead symbols we define the function .
Definition 2.7 Let be the set as follows.
The set consists of the first symbols of ; for , it consists the full . If , then .
Definition 2.8 The grammar is a grammar , if for derivations
() the equality
implies
Using this definition, if a grammar is a grammar then the symbol after the parsed determine the next derivation rule unambiguously (Figure 2.8).
One can see from this definition that if a grammar is an grammar then for all it is also an grammar. If we speak about grammar then we also mean that is the least number such that the properties of the definition are true.
Example 2.5 The next grammar is a grammar. Let be a grammar whose derivation rules are:
We have to use the derivation for the start symbol if the next symbol of the parseable text is or . We use the derivation if the next symbol is the mark #.
Example 2.6 The next grammar is a grammar. Let be a grammar whose the derivation rules are:
One can see that at the last step of derivations
and
if we look ahead one symbol, then in both derivations we obtain the symbol . The proper rule for symbol is determined to look ahead two symbols ( or ).
There are context free grammars such that are not grammars. For example the next grammar is not grammar for any .
Example 2.7 Let be a grammar whose the derivation rules are:
consists of sentences és . If we analyse the sentence , then at the first step we can not decide by looking ahead symbols whether we have to use the derivation or , since for all .
By the definition of the grammar, if we get the sentential form using leftmost derivations, then the next symbol determines the next rule for symbol . This is stated in the next theorem.
Theorem 2.9 Grammar is a grammar iff
implies
If there is a rule in the grammar, then the set consists the length prefixes of terminal series generated from . It implies that, for deciding the property , we have to check not only the derivation rules, but also the infinite derivations.
We can give good methods, that are used in the practice, for grammars only. We define the follower-series, which follow a symbol or series of symbols.
Definition 2.10 , and if , then () .
The second part of the definition is necessary because if there are no symbols after the in the derivation , that is , then the next symbol after is the mark # only.
() consists of terminal symbols that can be immediately after the symbol in the derivation
Theorem 2.11 The grammar is a grammar iff, for all nonterminal and for all derivation rules ,
In this theorem the expression means that we have to concatenate to the elements of set separately, and for all elements of this new set we have to apply the function .
It is evident that Theorem 2.11 is suitable to decide whether a grammar is or it is not.
Hence forward we deal with languages determined by grammars, and we investigate the parsing methods of languages. For the sake of simplicity, we omit indexes from the names of functions és .
The elements of the set are determined using the next algorithm.
First(
)
1IF
2THEN
3IF
, where 4THEN
5IF
, where 6THEN
IF
7THEN
8ELSE
9FOR
all 10DO
11FOR
TO
12DO
IF
13THEN
14IF
15THEN
16IF
17THEN
18FOR
TO
19DO
IF
20THEN
21IF
22THEN
23RETURN
In lines 1–4 the set is given for and a terminal symbol . In lines 5–15 we construct the elements of this set for a nonterminal . If is derivated from then we put symbol into the set in lines 6–7 and 14–15. If the argument is a symbol stream then the elements of the set are constructed in lines 16–22. We notice that we can terminate the fOR
cycle in lines 11 and 18 if , since in this case it is not possible to derive symbol from .
In Theorem 2.11 and hereafter, it is necessary to know the elements of the set . The next algorithm constructs this set.
Follow(
)
1IF
2THEN
3ELSE
4FOR
all rules 5DO
IF
6THEN
7IF
8THEN
9ELSE
10RETURN
The elements of the set get into the set . In lines 4–9 we check that, if the argumentum is at the right side of a derivation rule, what symbols may stand immediately after him. It is obvious that no is in this set, and the symbol is in the set only if the argumentum is the rightmost symbol of a sentential form.
Suppose that we analyse a series of terminal symbols and the part has already been analysed without errors. We analyse the text with a top-down method, so we use leftmost derivations. Suppose that our sentential form is , that is, it has form or () (Figure 2.9).
In the first case the next step is the substitution of symbol . We know the next element of the input series, this is the terminal , therefore we can determine the correct substitution of symbol . This substitution is the rule for which . If there is such a rule then, according to the definition of grammar, there is exactly one. If there is not such a rule, then a syntactic error was found.
In the second case the next symbol of the sentential form is the terminal symbol , thus we look out for the symbol as the next symbol of the analysed text. If this comes true, that is, , then the symbol is a correct symbol and we can go further. We put the symbol into the already analysed text. If , then here is a syntactic error. We can see that the position of the error is known, and the erroneous symbol is the terminal symbol .
The action of the parser is the following. Let # be the sign of the right end of the analysed text, that is, the mark # is the last symbol of the text. We use a stack through the analysing, the bottom of the stack is signed by mark #, too. We give serial numbers to derivation rules and through the analysing we write the number of the applied rule into a list. At the end of parsing we can write the syntax tree from this list (Figure 2.10).
We sign the state of the parser using triples . The symbol is the text not analysed yet. is the part of the sentential form corresponding to the not analysed text; this information is in the stack, the symbol is at the top of the stack. is the list of the serial numbers of production rules.
If we analyse the text then we observe the symbol at the top of the stack, and the symbol that is the first symbol of the not analysed text. The name of the symbol is actual symbol. There are pointers to the top of the stack and to the actual symbol.
We use a top down parser, therefore the initial content of the stack is . If the initial analysed text is , then the initial state of the parsing process is the triple , where is the sign of the empty list.
We analyse the text, the series of symbols using a parsing table. The rows of this table sign the symbols at the top of the stack, the columns of the table sign the next input symbols, and we write mark # to the last row and the last column of the table. Hence the number of rows of the table is greater by one than the number of symbols of the grammar, and the number of columns is greater by one than the number of terminal symbols of the grammar.
The element of the table is as follows.
We fill in the parsing table using the following algorithm.
LL(1)-Table-Fill-in(
)
1FOR
all 2DO
IF
the -th rule 3THEN
FOR
all -ra 4DO
5IF
6THEN
FOR
all 7DO
8FOR
all 9DO
10 11FOR
all and all 12DO
IF
”empty” 13THEN
14RETURN
At the line 10 we write the text accept into the right lower corner of the table. At the lines 8–9 we write the text pop into the main diagonal of the square labelled by terminal symbols. The program in lines 1–7 writes a tuple in which the first element is the right part of a derivation rule and the second element is the serial number of this rule. In lines 12–13 we write error texts into the empty positions.
The actions of the parser are written by state-transitions. The initial state is , where the initial text is , and the parsing process will be finished if the parser goes into the state , this state is the final state. If the text is in an intermediate step, and the symbol is at the top of stack, then the possible state-transitions are as follows.
The letters O.K. mean that the analysed text is syntactically correct; the text ERROR means that a syntactic error is detected.
The actions of this parser are written by the next algorithm.
LL(1)-Parser(
)
1 , 2REPEAT
3IF
és 4THEN
5ELSE
IF
6THEN
Then . 7ELSE
IF
8THEN
Then . 9ELSE
Then . 10UNTIL
OR
11RETURN
The input parameters of this algorithm are the text and the parsing table . The variable describes the state of the parser: its value is analyse, during the analysis, and it is either O.K. or ERROR. at the end. The parser determines his action by the actual symbol and by the symbol at the top of the stack, using the parsing table . In the line 3–4 the parser builds the syntax tree using the derivation rule . In lines 5–6 the parser executes a shift action, since there is a symbol at the top of the stack. At lines 8–9 the algorithm finishes his work if the stack is empty and it is at the end of the text, otherwise a syntactic error was detected. At the end of this work the result is O.K. or ERROR in the variable , and, as a result, there is the triple at the output of this algorithm. If the text was correct, then we can create the syntax tree of the analysed text from the third element of the triple. If there was an error, then the first element of the triple points to the position of the erroneous symbol.
Example 2.8 Let be a grammar , where the set of derivation rules:
From these rules we can determine the sets. To fill in the parsing table, the following sets are required:
,
,
,
,
,
,
,
.
The parsing table is as follows. The empty positions in the table mean errors
Example 2.9 Using the parsing table of the previous example, analyse the text .
The syntax tree of the analysed text is the Figure 2.11.
There is another frequently used method for the backtrackless top-down parsing. Its essence is that we write a real program for the applied grammar. We create procedures to the symbols of grammar, and using these procedures the recursive procedure calls realize the stack of the parser and the stack management. This is a top-down parsing method, and the procedures call each other recursively; it is the origin of the name of this method, that is, recursive-descent method.
To check the terminal symbols we create the procedure Check. Let the parameter of this procedure be the, that is the leftmost unchecked terminal symbol of the sentential form, and let the actual symbol be the symbol which is analysed in that moment.
procedure Check(a); begin if actual_symbol = a then Next_symbol else Error_report end;
The procedure Next_symbol reads the next symbol, it is a call for the lexical analyser. This procedure determines the next symbol and put this symbol into the actual_symbol variable. The procedure Error_report creates an error report and then finishes the parsing.
We create procedures to symbols of the grammar as follows. The procedure of the nonterminal symbol is the next.
procedure A; begin T(A) end;
where T(A)
is determined by symbols of the right part of derivation rule having symbol in its left part.
The grammars which are used for syntactic analysis are reduced grammars. It means that no unnecessary symbols in the grammar, and all of symbols occur at the left side at least one reduction rule. Therefore, if we consider the symbol , there is at least one production rule.
If there is only one production rule for the symbol ,
let the program of the rule is as follows: Check(a)
,
for the rule we give the procedure call B
,
for the rule we give the next block:
begin
T(X_1);
T(X_2);
...
T(X_n)
end;
If there are more rules for the symbol :
If the rules are -free, that is from () it is not possible to deduce , then T(A)
case actual_symbol of
First(alpha_1) : T(alpha_1);
First(alpha_2) : T(alpha_2);
...
First(alpha_n) : T(alpha_n)
end;
where First(alpha_i)
is the sign of the set .
We note that this is the first point of the method of recursive-descent parser where we use the fact that the grammar is an grammar.
We use the grammar to write a programming language, therefore it is not comfortable to require that the grammar is a -free grammar. For the rules we create the next T(A)
program:
case actual_symbol of
First(alpha_1) : T(alpha_1);
First(alpha_2) : T(alpha_2);
...
First(alpha_(n-1)) : T(alpha_(n-1));
Follow(A) : skip
end;
where Follow(A)
is the set .
In particular, if the rules for some , that is , then the -th row of the case
statement is
Follow(A) : skip
In the program T(A)
, if it is possible, we use if-then-else
or while
statement instead of the statement case
.
The start procedure, that is the main program of this parsing program, is the procedure which is created for the start symbol of the grammar.
We can create the recursive-descent parsing program with the next algorithm. The input of this algorithm is the grammar , and the result is parsing program . In this algorithm we use a WRITE-PROGRAM
procedure, which concatenates the new program lines to the program . We will not go into the details of this algorithm.
Create-Rec-Desc(
)
1 2WRITE-PROGRAM(
3procedure Check(a);
4begin
5if actual_symbol = a
6then Next_symbol
7else Error_report
8end;
9)
10FOR
all symbol of the grammar 11DO
IF
12THEN
WRITE-PROGRAM(
13program S;
14begin
15REC-DESC-STAT
16end;
17)
18ELSE
WRITE-PROGRAM(
19procedure A;
20begin
21REC-DESC-STAT
22end;
23)
24RETURN
The algorithm creates the Check procedure in lines 2–9. Then, for all nonterminals of grammar , it determines their procedures using the algorithm REC-DESC-STAT
. In the lines 11–17, we can see that for the start symbol we create the main program. The output of the algorithm is the parsing program.
Rec-Desc-Stat(
)
1IF
there is only one rule 2THEN
REC-DESC-STAT1
. 3ELSE
REC-DESC-STAT2
. 4RETURN
The form of the statements of the parsing program depends on the derivation rules of the symbol . Therefore the algorithm REC-DESC-STAT
divides the next tasks into two parts. The algorithm REC-DESC-STAT1
deals with the case when there is only one derivation rule, and the algorithm REC-DESC-STAT2
creates the program for the alternatives.
Rec-Desc-Stat1(
)
1IF
2THEN
WRITE-PROGRAM(
3Check(a)
4)
5IF
6THEN
WRITE-PROGRAM(
7B
8)
9IF
10THEN
WRITE-PROGRAM(
11begin
11 12REC-DESC-STAT1
;
13REC-DESC-STAT1
;
14 15REC-DESC-STAT1
16end;
17RETURN
Rec-Desc-Stat2(
)
1IF
the rules are - free 2THEN
WRITE-PROGRAM(
3case actual_symbol of
4First(alpha_1) :
REC-DESC-STAT1
;
5...
6First(alpha_n) :
REC-DESC-STAT1
7end;
8)
9IF
there is a -rule, 10THEN
WRITE-PROGRAM(
11case actual_symbol of
12First(alpha_1) :
REC-DESC-STAT1
;
13...
14First(alpha_(i-1)) :
REC-DESC-STAT1
;
15Follow(A) : skip;
16First(alpha_(i+1)) :
REC-DESC-STAT1
;
17...
18First(alpha_n) :
REC-DESC-STAT1
19end;
20)
21RETURN
These two algorithms create the program described above.
Checking the end of the parsed text is achieved by the recursive- descent parsing method with the next modification. We generate a new derivation rule for the end mark #. If the start symbol of the grammar is , then we create the new rule , where the new symbol is the start symbol of our new grammar. The mark # is considered as terminal symbol. Then we generate the parsing program for this new grammar.
Example 2.10 We augment the grammar of the Example 2.8 in the above manner. The production rules are as follows.
In the example 2.8 we give the necessary and sets. We use the next sets:
,
,
,
,
,
.
In the comments of the program lines we give the using of these sets. The first characters of the comment are the character pair --
.
The program of the recursive-descent parser is the following.
program S'; begin E; Check(#) end. procedure E; begin T; E' end; procedure E'; begin case actual_symbol of + : begin -- First(+TE') Check(+); T; E' end; ),# : skip -- Follow(E') end end; procedure T; begin F; T' end; procedure T'; begin case actual_symbol of * : begin -- First(*FT') Check(*); F; T' end; +,),# : skip -- Follow(T') end end; procedure F; begin case actual_symbol of ( : begin -- First((E)) Check((); E; Check()) end; i : Check(i) -- First(i) end end;
We can see that the main program of this parser belongs to the symbol .
If we analyse from bottom to up, then we start with the program text. We search the handle of the sentential form, and we substitute the nonterminal symbol that belongs to the handle, for this handle. After this first step, we repeat this procedure several times. Our goal is to achieve the start symbol of the grammar. This symbol will be the root of the syntax tree, and by this time the terminal symbols of the program text are the leaves of the tree.
First we review the notions which are necessary in the parsing.
To analyse bottom-up, we have to determine the handle of the sentential form. The problem is to create a good method which finds the handle, and to find the best substitution if there are more than one possibilities.
Definition 2.12 If , then the rightmost substitution of the sentential form () is , that is
Definition 2.13 If the derivation () all of the substitutions were rightmost substitution, then this derivation is a rightmost derivation,
In a rightmost derivation, terminal symbols are at the right side of the sentential form. By the connection of the notion of the handle and the rightmost derivation, if we apply the steps of a rightmost derivation backwards, then we obtain the steps of a bottom-up parsing. Hence the bottom-up parsing is equivalent with the “inverse” of a rightmost derivation. Therefore, if we deal with bottom-up methods, we will not write the text “rightmost” at the arrows.
General bottom-up parsing methods are realized by using backtrack algorithms. They are similar to the top-down parsing methods. But the backtrack steps make the parser very slow. Therefore we only deal with grammars such that have parsing methods without backtracks.
Hence forward we produce a very efficient algorithm for a large class of context-free grammars. This class contains the grammars for the programming languages.
The parsing is called parsing; the grammar is called grammar. means the “Left to Right” method, and means that if we look ahead symbols then we can determine the handles of the sentential forms. The parsing method is a shift-reduce method.
We deal with parsing only, since for all grammar there is an equivalent grammar. This fact is very important for us since, using this type of grammars, it is enough to look ahead one symbol in all cases.
Creating parsers is not an easy task. However, there are such standard programs (for example the yacc
in UNIX systems), that create the complete parsing program from the derivation rules of a grammar. Using these programs the task of writing parsers is not too hard.
After studying the grammars we will deal with the parsing method. This method is used in the compilers of modern programming languages.
As we did previously, we write a mark # to the right end of the text to be analysed. We introduce a new nonterminal symbol and a new rule into the grammar.
Definition 2.14 Let be the augmented grammar belongs to grammar , where augmented grammar
Assign serial numbers to the derivation rules of grammar, and let be the 0th rule. Using this numbering, if we apply the 0th rule, it means that the parsing process is concluded and the text is correct.
We notice that if the original start symbol does not happen on the right side of any rules, then there is no need for this augmentation. However, for the sake of generality, we deal with augmented grammars only.
Definition 2.15 The augmented grammar is an LR(k) grammar , if for derivations
() the equality
implies
The feature of grammars is that, in the sentential form , looking ahead symbol from unambiguously decides if is or is not the handle. If the handle is , then we have to reduce the form using the rule , that results the new sentential form is . Its reason is the following: suppose that, for sentential forms and , (their prefixes are same), , and we can reduce to and to . In this case, since the grammar is a grammar, and hold. Therefore in this case either the handle is or never is the handle.
Example 2.11 Let be a grammar and let the derivation rules be as follows.
This grammar is not an grammar, since using notations of the definition, in the derivations
it holds that , and .
Example 2.12 The next grammar is a grammar. , the derivation rules are:
.
In the next example we show that there is a context-free grammar, such that is not grammar for any .
Example 2.13 Let be a grammar and let the derivation rules be
Now for all
and
but
It is not sure that, for a grammar, we can find an equivalent grammar. However, grammars have this nice property.
Theorem 2.16 For all grammar there is an equivalent grammar.
The great significance of this theorem is that it makes sufficient to study the grammars instead of grammars.
Now we define a very important notion of the parsings.
Definition 2.17 If is the handle of the () sentential form, then the prefixes of are the viable prefixes of .
Example 2.14 Let be a grammar and the derivation rule as follows.
is a sentential form, and the first is the handle. The viable prefixes of this sentential form are .
By the above definition, symbols after the handle are not parts of any viable prefix. Hence the task of finding the handle is the task of finding the longest viable prefix.
For a given grammar, the set of viable prefixes is determined, but it is obvious that the size of this set is not always finite.
The significance of viable prefixes are the following. We can assign states of a deterministic finite automaton to viable prefixes, and we can assign state transitions to the symbols of the grammar. From the initial state we go to a state along the symbols of a viable prefix. Using this property, we will give a method to create an automaton that executes the task of parsing.
Definition 2.18 If is a rule of a grammar, then let
be a LR(1)-item, where is the core of the -item, and is the lookahead symbol of the -item.
The lookahead symbol is instrumental in reduction, i.e. it has form . It means that we can execute reduction only if the symbol follows the handle .
Definition 2.19 The -item is valid for the viable prefix if
and is the first symbol of or if then .
Example 2.15 Let a grammar and the derivation rules as follows.
Using these rules, we can derive . Here is a viable prefix, and is valid for this viable prefix. Similarly, , and -item is valid for viable prefix .
Creating a parser, we construct the canonical sets of -items. To achieve this we have to define the closure and read functions.
Definition 2.20 Let the set be a set of -items for a given grammar. The set closure() consists of the next -items:
1. every element of the set is an element of the set ,
2. if , and is a derivation rule of the grammar, then for all ,
3. the set is needed to expand using the step 2 until no more items can be added to it.
By definitions, if the -item is valid for the viable prefix , then the -item is valid for the same viable prefix in the case of . (Figure 2.14). It is obvious that the function closure creates all of -items which are valid for viable prefix .
We can define the function , i.e. the closure of set by the following algorithm. The result of this algorithm is the set .
Closure-Set-of-Items(
)
1 2FOR
all -item 3DO
4RETURN
Closure-Item(
)
1 2IF
the -item has form 3THEN
4 5REPEAT
6FOR
for all which have form 7DO
FOR
for all rules 8DO
FOR
for all symbols 9DO
10 11IF
12THEN
13 14UNTIL
15RETURN
The algorithm CLOSURE-ITEM
creates , the closure of item . If, in the argument , the “point” is followed by a terminal symbol, then the result is this item only (line 1). If in the “point” is followed by a nonterminal symbol , then we can create new items from every rule having the symbol at their left side (line 9). We have to check this condition for all new items, too, the rEPEAT
cycle is in line 5–14. These steps are executed until no more items can be added (line 14). The set contains the items to be checked, the set contains the new items. We can find the operation in line 10.
Definition 2.21 Let be a set of -items for the grammar . Then the set read() () consists of the following -items.
1. if , then all items of the set are in ,
2. the set is extended using step 1 until no more items can be added to it.
The function “reads symbol ” in items of , and after this operation the sign "point" in the items gets to the right side of . If the set contains the valid -items for the viable prefix then the set contains the valid -items for the viable prefix .
The algorithm READ-SET-OF-ITEMS
executes the function read. The result is the set .
Read-Set(
)
1 2FOR
all 3DO
4RETURN
Read-Item(
)
1IF
and 2THEN
3ELSE
4RETURN
Using these algorithms we can create all of items which writes the state after reading of symbol .
Now we introduce the following notation for -items, to give shorter descriptions. Let
be a notation for items
Example 2.16 The -item is an item of the grammar in the example 2.15. For this item
We can create the canonical sets of -items or shortly the -canonical sets with the following method.
Definition 2.22 Canonical sets of -items are the following.
,
Create the set for a symbol . If this set is not empty and it is not equal to canonical set then it is the next canonical set .
Repeat this operation for all possible terminal and nonterminal symbol . If we get a nonempty set which is not equal to any of previous sets then this set is a new canonical set, and its index is greater by one as the maximal index of previously generated canonical sets.
repeat the above operation for all previously generated canonical sets and for all symbols of the grammar until no more items can be added to it.
The sets
are the canonical sets of -items of the grammar .
The number of elements of -items for a grammar is finite, hence the above method is terminated in finite time.
The next algorithm creates canonical sets of the grammar .
Create-Canonical-Sets(
)
1 2 3 4REPEAT
5 6FOR
all -re 7DO
8FOR
all -re 9DO
10IF
and 11THEN
12 13 14 15UNTIL
16RETURN
The result of the algorithm is . The first canonical set is the set in the line 2. Further canonical sets are created by functions CLOSURE-SET-OF-ITEMS(READ-SET)
in the line 9. The program in the line 10 checks that the new set differs from previous sets, and if the answer is true then this set will be a new set in lines 11–12. The fOR
cycle in lines 6–14 guarantees that these operations are executed for all sets previously generated. In lines 3–14 the rEPEAT
cycle generate new canonical sets as long as it is possible.
Example 2.17 The canonical sets of -items for the Example 2.15 are as follows.
The automaton of the parser is in Figure 2.15.
If the canonical sets of -items
were created, then assign the state of an automaton to the set . Relation between the states of the automaton and the canonical sets of -items is stated by the next theorem. This theorem is the ”great”' theorem of the -parsing.
Theorem 2.23 The set of the -items being valid for a viable prefix can be assigned to the automaton-state such that there is path from the initial state to state labeled by .
This theorem states that we can create the automaton of the parser using canonical sets. Now we give a method to create this parser from canonical sets of -items.
The deterministic finite automaton can be described with a table, that is called parsing table. The rows of the table are assigned to the states of the automaton.
The parsing table has two parts. The first is the action table. Since the operations of parser are determined by the symbols of analysed text, the action table is divided into columns labeled by the terminal symbols. The action table contains information about the action performing at the given state and at the given symbol. These actions can be shifts or reductions. The sign of a shift operation is , where is the next state. The sign of the reduction is , where is the serial number of the applied rule. The reduction by the rule having the serial number zero means the termination of the parsing and that the parsed text is syntactically correct; for this reason we call this operation accept.
The second part of the parsing table is the goto table. In this table are informations about shifts caused by nonterminals. (Shifts belong to terminals are in the action table.)
Let be the set of states of the automata. The -th row of the table is filled in from the -items of canonical set .
The -th row of the action table:
if and then ,
if and , then , where is the -th rule of the grammar,
if , then .
The method of filling in the goto table:
if , then .
In both table we have to write the text error into the empty positions.
These action and goto tables are called canonical parsing tables.
Theorem 2.24 The augmented grammar is grammar iff we can fill in the parsing tables created for this grammar without conflicts.
We can fill in the parsing tables with the next algorithm.
Fill-in-LR(1)-Table(
)
1FOR
all LR(1) canonical sets 2DO
FOR
all LR(1)-items 3IF
and 4THEN
5IF
and and the -th rule 6THEN
7IF
8THEN
9IF
10THEN
11FOR
all 12DO
IF
“empty” 13THEN
14FOR
all 15DO
IF
”empty” 16THEN
17RETURN
action, goto
We fill in the tables its line-by-line. In lines 2–6 of the algorithm we fill in the action table, in lines 9–10 we fill in the goto table. In lines 11–13 we write the error into the positions which remained empty.
Now we deal with the steps of the parsing. (Figure 2.16).
The state of the parsing is written by configurations. A configuration of the parser consists of two parts, the first is the stack and the second is the unexpended input text.
The stack of the parsing is a double stack, we write or read two data with the operations push or pop. The stack consists of pairs of symbols, the first element of pairs there is a terminal or nonterminal symbol, and the second element is the serial number of the state of automaton. The content of the start state is .
The start configuration is , where means the unexpected text.
The parsing is successful if the parser moves to final state. In the final state the content of the stack is , and the parser is at the end of the text.
Suppose that the parser is in the configuration . The next move of the parser is determined by .
State transitions are the following.
If , i.e. the parser executes a shift, then the actual symbol and the new state are written into the stack. That is, the new configuration is
If , then we execute a reduction by the -th rule . In this step we delete rows, i.e. we delete elements from the stack, and then we determine the new state using the goto table. If after the deletion there is the state at the top of the stack, then the new state is .
where .
If , then the parsing is completed, and the analysed text was correct.
If , then the parsing terminates, and a syntactic error was discovered at the symbol .
The parser is often named canonical parser.
Denote the action and goto tables together by . We can give the following algorithm for the steps of parser.
LR(1)-Parser(
)
1 , 2REPEAT
3 4IF
5THEN
6ELSE
IF
and is the -th rule and 7 and 8THEN
9ELSE
IF
10THEN
11ELSE
12UNTIL
OR
13RETURN
The input parameters of the algorithm are the text and table . The variable indicates the action of the parser. It has value parsing in the intermediate states, and its value is O.K. or ERROR at the final states. In line 3 we detail the configuration of the parser, that is necessary at lines 6–8. Using the action table, the parser determines its move from the symbol at the top of the stack and from the actual symbol . In lines 4–5 we execute a shift step, in lines 6–8 a reduction. The algorithm is completed in lines 9–11. At this moment, if the parser is at the end of text and the state 0 is at the top of stack, then the text is correct, otherwise a syntax error was detected. According to this, the output of the algorithm is O.K. or ERROR, and the final configuration is at the output, too. In the case of error, the first symbol of the second element of the configuration is the erroneous symbol.
Example 2.18 The action and goto tables of the parser for the grammar of Example 2.15 are as follows. The empty positions denote errors.
Example 2.19 Using the tables of the previous example, analyse the text .
The syntax tree of the sentence is in Figure 2.11.
Our goal is to decrease the number of states of the parser, since not only the size but the speed of the compiler is dependent on the number of states. At the same time, we wish not to cut radically the set of grammars and languages, by using our new method.
There are a lot of -items in the canonical sets, such that are very similar: their core are the same, only their lookahead symbols are different. If there are two or more canonical sets in which there are similar items only, then we merge these sets.
If the canonical sets és a are mergeable, then let .
Execute all of possible merging of canonical sets. After renumbering the indexes we obtain sets ; these are the merged canonical sets or canonical sets.
We create the parser from these united canonical sets.
Example 2.20 Using the canonical sets of the example 2.17, we can merge the next canonical sets:
and ,
and ,
and .
In the Figure 2.15 it can be seen that mergeable sets are in equivalent or similar positions in the automaton.
There is no difficulty with the function read if we use merged canonical sets. If
and
then
We can prove this on the following way. By the definition of function read, the set depends on the core of -items in only, and it is independent of the lookahead symbols. Since the cores of -items in the sets are the same, the cores of -items of
are also the same. It follows that these sets are mergeable into a set , thus .
However, after merging canonical sets of -items, elements of this set can raise difficulties. Suppose that .
After merging there are not shift-shift conflicts. If
and
then there is a shift for the symbol and we saw that the function read does not cause problem, i.e. the set is equal to the set .
If there is an item
in the canonical set and there is an item
in the set a , then the merged set is an inadequate set with the symbol , i.e. there is a shift-reduce conflict in the merged set.
But this case never happens. Both items are elements of the set and of the set . These sets are mergeable sets, thus they are different in lookahead symbols only. It follows that there is an item in the set . Using the Theorem 2.24 we get that the grammar is not a grammar; we get shift-reduce conflict from the set for the parser, too.
However, after merging reduce-reduce conflict may arise. The properties of grammar do not exclude this case. In the next example we show such a case.
Example 2.21 Let be a grammar, and the derivation rules are as follows.
This grammar is a grammar. For the viable prefix the -items
for the viable prefix the -items
create two canonical sets.
After merging these two sets we get a reduce-reduce conflict. If the input symbol is or then the handle is , but we cannot decide that if we have to use the rule or the rule for reducing.
Now we give the method for creating a parsing table. First we give the canonical sets of -items
then we merge canonical sets in which the sets constructed from the core of the items are identical ones. Let
be the canonical sets.
For the calculation of the size of the action and goto tables and for filling in these tables we use the sets . The method is the same as it was in the parsers. The constructed tables are named by parsing tables.
Definition 2.25 If the filling in the parsing tables do not produce conflicts then the grammar is said to be an grammar.
The run of parser is the same as it was in parser.
Example 2.22 Denote the result of merging canonical sets and by . Let be the state which belonging to this set.
The canonical sets of the grammar of Example 2.15 were given in the Example2.17 and the mergeable sets were seen in the example 2.20. For this grammar we can create the next parsing tables.
The filling in the tables are conflict free, therefore the grammar is an grammar. The automaton of this parser is in Figure 2.18.
Example 2.23 Analyse the text using the parsing table of the previous example.
The syntax tree of the parsed text is in the Figure 2.17.
As it can be seen from the previous example, the grammars are grammars. The converse assertion is not true. In Example 2.21 there is a grammar which is , but it is not grammar.
Programming languages can be written by grammars. The most frequently used methods in compilers of programming languages is the method. The advantage of the parser is that the sizes of parsing tables are smaller than the size of parsing tables.
For example, the parsing tables for the Pascal language have a few hundreds of lines, whilst the parsers for this language have a few thousands of lines.
Exercises
2.3-1 Find the grammars among the following grammars (we give their derivation rules only).
1.
2.
3.
4.
2.3-2 Prove that the next grammars are grammars (we give their derivation rules only).
1.
2.
3.
2.3-3 Prove that the next grammars are not grammars (we give their derivation rules only).
1.
2.
3.
2.3-4 Show that a language has only one sentence.
2.3-5 Prove that the next grammars are grammars (we give their derivation rules only).
1.
2.
2.3-6 Prove that the next grammars are grammars. (we give their derivation rules only).
1 .
2.
2.3-7 Prove that the next grammars are not grammars for any (we give their derivation rules only).
1.
2.
2.3-8 Prove that the next grammars are but are not grammars (we give their derivation rules only).
1.
2.
2.3-9 Create parsing table for the above grammars.
2.3-10 Using the recursive descent method, write the parsing program for the above grammars.
2.3-11 Create canonical sets and the parsing tables for the above grammars.
2.3-12 Create merged canonical sets and the parsing tables for the above grammars.
PROBLEMS |
2-1
Lexical analysis of a program text
The algorithm LEX-ANALYSE
in Section 2.2 gives a scanner for the text that is described by only one regular expression or deterministic finite automaton, i.e. this scanner is able to analyse only one symbol. Create an automaton which executes total lexical analysis of a program language, and give the algorithm LEX-ANALYSE-LANGUAGE
for this automaton. Let the input of the algorithm be the text of a program, and the output be the series of symbols. It is obvious that if the automaton goes into a finite state then its new work begins at the initial state, for analysing the next symbol. The algorithm finishes his work if it is at the end of the text or a lexical error is detected.
2-2
Series of symbols augmented with data of symbols
Modify the algorithm of the previous task on such way that the output is the series of symbols augmented with the appropriate attributes. For example, the attribute of a variable is the character string of its name, or the attribute of a number is its value and type. It is practical to write pointers to the symbols in places of data.
2-3
parser from
canonical sets
If we omit lookahead symbols from the -items then we get -items. We can define functions closure and read for -items too, doing not care for lookahead symbols. Using a method similar to the method of , we can construct canonical sets
One can observe that the number of merged canonical sets is equal to the number of canonical sets, since the cores of -items of the merged canonical sets are the same as the items of the canonical sets. Therefore the number of states of parser is equal to the number of states of its parser.
Using this property, we can construct canonical sets from canonical sets, by completing the items of the canonical sets with lookahead symbols. The result of this procedure is the set of canonical sets.
It is obvious that the right part of an -item begins with symbol point only if this item was constructed by the function closure. (We notice that there is one exception, the item of the canonical set .) Therefore it is no need for all items of canonical sets. Let the kernel of the canonical set be the -item , and let the kernel of any other canonical set be the set of the -items such that there is no point at the first position on the right side of the item. We give an canonical set by its kernel, since all of items can be construct from the kernel using the function closure.
If we complete the items of the kernel of canonical sets then we get the kernel of the merged canonical sets. That is, if the kernel of an canonical set is , then from it with completions we get the kernel of the canonical set, .
If we know then we can construct easily. If , and , then . For -items, if , and then we have to determine also the lookahead symbols, i.e. the symbols such that .
If and then it is sure that . In this case, we say that the lookahead symbol was spontaneously generated for this item of canonical set . The symbol do not play important role in the construction of the lookahead symbol.
If then is an element of the set , and the lookahead symbol is . In this case we say that the lookahead symbol is propagated from into the item of the set .
If the kernel of an canonical set is given then we construct the propagated and spontaneously generated lookahead symbols for items of by the following algorithm.
For all items we construct the set , where is a dummy symbol,
if and then and the symbol is spontaneously generated into the item of the set ,
if then , and the symbol is propagated from into the item of the set .
The kernel of the canonical set has only one element. The core of this element is . For this item we can give the lookahead symbol directly. Since the core of the kernel of all canonical sets are given, using the above method we can calculate all of propagated and spontaneously generated symbols.
Give the algorithm which constructs canonical sets from canonical sets using the methods of propagation and spontaneously generation.
CHAPTER NOTES |
The theory and practice of compilers, computers and program languages are of the same age. The construction of first compilers date back to the 1950's. The task of writing compilers was a very hard task at that time, the first Fortran compiler took 18 man-years to implement [6]. From that time more and more precise definitions and solutions have been given to the problems of compilation, and better and better methods and utilities have been used in the construction of translators.
The development of formal languages and automata was a great leap forward, and we can say that this development was urged by the demand of writing of compilers. In our days this task is a simple routine project. New results, new discoveries are expected in the field of code optimisation only.
One of the earliest nondeterministic and backtrack algorithms appeared in the 1960's. The first two dynamic programming algorithms were the CYK (Cocke-Younger-Kasami) algorithm from 1965–67 and the Earley-algorithm from 1965. The idea of precedence parsers is from the end of 1970's and from the beginning of 1980's. The grammars was defined by Knuth in 1965; the definition of grammars is dated from the beginning of 1970's. grammars were studied by De Remer in 1971, the elaborating of parsing methods were finished in the beginning of 1980's [7][5][6].
To the middle of 1980's it became obvious that the parsing methods are the real efficient methods and since than the methods are used in compilers [7].
A lot of very excellent books deal with the theory and practice of compiles. Perhaps the most successful of them was the book of Gries [100]; in this book there are interesting results for precedence grammars. The first successful book which wrote about the new algorithms was of Aho and Ullman [5], we can find here also the CYK and the Early algorithms. It was followed by the “dragon book” of Aho and Ullman[6]; the extended and corrected issue of it was published in 1986 by authors Aho, Ullman and Sethi [7].
Without completeness we notice the books of Fischer and LeBlanc [74], Tremblay and Sorenson [258], Waite and Goos [267], Hunter [121], Pittman [206] and Mak [171]. Advanced achievements are in recently published books, among others in the book of Muchnick [189], Grune, Bal, Jacobs and Langendoen [102], in the book of Cooper and Torczon [52] and in a chapter of the book by Louden [169].
Table of Contents
Algorithms for data compression usually proceed as follows. They encode a text over some finite alphabet into a sequence of bits, hereby exploiting the fact that the letters of this alphabet occur with different frequencies. For instance, an “e” occurs more frequently than a “q” and will therefore be assigned a shorter codeword. The quality of the compression procedure is then measured in terms of the average codeword length.
So the underlying model is probabilistic, namely we consider a finite alphabet and a probability distribution on this alphabet, where the probability distribution reflects the (relative) frequencies of the letters. Such a pair—an alphabet with a probability distribution—is called a source. We shall first introduce some basic facts from Information Theory. Most important is the notion of entropy, since the source entropy characterises the achievable lower bounds for compressibility.
The source model to be best understood, is the discrete memoryless source. Here the letters occur independently of each other in the text. The use of prefix codes, in which no codeword is the beginning of another one, allows to compress the text down to the entropy of the source. We shall study this in detail. The lower bound is obtained via Kraft's inequality, the achievability is demonstrated by the use of Huffman codes, which can be shown to be optimal.
There are some assumptions on the discrete memoryless source, which are not fulfilled in most practical situations. Firstly, usually this source model is not realistic, since the letters do not occur independently in the text. Secondly, the probability distribution is not known in advance. So the coding algorithms should be universal for a whole class of probability distributions on the alphabet. The analysis of such universal coding techniques is much more involved than the analysis of the discrete memoryless source, such that we shall only present the algorithms and do not prove the quality of their performance. Universal coding techniques mainly fall into two classes.
Statistical coding techniques estimate the probability of the next letters as accurately as possible. This process is called modelling of the source. Having enough information about the probabilities, the text is encoded, where usually arithmetic coding is applied. Here the probability is represented by an interval and this interval will be encoded.
Dictionary-based algorithms store patterns, which occurred before in the text, in a dictionary and at the next occurrence of a pattern this is encoded via its position in the dictionary. The most prominent procedure of this kind is due to Ziv and Lempel.
We shall also present a third universal coding technique which falls in neither of these two classes. The algorithm due to Burrows and Wheeler has become quite prominent in recent years, since implementations based on it perform very well in practice.
All algorithms mentioned so far are lossless, i. e., there is no information lost after decoding. So the original text will be recovered without any errors. In contrast, there are lossy data compression techniques, where the text obtained after decoding does not completely coincide with the original text. Lossy compression algorithms are used in applications like image, sound, video, or speech compression. The loss should, of course, only marginally effect the quality. For instance, frequencies not realizable by the human eye or ear can be dropped. However, the understanding of such techniques requires a solid background in image, sound or speech processing, which would be far beyond the scope of this paper, such that we shall illustrate only the basic concepts behind image compression algorithms such as JPEG.
We emphasise here the recent developments such as the Burrows-Wheeler transform and the context–tree weighting method. Rigorous proofs will only be presented for the results on the discrete memoryless source which is best understood but not a very realistic source model in practice. However, it is also the basis for more complicated source models, where the calculations involve conditional probabilities. The asymptotic computational complexity of compression algorithms is often linear in the text length, since the algorithms simply parse through the text. However, the running time relevant for practical implementations is mostly determined by the constants as dictionary size in Ziv-Lempel coding or depth of the context tree, when arithmetic coding is applied. Further, an exact analysis or comparison of compression algorithms often heavily depends on the structure of the source or the type of file to be compressed, such that usually the performance of compression algorithms is tested on benchmark files. The most well-known collections of benchmark files are the Calgary Corpus and the Canterbury Corpus.
The source model discussed throughout this chapter is the Discrete Memoryless Source (DMS). Such a source is a pair , where , is a finite alphabet and is a probability distribution on . A discrete memoryless source can also be described by a random variable , where for all . A word is the realization of the random variable , where the are identically distributed and independent of each other. So the probability is the product of the probabilities of the single letters.
Estimations for the letter probabilities in natural languages are obtained by statistical methods. If we consider the English language and choose for the latin alphabet with an additional symbol for Space and punctuation marks, the probability distribution can be derived from the frequency table in 3.1, which is obtained from the copy–fitting tables used by professional printers. So , etc.
Observe that this source model is often not realistic. For instance, in English texts e.g. the combination `th' occurs more often than `ht'. This could not be the case, if an English text was produced by a discrete memoryless source, since then .
In the discussion of the communication model it was pointed out that the encoder wants to compress the original data into a short sequence of binary digits, hereby using a binary code, i. e., a function . To each element a codeword is assigned. The aim of the encoder is to minimise the average length of the codewords. It turns out that the best possible data compression can be described in terms of the entropy of the probability distribution . The entropy is given by the formula
where the logarithm is to the base 2. We shall also use the notation according to the interpretation of the source as a random variable.
A code (of variable length) is a function , . Here is the set of codewords, where for the codeword is where denotes the length of , i. e., the number of bits used to present .
A code is uniquely decipherable (UDC), if every word in is representable by at most one sequence of codewords.
A code is a prefix code, if no codeword is prefix of another one, i. e., for any two codewords and , , with holds . So in at least one of the first components and differ.
Messages encoded using a prefix code are uniquely decipherable. The decoder proceeds by reading the next letter until a codeword is formed. Since cannot be the beginning of another codeword, it must correspond to the letter . Now the decoder continues until another codeword is formed. The process may be repeated until the end of the message. So after having found the codeword the decoder instantaneously knows that is the next letter of the message. Because of this property a prefix code is also denoted as instantaneous code.
The criterion for data compression is to minimise the average length of the codewords. So if we are given a source , where and is a probability distribution on , the average length is defined by
The following prefix code for English texts has average length .
We can still do better, if we do not encode single letters, but blocks of letters for some . In this case we replace the source by for some . Remember that for a word , since the source is memoryless. If e.g. we are given an alphabet with two letters, and , , then the code defined by , has average length . Obviously we cannot find a better code. The combinations of two letters now have the following probabilities:
The prefix code defined by
has average length . So could be interpreted as the average length the code requires per letter of the alphabet . When we encode blocks of letters we are interested in the behaviour of
It follows from the Noiseless Coding Theorem, which is stated in the next section, that the entropy of the source .
In our example for the English language we have . So the code presented above, where only single letters are encoded, is already nearly optimal in respect of . Further compression is possible, if we consider the dependencies between the letters.
We shall now introduce a necessary and sufficient condition for the existence of a prefix code with prescribed word lengths .
Theorem 3.1 (Kraft's inequality) Let . A uniquely decipherable code with word lengths exists, if and only if
Proof. The central idea is to interpret the codewords as nodes of a rooted binary tree with depth . The tree is required to be complete (every path from the root to a leaf has length ) and regular (every inner node has outdegree 2). The example in Figure 3.2 for may serve as an illustration.
So the nodes with distance from the root are labeled with the words . The upper successor of is labeled , its lower successor is labeled .
The shadow of a node labeled by is the set of all the leaves which are labeled by a word (of length ) beginning with . In other words, the shadow of consists of the leaves labeled by a sequence with prefix . In our example is the shadow of the node labeled by .
Now suppose we are given positive integers . We further assume that . As first codeword is chosen. Since , we have (otherwise only one letter has to be encoded). Hence there are left some nodes on the level, which are not in the shadow of . We pick the first of these remaining nodes and go back steps in direction to the root. Since we shall find a node labeled by a sequence of bits, which is not a prefix of . So we can choose this sequence as . Now again, either , and we are ready, or by the hypothesis and we can find a node on the level, which is not contained in the shadows of and . We find the next codeword as shown above. The process can be continued until all codewords are assigned.
Conversely, observe that , where is the number of codewords with length in the uniquely decipherable prefix code and again denotes the maximal word length.
The power of this term can be expanded as
Here is the total number of messages whose coded representation is of length .
Since the code is uniquely decipherable, to every sequence of letters corresponds at most one possible message. Hence and . Taking –th root this yields .
Since this inequality holds for any and , we have the desired result
Theorem 3.2 (Noiseless Coding Theorem) For a source , it is always possible to find a uniquely decipherable code with average length
Proof. Let denote the codeword lengths of an optimal uniquely decipherable code. Now we define a probability distribution by for , where . By Kraft's inequality .
For two probability distributions and on the I-divergence is defined by
I-divergence is a good measure for the distance of two probability distributions. Especially, always the I-divergence . So for any probability distribution
From this it follows that
Since , and hence .
In order to prove the right-hand side of the Noiseless Coding Theorem for we define . Observe that and hence .
So and from Kraft's Inequality we know that there exists a uniquely decipherable code with word lengths . This code has average length
In the proof of the Noiseless Coding Theorem it was explicitly shown how to construct a prefix code to a given probability distribution . The idea was to assign to each a codeword of length by choosing an appropriate vertex in the tree introduced. However, this procedure does not always yield an optimal code. If e.g. we are given the probability distribution , we would encode , , and thus achieve an average codeword length . But the code with , , has only average length .
Shannon gave an explicit procedure for obtaining codes with codeword lengths using the binary representation of cumulative probabilities (Shannon remarked this procedure was originally due to Fano). The elements of the source are ordered according to increasing probabilities . Then the codeword consists of the first bits of the binary expansion of the sum . This procedure was further developed by Elias. The elements of the source now may occur in any order. The Shannon-Fano-Elias-code has as codewords the first bits of the binary expansion of the sum .
We shall illustrate these procedures with the example in Figure 3.3.
A more efficient procedure is also due to Shannon and Fano. The Shannon-Fano-algorithm will be illustrated by the same example in Figure 3.4.
The messages are first written in order of nonincreasing probabilities. Then the message set is partitioned into two most equiprobable subsets and . A is assigned to each message contained in one subset and a to each of the remaining messages. The same procedure is repeated for subsets of and ; that is, will be partitioned into two subsets and . Now the code word corresponding to a message contained in will start with and that corresponding to a message in will begin with . This procedure is continued until each subset contains only one message.
However, this algorithm neither yields an optimal code in general, since the prefix code , , , , , , has average length .
The Huffman coding algorithm is a recursive procedure, which we shall illustrate with the same example as for the Shannon-Fano-algorithm in Figure 3.5 with and . The source is successively reduced by one element. In each reduction step we add up the two smallest probabilities and insert their sum in the increasingly ordered sequence , thus obtaining a new probability distribution with . Finally we arrive at a source with two elements ordered according to their probabilities. The first element is assigned a , the second element a . Now we again “blow up” the source until the original source is restored. In each step and are obtained by appending or , respectively, to the codeword corresponding to .
The following theorem demonstrates that the Huffman coding algorithm always yields a prefix code optimal with respect to the average codeword length.
Theorem 3.3 We are given a source , where and the probabilities are ordered non–increasingly . A new probability distribution is defined by
Let be an optimal prefix code for . Now we define a code for the distribution by
Then is an optimal prefix code for and , where denotes the length of an optimal prefix code for probability distribution .
Proof. For a probability distribution on with there exists an optimal prefix code with
i)
ii)
iii) and differ exactly in the last position.
This holds, since:
i) Assume that there are with and . Then the code obtained by interchanging the codewords and has average length , since
ii) Assume we are given a code with . Because of the prefix property we may drop the last bits and thus obtain a new code with .
iii) If no two codewords of maximal length agree in all places but the last, then we may drop the last digit of all such codewords to obtain a better code.
Now we are ready to prove the statement from the theorem. From the definition of and we have
Now let be an optimal prefix code with the properties ii) and iii) from the preceding lemma. We define a prefix code for
by for and is obtained by dropping the last bit of or .
Now
and hence , since .
If denotes the size of the source alphabet, the Huffman coding algorithm needs additions and code modifications (appending or ). Further we need insertions, such that the total complexity can be roughly estimated to be . However, observe that with the Noiseless Coding Theorem, the quality of the compression rate can only be improved by jointly encoding blocks of, say, letters, which would result in a Huffman code for the source of size . So, the price for better compression is a rather drastic increase in complexity. Further, the codewords for all letters have to be stored. Encoding a sequence of letters can since be done in steps.
Exercises
3.1-1 Show that the code with and is uniquely decipherable but not instantaneous for any .
3.1-2 Compute the entropy of the source , with and .
3.1-3 Find the Huffman-codes and the Shannon-Fano-codes for the sources with as in the previous exercise for and calculate their average codeword lengths.
3.1-4 Show that always .
3.1-5 Show that the redundancy of a prefix code for a source with probability distribution can be expressed as a special I–divergence.
3.1-6 Show that the I-divergence for all probability distributions and over some alphabet with equality exactly if but that the I-divergence is not a metric.
In statistical coding techniques as Shannon-Fano- or Huffman-coding the probability distribution of the source is modelled as accurately as possible and then the words are encoded such that a higher probability results in a shorter codeword length.
We know that Huffman-codes are optimal with respect to the average codeword length. However, the entropy is approached by increasing the block length. On the other hand, for long blocks of source symbols, Huffman-coding is a rather complex procedure, since it requires the calculation of the probabilities of all sequences of the given block length and the construction of the corresponding complete code.
For compression techniques based on statistical methods often arithmetic coding is preferred. Arithmetic coding is a straightforward extension of the Shannon-Fano-Elias-code. The idea is to represent a probability by an interval. In order to do so, the probabilities have to be calculated very accurately. This process is denoted as modelling of the source. So statistical compression techniques consist of two stages: modelling and coding. As just mentioned, coding is usually done by arithmetic coding. The different algorithms like, for instance, DCM (Discrete Markov Coding) and PPM (Prediction by Partial Matching) vary in the way of modelling the source. We are going to present the context-tree weighting method, a transparent algorithm for the estimation of block probabilities due to Willems, Shtarkov, and Tjalkens, which also allows a straightforward analysis of the efficiency.
The idea behind arithmetic coding is to represent a message by interval , where is the sum of the probabilities of those sequences which are smaller than in lexicographic order.
A codeword assigned to message also corresponds to an interval. Namely, we identify codeword of length with interval , where is the binary expansion of the nominator in the fraction . The special choice of codeword will be obtained from and as follows:
So message is encoded by a codeword , whose interval is inside interval .
Let us illustrate arithmetic coding by the following example of a discrete memoryless source with and .
At first glance it may seem that this code is much worse than the Huffman code for the same source with codeword lengths () we found previously. On the other hand, it can be shown that arithmetic coding always achieves an average codeword length , which is only two bits apart from the lower bound in the noiseless coding theorem. Huffman coding would usually yield an even better code. However, this “negligible” loss in compression rate is compensated by several advantages. The codeword is directly computed from the source sequence, which means that we do not have to store the code as in the case of Huffman coding. Further, the relevant source models allow to easily compute and from , usually by multiplication by . This means that the sequence to be encoded can be parsed sequentially bit by bit, unlike in Huffman coding, where we would have to encode blockwise.
The basic algorithm for encoding a sequence by arithmetic coding works as follows. We assume that , (in the case for all the discrete memoryless source arises, but in the section on modelling more complicated formulae come into play) and hence
Starting with and the first letters of the text to be compressed determine the current interval . These current intervals are successively refined via the recursions
is usually denoted as augend. The final interval will then be encoded by interval as described above. So the algorithm looks as follows.
Arithmetic-Encoder(
)
1 2 3FOR
TO
4DO
5 6 7 8RETURN
We shall illustrate the encoding procedure by the following example from the literature. Let the discrete, memoryless source be given with ternary alphabet and , , . The sequence has to be encoded. Observe that and for all . Further , , and .
The above algorithm yields
Hence and . From this can be calculated that and finally whose binary representation is codeword .
Decoding is very similar to encoding. The decoder recursively “undoes” the encoder's recursion. We divide the interval into subintervals with bounds defined by . Then we find the interval in which codeword can be found. This interval determines the next symbol. Then we subtract and rescale by multiplication by .
Arithmetic-Decoder(
)
1FOR
TO
2DO
3WHILE
4DO
5 6 7RETURN
Observe that when the decoder only receives codeword he does not know when the decoding procedure terminates. For instance can be the codeword for , , , etc. In the above pseudocode it is implicit that the number of symbols has also been transmitted to the decoder, in which case it is clear what the last letter to be encoded was. Another possibility would be to provide a special end-of-file (EOF)-symbol with a small probability, which is known to both the encoder and the decoder. When the decoder sees this symbol, he stops decoding. In this case line 1 would be replaced by
1 WHILE
(EOF)
(and would have to be increased). In our above example, the decoder would receive the codeword , the binary expansion of up to bits. This number falls in the interval which belongs to the letter , hence the first letter . Then he calculates . Again this number is in the interval and the second letter is . In order to determine the calculation must be performed. Again such that also . Finally . Since , the last letter of the sequence must be .
Recall that message is encoded by a codeword , whose interval is inside interval . This follows from .
Obviously a prefix code is obtained, since a codeword can only be a prefix of another one, if their corresponding intervals overlap – and the intervals are obviously disjoint for different -s.
Further, we mentioned already that arithmetic coding compresses down to the entropy up to two bits. This is because for every sequence it is . It can also be shown that the additional transmission of block length or the introduction of the EOF symbol only results in a negligible loss of compression.
However, the basic algorithms we presented are not useful in order to compress longer files, since with increasing block length the intervals are getting smaller and smaller, such that rounding errors will be unavoidable. We shall present a technique to overcome this problem in the following.
The basic algorithm for arithmetic coding is linear in the length of the sequence to be encoded. Usually, arithmetic coding is compared to Huffman coding. In contrast to Huffman coding, we do not have to store the whole code, but can obtain the codeword directly from the corresponding interval. However, for a discrete memoryless source, where the probability distribution is the same for all letters, this is not such a big advantage, since the Huffman code will be the same for all letters (or blocks of letters) and hence has to be computed only once. Huffman coding, on the other hand, does not use any multiplications which slow down arithmetic coding.
For the adaptive case, in which the 's may change for different letters to be encoded, a new Huffman code would have to be calculated for each new letter. In this case, usually arithmetic coding is preferred. We shall investigate such situations in the section on modelling. For implementations in practice floating point arithmetic is avoided. Instead, the subdivision of the interval is represented by a subdivision of the integer range , say, with proportions according to the source probabilities. Now integer arithmetic can be applied, which is faster and more precise.
In the basic algorithms for arithmetic encoding and decoding the shrinking of the current interval would require the use of high precision arithmetic for longer sequences. Further, no bit of the codeword is produced until the complete sequence has been read in. This can be overcome by coding each bit as soon as it is known and then double the length of the current interval , say, so that this expansion represents only the unknown part of the interval. This is the case when the leading bits of the lower and upper bound are the same, i. e. the interval is completely contained either in or in . The following expansion rules guarantee that the current interval does not become too small.
Case 1 (): , .
Case 2 (): , .
Case 3 (): , .
The last case called underflow (or follow) prevents the interval from shrinking too much when the bounds are close to . Observe that if the current interval is contained in with , we do not know the next output bit, but we do know that whatever it is, the following bit will have the opposite value. However, in contrast to the other cases we cannot continue encoding here, but have to wait (remain in the underflow state and adjust a counter to the number of subsequent underflows, i. e. ) until the current interval falls into either or . In this case we encode the leading bit of this interval – for and for – followed by many inverse bits and then set . The procedure stops, when all letters are read in and the current interval does not allow any further expansion.
Arithmetic-Precision-Encoder(
)
1 2 3 4 5FOR
TO
6DO
7 8 9WHILE
AND NOT ( AND ) 10DO
IF
11THEN
many s 12 13 14 15ELSE
IF
16THEN
, many 0s 17 18 19 20ELSE IF
AND 21THEN
22 23 24IF
25THEN
, many 1s) 26RETURN
We shall illustrate the encoding algorithm in Figure3.6 by our example – the encoding of the message with alphabet and probability distribution . An underflow occurs in the sixth row: we keep track of the underflow state and later encode the inverse of the next bit, here this inverse bit is the in the ninth row. The encoded string is .
Precision-decoding involves the consideration of a third variable besides the interval bounds LO and HI.
In this section we shall only consider binary sequences to be compressed by an arithmetic coder. Further, we shortly write instead of in order to allow further subscripts and superscripts for the description of the special situation. will denote estimated probabilities, weighted probabilities, and probabilities assigned to a special context .
The application of arithmetic coding is quite appropriate if the probability distribution of the source is such that can easily be calculated from . Obviously this is the case, when the source is discrete and memoryless, since then .
Even when the underlying parameter of a binary, discrete memoryless source is not known, there is an efficient way due to Krichevsky and Trofimov to estimate the probabilities via
where and denote the number of s and s, respectively, in the sequence . So given the sequence with many s and many s, the probability that the next letter will be a is estimated as . The estimated block probability of a sequence containing zeros and ones then is
with initial values and as in Figure 3.7, where the values of the Krichevsky-Trofimov–estimator for small are listed.
Note that the summand in the nominator guarantees that the probability for the next letter to be a is positive even when the symbol did not occur in the sequence so far. In order to avoid infinite codeword length, this phenomenon has to be carefully taken into account when estimating the probability of the next letter in all approaches to estimate the parameters, when arithmetic coding is applied.
In most situations the source is not memoryless, i. e., the dependencies between the letters have to be considered. A suitable way to represent such dependencies is the use of a suffix tree, which we denote as context tree. The context of symbol is suffix preceding . To each context (or leaf in the suffix tree) there corresponds a parameter , which is the probability of the occurrence of a 1 when the last sequence of past source symbols is equal to context (and hence is the probability for a in this case). We are distinguishing here between the model (the suffix tree) and the parameters ().
Example 3.1 Let and , , and . The corresponding suffix tree jointly with the parsing process for a special sequence can be seen in Figure 3.8.
The actual probability of the sequence ' ' given the past ' ' is , since the first letter is preceded by suffix , the second letter is preceded by suffix , etc.
Suppose the model is known, but not the parameters . The problem now is to find a good coding distribution for this case. The tree structure allows to easily determine which context precedes a particular symbol. All symbols having the same context (or suffix) form a memoryless source subsequence whose probability is determined by the unknown parameter . In our example these subsequences are ' ' for , ' ' for and ' ' for . One uses the Krichevsky-Trofimov-estimator for this case. To each node in the suffix tree, we count the numbers of zeros and of ones preceded by suffix . For the children and of parent node obviously and must be satisfied.
In our example for the root , and , . Further , , , , , , , , , and . These last numbers are not relevant for our special source but will be important later on, when the source model or the corresponding suffix tree, respectively, is not known in advance.
Example 3.2 Let as in the previous example. Encoding a subsequence is done by successively updating the corresponding counters for and . For example, when we encode the sequence ' ' given the past ' ' using the above suffix tree and Krichevsky-Trofimov-estimator we obtain
where , and are the probabilities of the subsequences ' ', ' ' and ' ' in the context of the leaves. These subsequences are assumed to be memoryless.
Suppose we have a good coding distribution for source 1 and another one, , for source 2. We are looking for a good coding distribution for both sources. One possibility is to compute and and then 1 bit is needed to identify the best model which then will be used to compress the sequence. This method is called selecting. Another possibility is to employ the weighted distribution, which is
We shall present now the context-tree weighting algorithm. Under the assumption that a context tree is a full tree of depth , only and , i. e. the number of zeros and ones in the subsequence of bits preceded by context , are stored in each node of the context tree.
Further, to each node is assigned a weighted probability which is recursively defined as
where describes the length of the (binary) string and is the estimated probability using the Krichevsky-Trofimov-estimator.
Example 3.3 After encoding the sequence ' ' given the past ' ' we obtain the context tree of depth 3 in Figure 3.9. The weighted probability of the root node finally yields the coding probability corresponding to the parsed sequence.
Figure 3.9. Weighted context tree for source sequence ' ' with past . The pair denotes zeros and ones preceded by the corresponding context . For the contexts it is .
Recall that for the application in arithmetic coding it is important that probabilities and can be efficiently calculated from . This is possible with the context-tree weighting method, since the weighted probabilities only have to be updated, when is changing. This just occurs for the contexts along the path from the root to the leaf in the context tree preceding the new symbol —namely the contexts for and the root . Along this path, has to be performed, when , and has to be performed, when , and the corresponding probabilities and have to be updated.
This suggests the following algorithm for updating the context tree when reading the next letter . Recall that to each node of the tree we store the parameters , and . These parameters have to be updated in order to obtain . We assume the convention that the ordered pair denotes the root .
Update-Context-Tree(
)
1 2IF
3THEN
4 5ELSE
6 7FOR
TO
8DO
9IF
10THEN
11 12ELSE
13 14 15RETURN
The probability assigned to the root in the context tree will be used for the successive subdivisions in arithmetic coding. Initially, before reading , the parameters in the context tree are , , and for all contexts in the tree. In our example the updates given the past would yield the successive probabilities : for , for , for , for , for , for , for , and finally for .
Recall that the quality of a code concerning its compression capability is measured with respect to the average codeword length. The average codeword length of the best code comes as close as possible to the entropy of the source. The difference between the average codeword length and the entropy is denoted as the redundancy of code , hence
which obviously is the weighted (by ) sum of the individual redundancies
The individual redundancy of sequences given the (known) source for all for , is bounded by
The individual redundancy of sequences using the context–tree weighting algorithm (and hence a complete tree of all possible contexts as model ) is bounded by
Comparing these two formulae, we see that the difference of the individual redundancies is bits. This can be considered as the cost of not knowing the model, i.e. the model redundancy. So, the redundancy splits into the parameter redundancy, i. e. the cost of not knowing the parameter, and the model redundancy. It can be shown that the expected redundancy behaviour of the context-tree weighting method achieves the asymptotic lower bound due to Rissanen who could demonstrate that about bits per parameter is the minimum possible expected redundancy for .
The computational complexity is proportional to the number of nodes that are visited when updating the tree, which is about . Therefore, the number of operations necessary for processing symbols is linear in . However, these operations are mainly multiplications with factors requiring high precision.
As for most modelling algorithms, the backlog of implementations in practice is the huge amount of memory. A complete tree of depth has to be stored and updated. Only with increasing the estimations of the probabilities are becoming more accurate and hence the average codeword length of an arithmetic code based on these estimations would become shorter. The size of the memory, however, depends exponentially on the depth of the tree.
We presented the context--tree weighting method only for binary sequences. Note that in this case the cumulative probability of a binary sequence can be calculated as
For compression of sources with larger alphabets, for instance ASCII-files, we refer to the literature.
Exercises
3.2-1 Compute the arithmetic codes for the sources , with and and compare these codes with the corresponding Huffman-codes derived previously.
3.2-2 For the codes derived in the previous exercise compute the individual redundancies of each codeword and the redundancies of the codes.
3.2-3 Compute the estimated probabilities for the sequence and all its subsequences using the Krichevsky-Trofimov-estimator.
3.2-4 Compute all parameters and the estimated probability for the sequence given the past , when the context tree is known. What will be the codeword of an arithmetic code in this case?
3.2-5 Compute all parameters and the estimated probability for the sequence given the past , when the context tree is not known, using the context-tree weighting algorithm.
3.2-6 Based on the computations from the previous exercise, update the estimated probability for the sequence given the past .
Show that for the cumulative probability of a binary sequence it is
In 1976–1978 Jacob Ziv and Abraham Lempel introduced two universal coding algorithms, which in contrast to statistical coding techniques, considered so far, do not make explicit use of the underlying probability distribution. The basic idea here is to replace a previously seen string with a pointer into a history buffer (LZ77) or with the index of a dictionary (LZ78). LZ algorithms are widely used—”zip” and its variations use the LZ77 algorithm. So, in contrast to the presentation by several authors, Ziv-Lempel-coding is not a single algorithm. Originally, Lempel and Ziv introduced a method to measure the complexity of a string—like in Kolmogorov complexity. This led to two different algorithms, LZ77 and LZ78. Many modifications and variations have been developed since. However, we shall present the original algorithms and refer to the literature for further information.
The idea of LZ77 is to pass a sliding window over the text to be compressed. One looks for the longest substring in this window representing the next letters of the text. The window consists of two parts: a history window of length , say, in which the last bits of the text considered so far are stored, and a lookahead window of length containing the next bits of the text. In the simplest case and are fixed. Usually, is much bigger than . Then one encodes the triple (offset, length, letter). Here the offset is the number of letters one has to go back in the text to find the matching substring, the length is just the length of this matching substring, and the letter to be stored is the letter following the matching substring. Let us illustrate this procedure with an example. Assume the text to be compressed is , the window is of size 15 with letters history and letters lookahead buffer. Assume, the sliding window now arrived at
i. e., the history window contains the 10 letters and the lookahead window contains the five letters . The longest substring matching the first letters of the lookahead window is of length , which is found nine letters back from the right end of the history window. So we encode , since is the next letter (the string is also found five letters back, in the original LZ77 algorithm one would select the largest offset). The window then is moved letters forward
The next codeword is , since the longest matching substring is of length found letters backwards and is the letter following this substring in the lookahead window. We proceed with
and encode . Further
Here we encode . Observe that the match can extend into the lookahead window.
There are many subtleties to be taken into account. If a symbol did not appear yet in the text, offset and length are set to . If there are two matching strings of the same length, one has to choose between the first and the second offset. Both variations have advantages. Initially one might start with an empty history window and the first letters of the text to be compressed in the lookahead window - there are also further variations.
A common modification of the original scheme is to output only the pair (offset, length) and not the following letter of the text. Using this coding procedure one has to take into consideration the case in which the next letter does not occur in the history window. In this case, usually the letter itself is stored, such that the decoder has to distinguish between pairs of numbers and single letters. Further variations do not necessarily encode the longest matching substring.
LZ78 does not use a sliding window but a dictionary which is represented here as a table with an index and an entry. LZ78 parses the text to be compressed into a collection of strings, where each string is the longest matching string seen so far plus the symbol following in the text to be compressed. The new string is added into the dictionary. The new entry is coded as , where is the index of the existing table entry and is the appended symbol.
As an example, consider the string “ ”. It is divided by LZ78 into strings as shown below. String is here the empty string.
Since we are not using a sliding window, there is no limit for how far back strings can reach. However, in practice the dictionary cannot continue to grow infinitely. There are several ways to manage this problem. For instance, after having reached the maximum number of entries in the dictionary, no further entries can be added to the table and coding becomes static. Another variation would be to replace older entries. The decoder knows how many bits must be reserved for the index of the string in the dictionary, and hence decompression is straightforward.
Ziv-Lempel coding asymptotically achieves the best possible compression rate which again is the entropy rate of the source. The source model, however, is much more general than the discrete memoryless source. The stochastic process generating the next letter, is assumed to be stationary (the probability of a sequence does not depend on the instant of time, i. e. for all and all sequences ). For stationary processes the limit exists and is defined to be the entropy rate.
If denotes the number of strings in the parsing process of LZ78 for a text generated by a stationary source, then the number of bits required to encode all these strings is . It can be shown that converges to the entropy rate of the source. However, this would require that all strings can be stored in the dictionary.
If we fix the size of the sliding window or the dictionary, the running time of encoding a sequence of letters will be linear in . However, as usually in data compression, there is a tradeoff between compression rate and speed. A better compression is only possible with larger memory. Increasing the size of the dictionary or the window will, however, result in a slower performance, since the most time consuming task is the search for the matching substring or the position in the dictionary.
Decoding in both LZ77 and LZ78 is straightforward. Observe that with LZ77 decoding is usually much faster than encoding, since the decoder already obtains the information at which position in the history he can read out the next letters of the text to be recovered, whereas the encoder has to find the longest matching substring in the history window. So algorithms based on LZ77 are useful for files which are compressed once and decompressed more frequently.
Further, the encoded text is not necessarily shorter than the original text. Especially in the beginning of the encoding the coded version may expand a lot. This expansion has to be taken into consideration.
For implementation it is not optimal to represent the text as an array. A suitable data structure will be a circular queue for the lookahead window and a binary search tree for the history window in LZ77, while for LZ78 a dictionary tree should be used.
Exercises
3.3-1 Apply the algorithms LZ77 and LZ78 to the string “abracadabra”.
3.3-2 Which type of files will be well compressed with LZ77 and LZ78, respectively? For which type of files are LZ77 and LZ78 not so advantageous?
3.3-3 Discuss the advantages of encoding the first or the last offset, when several matching substrings are found in LZ77.
The Burrows-Wheeler-transform will best be demonstrated by an example. Assume that our original text is “WHEELER”. This text will be mapped to a second text and an index according to the following rules.
1) We form a matrix consisting of all cyclic shifts of the original text . In our example
2) From we obtain a new matrix by simply ordering the rows in lexicographically. Here this yields the matrix
3) The transformed string then is just the last column of the matrix and the index is the number of the row of , in which the original text is contained. In our example “HELWEER” and – we start counting the the rows with row no. .
This gives rise to the following pseudocode. We write here instead of and instead of , since the purpose of the vector notation is only to distinguish the vectors from the letters in the text.
BWT-Encoder(
)
1FOR
TO
2DO
3FOR
TO
4DO
FOR
TO
5DO
6FOR
TO
7DO
row of row of in lexicographic order 8FOR
TO
9DO
10 11WHILE
(row of row of ) 12DO
13 14RETURN
and
It can be shown that this transformation is invertible, i. e., it is possible to reconstruct the original text from its transform and the index . This is because these two parameters just yield enough information to find out the underlying permutation of the letters. Let us illustrate this reconstruction using the above example again. From the transformed string we obtain a second string by simply ordering the letters in in ascending order. Actually, is the first column of the matrix above. So, in our example
Now obviously the first letter of our original text is the letter in position of the sorted string , so here . Then we look at the position of the letter just considered in the string – here there is only one W, which is letter no. 3 in . This position gives us the location of the next letter of the original text, namely . is found in position no. 0 in , hence . Now there are three E–s in the string and we take the first one not used so far, here the one in position no. 1, and hence . We iterate this procedure and find , , .
This suggests the following pseudocode.
BWT-Decoder(
)
1 sort 2 3FOR
TO
4DO
5WHILE
( OR is a component of ) 6DO
7 8 9RETURN
This algorithm implies a more formal description. Since the decoder only knows , he has to sort this string to find out . To each letter from the transformed string record the position in from which it was jumped to by the process described above. So the vector in our pseudocode yields a permutation such that for each row it is in matrix . In our example . This permutation can be used to reconstruct the original text of length via , where and for .
Observe that so far the original data have only been transformed and are not compressed, since string has exactly the same length as the original string . So what is the advantage of the Burrows-Wheeler transformation? The idea is that the transformed string can be much more efficiently encoded than the original string. The dependencies among the letters have the effect that in the transformed string there appear long blocks consisting of the same letter.
In order to exploit such frequent blocks of the same letter, Burrows and Wheeler suggested the following move-to-front-code, which we shall illustrate again with our example above.
We write down a list containing the letters used in our text in alphabetic order indexed by their position in this list.
Then we parse through the transformed string letter by letter, note the index of the next letter and move this letter to the front of the list. So in the first step we note 1—the index of the H, move H to the front and obtain the list
Then we note 1 and move E to the front,
note 2 and move L to the front,
note 4 and move W to the front,
note 2 and move E to the front,
note 0 and leave E at the front,
note 4 and move R to the front,
So we obtain the sequence as our move-to-front-code. The pseudocode may look as follows, where is a list of the letters occuring in the string .
Move-To-Front(
)
1 list of letters occuring in ordered alphabetically 2FOR
TO
3DO
4WHILE
() 5 6 7FOR
TO
8DO
9RETURN
The move-to-front-code will finally be compressed, for instance by Huffman-coding.
The compression is due to the move-to-front-code obtained from the transformed string . It can easily be seen that this move-to-front coding procedure is invertible, so one can recover the string from the code obtained as above.
Now it can be observed that in the move-to-front-code small numbers occur more frequently. Unfortunately, this will become obvious only with much longer texts than in our example—in long strings it was observed that even about 70 per cent of the numbers are . This irregularity in distribution can be exploited by compressing the sequence obtained after move-to-front-coding, for instance by Huffman-codes or run-length codes.
The algorithm performed very well in practice regarding the compression rate as well as the speed. The asymptotic optimality of compression has been proven for a wide class of sources.
The most complex part of the Burrows-Wheeler transform is the sorting of the block yielding the transformed string . Due to fast sorting procedures, especially suited for the type of data to be compressed, compression algorithms based on the Burrows-Wheeler transform are usually very fast. On the other hand, compression is done blockwise. The text to be compressed has to be divided into blocks of appropriate size such that the matrices and still fit into the memory. So the decoder has to wait until the whole next block is transmitted and cannot work sequentially bit by bit as in arithmetic coding or Ziv-Lempel coding.
Exercises
3.4-1 Apply the Burrows-Wheeler-transform and the move-to-front code to the text “abracadabra”.
3.4-2 Verify that the transformed string and the index of the position in the sorted text (containing the first letter of the original text to be compressed) indeed yield enough information to reconstruct the original text.
3.4-3 Show how in our example the decoder would obtain the string ”HELWEER” from the move-to-front code and the letters E, H, L, W, R occuring in the text. Describe the general procedure for decoding move-to-front codes.
3.4-4 We followed here the encoding procedure presented by Burrows and Wheeler. Can the encoder obtain the transformed string even without constructing the two matrices and ?
The idea of image compression algorithms is similar to the one behind the Burrows-Wheeler-transform. The text to be compressed is transformed to a format which is suitable for application of the techniques presented in the previous sections, such as Huffman coding or arithmetic coding. There are several procedures based on the type of image (for instance, black/white, greyscale or colour image) or compression (lossless or lossy). We shall present the basic steps—representation of data, discrete cosine transform, quantisation, coding—of lossy image compression procedures like the standard JPEG.
A greyscale image is represented as a two-dimensional array , where each entry represents the intensity (or brightness) at position of the image. Each is either a signed or an unsigned -bit integers, i. e., or .
A position in a colour image is usually represented by three greyscale values , , and per position corresponding to the intensity of the primary colours red, green and blue.
In order to compress the image, the three arrays (or channels) , , are first converted to the luminance/chrominance space by the -transform (performed entry–wise)
is the luminance or intensity channel, where the coefficients weighting the colours have been found empirically and represent the best possible approximation of the intensity as perceived by the human eye. The chrominance channels and contain the colour information on red and blue as the differences from . The information on green is obtained as big part in the luminance .
A first compression for colour images commonly is already obtained after application of the -transform by removing irrelevant information. Since the human eye is less sensitive to rapid colour changes than to changes in intensity, the resolution of the two chrominance channels and is reduced by a factor of in both vertical and horizontal direction, which results after sub-sampling in arrays of of the original size.
The arrays then are subdivided into blocks, on which successively the actual (lossy) data compression procedure is applied.
Let us consider the following example based on a real image, on which the steps of compression will be illustrated. Assume that the block of -bit unsigned integers below is obtained as a part of an image.
Each block , say, is transformed into a new block . There are several possible transforms, usually the discrete cosine transform is applied, which here obeys the formula
The cosine transform is applied after shifting the unsigned integers to signed integers by subtraction of .
DCT(
)
1FOR
TO
7 2DO
FOR
TO
7 3DO
DCT - coefficient of matrix 4RETURN
The coefficients need not be calculated according to the formula above. They can also be obtained via a related Fourier transform (see Exercises) such that a Fast Fourier Transform may be applied. JPEG also supports wavelet transforms, which may replace the discrete cosine transform here.
The discrete cosine transform can be inverted via
where and are normalisation constants.
In our example, the transformed block is
where the entries are rounded.
The discrete cosine transform is closely related to the discrete Fourier transform and similarly maps signals to frequencies. Removing higher frequencies results in a less sharp image, an effect that is tolerated, such that higher frequencies are stored with less accuracy.
Of special importance is the entry , which can be interpreted as a measure for the intensity of the whole block.
The discrete cosine transform maps integers to real numbers, which in each case have to be rounded to be representable. Of course, this rounding already results in a loss of information. However, the transformed block will now be much easier to manipulate. A quantisation takes place, which maps the entries of to integers by division by the corresponding entry in a luminance quantisation matrix . In our example we use
The quantisation matrix has to be carefully chosen in order to leave the image at highest possible quality. Quantisation is the lossy part of the compression procedure. The idea is to remove information which should not be “visually significant”. Of course, at this point there is a tradeoff between the compression rate and the quality of the decoded image. So, in JPEG the quantisation table is not included into the standard but must be specified (and hence be encoded).
Quantisation(
)
1FOR
TO
7 2DO
FOR
TO
7 3DO
4RETURN
This quantisation transforms block to a new block with , where is the closest integer to . This block will finally be encoded. Observe that in the transformed block besides the entry all other entries are relatively small numbers, which has the effect that mainly consists of s .
Coefficient , in this case , deserves special consideration. It is called DC term (direct current), while the other entries are denoted AC coefficients (alternate current).
Matrix will finally be encoded by a Huffman code. We shall only sketch the procedure. First the DC term will be encoded by the difference to the DC term of the previously encoded block. For instance, if the previous DC term was 12, then will be encoded as .
After that the AC coefficients are encoded according to the zig-zag order , , , , , , , etc.. In our example, this yields the sequence followed by 55 zeros. This zig–zag order exploits the fact that there are long runs of successive zeros. These runs will be even more efficiently represented by application of run-length coding, i. e., we encode the number of zeros before the next nonzero element in the sequence followed by this element.
Integers are written in such a way that small numbers have shorter representations. This is achieved by splitting their representation into size (number of bits to be reserved) and amplitude (the actual value). So, has size , and have size . , , , and have size , etc.
In our example this yields the sequence for the DC term followed by , , , , , and a final as an end-of-block symbol indicating that only zeros follow from now on. , for instance, means that there is 1 zero followed by an element of size and amplitude .
These pairs are then assigned codewords from a Huffman code. There are different Huffman codes for the pairs (run, size) and for the amplitudes. These Huffman codes have to be specified and hence be included into the code of the image.
In the following pseudocode for the encoding of a single -block we shall denote the different Huffman codes by encode-1, encode-2, encode-3.
Run-Length-Code(
)
1 2 3 4FOR
TO
14 5DO
FOR
TO
l 6DO
IF
7THEN
8ELSE
9IF
10THEN
11ELSE
12 13 14IF
15THEN
16RETURN
At the decoding end matrix will be reconstructed. Finally, by multiplication of each entry by the corresponding entry from the quantisation matrix we obtain an approximation to the block , here
To the inverse cosine transform is applied. This allows to decode the original –block of the original image – in our example as
Exercises
3.5-1 Find size and amplitude for the representation of the integers , , and .
3.5-2 Write the entries of the following matrix in zig – zag order
How would this matrix be encoded if the difference of the DC term to the previous one was ?
3.5-3 In our example after quantising the sequence , , , , , , has to be encoded. Assume the Huffman codebooks would yield to encode the difference from the preceding block's DC, , , and for the amplitudes , , and , respectively, and , , , and for the pairs , , , and , respectively. What would be the bitstream to be encoded for the block in our example? How many bits would hence be necessary to compress this block?
3.5-4 What would be matrices , and , if we had used
for quantising after the cosine transform in the block of our example?
3.5-5 What would be the zig-zag-code in this case (assuming again that the DC term would have difference from the previous DC term)?
3.5-6 For any sequence define a new sequence by
This sequence can be expanded to a Fourier-series via
Show how the coefficients of the discrete cosine transform
arise from this Fourier series.
PROBLEMS |
3-1
Adaptive Huffman-codes
Dynamic and adaptive Huffman-coding is based on the following property. A binary code tree has the sibling property if each node has a sibling and if the nodes can be listed in order of nonincreasing probabilities with each node being adjacent to its sibling. Show that a binary prefix code is a Huffman-code exactly if the corresponding code tree has the sibling property.
Generalisations of Kraft's inequality
In the proof of Kraft's inequality it is essential to order the lengths . Show that the construction of a prefix code for given lengths is not possible if we are not allowed to order the lengths. This scenario of unordered lengths occurs with the Shannon-Fano-Elias-code and in the theory of alphabetic codes, which are related to special search problems. Show that in this case a prefix code with lengths exists if and only if
If we additionally require the prefix codes to be also suffix-free i. e., no codeword is the end of another one, it is an open problem to show that Kraft's inequality holds with the on the right–hand side replaced by 3/4, i. e.,
Redundancy of Krichevsky-Trofimov-estimator
Show that using the Krichevsky-Trofimov-estimator, when parameter of a discrete memoryless source is unknown, the individual redundancy of sequence is at most for all sequences and all .
Alternatives to move-to-front-codes
Find further procedures which like move-to-front-coding prepare the text for compression after application of the Burrows-Wheeler-transform.
CHAPTER NOTES |
The frequency table of the letters in English texts is taken from [272]. The Huffman coding algorithm was introduced by Huffman in [119]. A pseudocode can be found in [54], where the Huffman coding algorithm is presented as a special Greedy algorithm. There are also adaptive or dynamic variants of Huffman-coding, which adapt the Huffman-code if it is no longer optimal for the actual frequency table, for the case that the probability distribution of the source is not known in advance. The “3/4-conjecture” on Kraft's inequality for fix-free-codes is due to Ahlswede, Balkenhol, and Khachatrian [4].
Arithmetic coding has been introduced by Rissanen [212] and Pasco [199]. For a discussion of implementation questions see [158],[277]. In the section on modelling we are following the presentation of Willems, Shtarkov and Tjalkens in[275]. The exact calculations can be found in their original paper [274] which received the Best Paper Award of the IEEE Information Theory Society in 1996. The Krichevsky-Trofimov-estimator had been introduced in [153].
We presented the two original algorithms LZ77 and LZ78 [281],[282] due to Lempel and Ziv. Many variants, modifications and extensions have been developed since that – concerning the handling of the dictionary, the pointers, the behaviour after the dictionary is complete, etc. For a description, see, for instance, [25] or [26]. Most of the prominent tools for data compression are variations of Ziv-Lempel-coding. For example “zip” and “gzip” are based on LZ77 and a variant of LZ78 is used by the program “compress”.
The Burrows-Wheeler transform was introduced in the technical report [36]. It became popular in the sequel, especially because of the Unix compression tool “bzip” based on the Burrows-Wheeler-transform, which outperformed most dictionary—based tools on several benchmark files. Also it avoids arithmetic coding, for which patent rights have to be taken into consideration. Further investigations on the Burrows-Wheeler-transform have been carried out, for instance in [21],[70],[155].
We only sketched the basics behind lossy image compression, especially the preparation of the data for application of techniques as Huffman coding. For a detailed discussion we refer to [256], where also the new JPEG2000 standard is described. Our example is taken from [268].
JPEG—short for Joint Photographic Experts Group—is very flexible. For instance, it also supports lossless data compression. All the topics presented in the section on image compression are not unique. There are models involving more basic colours and further transforms besides the -transform (for which even different scaling factors for the chrominance channels were used, the formula presented here is from [256]). The cosine transform may be replaced by another operation like a wavelet transform. Further, there is freedom to choose the quantisation matrix, responsible for the quality of the compressed image, and the Huffman code. On the other hand, this has the effect that these parameters have to be explicitly specified and hence are part of the coded image.
The ideas behind procedures for video and sound compression are rather similar to those for image compression. In principal, they follow the same steps. The amount of data in these cases, however, is much bigger. Again information is lost by removing irrelevant information not realizable by the human eye or ear (for instance by psychoacoustic models) and by quantising, where the quality should not be reduced significantly. More refined quantising methods are applied in these cases.
Most information on data compression algorithms can be found in literature on Information Theory, for instance [55],[104], since the analysis of the achievable compression rates requires knowledge of source coding theory. Recently, there have appeared several books on data compression, for instance [26],[106],[191],[224],[225], to which we refer to further reading. The benchmark files of the Calgary Corpus and the Canterbury Corpus are available under [37] or [38].
The book of I. Csiszár and J. Körner [58] analyses different aspects of information theory including the problems of data compression too.
Table of Contents
Any planned computation will be subject to different kinds of unpredictable influences during execution. Here are some examples:
(1) Loss or change of stored data during execution.
(2) Random, physical errors in the computer.
(3) Unexpected interactions between different parts of the system working simultaneously, or loss of connections in a network.
(4) Bugs in the program.
(5) Malicious attacks.
Up to now, it does not seem that the problem of bugs can be solved just with the help of appropriate algorithms. The discipline of software engineering addresses this problem by studying and improving the structure of programs and the process of their creation.
Malicious attacks are addressed by the discipline of computer security. A large part of the recommended solutions involves cryptography.
Problems of kind (3) are very important and a whole discipline, distributed computing has been created to deal with them.
The problem of storage errors is similar to the problems of reliable communication, studied in information theory: it can be viewed as communication from the present to the future. In both cases, we can protect against noise with the help of error-correcting codes (you will see some examples below).
In this chapter, we will discuss some sample problems, mainly from category (2). In this category, distinction should also be made between permanent and transient errors. An error is permanent when a part of the computing device is damaged physically and remains faulty for a long time, until some outside intervention by repairmen to it. It is transient if it happens only in a single step: the part of the device in which it happened is not damaged, in the next step it operates correctly again. For example, if a position in memory turns from 0 to 1 by accident, but a subsequent write operation can write a 0 again then a transient error happened. If the bit turned to 1 and the computer cannot change it to 0 again, this is a permanent error.
Some of these problems, especially the ones for transient errors, are as old as computing. The details of any physical errors depend on the kind of computer it is implemented on (and, of course, on the kind of computation we want to carry out). But after abstracting away from a lot of distracting details, we are left with some clean but challenging theoretical formulations, and some rather pleasing solutions. There are also interesting connections to other disciplines, like statistical physics and biology.
The computer industry has been amazingly successful over the last five decades in making the computer components smaller, faster, and at the same time more reliable. Among the daily computer horror stories seen in the press, the one conspicuously missing is where the processor wrote a 1 in place of a 0, just out of caprice. (It undisputably happens, but too rarely to become the identifiable source of some visible malfunction.) On the other hand, the generality of some of the results on the correction of transient errors makes them applicable in several settings. Though individual physical processors are very reliable (error rate is maybe once in every executions), when considering a whole network as performing a computation, the problems caused by unreliable network connections or possibly malicious network participants is not unlike the problems caused by unreliable processors.
The key idea for making a computation reliable is redundancy, which might be formulated as the following two procedures:
(i) Store information in such a form that losing any small part of it is not fatal: it can be restored using the rest of the data. For example, store it in multiple copies.
(ii) Perform the needed computations repeatedly, to make sure that the faulty results can be outvoted.
Our chapter will only use these methods, but there are other remarkable ideas which we cannot follow up here. For example, method (ii) seems especially costly; it is desireable to avoid a lot of repeated computation. The following ideas target this dilemma.
(A) Perform the computation directly on the information in its redundant form: then maybe recomputations can be avoided.
(B) Arrange the computation into “segments” such a way that those partial results that are to be used later, can be cheaply checked at each “milestone” between segments. If the checking finds error, repeat the last segment.
The present chapter does not require great sophistication in probability theory but there are some facts coming up repeatedly which I will review here. If you need additional information, you will find it in any graduate probability theory text.
A probability space is a triple where is the set of elementary events, is a set of subsets of called the set of events and is a function. For , the value is called the probability of event . It is required that and that implies . Further, if a (possibly infinite) sequence of sets is in then so is their union. Also, it is assumed that and that if are disjoint then
For , the conditional probability of given is defined as
Events are independent if for any sequence we have
Example 4.1 Let where is the set of all subsets of and . This is an example of a discrete probability space: one that has a countable number of elements.
More generally, a discrete probability space is given by a countable set , and a sequence with , . The set of events is the set of all subsets of , and for an event we define .
A random variable over a probability space is a function from to the real numbers, with the property that every set of the form is an event: it is in . Frequently, random variables are denoted by capital letters , possibly with indices, and the argument is omitted from . The event is then also written as . This notation is freely and informally extended to more complicated events. The distribution of a random variable is the function . We will frequently only specify the distribution of our variables, and not mention the underlying probability space, when it is clear from the context that it can be specified in one way or another. We can speak about the joint distribution of two or more random variables, but only if it is assumed that they can be defined as functions on a common probability space. Random variables with a joint distribution are independent if every -tuple of events of the form is independent.
The expected value of a random variable taking values with probabilities is defined as
It is easy to see that the expected value is a linear function of the random variable:
even if are not independent. On the other hand, if variables are independent then the expected values can also be multiplied:
There is an important simple inequality called the Markov inequality, which says that for an arbitrary nonnegative random variable and any value we have
In what follows the bounds
will be useful. Of these, the well-known upper bound holds since the graph of the function is below its tangent line drawn at the point . The lower bound is obtained from the identity and
Consider independent random variables that are identically distributed, with
Let
We want to estimate the probability for any constant . The “law of large numbers” says that if then this probability converges fast to 0 as while if then it converges fast to 1. Let
where the inequality (useful for small and ) comes via and from (4.3). Using the concavity of logarithm, it can be shown that is always nonnegative, and is 0 only if (see Exercise 4.1-1).
Theorem 4.1 (Large deviations for coin-toss) If then
This theorem shows that if then converges to 0 exponentially fast. Inequality (4.5) will allow the following simplification:
useful for small and .
Proof. For a certain real number (to be chosen later), let be the random variable that is if and if , and let : then
Applying the Markov inequality (4.2) and (4.1), we get
where . Let us choose , this is if . Then we get , and hence
This theorem also yields some convenient estimates for binomial coefficients. Let
This is sometimes called the entropy of the probability distribution (measured in logarithms over base instead of base 2). From inequality (4.3) we obtain the estimate
which is useful for small .
In particular, taking with gives
Proof. Theorem 4.1 says for the case :
Substituting , and noting the symmetries , and (4.7) gives (4.8).
Remark 4.3 Inequality (4.6) also follows from the trivial estimate combined with (4.9).
Exercises
4.1-1 Prove that the statement made in the main text that is always nonnegative, and is 0 only if .
4.1-2 For , derive from Theorem 4.1 the useful bound
Hint. Let , and use the Taylor formula: , where .
4.1-3 Prove that in Theorem 4.1, the assumption that are independent and identically distributed can be weakened: replaced by the single inequality
In a model of computation taking errors into account, the natural assumption is that errors occur everywhere. The most familiar kind of computer, which is separated into a single processor and memory, seems extremely vulnerable under such conditions: while the processor is not “looking”, noise may cause irreparable damage in the memory. Let us therefore rather consider computation models that are parallel: information is being processed everywhere in the system, not only in some distinguished places. Then error correction can be built into the work of every part of the system. We will concentrate on the best known parallel computation model: Boolean circuits.
Let us look inside a computer, (actually inside an integrated circuit, with a microscope). Discouraged by a lot of physical detail irrelevant to abstract notions of computation, we will decide to look at the blueprints of the circuit designer, at the stage when it shows the smallest elements of the circuit still according to their computational functions. We will see a network of lines that can be in two states (of electric potential), “high” or “low”, or in other words “true” or “false”, or, as we will write, 1 or 0. The points connected by these lines are the familiar logic components: at the lowest level of computation, a typical computer processes bits. Integers, floating-point numbers, characters are all represented as strings of bits, and the usual arithmetical operations can be composed of bit operations.
Definition 4.4 A Boolean vector function is a mapping . Most of the time, we will take and speak of a Boolean function.
The variables in are sometimes called Boolean variables, Boolean variables or bits.
Example 4.2 Given an undirected graph with nodes, suppose we want to study the question whether it has a Hamiltonian cycle (a sequence listing all vertices of such that is an edge for each and also is an edge). This question is described by a Boolean function as follows. The graph can be described with Boolean variables (): is 1 if and only if there is an edge between nodes and . We define if there is a Hamiltonian cycle in and 0 otherwise.
Example 4.3 [Boolean vector function] Let , let the input be two integers , written as -bit strings: . The output of the function is their product (written in binary): if , then .
There are only four one-variable Boolean functions: the identically 0, identically 1, the identity and the negation: . We mention only the following two-variable Boolean functions: the operation of conjunction (logical AND):
this is the same as multiplication. The operation of disjunction, or logical OR:
It is easy to see that : in other words, disjunction can be expressed using the functions and the operation of composition. The following two-argument Boolean functions are also frequently used:
A finite number of Boolean functions is sufficent to express all others: thus, arbitrarily complex Boolean functions can be “computed” by “elementary” operations. In some sense, this is what happens inside computers.
Definition 4.5 A set of Boolean functions is a complete basis if every other Boolean function can be obtained by repeated composition from its elements.
Claim 4.6 The set forms a complete basis; in other words, every Boolean function can be represented by a Boolean expression using only these connectives.
The proof can be found in all elementary introductions to propositional logic. Note that since can be expressed using , this latter set is also a complete basis (and so is ).
From now on, under a Boolean expression (formula), we mean an expression built up from elements of some given complete basis. If we do not mention the basis then the complete basis will be meant.
In general, one and the same Boolean function can be expressed in many ways as a Boolean expression. Given such an expression, it is easy to compute the value of the function. However, most Boolean functions can still be expressed only by very large Boolean expression (see Exercise 4.2-4).
A Boolean expression is sometimes large since when writing it, there is no possibility for reusing partial results. (For example, in the expression
the part occurs twice.) This deficiency is corrected by the following more general formalism.
A Boolean circuit is essentially an acyclic directed graph, each of whose nodes computes a Boolean function (from some complete basis) of the bits coming into it on its input edges, and sends out the result on its output edges (see Figure 4.2). Let us give a formal definition.
Figure 4.2. The assignment (values on nodes, configuration) gets propagated through all the gates. This is the “computation”.
Definition 4.7 Let be a complete basis of Boolean functions. For an integer let be a set of nodes. A Boolean circuit over is given by the following tuple:
For every node there is a natural number showing its number of inputs. The sources, nodes with , are called input nodes: we will denote them, in increasing order, as
To each non-input node a Boolean function
from the complete basis is assigned: it is called the gate of node . It has as many arguments as the number of entering edges. The sinks of the graph, nodes without outgoing edges, will be called output nodes: they can be denoted by
(Our Boolean circuits will mostly have just a single output node.) To every non-input node and every belongs a node (the node sending the value of input variable of the gate of ). The circuit defines a graph whose set of edges is
We require for each (we identified the with the natural numbers ): this implies that the graph is acyclic. The size
of the circuit is the number of nodes. The depth of a node is the maximal length of directed paths leading from an input node to . The depth of a circuit is the maximum depth of its output nodes.
Definition 4.8 An input assignment, or input configuration to our circuit is a vector with giving value to node
for , . The function can be extended to a unique configuration on all other nodes of the circuit as follows. If gate has arguments then
For example, if , and () are the input nodes to then . The process of extending the configuration by the above equation is also called the computation of the circuit. The vector of the values for is the result of the computation. We say that the Boolean circuit computes the vector function
The assignment procedure can be performed in stages: in stage , all nodes of depth receive their values.
We assign values to the edges as well: the value assigned to an edge is the one assigned to its start node.
The depth of a Boolean circuit can be viewed as the shortest time it takes to compute the output vector from the input vector by this circuit. Az an example application of Boolean circuits, let us develop a circuit that computes the sum of its input bits very fast. We will need this result later in the present chapter for error-correcting purposes.
Definition 4.9 We will say that a Boolean circuit computes a near-majority if it outputs a bit with the following property: if of all input bits is equal to then .
The depth of our circuit is clearly , since the output must have a path to the majority of inputs. In order to compute the majority, we will also solve the task of summing the input bits.
(a) Over the complete basis consisting of the set of all 3-argument Boolean functions, for each there is a Boolean circuit of input size and depth whose output vector represents the sum of the input bits as a binary number.
(b) Over this same complete basis, for each there is a Boolean circuit of input size and depth computing a near-majority.
Proof. First we prove (a). For simplicity, assume : if is not of this form, we may add some fake inputs. The naive approach would be proceed according to Figure 4.3: to first compute . Then, to compute , and so on. Then will indeed be computed in stages.
It is somewhat troublesome that here is a number, not a bit, and therefore must be represented by a bit vector, that is by group of nodes in the circuit, not just by a single node. However, the general addition operation
when performed in the naive way, will typically take more than a constant number of steps: the numbers have length up to and therefore the addition may add to the depth, bringing the total depth to .
The following observation helps to decrease the depth. Let be three numbers in binary notation: for example, . There are simple parallel formulas to represent the sum of these three numbers as the sum of two others, that is to compute where are numbers also in binary notation
Since both formulas are computed by a single 3-argument gate, 3 numbers can be reduced to 2 (while preserving the sum) in a single parallel computation step. Two such steps reduce 4 numbers to 2. In steps therefore they reduce a sum of terms to a sum of 2 numbers of length . Adding these two numbers in the regular way increases the depth by : we found that bits can be be added in steps.
To prove (b), construct the circuit as in the proof of (a), but without the last addition: the output is two -bit numbers whose sum we are interested in. The highest-order nonzero bit of these numbers is at some position . If the sum is more than then one these numbers has a nonzero bit at position or . We can determine this in two applications of 3-input gates.
Exercises
4.2-1 Show that is a complete basis.
4.2-2 Show that the function forms a complete basis by itself.
4.2-3 Let us fix the complete basis . Prove Proposition 4.6 (or look up its proof in a textbook). Use it to give an upper bound for an arbitrary Boolean function of variables, on:
(a) the smallest size of a Boolean expression for ;
(b) the smallest size of a Boolean circuit for ;
(c) the smallest depth of a Boolean circuit for ;
4.2-4 Show that for every there is a Boolean function of variables such that every Boolean circuit in the complete basis computing contains nodes.
Hint. For a constant , upperbound the number of Boolean circuits with at most nodes and compare it with the number of Boolean functions over variables.]
4.2-5 Consider a circuit with inputs, whose single output bit is computed from the inputs by levels of 3-input majority gates. Show that there is an input vector which is 1 in only positions but with which outputs 1. Thus a small minority of the inputs, when cleverly arranged, can command the result of this circuit.
Let be a Boolean circuit as given in Definition 4.7. When noise is allowed then the values
will not be determined by the formula (4.11) anymore. Instead, they will be random variables . The random assignment will be called a random configuration.
Definition 4.11 At vertex , let
In other words, if gate is not equal to the value computed by the noise-free gate from its inputs . (See Figure 4.4.) The set of vertices where is non-zero is the set of faults.
Let us call the difference the deviation at node .
Let us impose conditions on the kind of noise that will be allowed. Each fault should occur only with probability at most , two specific faults should only occur with probability at most , and so on.
Definition 4.12 For an , let us say that the random configuration is -admissible if
(a) for .
(b) For every set of non-input nodes, we have
In other words, in an -admissible random configuration, the probability of having faults at different specific gates is at most . This is how we require that not only is the fault probability low but also, faults do not “conspire”. The admissibility condition is satisfied if faults occur independently with probability .
Our goal is to build a circuit that will work correctly, with high probability, despite the ever-present noise: in other words, in which errors do not accumulate. This concept is formalized below.
Definition 4.13 We say that the circuit with output node is -resilient if for all inputs , all -admissible configurations , we have .
Let us explore this concept. There is no -resilient circuit with , since even the last gate can fail with probability . So, let us, a little more generously, allow . Clearly, for each circuit and for each we can choose small enough so that is -resilient. But this is not what we are after: hopefully, one does not need more reliable gates every time one builds a larger circuit. So, we hope to find a function
and an with the property that for all , , every Boolean circuit of size there is some -resilient circuit of size computing the same function as . If we achieve this then we can say that we prevented the accumulation of errors. Of course, we want to make relatively small, and large (allowing more noise). The function can be called the redundancy: the factor by which we need to increase the size of the circuit to make it resilient. Note that the problem is nontrivial even with, say, . Unless the accumulation of errors is prevented we will lose gradually all information about the desired output, and no could be guaranteed.
How can we correct errors? A simple idea is this: do “everything” 3 times and then continue with the result obtained by majority vote.
Definition 4.14 For odd natural number , a -input majority gate is a Boolean function that outputs the value equal to the majority of its inputs.
Note that a -input majority can be computed using gates of type AND and NOT.
Why should majority voting help? The following informal discussion helps understanding the benefits and pitfalls. Suppose for a moment that the output is a single bit. If the probability of each of the three independently computed results failing is then the probability that at least of them fails is bounded by . Since the majority vote itself can fail with some probability the total probability of failure is bounded by . We decrease the probability of failure, provided the condition holds.
We found that if is small, then repetition and majority vote can “make it” smaller. Of course, in order to keep the error probability from accumulating, we would have to perform this majority operation repeatedly. Suppose, for example, that our computation has stages. Our bound on the probability of faulty output after stage is . We plan to perform the majority operation after each stage . Let us perform stage three times. The probability of failure is now bounded by
Here, the error probabilities of the different stages accumulate, and even if we only get a bound . So, this strategy will not work for arbitrarily large computations.
Here is a somewhat mad idea to avoid accumulation: repeat everything before the end of stage three times, not only stage itself. In this case, the growing bound (4.15) would be replaced with
Now if and then also , so errors do not accumulate. But we paid an enormous price: the fault-tolerant version of the computation reaching stage is 3 times larger than the one reaching stage . To make stages fault-tolerant this way will cost a factor of in size. This way, the function introduced above may become exponential in .
The theorem below formalizes the above discussion.
Theorem 4.15 Let be a finite and complete basis for Boolean functions. If then every function can be computed by an -resilient circuit over .
Proof. For simplicity, we will prove the result for a complete basis that contains the three-argument majority function and contains not functions with more than three arguments. We also assume that faults occur independently. Let be a noise-free circuit of depth computing function . We will prove that there is an -resilient circuit of depth computing . The proof is by induction on . The sufficient conditions on and will emerge from the proof.
The statement is certainly true for , so suppose . Let be the output gate of the circuit , then . The subcircuits computing the functions have depth . By the inductive assumption, there exist -resilient circuits of depth that compute . Let be a new circuit containing copies of the circuits (with the corresponding input nodes merged), with a new node in which is computed as is applied to the outputs of . Then the probability of error of is at most if since each circuit can err with probability and the node with gate can fail with probability .
Let us now form by taking three copies of (with the inputs merged) and adding a new node computing the majority of the outputs of these three copies. The error probability of is at most . Indeed, error will be due to either a fault at the majority gate or an error in at least two of the three independent copies of . So under condition
the circuit is -resilient. This condition will be satisfied by .
The circuit constructed in the proof above is at least times larger than . So, the redundancy is enormous. Fortunately, we will see a much more economical solution. But there are interesting circuits with small depth, for which the factor is not extravagant.
Theorem 4.16 Over the complete basis consisting of all 3-argument Boolean functions, for all sufficiently small , if then for each there is an -resilient Boolean circuit of input size , depth and size outputting a near-majority (as given in Definition 4.9).
Proof. Apply Theorem 4.15 to the circuit from part (a) of Theorem 4.10: it gives a new, -deep -resilient circuit computing a near-majority. The size of any such circuit with 3-input gates is at most
Exercises
4.3-1 Exercise 4.2-5 suggests that the iterated majority vote is not safe against manipulation. However, it works very well under some circumstances. Let the input to be a vector of independent Boolean random variables with . Denote the (random) output bit of the circuit by . Assuming that our majority gates can fail with probability independently, prove
Hint. Define , , , and prove .
4.3-2 We say that a circuit computes the function in an -input-robust way, if the following holds: For any input vector , for any vector of independent Boolean random variables “perturbing it” in the sense , for the output of circuit on input we have . Show that if the function is computable on an -input-robust circuit then .
In this section, we will see ways to introduce fault-tolerance that scale up better. Namely, we will show:
Theorem 4.17 There are constants such that for
for all , , for every deterministic computation of size there is an -resilient computation of size with the same result.
Let us introduce a concept that will simplify the error analysis of our circuits, making it independent of the input vector .
Definition 4.18 In a Boolean circuit , let us call a majority gate at a node a correcting majority gate if for every input vector of , all input wires of node have the same value. Consider a computation of such a circuit . This computation will make some nodes and wires of tainted. We define taintedness by the following rules:
– The input nodes are untainted.
– If a node is tainted then all of its output wires are tainted.
– A correcting majority gate is tainted if either it fails or a majority of its inputs are tainted.
– Any other gate is tainted if either it fails or one of its inputs is tainted.
Clearly, if for all -admissible random configurations the output is tainted with probability then the circuit is -resilient.
So far, we have only made use of redundancy idea (ii) of the introduction to the present chapter: repeating computation steps. Let us now try to use idea (i) (keeping information in redundant form) in Boolean circuits. To protect information traveling from gate to gate, we replace each wire of the noiseless circuit by a “cable” of wires (where will be chosen appropriately). Each wire within the cable is supposed to carry the same bit of information, and we hope that a majority will carry this bit even if some of the wires fail.
Definition 4.19 In a Boolean circuit , a certain set of edges is allowed to be called a cable if in a noise-free computation of this circuit, each edge carries the same Boolean value. The width of the cable is its number of elements. Let us fix an appropriate constant threshold . Consider any possible computation of the noisy version of the circuit , and a cable of width in . This cable will be called -safe if at most of its wires are tainted.
Let us take a Boolean circuit that we want to make resilient. As we replace wires of with cables of containing wires each, we will replace each noiseless 2-argument gate at a node by a module called the executive organ of gates, which for each , passes the th wire both incoming cables into the th node of the organ. Each of these nodes contains a gate of one and the same type . The wires emerging from these nodes form the output cable of the executive organ.
The number of tainted wires in this output cable may become too high: indeed, if there were tainted wires in the cable and also in the cable then there could be as many as such wires in the cable (not even counting the possible new taints added by faults in the executive organ). The crucial part of the construction is to attach to the executive organ a so-called restoring organ: a module intended to decrease the taint in a cable.
How to build a restoring organ? Keeping in mind that this organ itself must also work in noise, one solution is to build (for an approriate ) a special -resilient circuit that computes the near-majority of its inputs in independent copies. Theorem 4.16 provides a circuit of size to do this.
It turns out that, at least asymptotically, there is a better solution. We will look for a very simple restoring organ: one whose own noise we can analyse easily. What could be simpler than a circuit having only one level of gates? We fix an odd positive integer constant (for example, ). Each gate of our organ will be a -input majority gate.
Definition 4.20 A multigraph is a graph in which between any two vertices there may be several edges, not just 0 or 1. Let us call a bipartite multigraph with inputs and outputs, -half-regular. if each output node has degree . Such a graph is a -compressor if it has the following property: for every set of at most inputs, the number of those output points connected to at least elements of (with multiplicity) is at most .
The compressor property is interesting generally when . For example, in an -compressor the outputs have degree , and the majority operation in these nodes decreases every error set confined to 10% of all input to just 5% of all outputs. A compressor with the right parameters could serve as our restoring organ: it decreases a minority to a smaller minority and may in this way restore the safety of a cable. But, are there compressors?
Theorem 4.21 For all , all integers with
there is an such that for all integer there is a -compressor.
Note that for , the theorem does not guarantee a compressor with .
Proof. We will not give an explicit construction for the multigraph, we will just show that it exists. We will select a -half-regular multigraph randomly (each such multigraph with the same probability), and show that it will be a -compressor with positive probability. This proof method is called the probabilistic method. Let
Our construction will be somewhat more general, allowing outputs. Let us generate a random bipartite -half-regular multigraph with inputs and outputs in the following way. To each output, we draw edges from random input nodes chosen independently and with uniform distribution over all inputs. Let be an input set of size , let be an output node and let be the event that has or more edges from . Then we have
On the average (in expected value), the event will occur for different output nodes . For an input set , let be the event that the set of nodes for which holds has size . By inequality (4.6) we have
The number of sets of inputs with elements is, using inequality (4.7),
The probability that our random graph is not a compressor is at most as large as the probability that there is at least one input set for which event holds. This can be bounded by
where
As we decrease the first term of this expression dominates. Its coefficient is positive according to the assumption (4.17). We will have if
Example 4.4 Choosing , , the value will work.
We turn a -compressor into a restoring organ , by placing -input majority gates into its outputs. If the majority elements sometimes fail then the output of is random. Assume that at most inputs of are tainted. Then outputs can only be tainted if majority gates fail. Let
be the probability of this event. Assuming that the gates of fail independently with probability , inequality 4.6 gives
Example 4.5 Choose , , as in Example 4.4, further (this will satisfy the inequality (4.19) needed later). With , we get .
The attractively small degree led to an extremely unattractive probability bound on the failure of the whole compressor. This bound does decrease exponentially with cable width , but only an extremely large would make it small.
Example 4.6 Choosing again , but (voting in each gate of the compressor over 41 wires instead of 7), leads to somewhat more realistic results. This choice allows . With , again, we get .
These numbers look less frightening, but we will still need many scores of wires in the cable to drive down the probability of compression failure. And although in practice our computing components fail with frequency much less than , we may want to look at the largest that still can be tolerated.
Compressors allow us to construct a reliable Boolean circuit all of whose cables are safe.
Definition 4.22 Given a Boolean circuit with a single bit of output (for simplicity), a cable width and a Boolean circuit with inputs and outputs, let
be the Boolean circuit that we obtain as follows. The input nodes of are the same as those of . We replace each wire of with a cable of width , and each gate of with an executive organ followed by a restoring organ that is a copy of the circuit . The new circuit has outputs: the outputs of the restoring organ of belonging to the last gate of .
In noise-free computations, on every input, the output of is the same as the output of , but in identical copies.
Lemma 4.23 There are constants and for every cable width a circuit of size and gate size with the following property. For every Boolean circuit of gate size and number of nodes , for every , for every -admissible configuration of , the probability that not every cable of is -safe is .
Proof. We know that there are and with the property that for all a -compressor exists. Let be chosen to satisfy
and define
Let be a restoring organ built from a -compressor. Consider a gate of circuit , and the corresponding executive organ and restoring organ in . Let us estimate the probability of the event that the input cables of this combined organ are -safe but its output cable is not. Assume that the two incoming cables are safe: then at most of the outputs of the executive organ are tainted due to the incoming cables: new taint can still occur due to failures. Let be the event that the executive organ taints at least more of these outputs. Then , using the estimate (4.18). The outputs of the executive organ are the inputs of the restoring organ. If no more than of these are tainted then, in case the organ operates perfectly, it would decrease the number of tainted wires to . Let be the event that the restoring organ taints an additional of these wires. Then again, . If neither nor occur then at most (see (4.19)) tainted wires emerge from the restoring organ, so the outgoing cable is safe. Therefore and hence .
Let be the nodes of the circuit . Since the incoming cables of the whole circuit are safe, the event that there is some cable that is not safe is contained in ; hence the probability is bounded by .
Proof. [Proof of Theorem 4.17] We will prove the theorem only for the case when our computation is a Boolean circuit with a single bit of output. The generalization with more bits of output is straightforward. The proof of Lemma 4.23 gives us a circuit whose output cable is safe except for an event of probability . Let us choose in such a way that this becomes :
It remains to add a little circuit to this output cable to extract from it the majority reliably. This can be done using Theorem 4.16, adding a small extra circuit of size that can be called the coda to . Let us call the resulting circuit .
The probability that the output cable is unsafe is . The probability that the output cable is safe but the “coda” circuit fails is bounded by . So, the probability that fails is , by the assumption .
Let us estimate the size of . By 4.21, we can choose cable width . We have , hence
Example 4.7 Take the constants of Example 4.6, with defined in equation (4.20): then , , , , , , giving
so making as small as possible (ignoring that it must be integer), we get . With , this allows . In addition to this truly unpleasant cable size, the size of the “coda” circuit is , which dominates the size of the rest of (though as it becomes asymptotically negligible).
As Example 4.7 shows, the actual price in redundancy computable from the proof is unacceptable in practice. The redundancy sounds good, since it is only logarithmic in the size of the computation, and by choosing a rather large majority gate (41 inputs), the factor in the here also does not look bad; still, we do not expect the final price of reliability to be this high. How much can this redundancy improved by optimization or other methods? Problem 4-6 shows that in a slightly more restricted error model (all faults are independent and have the same probability), with more randomization, better constants can be achieved. Exercises 4.4-1, 4.4-2 and 4.4-5 are concerned with an improved construction for the “coda” circuit. Exercise 4.5-2 shows that the coda circuit can be omitted completely. But none of these improvements bring redundancy to acceptable level. Even aside from the discomfort caused by their random choice (this can be helped), concentrators themselves are rather large and unwieldy. The problem is probably with using circuits as a model for computation. There is no natural way to break up a general circuit into subunits of non-constant size in order to deal with the reliability problem in modular style.
This subsection is sketchier than the preceding ones, and assumes some knowledge of linear algebra.
We have shown that compressors exist. How expensive is it to find a -compressor, say, with , , , as in Example 4.6? In a deterministic algorithm, we could search through all the approximately -half-regular bipartite graphs. For each of these, we could check all possible input sets of size : as we know, their number is . The cost of checking each subset is , so the total number of operations is . Though this number is exponential in , recall that in our error-correcting construction, for the size of the noiseless circuit: therefore the total number of operations needed to find a compressor is polynomial in .
The proof of Theorem 4.21 shows that a randomly chosen -half-regular bipartite graph is a compressor with large probability. Therefore there is a faster, randomized algorithm for finding a compressor. Pick a random -half-regular bipartite graph, check if it is a compressor: if it is not, repeat. We will be done in a constant expected number of repetititons. This is a faster algorithm, but is still exponential in , since each checking takes operations.
Is it possible to construct a compressor explicitly, avoiding any search that takes exponential time in ? The answer is yes. We will show here only, however, that the compressor property is implied by a certain property involving linear algebra, which can be checked in polynomial time. Certain explicitly constructed graphs are known that possess this property. These are generally sought after not so much for their compressor property as for their expander property (see the section on reliable storage).
For vectors , let denote their inner product. A -half-regular bipartite multigraph with nodes can be defined by an incidence matrix , where is the number of edges connecting input to output . Let be the vector . Then , so is an eigenvector of with eigenvalue . Moreover, is the largest eigenvalue of . Indeed, denoting by for any row vector , we have .
Theorem 4.24 Let be a multigraph defined by the matrix . For all , and
there is an such that if the second largest eigenvalue of the matrix is then is a -compressor.
Proof. The matrix has largest eigenvalue . Since it is symmetric, it has a basis of orthogonal eigenvectors of unit length with corresponding nonnegative eigenvalues
where and . Recall that in the orthonormal basis , any vector can be written as . For an arbitrary vector , we can estimate as follows.
Let now be a set of size and where for and 0 otherwise. Then, coordinate of counts the number of edges coming from the set to the node . Also, , the number of elements of . We get
Suppose that there are nodes with , then this says
Since (4.22) implies , it follows that is a -compressor for small enough .
It is actually sufficient to look for graphs with large and where are constants. To see this, let us define the product of two bipartite multigraphs with vertices by the multigraph belonging to the product of the corresponding matrices.
Suppose that is symmetric: then its second largest eigenvalue is and the ratio of the two largest eigenvalues of is . Therefore using for a sufficiently large as our matrix, the condition (4.22) can be satisfied. Unfortunately, taking the power will increase the degree , taking us probably even farther away from practical realizability.
We found that there is a construction of a compressor with the desired parameters as soon as we find multigraphs with arbitrarily large sizes , with symmetric matrices and with a ratio of the two largest eigenvalues of bounded by a constant independent of . There are various constructions of such multigraphs (see the references in the historical overview). The estimation of the desired eigenvalue quotient is never very simple.
Exercises
4.4-1 The proof of Theorem 4.17 uses a “coda” circuit of size . Once we proved this theorem we could, of course, apply it to the computation of the final majority itself: this would reduce the size of the coda circuit to . Try out this approach on the numerical examples considered above to see whether it results in a significant improvement.
4.4-2 The proof of Theorem 4.21 provided also bipartite graphs with the compressor property, with inputs and outputs. An idea to build a smaller “coda” circuit in the proof of Theorem 4.17 is to concatenate several such compressors, decreasing the number of cables in a geometric series. Explore this idea, keeping in mind, however, that as decreases, the “exponential” error estimate in inequality 4.18 becomes weaker.
4.4-3 In a noisy Boolean circuit, let if the gate at vertex fails and otherwise. Further, let if is tainted, and 0 otherwise. Suppose that the distribution of the random variables does not depend on the Boolean input vector. Show that then the joint distribution of the random variables is also independent of the input vector.
4.4-4 This exercise extends the result of Exercise 4.3-1 to random input vectors: it shows that if a random input vector has only a small number of errors, then the iterated majority vote of Exercise 4.2-5 may still work for it, if we rearrange the input wires randomly. Let , and let be a vector of integers . We define a Boolean circuit as follows. This circuit takes input vector , computes the vector where (in other words, just leads a wire from input node to an “intermediate node” ) and then inputs into the circuit .
Denote the (possibly random) output bit of by . For any fixed input vector , assuming that our majority gates can fail with probability independently, denote . Assume that the input is a vector of (not necessarily independent) Boolean random variables, with . Denoting , assume . Prove that there is a choice of the vector for which
The choice may depend on the distribution of the random vector .
Hint. Choose the vector (and hence the circuit ) randomly, as a random vector where the variables are independent and uniformly distributed over , and denote . Then prove
For this, interchange the averaging over and . Then note that is the probability of when the “wires” are chosen randomly “on the fly” during the computation of the circuit.
4.4-5 Taking the notation of Exercise 4.4-3 suppose, like there, that the random variables are independent of each other, and their distribution does not depend on the Boolean input vector. Take the Boolean circuit introduced in Definition 4.22, and define the random Boolean vector where if and only if the th output node is tainted. Apply Exercise 4.4-4 to show that there is a circuit that can be attached to the output nodes to play the role of the “coda” circuit in the proof of Theorem 4.17. The size of is only linear in , not as for the coda circuit in the proof there. But, we assumed a little more about the fault distribution, and also the choice of the “wiring”' depends on the circuit .
There is hardly any simpler computation than not doing anything, just keeping the input unchanged. This task does not fit well, however, into the simple model of Boolean circuits as introduced above.
An obvious element of ordinary computations is missing from the above described Boolean circuit model: repetition. If we want to repeat some computation steps, then we need to introduce timing into the work of computing elements and to store the partial results between consecutive steps. Let us look at the drawings of the circuit designer again. We will see components like in Figure 4.9, with one ingoing edge and no operation associated with them; these will be called shift registers. The shift registers are controlled by one central clock (invisible on the drawing). At each clock pulse, the assignment value on the incoming edge jumps onto the outgoing edges and “stays in” the register. Figure 4.10 shows how shift registers may be used inside a circuit.
Definition 4.25 A clocked circuit over a complete basis is given by a tuple just like a Boolean circuit in 4.10. Also, the circuit defines a graph similarly. Recall that we identified nodes with the natural numbers . To each non-input node either a gate is assigned as before, or a shift register: in this case (there is only one argument). We do not require the graph to be acyclic, but we do require every directed cycle (if there is any) to pass through at least one shift register.
Figure 4.10. Part of a circuit which computes the sum of two binary numbers . We feed the digits of and beginning with the lowest-order ones, at the input nodes. The digits of the sum come out on the output edge. A shift register holds the carry.
The circuit works in a sequence of clock cycles. Let us denote the input vector at clock cycle by , the shift register states by , and the output vector by . The part of the circuit going from the inputs and the shift registers to the outputs and the shift registers defines two Boolean vector functions and . The operation of the clocked circuit is described by the following equations (see Figure 4.11, which does not show any inputs and outputs).
Figure 4.11. A “computer” consists of some memory (shift registers) and a Boolean circuit operating on it. We can define the size of computation as the size of the computer times the number of steps.
Frequently, we have no inputs or outputs during the work of the circuit, so the equations (4.23) can be simplified to
How to use a clocked circuit described by this equation for computation? We write some initial values into the shift registers, and propagate the assignment using the gates, for the given clock cycle. Now we send a clock pulse to the register, causing it to write new values to their output edges (which are identical to the input edges of the circuit). After this, the new assignment is computed, and so on.
How to compute a function with the help of such a circuit? Here is a possible convention. We enter the input (only in the first step), and then run the circuit, until it signals at an extra output edge when desired result can be received from the other output nodes.
Example 4.8 This example uses a convention different from the above described one: new input bits are supplied in every step, and the output is also delivered continuously. For the binary adder of Figure 4.10, let and be the two input bits in cycle , let be the content of the carry, and be the output in the same cycle. Then the equations (4.23) now have the form
where is the majority operation.
A clocked circuit is an interesting parallel computer but let us pose now a task for it that is trivial in the absence of failures: information storage. We would like to store a certain amount of information in such a way that it can be recovered after some time, despite failures in the circuit. For this, the transition function introduced in (4.24) cannot be just the identity: it will have to perform some error-correcting operations. The restoring organs discussed earlier are natural candidates. Indeed, suppose that we use memory cells to store a bit of information. We can call the content of this -tuple safe when the number of memory cells that dissent from the correct value is under some treshold . Let the rest of the circuit be a restoring organ built on a -compressor with . Suppose that the input cable is safe. Then the probability that after the transition, the new output cable (and therefore the new state) is not safe is for some constant . Suppose we keep the circuit running for steps. Then the probability that the state is not safe in some of these steps is which is small as long as is significantly smaller than . When storing bits of information, the probability that any of the bits loses its safety in some step is .
To make this discussion rigorous, an error model must be introduced for clocked circuits. Since we will only consider simple transition functions like the majority vote above, with a single computation step between times and , we will make the model also very simple.
Definition 4.26 Consider a clocked circuit described by equation (4.24), where at each time instant , the configuration is described by the bit vector . Consider a sequence of random bit vectors for . Similarly to (4.13) we define
Thus, says that a failure occurs at the space-time point . The sequence will be called -admissible if (4.14) holds for every set of space-time points with .
By the just described construction, it is possible to keep bits of information for steps in
memory cells. More precisely, the cable will be safe with large probability in any admissible evolution ().
Cannot we do better? The reliable information storage problem is related to the problem of information transmission: given a message , a sender wants to transmit it to a receiver throught a noisy channel. Only now sender and receiver are the same person, and the noisy channel is just the passing of time. Below, we develop some basic concepts of reliable information transmission, and then we will apply them to the construction of a reliable data storage scheme that is more economical than the above seen naive, repetition-based solution.
To protect information, we can use redundancy in a way more efficient than repetition. We might even add only a single redundant bit to our message. Let , be the word we want to protect. Let us create the error check bit
For example, . Our codeword will be subject to noise and it turns into a new word, . If differs from in a single changed (not deleted or added) bit then we will detect this, since then violates the error check relation
We will not be able to correct the error, since we do not know which bit was corrupted.
To also correct corrupted bits, we need to add more error check bits. We may try to add two more bits:
Then an uncorrupted word must satisfy the error check relations
or, in matrix notation , where
Note . The matrix is called the error check matrix, or parity check matrix.
Another way to write the error check relations is
Now if is corrupted, even if only in a single position, unfortunately we still cannot correct it: since , the error could be in position 1 or 5 and we could not tell the difference. If we choose our error-check matrix in such a way that the colum vectors are all different (of course also from 0), then we can always correct an error, provided there is only one. Indeed, if the error was in position 3 then
Since all vectors are different, if we see the vector we can imply that the bit is corrupted. This code is called the Hamming code. For example, the following error check matrix defines the Hamming code of size 7:
In general, if we have error check bits then our code can have size , hence the number of bits left to store information, the information bits is . So, to protect bits of information from a single error, the Hamming code adds error check bits. This is much better than repeating every bit 3 times.
Let us summarize the error-correction scenario in general terms.
In order to fight noise, the sender encodes the message by an encoding function into a longer string which, for simplicity, we also assume to be binary. This codeword will be changed by noise into a string . The receiver gets and applies to it a decoding function .
Definition 4.27 The pair of functions and is called a code if holds for all . The strings are called messages, words of the form are called codewords. (Sometimes the set of all codewords by itself is also called a code.) For every message , the set of words is called the decoding set of . (Of course, different decoding sets are disjoint.) The number
is called the rate of the code.
We say that our code that corrects errors if for all possible messages , if the received word differs from the codeword in at most positions, then .
If the rate is then the -bit codewords carry bits of useful information. In terms of decoding sets, a code corrects errors if each decoding set contains all words that differ from in at most symbols (the set of these words is a kind of “ball” of radius ).
The Hamming code corrects a single error, and its rate is close to 1. One of the important questions connected with error-correcting codes is how much do we have to lower the rate in order to correct more errors.
Having a notion of codes, we can formulate the main result of this section about information storage.
Theorem 4.28 (Network information storage)There are constants with the following property. For all sufficiently large , there is a code with message length and codeword length , and a Boolean clocked circuit of size with inputs and outputs, such that the following holds. Suppose that at time 0, the memory cells of the circuit contain string . Suppose further that the evolution of the circuit has -admissible failures. Then we have
This theorem shows that it is possible to store bits information for time , in a clocked circuit of size
As long as the storage time is below the exponential bound for a certain constant , this circuit size is only a constant times larger than the amount of information it stores. (In contrast, in (4.26) we needed an extra factor when we used a separate restoring organ for each bit.)
The theorem says nothing about how difficult it is to compute the codeword at the beginning and how difficult it is to carry out the decoding at the end. Moreover, it is desireable to perform these two operations also in a noise-tolerant fashion. We will return to the problem of decoding later.
Since we will be dealing more with bit matrices, it is convenient to introduce the algebraic structure
which is a two-element field. Addition and multiplication in are defined modulo 2 (of course, for multiplication this is no change). It is also convenient to vest the set of binary strings with the structure of an -dimensional vector space over the field . Most theorems and algorithms of basic linear algebra apply to arbitrary fields: in particular, one can define the row rank of a matrix as the maximum number of linearly independent rows, and similarly the column rank. Then it is a theorem that the row rank is equal to the colum rank. From now on, in algebraic operations over bits or bit vectors, we will write in place of unless this leads to confusion. To save space, we will frequently write column vectors horizontally: we write
where denotes the transpose of matrix . We will write
for the identity matrix over the vector space .
Let us generalize the idea of the Hamming code.
Definition 4.29 A code with message length and codeword length is linear if, when viewing the message and code vectors as vectors over the field , the encoding function can be computed according to the formula
with an matrix called the generator matrix of the code. The number is called the the number of information bits in the code, the number
the number of error-check bits.
Example 4.9 The matrix in (4.27) can be written as , wit
Then the error check relation can be written as
This shows that the bits can be taken to be the message bits, or “information bits”, of the code, making the Hamming code a linear code with the generator matrix . (Of course, over the field .)
The following statement is proved using standard linear algebra, and it generalizes the relation between error check matrix and generator matrix seen in Example 4.9.
Claim 4.30 Let be given with .
(a) For every matrix of rank over there is a matrix of rank with the property
(b) For every matrix of rank over there is an matrix of rank with property (4.28).
Definition 4.31 For a vector , let denote the number of its nonzero elements: we will also call it the weight of .
In what follows it will be convenient to define a code starting from an error-check matrix . If the matrix has rank then the code has rate
We can fix any subset of linearly independent columns, and call the indices error check bits and the indices the information bits. (In Example 4.9, we chose .) Important operations can performed over a code, however, without fixing any separation into error-check bits and information bits.
Correcting a single error was not too difficult; finding a similar scheme to correct 2 errors is much harder. However, in storing bits, typically (much more than 2) of those bits will be corrupted in every step. There are ingenious and quite efficient codes of positive rate (independent of ) correcting even this many errors. When applied to information storage, however, the error-correction mechanism itself must also work in noise, so we are looking for a particularly simple one. It works in our favor, however, that not all errors need to be corrected: it is sufficient to cut down their number, similarly to the restoring organ in reliable Boolean circuits above.
For simplicity, as gates of our circuit we will allow certain Boolean functions with a large, but constant, number of arguments. On the other hand, our Boolean circuit will have just depth 1, similarly to a restoring organ of Section 4.4. The output of each gate is the input of a memory cell (shift register). For simplicity, we identify the gate and the memory cell and call it a cell. At each clock tick, a cell reads its inputs from other cells, computes a Boolean function on them, and stores the result (till the next clock tick). But now, instead of majority vote among the input values cells, the Boolean function computed by each cell will be slightly more complicated.
Our particular restoring operations will be defined, with the help of a certain parity check matrix . Let be a vector of bits. For some , let (from “vertical”) be the set of those indices with . For integer , let (from “horizontal”) be the set of those indices with . Then the condition can also be expressed by saying that for all , we have . The sets are called the parity check sets belonging to the matrix . From now on, the indices will be called checks, and the indices locations.
Definition 4.32 A linear code is a low-density parity-check code with bounds if the following conditions are satisfied:
(a) For each we have ;
(b) For each we have .
In other words, the weight of each row is at most and the weight of each column is at most .
In our constructions, we will keep the bounds constant while the length of codewords grows. Consider a situation when is a codeword corrupted by some errors. To check whether bit is incorrect we may check all the sums
for all . If all these sums are 0 then we would not suspect to be in error. If only one of these is nonzero, we will know that has some errors but we may still think that the error is not in bit . But if a significant number of these sums is nonzero then we may suspect that is a culprit and may want to change it. This idea suggests the following definition.
Definition 4.33 For a low-density parity-check code with bounds , the refreshing operation associated with the code is the following, to be performed simultaneously for all locations :
Find out whether more than of the sums are nonzero among the ones for . If this is the case, flip .
Let denote the vector obtained from by this operation. For parameters , let us call a -refresher if for each vector of length with weight the weight of the resulting vector decreases thus:
Notice the similarity of refreshers to compressors. The following lemma shows the use of refreshers, and is an example of the advantages of linear codes.
Lemma 4.34 For an -refresher , let be an -vector and a codeword of length with . Then .
Proof. Since is a codeword, , implying . Therefore the error correction flips the same bits in as in : , giving . So, if , then .
Theorem 4.35 There is a parameter and integers such that for all sufficiently large codelength and there is a -refresher with at least information bits.
In particular, we can choose , , .
We postpone the proof of this theorem, and apply it first.
Proof. Proof of Theorem 4.28 Theorem 4.35 provides us with a device for information storage. Indeed, we can implement the operation using a single gate of at most inputs for each bit of . Now as long as the inequality holds for some codeword , the inequality follows with . Of course, some gates will fail and introduce new deviations resulting in some rather than . Let and . Then just as earlier, the probability that there are more than failures is bounded by the exponentially decreasing expression . With fewer than new deviations, we will still have . The probability that at any time the number of failures is more than is bounded by
Example 4.10 Let . Using the sample values in Theorem 4.35 we can take , , so the information rate is . With the corresponding values of , and , we have . The probability that there are more than failures is bounded by
This is exponentially decreasing with , albeit initially very slowly: it is not really small until . Still, for , it gives .
In order to use a refresher for information storage, first we need to enter the encoded information into it, and at the end, we need to decode the information from it. How can this be done in a noisy environment? We have nothing particularly smart to say here about encoding besides the reference to the general reliable computation scheme discussed earlier. On the other hand, it turns out that if is sufficiently small then decoding can be avoided altogether.
Recall that in our codes, it is possible to designate certain symbols as information symbols. So, in principle it is sufficient to read out these symbols. The question is only how likely it is that any one of these symbols will be corrupted. The following theorem upperbounds the probability for any symbol to be corrupted, at any time.
Theorem 4.36 For parameters , integers , codelength , with , consider a -refresher. Build a Boolean clocked circuit of size with inputs and outputs based on this refresher, just as in the proof of Theorem 4.28. Suppose that at time 0, the memory cells of the circuit contain string . Suppose further that the evolution of the circuit has -admissible failures. Let be the bits stored at time . Then implies
for some depending on .
Remark 4.37 What we are bounding is only the probability of a corrupt symbol in the particular position . Some of the symbols will certainly be corrupt, but any one symbol one points to will be corrupt only with probability .
The upper bound on required in the condition of the theorem is very severe, underscoring the theoretical character of this result.
Proof. As usual, it is sufficient to assume . Let , and let be the set of circuit elements which fail at time . Let us define the following sequence of integers:
It is easy to see by induction
The first members of the sequence are 1,2,3,4,6,8,11,15,18,24,32, and for they are 1,1,1,2,2,3,4,5,6,8,11.
Lemma 4.38 Suppose that . Then either there is a time at which circuit elements failed, or there is a sequence of sets for and with the following properties.
(a) For , every element of shares some error-check with some element of . Also every element of shares some error-check with some element of .
(b) We have for , on the other hand .
(c) We have , , for all , and .
Proof. We will define the sequence recursively, and will say when to stop. If then we set , , and stop. Suppose that is already defined. Let us define (or if ). Let be the set of those which share some error-check with an element of , and let . The refresher property implies that either or
In the former case, there must have been some time with , otherwise could never become larger than . In the latter case, the property implies
Now if then let be any subset of with size (there is one), else let and a set of size (there is one). This construction has the required properties.
For a given , the number of different choices for is bounded by
where we used (4.9). Similarly, the number of different choices for is bounded by
It follows that the number of choices for the whole sequence is bounded by
On the other hand, the probability for a fixed to have is . This way, we can bound the probability that the sequence ends exactly at by
where we used (4.29). For small this gives
Therefore
where we used and the property for . We can bound the last expression by with an appropriate constantb .
We found that the event happens either if there is a time at which circuit elements failed (this has probability bound ) or an event of probability occurs.
We will construct our refreshers from bipartite multigraphs with a property similar to compressors: expanders.
Definition 4.39 Here, we will distinguish the two parts of the bipartite (multi) graphs not as inputs and outputs but as left nodes and right nodes. A bipartite multigraph is -regular if the points of the left set have degree and the points in the right set have degree . Consider such a graph, with the left set having nodes (then the right set has nodes). For a subset of the left set of , let consist of the points connected by some edge to some element of . We say that the graph expands by a factor if we have . For , our graph is an -expander if expands every subset of size of the left set by a factor .
Definition 4.40 Given an -regular bipartite multigraph , with left set and right set , we assign to it a parity-check code as follows: if is connected to , and 0 otherwise.
Now for every possible error set , the set describes the set of parity checks that the elements of participate in. Under some conditions, the lower bound on the size of guarantees that a sufficient number of errors will be corrected.
Theorem 4.41 Let be an -expander with integer . Let . Then is a -refresher.
Proof. More generally, for any , let be an -expander with integer . We will prove that is a -refresher. For an -dimensional bit vector with , , assume
Our goal is to show : in other words, that in the corrected vector the number of errors decreases at least by a factor of .
Let be the set of bits in that the error correction operation fails to flip, with , and the set of of bits that were 0 but the operation turns them to 1, with . Our goal is to bound . The key observation is that each element of shares at least half of its neighbors with elements of , and similarly, each element of shares at least half of its neighbors with other elements of . Therefore both and contribute relatively weakly to the expansion of . Since this expansion is assumed strong, the size of must be limited.
Let
By expansion, .
First we show . Assume namely that, on the contrary, , and let be a subset of such that (an integer, according to the assumptions of the theorem). By expansion,
Each bit in has at most neighbors that are not neighbors of ; so,
Combining these:
since . This contradiction with 4.30 shows .
Now implies (recalling that each element of contributes at most new neighbors):
Each must share at least half of its neighbors with others in . Therefore contributes at most neighbors on its own; the contribution of the other must be divided by 2, so the the total contribution of to the neighbors of is at most :
Combining with (4.31):
Are there expanders good enough for Theorem 4.41? The maximum expansion factor is the degree and we require a factor of It turns out that random choice works here, too, similarly to the one used in the “construction” of compressors.
The choice has to be done in a way that the result is an -regular bipartite multigraph of left size . We will start with left nodes and right nodes . Now we choose a random matching, that is a set of edges with the property that every left node is connected by an edge to exactly one right node. Let us call the resulting graph . We obtain now as follows: we collapse each group of left nodes into a single node: into one node, into another node, and so on. Similarly, we collapse each group of right nodes into a single node: into one node, into another node, and so on. The edges between any pair of nodes in are inherited from the ancestors of these nodes in . This results in a graph with left nodes of degree and right nodes of degree . The process may give multiple edges between nodes of , this is why is a multigraph. Two nodes of will be called cluster neighbors if they are collapsed to the same node of .
Then the above random choice gives an -expander with positive probability.
Example 4.11 If , then the inequality in the condition of the theorem becomes
Proof. Let be a set of size in the left set of . We will estimate the probability that has too few neighbors. In the above choice of the graph we might as well start with assigning edges to the nodes of , in some fixed order of the nodes of the preimage of in . There are edges to assign. Let us call a node of the right set of occupied if it has a cluster neighbor already reached by an earlier edge. Let be a random variable that is 1 if the th edge goes to an occupied node and 0 otherwise. There are
choices for the th edge and at most of these are occupied. Therefore
Using the large deviations theorem in the generalization given in Exercise 4.1-3, we have, for :
Now, the number of different neighbors of is , hence
Let us now multiply this with the number
of sets of size :
where in the last step we assumed . This is if
Substituting gives the formula of the theorem.
Proof. Proof of Theorem 4.35 Theorem 4.41 shows how to get a refresher from an expander, and Theorem 4.42 shows the existence of expanders for certain parameters. Example 4.11 shows that the parameters can be chosen as needed for the refreshers.
Exercises
4.5-1 Prove Proposition 4.30.
4.5-2 Apply the ideas of the proof of Theorem 4.36 to the proof of Theorem 4.17, showing that the “coda” circuit is not needed: each wire of the output cable carries the correct value with high probability.
PROBLEMS |
Consider a circuit like in Exercise 4.2-5, assuming that each gate fails with probability independently of all the others and of the input. Assume that the input vector is all 0, and let be the probability that the circuit outputs a 1. Show that there is a value with the property that for all we have , and for , we have have . Estimate also the speed of convergence in both cases.
We defined a compressor as a -halfregular bipartite multigraph. Let us call a compressor regular if it is a -regular multigraph (the input nodes also have degree ). Prove a theorem similar to Theorem 4.21: for each there is an integer and an such that for all integer there is a regular -compressor.
Hint. Choose a random -regular bipartite multigraph by the following process: (1. Replace each vertex by a group of vertices. 2. Choose a random complete matching betwen the new input and output vertices. 3. Merge each group of vertices into one vertex again.) Prove that the probability, over this choice, that a -regular multigraph is a not a compressor is small. For this, express the probability with the help of factorials and estimate the factorials using Stirling's formula.
4-3
Two-way expander
Recall the definition of expanders. Call a -expander regular if it is a -regular multigraph (the input nodes also have degree ). We will call this multigraph a two-way expander if it is an expander in both directions: from to and from to . Prove a theorem similar to the one in Problem 4-2: for all there is an such that for all integers there is a two-way regular -expander.
4-4
Restoring organ from 3-way voting
The proof of Theorem 4.21 did not guarantee a -compressor with any , . If we only want to use 3-way majority gates, consider the following construction. First create a 3-halfregular bipartite graph with inputs and outputs , with a 3-input majority gate in each . Then create new nodes , with a 3-input majority gate in each . The gate of computes the majority of , the gate of computes the majority of , and so on. Calculate whether a random choice of the graph will turn the circuit with inputs and outputs into a restoring organ. Then consider three stages instead of two, where has outputs and see what is gained.
4-5
Restoring organ from NOR gates
The majority gate is not the only gate capable of strengthening the majority. Recall the NOR gate introduced in Exercise 4.2-2, and form . Show that a construction similar to Problem 4-4 can be carried out with used in place of 3-way majority gates.
4-6
More randomness, smaller restoring organs
Taking the notation of Exercise 4.4-3, suppose like there, that the random variables are independent of each other, and their distribution does not depend on the Boolean input vector. Apply the idea of Exercise 4.4-5 to the construction of each restoring organ. Namely, construct a different restoring organ for each position: the choice depends on the circuit preceding this position. Show that in this case, our error estimates can be significantly improved. The improvement comes, just as in Exercise 4.4-5, since now we do not have to multiply the error probability by the number of all possible sets of size of tainted wires. Since we know the distribution of this set, we can average over it.
4-7
Near-sorting with expanders
In this problem, we show that expanders can be used for “near-sorting”. Let be a regular two-way -expander, whose two parts of size are and . According to a theorem of Kőnig, (the edge-set of) every -regular bipartite multigraph is the disjoint union of (the edge-sets of) complete matchings . To such an expander, we assign a Boolean circuit of depth as follows. The circuit's nodes are subdivide into levels . On level we have two disjoint sets of size of nodes , (). The Boolean value on will be and respectively. Denote the vector of values at stage by . If is an edge in the matching , then we put an gate into , and a gate into :
This network is trying to “sort” the 0's to and the 1's to in stages. More generally, the values in the vectors could be arbitrary numbers. Then if still means and means then each vector is a permutation of the vector . Let . Prove that is -sorted in the sense that for all , at least among the smallest values of is in its left half and at least among the largest values are in its right half.
4-8
Restoring organ from near-sorters
Develop a new restoring organ using expanders, as follows. First, split each wire of the input cable , to get two sets . Attach the -sorter of Problem 4-7, getting outputs . Now split the wires of into two sets . Attach the -sorter again, getting outputs . Keep only for the output cable. Show that the Boolean vector circuit leading from to can be used as a restoring organ.
CHAPTER NOTES |
The large deviation theorem (Theorem 4.1), or theorems similar to it, are sometimes attributed to Chernoff or Bernstein. One of its frequently used variants is given in Exercise 4.1-2.
The problem of reliable computation with unreliable components was addressed by John von Neumann in [192] on the model of logic circuits. A complete proof of the result of that paper (with a different restoring organ) appeare first in the paper [63] of R. L. Dobrushin and S. I. Ortyukov. Our presentation relied on parts of the paper [205] of N. Pippenger.
The lower-bound result of Dobrushin and Ortyukov in the paper [62] (corrected in [203], [211] and [85]), shows that reduncancy of is unavoidable for a general reliable computation whose complexity is . However, this lower bound only shows the necessity of putting the input into a redundantly encoded form (otherwise critical information may be lost in the first step). As shown in [205], for many important function classes, linear redundancy is achievable.
It seems natural to separate the cost of the initial encoding: it might be possible to perform the rest of the computation with much less redundancy. An important step in this direction has been made by D. Spielman in the paper [248] in (essentially) the clocked-circuit model. Spielman takes a parallel computation with time running on elementary components and makes it reliable using only times more processors and running it times longer. The failure probability will be . This is small as long as is not much larger than . So, the redundancy is bounded by some power of the logarithm of the space requirement; the time requirement does not enter explictly. In Boolean circuits no time- and space- complexity is defined separately. The size of the circuit is analogous to the quantity obtained in other models by taking the product of space and time complexity.
Questions more complex than Problem 4-1 have been studied in [204]. The method of Problem 4-2, for generating random -regular multigraphs is analyzed for example in [27]. It is much harder to generate simple regular graphs (not multigraphs) uniformly. See for example [143].
The result of Exercise 4.2-4 is due to C. Shannon, see [234]. The asymptotically best circuit size for the worst functions was found by Lupanov in [170]. Exercise 4.3-1 is based on [63], and Exercise 4.3-2 is based on [62] (and its corrections).
Problem 4-7 is based on the starting idea of the depth sorting networks in [9].
For storage in Boolean circuits we partly relied on A. V. Kuznietsov's paper [156] (the main theorem, on the existence of refreshers is from M. Pinsker). Low density parity check codes were introduced by R. G. Gallager in the book[80], and their use in reliable storage was first suggested by M. G. Taylor in the paper [257]. New, constructive versions of these codes were developed by M. Sipser and D. Spielman in the paper [247], with superfast coding and decoding.
Expanders, invented by Pinsker in [202] have been used extensively in theoretical computer science: see for example[188] for some more detail. This book also gives references on the construction of graphs with large eigenvalue-gap. Exercise 4.4-4 and Problem 4-6 are based on [63].
The use of expanders in the role of refreshers was suggested by Pippenger (private communication): our exposition follows Sipser and Spielman in [243]. Random expanders were found for example by Pinsker. The needed expansion rate ( times the left degree) is larger than what can be implied from the size of the eigenvalue gap. As shown in [202] (see the proof in Theorem 4.42) random expanders have the needed expansion rate. Lately, constructive expanders with nearly maximal expansion rate were announced by Capalbo, Reingold, Vadhan and Wigderson in [39].
Reliable computation is also possible in a model of parallel computation that is much more regular than logic circuits: in cellular automata. We cannot present those results here: see for example the papers [84], [86].
Table of Contents
Table of Contents
First, in this chapter, we will discuss some of the basic concepts of algebra, such as fields, vector spaces and polynomials (Section 5.1). Our main focus will be the study of polynomial rings in one variable. These polynomial rings play a very important role in constructive applications. After this, we will outline the theory of finite fields, putting a strong emphasis on the problem of constructing them (Section 5.2) and on the problem of factoring polynomials over such fields (Section 5.3). Then we will study lattices and discuss the Lenstra-Lenstra-Lovász algorithm which can be used to find short lattice vectors (Section 5.4). We will present a polynomial time algorithm for the factorisation of polynomials with rational coefficients; this was the first notable application of the Lenstra-Lenstra-Lovász algorithm (Section 5.5).
In this section we will overview some important concepts related to rings and polynomials.
We recall some definitions introduced in Chapters 31–33 of the textbook Introduction to Algorithms. In the sequel all cross references to Chapters 31–33 refer to results in that book.
A set with at least two elements is called a ring, if it has two binary operations, the addition, denoted by the sign, and the multiplication, denoted by the sign. The elements of form an Abelian group with respect to the addition, and they form a monoid (that is, a semigroup with an identity), whose identity element is denoted by 1, with respect to the multiplication. We assume that . Further, the distributive properties also hold: for arbitrary elements we have
Being an Abelian group with respect to the addition means that the operation is associative, commutative, it has an identity element (denoted by 0), and every element has an inverse with respect to this identity. More precisely, these requirements are the following:
associative property: for all triples we have ;
commutative property: for all pairs we have ;
existence of the identity element: for the zero element of and for all elements of , we have ;
existence of the additive inverse: for all there exists , such that .
It is easy to show that each of the elements in has a unique inverse. We usually denote the inverse of an element by .
Concerning the multiplication, we require that it must be associative and that the multiplicative identity should exist. The identity of a ring is the multiplicative identity of . The usual name of the additive identity is zero. We usually omit the sign when writing the multiplication, for example we usually write instead of .
(i) The set of integers with the usual operations and .
(ii) The set of residue classes modulo with respect to the addition and multiplication modulo .
(iii) The set of -matrices with real entries with respect to the addition and multiplication of matrices.
Let and be rings. A map is said to be a homomorphism, if preserves the operations, in the sense that and holds for all pairs . A homomorphism is called an isomorphism, if is a one-to-one correspondence, and the inverse is also a homomorphism. We say that the rings and are isomorphic, if there is an isomorphism between them. If and are isomorphic rings, then we write . From an algebraic point of view, isomorphic rings can be viewed as identical.
For example the map which maps an integer to its residue modulo 6 is a homomorphism: , etc.
A useful and important ring theoretic construction is the direct sum. The direct sum of the rings and is denoted by . The underlying set of the direct sum is , that is, the set of ordered pairs where . The operations are defined componentwise: for we let
Easy calculation shows that is a ring with respect to the operations above. This construction can easily be generalised to more than two rings. In this case, the elements of the direct sum are the -tuples, where is the number of rings in the direct sum, and the operations are defined componentwise.
A ring is said to be a field, if its non-zero elements form an Abelian group with respect to the multiplication. The multiplicative inverse of a non-zero element is usually denoted .
The best-known examples of fields are the the sets of rational numbers, real numbers, and complex numbers with respect to the usual operations. We usually denote these fields by , respectively.
Another important class of fields consists of the fields of -elements where is a prime number. The elements of are the residue classes modulo , and the operations are the addition and the multiplication defined on the residue classes. The distributive property can easily be derived from the distributivity of the integer operations. By Theorem 33.12, is a group with respect to the addition, and, by Theorem 33.13, the set of non-zero elements of is a group with respect to the multiplication. In order to prove this latter claim, we need to use that is a prime number.
In an arbitrary field, we may consider the set of elements of the form , that is, the set of elements that can be written as the sum of copies of the multiplicative identity where is a positive integer. Clearly, one of the two possibilities must hold:
(a) none of the elements is zero;
(b) is zero for some .
In case (a) we say that is a field with characteristic zero. In case (b) the characteristic of is the smallest such that . In this case, the number must be a prime, for, if , then , and so either or .
Suppose that denotes the smallest subfield of that contains . Then is said to be the prime field of . In case (a) the subfield consists of the elements where is an integer and is a positive integer. In this case, is isomorphic to the field of rational numbers. The identification is obvious: .
In case (b) the characteristic is a prime number, and is the set of elements where . In this case, is isomorphic to the field of residue classes modulo .
Let be a field. An additively written Abelian group is said to be a vector space over , or simply an -vector space, if for all elements and , an element is defined (in other words, acts on ) and the following hold:
Here are arbitrary elements of , the elements are arbitrary in , and the element 1 is the multiplicative identity of .
The space of -matrices over is an important example of vector spaces. Their properties are studied in Chapter 31.
A vector space over a field is said to be finite-dimensional if there is a collection of finitely many elements in such that each of the elements can be written as a linear combination for some . Such a set is called a generating set of . The cardinality of the smallest generating set of is referred to as the dimension of over , denoted . In a finite-dimensional vector space, a generating system containing elements is said to be a basis.
A set of elements of a vector space is said to be linearly independent, if, for , the equation implies . It is easy to show that a basis in is a linearly independent set. An important property of linearly independent sets is that such a set can be extended to a basis of the vector space. The dimension of a vector space coincides with the cardinality of its largest linearly independent set.
A non-empty subset of a vector space is said to be a subspace of , if it is an (additive) subgroup of , and holds for all and . It is obvious that a subspace can be viewed as a vector space.
The concept of homomorphisms can be defined for vector spaces, but in this context we usually refer to them as linear maps. Let and be vector spaces over a common field . A map is said to be linear, if, for all and , we have
The linear mapping is an isomorphism if is a one-to-one correspondence and its inverse is also a homomorphism. Two vector spaces are said to be isomorphic if there is an isomorphism between them.
Lemma 5.1 Suppose that is a linear mapping. Then is a subspace in . If is one-to-one, then . If, in this case, , then and the mapping is an isomorphism.
we obtain that is a subspace. Further, it is clear that the images of the elements of a generating set of form a generating set for . Let us now suppose that is one-to-one. In this case, the image of a linearly independent subset of is linearly independent in . It easily follows from these observations that the image of a basis of is a basis of , and so . If we assume, in addition, that , then a basis of is also a basis of , as it is a linearly independent set, and so it can be extended to a basis of . Thus and the mapping must be a one-to-one correspondence. It is easy to see, and is left to the reader, that is a linear mapping.
The direct sum of vector spaces can be defined similarly to the direct sum of rings. The direct sum of the vector spaces and is denoted by . The underlying set of the direct sum is , and the addition and the action of the field are defined componentwise. It is easy to see that
Let be a field and let be a finite multiplicative subgroup of . That is, the set contains finitely many elements of , each of which is non-zero, is closed under multiplication, and the multiplicative inverse of an element of also lies in . We aim to show that the group is cyclic, that is, can be generated by a single element. The main concepts related to cyclic groups can be found in Section 33.3.4. of an element is the smallest positive integer such that .
The cyclic group generated by an element is denoted by . Clearly, , and an element generates the group if and only if and are relatively prime. Hence the group has exactly generators where is Euler's totient function (see Subsection 33.3.2).
The following identity is valid for an arbitrary integer :
Here the summation index runs through all positive divisors of . In order to verify this identity, consider all the rational numbers with . The number of these is exactly . After simplifying these fractions, they will be of the form where is a positive divisor of . A fixed denominator will occur exactly times.
Theorem 5.2 Suppose that is a field and let be a finite multiplicative subgroup of . Then there exists an element such that .
Proof. Suppose that . Lagrange's theorem (Theorem 33.15) implies that the order of an element is a divisor of . We claim, for an arbitrary , that there are at most elements in with order . The elements with order are roots of the polynomial . If has an element with order , then, by Lemma 5.5, (the lemma will be verified later). Therefore all the elements of with order are contained in the group , which, in turn, contains exactly elements of order .
If had no element of order , then the order of each of the elements of would be a proper divisor of . In this case, however, using the identity above and the fact that , we obtain
which is a contradiction.
Suppose that is a field and that are elements of . Recall that an expression of the form
where is an indeterminate, is said to be a polynomial over (see Chapter 32). The scalars are the coefficients of the polynomial . The degree of the zero polynomial is zero, while the degree of a non-zero polynomial is the largest index such that . The degree of is denoted by .
The set of all polynomials over in the indeterminate is denoted by . If
and
are polynomials with degree not larger than , then their sum is defined as the polynomial
whose coefficients are .
The product of the polynomials and is defined as the polynomial
with degree at most whose coefficients are given by the equations . On the right-hand side of these equations, the coefficients with index greater than are considered zero. Easy computation shows that is a commutative ring with respect to these operations. It is also straightforward to show that has no zero divisors, that is, whenever , then either or .
The ring of polynomials over is quite similar, in many ways, to the ring of integers. One of their similar features is that the procedure of division with remainder can be performed in both rings.
Lemma 5.3 Let be polynomials such that . Then there there exist polynomials and such that
and either or . Moreover, the polynomials and are uniquely determined by these conditions.
Proof. We verify the claim about the existence of the polynomials and by induction on the degree of . If or , then the assertion clearly holds. Let us suppose, therefore, that . Then subtracting a suitable multiple of from , we obtain that the degree of is smaller than . Then, by the induction hypothesis, there exist polynomials and such that
and either or . It is easy to see that, in this case, the polynomials and are as required.
It remains to show that the polynomials and are unique. Let and be polynomials, possibly different from and , satisfying the assertions of the lemma. That is, , and so . If the polynomial on the left-hand side is non-zero, then its degree is at least , while the degree of the polynomial on the right-hand side is smaller than . This, however, is not possible.
Let be a commutative ring with a multiplicative identity and without zero divisors, and set . The ring is said to be a Euclidean ring if there is a function such that , for all ; and, further, if , , then there are elements such that , and if , then . The previous lemma shows that is a Euclidean ring where the role of the function is played by the degree function.
The concept of divisibility in can be defined similarly to the definition of the corresponding concept in the ring of integers. A polynomial is said to be a divisor of a polynomial (the notation is ), if there is a polynomial such that . The non-zero elements of , which are clearly divisors of each of the polynomials, are called the trivial divisors or units. A non-zero polynomial is said to be irreducible, if whenever with , then either or is a unit.
Two polynomials are called associates, if there is some such that .
Using Lemma 5.3, one can easily prove the unique factorisation theorem in the ring of polynomials following the argument of the proof of the corresponding theorem in the ring of integers (see Section 33.1). The role of the absolute value of integers is played by the degree of polynomials.
Theorem 5.4 An arbitrary polynomial can be written in the form
where is a unit, the polynomials are pairwise non-associate and irreducible, and, further, the numbers are positive integers. Furthermore, this decomposition is essentially unique in the sense that whenever
is another such decomposition, then , and, after possibly reordering the factors , the polynomials and are associates, and moreover for all .
Two polynomials are said to be relatively prime, if they have no common irreducible divisors.
A scalar is a root of a polynomial , if . Here the value is obtained by substituting into the place of in .
Lemma 5.5 Suppose that is a root of a polynomial . Then there exists a polynomial such that . Hence the polynomial may have at most roots.
Proof. By Lemma 5.3, there exists and such that . Substituting for , we find that . The second assertion now follows by induction on from the fact that the roots of are also roots of .
Suppose that are polynomials of degree at most . Then the polynomials can obviously be computed using field operations. The product can be obtained, using its definition, by field operations. If the Fast Fourier Transform can be performed over , then the multiplication can be computed using only field operations (see Theorem 32.2). For general fields, the cost of the fastest known multiplication algorithms for polynomials (for instance the Schönhage–Strassen-method) is , that is, field operations.
The division with remainder, that is, determining the polynomials and for which and either or , can be performed using field operations following the straightforward method outlined in the proof of Lemma 5.3. There is, however, an algorithm (the Sieveking–Kung algorithm) for the same problem using only steps. The details of this algorithm are, however, not discussed here.
Let with , and let . We say that is congruent to modulo , or simply , if divides the polynomial . This concept of congruence is similar to the corresponding concept introduced in the ring of integers (see 33.3.2). It is easy to see from the definition that the relation is an equivalence relation on the set . Let (or simply if is clear from the context) denote the equivalence class containing . From Lemma 5.3 we obtain immediately, for each , that there is a unique such that , and either (if divides ) or . This polynomial is called the representative of the class . The set of equivalence classes is traditionally denoted by .
Lemma 5.6 Let and let . Suppose that and . Then
and
Proof. The first congruence is valid, as
and the right-hand side of this is clearly divisible by . The second and the third congruences follow similarly from the identities
and
respectively.
The previous lemma makes it possible to define the sum and the product of two congruence classes and as and , respectively. The lemma claims that the sum and the product are independent of the choice of the congruence class representatives. The same way, we may define the action of on the set of congruence classes: we set .
Theorem 5.7 Suppose that and that .
(i) The set of residue classes is a commutative ring with an identity under the operations and defined above.
(ii) The ring contains the field as a subring, and it is an -dimensional vector space over . Further, the residue classes form a basis of .
(iii) If is an irreducible polynomial in , then is a field.
Proof. (i) The fact that is a ring follows easily from the fact that is a ring. Let us, for instance, verify the distributive property:
The zero element of is the class , the additive inverse of the class is the class , while the multiplicative identity element is the class . The details are left to the reader.
(ii) The set is a subring isomorphic to . The correspondence is obvious: . By part (i), is an additive Abelian group, and the action of satisfies the vector space axioms. This follows from the fact that the polynomial ring is itself a vector space over . Let us, for example, verify the distributive property:
The other properties are left to the reader.
We claim that the classes are linearly independent. For, if
then , as the zero polynomial is the unique polynomial with degree less than that is divisible by . On the other hand, for a polynomial , the degree of the class representative of is less than . Thus the class can be expressed as a linear combination of the classes . Hence the classes form a basis of , and so .
(iii) Suppose that is irreducible. First we show that has no zero divisors. If , then divides , and so divides either or . That is, either or . Suppose now that with . We claim that the classes are linearly independent. Indeed, an equation implies , and, in turn, it also yields that . Therefore the classes form a basis of . Hence there exist coefficients for which
Thus we find that the class has a multiplicative inverse, and so is a field, as required.
We note that the converse of part (iii) of the previous theorem is also true, and its proof is left to the reader (Exercise 5.1-1).
Example 5.2 We usually represent the elements of the residue class ring by their representatives, which are polynomials with degree less than .
1. Suppose that is the field of two elements, and let . Then the ring has 8 elements, namely
Practically speaking, the addition between the classes is the is addition of polynomials. For instance
When computing the product, we compute the product of the representatives, and substitute it (or reduce it) with its remainder after dividing by . For instance,
The polynomial is irreducible over , since it has degree 3, and has no roots. Hence the residue class ring is a field.
2. Let and let . The elements of the residue class ring are the classes of the form where . The ring is not a field, since is not irreducible. For instance, .
Lemma 5.8 Let be a field containing a field and let .
(i) If is finite-dimensional as a vector space over , then there is a non-zero polynomial such that is a root of .
(ii) Assume that there is a polynomial with , and let be such a polynomial with minimal degree. Then the polynomial is irreducible in . Further, if with then is a divisor of .
Proof. (i) For a sufficiently large , the elements are linearly dependent over . A linear dependence gives a polynomial such that .
(ii) If , then, as , the element is a root of either or . As was chosen to have minimal degree, one of the polynomials is a unit, and so is irreducible. Finally, let such that . Let be the polynomials as in Lemma 5.3 for which . Substituting for into the last equation, we obtain , which is only possible if .
Definition 5.9 The polynomial in the last lemma is said to be a minimal polynomial of .
It follows from the previous lemma that the minimal polynomial is unique up to a scalar multiple. It will often be helpful to assume that the leading coefficient (the coefficient of the term with the highest degree) of the minimal polynomial is 1.
Corollary 5.10 Let be a field containing , and let . Suppose that is irreducible and that . Then is a minimal polynomial of .
Proof. Suppose that is a minimal polynomial of . By the previous lemma, and is irreducible. This is only possible if the polynomials and are associates.
Let be a field containing and let . Let denote the smallest subfield of that contains and .
Theorem 5.11 Let be a field containing and let . Suppose that is a minimal polynomial of . Then the field is isomorphic to the field . More precisely, there exists an isomorphism such that , for all , and . The map is also an isomorphism of vector spaces over , and so .
Proof. Let us consider the map , which maps a polynomial into . This is clearly a ring homomorphism, and . We claim that if and only if . Indeed, holds if and only if , that is, if , which, by Lemma 5.8, is equivalent to , and this amounts to saying that . Suppose that is the map induced by , that is, . By the argument above, the map is one-to-one. Routine computation shows that is a ring, and also a vector space, homomorphism. As is a field, its homomorphic image is also a field. The field contains and , and so necessarily .
Let be polynomials such that . Set , and define the polynomials and using division with reminder as follows:
Note that if then is smaller than . We form this sequence of polynomials until we obtain that . By Lemma 5.3, this defines a finite process. Let be the maximum of and . As, in each step, we decrease the degree of the polynomials, we have . The computation outlined above is usually referred to as the Euclidean algorithm. A version of this algorithm for the ring of integers is described in Section 33.2.
We say that the polynomial is the greatest common divisor of the polynomials and , if , , and, if a polynomial is a divisor of and , then is a divisor of . The usual notation for the greatest common divisor of and is . It follows from Theorem 5.4 that exists and it is unique up to a scalar multiple.
Theorem 5.12 Suppose that are polynomials, that , and let be the maximum of and . Assume, further, that the number and the polynomial are defined by the procedure above. Then
(i) .
(ii) There are polynomials with degree at most such that
(iii) With a given input , the polynomials can be computed using field operations in .
Proof. (i) Going backwards in the Euclidean algorithm, it is easy to see that the polynomial divides each of the , and so it divides both and . The same way, if a polynomial divides and , then it divides , for all , and, in particular, it divides . Thus .
(ii) The claim is obvious if , and so we may assume without loss of generality that . Starting at the beginning of the Euclidean sequence, it is easy to see that there are polynomials such that
We observe that (5.2) also holds if we substitute by its remainder after dividing by and substitute by its remainder after dividing by . In order to see this, we compute
and notice that the degree of the polynomials on both sides of this congruence is smaller than . This gives
(iii) Once we determined the polynomials , , and , the polynomials , and can be obtained using field operations in . Initially we have and . As , the claim follows.
Remark. Traditionally, the Euclidean algorithm is only used to compute the greatest common divisor. The version that also computes the polynomials and in (5.1) is usually called the extended Euclidean algorithm. In Chapter 6 the reader can find a discussion of the Euclidean algorithm for polynomials. It is relatively easy to see that the polynomials , , and in (5.1) can, in fact, be computed using field operations. The cost of the asymptotically best method is .
The derivative of a polynomial is often useful when investigating multiple factors. The derivative of the polynomial
is the polynomial
It follows immediately from the definition that the map is an -linear mapping . Further, for and , the equations and hold. The derivative of a product can be computed using the Leibniz rule: for all we have . As the derivation is a linear map, in order to show that the Leibniz rule is valid, it is enough to verify it for polynomials of the form and . It is easy to see that, for such polynomials, the Leibniz rule is valid.
The derivative is sensitive to multiple factors in the irreducible factorisation of .
Lemma 5.13 Let be an arbitrary field, and assume that and where . Then divides the derivative of the polynomial .
Proof. Using induction on and the Leibniz rule, we find . Thus, applying the Leibniz rule again, . Hence .
In many cases the converse of the last lemma also holds.
Lemma 5.14 Let be an arbitrary field, and assume that and where the polynomials and are relatively prime. Suppose further that (for instance has characteristic and is non-constant). Then the derivative is not divisible by .
Proof. By the Leibniz rule, . Since is smaller than , we obtain that is not divisible by , and neither is the product , as and are relatively prime.
Using the following theorem, the ring can be assembled from rings of the form where .
Theorem 5.15 (Chinese remainder theorem for polynomials)Let pairwise relatively prime polynomials with positive degree and set . Then the rings and are isomorphic. The mapping realizing the isomorphism is
Proof. First we note that the map is well-defined. If , then , which implies that and give the same remainder after division by the polynomial , that is, .
The mapping is clearly a ring homomorphism, and it is also a linear mapping between two vector spaces over . The mapping is one-to-one; for, if , then , that is, (), which gives and .
The dimensions of the vector spaces and coincide: indeed, both spaces have dimension . Lemma 5.1 implies that is an isomorphism between vector spaces. It only remains to show that preserves the multiplication; this, however, is left to the reader.
Exercises
5.1-1 Let be polynomial. Show that the residue class ring has no zero divisors if and only if is irreducible.
5.1-2 Let be a commutative ring with an identity. A subset is said to be an ideal, if is an additive subgroup, and , imply . Show that is a field if and only if its ideals are exactly and .
5.1-3 Let . Let denote the smallest ideal in that contains the elements . Show that always exists, and it consists of the elements of the form where .
5.1-4 A commutative ring with an identity and without zero divisors is said to be a principal ideal domain if, for each ideal of , there is an element such that (using the notation of the previous exercise) . Show that and where is a field, are principal ideal domains.
5.1-5 Suppose that is a commutative ring with an identity, that an ideal in , and that . Define a relation on as follows: if and only if . Verify the following:
a.) The relation is an equivalence relation on .
b.) Let denote the equivalence class containing an element , and let denote the set of equivalence classes. Set , and . Show that, with respect to these operations, is a commutative ring with an identity.
Hint. Follow the argument in the proof of Theorem 5.7.
5.1-6 Let be a field and let such that . Show that there exists a polynomial such that .
Hint. Use the Euclidean algorithm.
Finite fields, that is, fields with a finite number of elements, play an important role in mathematics and in several of its application areas, for instance, in computing. They are also fundamental in many important constructions. In this section we summarise the most important results in the theory of finite fields, putting an emphasis on the problem of their construction.
In this section denotes a prime number, and denotes a power of with a positive integer exponent.
Theorem 5.16 Suppose that is a finite field. Then there is a prime number such that the prime field of is isomorphic to (the field of residue classes modulo ). Further, the field is a finite dimensional vector space over , and the number of its elements is a power of . In fact, if , then .
Proof. The characteristic of must be a prime, say , as a field with characteristic zero must have infinitely many elements. Thus the prime field of is isomorphic to . Since is a subfield, the field is a vector space over . Let be a basis of over . Then each can be written uniquely in the form where . Hence .
In a field , the set of non-zero elements (the multiplicative group of ) is denoted by . From Theorem 5.2 we immediately obtain the following result.
Theorem 5.17 If is a finite field, then its multiplicative group is cyclic.
A generator of the group is said to be a primitive element. If and is a primitive element of , then the elements of are .
Corollary 5.18 Suppose that is a finite field with order and let be a primitive element of . Let be a minimal polynomial of over . Then is irreducible in , the degree of is , and is isomorphic to the field .
Proof. Since the element is primitive in , we have . The rest of the lemma follows from Lemma 5.8 and from Theorem 5.11.
Theorem 5.19 Let be a finite field with order . Then
(i) (Fermat's little theorem) If , then .
(ii) If , then .
Proof. (i) Suppose that is a primitive element. Then we may choose an integer such that . Therefore
(ii) Clearly, if then this claim is true, while, for , the claim follows from part (i).
Theorem 5.20 Let be a field with elements. Then
Proof. By Theorem 5.19 and Lemma 5.5, the product on the right-hand side is a divisor of the polynomial . Now the assertion follows, as the degrees and the leading coefficients of the two polynomials in the equation coincide.
Corollary 5.21 Arbitrary two finite fields with the same number of elements are isomorphic.
Proof. Suppose that , and that both and are fields with elements. Let be a primitive element in . Then Corollary 5.18 implies that a minimal polynomial of over is irreducible (in ) with degree . Further, . By Lemma 5.8 and Theorem 5.19, the minimal polynomial is a divisor of the polynomial . Applying Theorem 5.20 to , we find that the polynomial , and also its divisor , can be factored as a product of linear terms in , and so has at least one root in . As is irreducible in , it must be a minimal polynomial of (see Corollary 5.10), and so is isomorphic to the field . Comparing the number of elements in and in , we find that , and further, that and are isomorphic.
In the sequel, we let denote the field with elements, provided it exists. In order to prove the existence of such a field for each prime-power , the following two facts will be useful.
Lemma 5.22 If is a prime number and is an integer such that , then .
Proof. On the one hand, the number is an integer. On the other hand, is a fraction such that, for , its numerator is divisible by , but its denominator is not.
Lemma 5.23 Let be a commutative ring and let be a prime such that for all . Then the map mapping is a ring homomorphism.
Proof. Suppose that . Clearly,
By the previous lemma,
We obtain in the same way that .
The homomorphism in the previous lemma is called the Frobenius endomorphism.
Theorem 5.24 Assume that the polynomial is irreducible, and, for a positive integer , it is a divisor of the polynomial . Then the degree of divides .
Proof. Let be the degree of , and suppose, by contradiction, that where . The assumption that can be rephrased as . However, this means that, for an arbitrary polynomial , we have
Note that we applied Lemma 5.23 to the ring , and Theorem 5.19 to . The residue class ring is isomorphic to the field , which has elements. Let be a polynomial for which is a primitive element in the field . That is, , but for . Therefore,
and so . Since the residue class ring is a field, , but we must have . As , this contradicts to the primitivity of .
Theorem 5.25 For an arbitrary prime and positive integer , there exists a field with elements.
Proof. We use induction on . The claim clearly holds if . Now let and let be a prime divisor of . By the induction hypothesis, there is a field with elements. By Theorem 5.24, each of the irreducible factors, in , of the the polynomial has degree either 1 or . Further, , and so, by Lemma 5.13, is square-free. Over , the number of linear factors of is at most , and so is the degree of their product. Hence there exist at least polynomials with degree that are irreducible in . Let be such a polynomial. Then the field is isomorphic to the field with elements.
Corollary 5.26 For each positive integer , there is an irreducible polynomial with degree .
Proof. Take a minimal polynomial over of a primitive element in .
A little bit later, in Theorem 5.31, we will prove a stronger statement: a random polynomial in with degree is irreducible with high probability.
The following theorem describes all subfields of a finite field.
Theorem 5.27 The field contains a subfield isomorphic to , if and only if . In this case, there is exactly one subfield in that is isomorphic to .
Proof. The condition that is necessary, since the larger field is a vector space over the smaller field, and so must hold with a suitable integer .
Conversely, suppose that , and let be an irreducible polynomial with degree . Such a polynomial exists by Corollary 5.26. Let . Applying Theorem 5.19, we obtain, in , that , which yields . Thus must be a divisor of the polynomial . Using Theorem 5.20, we find that has a root in . Now we may prove in the usual way that the subfield is isomorphic to .
The last assertion is valid, as the elements of are exactly the roots of (Theorem 5.20), and this polynomial can have, in an arbitrary field, at most roots.
Next we prove an important property of the irreducible polynomials over finite fields.
Theorem 5.28 Assume that are finite fields, and let . Let be the minimal polynomial of over with leading coefficient , and suppose that . Then
Moreover, the elements are pairwise distinct.
Proof. Let . If with , then, using Lemma 5.23 and Theorem 5.19, we obtain
Thus is also a root of .
As is a root of , the argument in the previous paragraph shows that so are the elements . Hence it suffices to show, that they are pairwise distinct. Suppose, by contradiction, that and that . Let and let . By assumption, , which, by Lemma 5.8, means that . From Theorem 5.24, we obtain, in this case, that , which is a contradiction, as .
This theorem shows that a polynomial which is irreducible over a finite field cannot have multiple roots. Further, all the roots of can be obtained from a single root taking -th powers repeatedly.
In this section we characterise certain automorphisms of finite fields.
Definition 5.29 Suppose that are finite fields. The map is an -automorphism of the field , if it is an isomorphism between rings, and holds for all .
Recall that the map is defined as follows: where .
Theorem 5.30 The set of -automorphisms of the field is formed by the maps .
Proof. By Lemma 5.23, the map is a ring homomorphism. The map is obviously one-to-one, and hence it is also an isomorphism. It follows from Theorem 5.19, that leaves the elements fixed. Thus the maps are -automorphisms of .
Suppose that , and with , and that is an -automorphism of . We claim that is a root of . Indeed,
Let be a primitive element of and assume now that is a minimal polynomial of . By the observation above and by Theorem 5.28, , with some , that is, . Hence the images of a generating element of under the automorphisms and coincide, which gives .
Let . By Theorem 5.7 and Corollary 5.26, the field can be written in the form , where is an irreducible polynomial with degree . In practical applications of field theory, for example in computer science, this is the most common method of constructing a finite field. Using, for instance, the polynomial in Example 5.2, we may construct the field . The following theorem shows that we have a good chance of obtaining an irreducible polynomial by a random selection.
Theorem 5.31 Let be a uniformly distributed random polynomial with degree and leading coefficient . (Being uniformly distributed means that the probability of choosing is .) Then is irreducible over with probability at least .
Proof. First we estimate the number of elements for which . We claim that the number of such elements is at least
where the summation runs for the distinct prime divisors of . Indeed, if does not generate, over , the field , then it is contained in a maximal subfield of , and these maximal subfields are, by Theorem5.27, exactly the fields of the form . The number of distinct prime divisors of are at most , and so the number of such elements is at least . The minimal polynomials with leading coefficients 1 over of such elements have degree and they are irreducible. Such a polynomial is a minimal polynomial of exactly elements (Theorem 5.28). Hence the number of distinct irreducible polynomials with degree and leading coefficient 1 in is at least
from which the claim follows.
If, having , we would like to construct one of its extensions , then it is worth selecting a random polynomial
In other words, we select uniformly distributed random coefficients independently. The polynomial so obtained is irreducible with a high probability (in fact, with probability at least if is large). Further, in this case, we also have . We expect that we will have to select about polynomials before we find an irreducible one.
We have seen in Theorem 5.25 that field extensions can be obtained using irreducible polynomials. It is often useful if these polynomials have some further nice properties. The following lemma claims the existence of such polynomials.
Lemma 5.32 Let be a prime. In a finite field there exists an element which is not an -th power if and only if . If is such an element, then the polynomial is irreducible in , and so is a field with elements.
Proof. Suppose first that and let be a positive integer such that . If such that , then , while if , then . Hence, in this case, each of the elements of is an -th power.
Next we assume that , and we let be a primitive element in . Then, in , the -th powers are exactly the following elements: . Suppose now that , but . Then the order of an element is divisible by if and only if is not an -th power. Let be such an element, and let be an irreducible factor of the polynomial . Suppose that the degree of is ; clearly, . Then is a field with elements and, in , the equation holds. Therefore the order of is divisible by . Consequently, . As is not divisible by , we have . In other words . On the other hand, as , we find , and hence , which, since , can only happen if .
In certain cases, we can use the previous lemma to boost the probability of finding an irreducible polynomial.
Claim 5.33 Let be a prime such that . Then, for a random element , the polynomial is irreducible in with probability at least .
Proof. Under the conditions, the -th powers in constitute the cyclic subgroup with order . Thus a random element is an -th power with probability , and hence the assertion follows from Lemma 5.32.
Remark. Assume that , and, if , then assume also that . In this case there is an element in that is not an -th power. We claim that that the residue class is not an -th power in . Indeed, by the argument in the proof of Lemma 5.32, it suffices to show that . By our assumptions, this is clear if . Now assume that , and write . Then, for all integers , we have , and so, by the assumptions,
Exercises
5.2-1 Show that the polynomial can be factored as a product of linear factors over the field .
5.2-2 Show that the polynomial is irreducible over , that is, . What is the order of the element in the residue class ring? Is it true that the element is primitive in ?
5.2-3 Determine the irreducible factors of over the field .
5.2-4 Determine the subfields of .
5.2-5 Let and be positive integers. Show that there exists a finite field containing such that and . What can we say about the number of elements in ?
5.2-6 Show that the number of irreducible polynomials with degree and leading coefficient 1 over is at most .
5.2-7 (a) Let be a field, let be an -dimensional vector space over , and let be a linear transformation whose minimal polynomial coincides with its characteristic polynomial. Show that there exists a vector such that the images are linearly independent.
(b) A set is said to be a normal basis of over , if and is a linearly independent set over . Show that has a normal basis over .
Hint. Show that a minimal polynomial of the -linear map is , and use part (a).
One of the problems that we often have to solve when performing symbolic computation is the factorisation problem. Factoring an algebraic expression means writing it as a product of simpler expressions. Experience shows that this can be very helpful in the solution of a large variety of algebraic problems. In this section, we consider a class of factorisation algorithms that can be used to factor polynomials in one variable over finite fields.
The input of the polynomial factorisation problem is a polynomial . Our aim is to compute a factorisation
of where the polynomials are pairwise relatively prime and irreducible over , and the exponents are positive integers. By Theorem 5.4, determines the polynomials and the exponents essentially uniquely.
Then it is easy to compute modulo 23 that
None of the factors , , has a root in , and so they are necessarily irreducible in .
The factorisation algorithms are important computational tools, and so they are implemented in most of the computer algebra systems (Mathematica, Maple, etc). These algorithms are often used in the area of error-correcting codes and in cryptography.
Our aim in this section is to present some of the basic ideas and building blocks that can be used to factor polynomials over finite fields. We will place an emphasis on the existence of polynomial time algorithms. The discussion of the currently best known methods is, however, outside the scope of this book.
The factorisation problem in the previous section can efficiently be reduced to the special case when the polynomial to be factored is square-free; that is, in (5.3), for all . The basis of this reduction is Lemma 5.13 and the following simple result. Recall that the derivative of a polynomial is denoted by .
Lemma 5.34 Let be a polynomial. If , then there exists a polynomial such that .
Proof. Suppose that . Then . If the coefficient is zero in then either or . Hence, if then can be written as . Let ; then choosing , we have , and so .
If , then, using the previous lemma, a factorisation of into square-free factors can be obtained from that of the polynomial , which has smaller degree. On the other hand, if , then, by Lemma 5.13, the polynomial is already square-free and we only have to factor into square-free factors. The division of polynomials and computing the greatest common divisor can be performed in polynomial time, by Theorem 5.12. In order to compute the polynomial , we need the solutions, in , of equations of the form with . If , then is a solution of such an equation, which, using fast exponentiation (repeated squaring, see 33.6.1), can be obtained in polynomial time.
One of the two reduction steps can always be performed if is divisible by a square of a polynomial with positive degree.
Usually a polynomial can be written as a product of square-free factors in many different ways. For the sake of uniqueness, we define the square-free factorisation of a polynomial as the factorisation
where are integers, and the polynomials are relatively prime and square-free. Hence we collect together the irreducible factors of with the same multiplicity. The following algorithm computes a square-free factorisation of . Besides the observations we made in this section, we also use Lemma 5.14. This lemma, combined with Lemma 5.13, guarantees that the product of the irreducible factors with multiplicity one of a polynomial over a finite field is .
Square-Free-Factorisation(
)
1 2 3 4 5WHILE
6DO
IF
7THEN
8 9ELSE
10 11IF
12THEN
13 14RETURN
The degree of the polynomial decreases after each execution of the main loop, and the subroutines used in this algorithm run in polynomial time. Thus the method above can be performed in polynomial time.
Suppose that is a square-free polynomial. Now we factor as
where, for , the polynomial is a product of irreducible polynomials with degree . Though this step is not actually necessary for the solution of the factorisation problem, it is worth considering, as several of the known methods can efficiently exploit the structure of the polynomials . The following fact serves as the starting point of the distinct degree factorisation.
Theorem 5.35 The polynomial is the product of all the irreducible polynomials , each of which is taken with multiplicity , that have leading coefficient and whose degree divides .
Proof. As , all the irreducible factors of this polynomial occur with multiplicity one. If is irreducible and divides , then, by Theorem 5.24, the degree of divides .
Conversely, let be an irreducible polynomial with degree such that . Then, by Theorem 5.27, has a root in , which implies .
The theorem offers an efficient method for computing the polynomials . First we separate from , and then, step by step, we separate the product of the factors with higher degrees.
Distinct-Degree-Factorisation(
)
1 2FOR
TO
3DO
4 5RETURN
If, in this algorithm, the polynomial is constant, then we may stop, as the further steps will not give new factors. As the polynomial may have large degree, computing must be performed with particular care. The important idea here is that the residue can be computed using fast exponentiation.
The algorithm outlined above is suitable for testing whether a polynomial is irreducible, which is one of the important problems that we encounter when constructing finite fields. The algorithm presented here for distinct degree factorisation can solve this problem efficiently. For, it is obvious that a polynomial with degree is irreducible, if, in the factorisation (5.4), we have .
The following algorithm for testing whether a polynomial is irreducible is somewhat more efficient than the one sketched in the previous paragraph and handles correctly also the inputs that are not square-free.
Irreducibility-Test(
)
1 2IF
3THEN
RETURN
“no” 4FOR
the prime divisors of 5DO
IF
6THEN
RETURN
“no” 7RETURN
“yes”
In lines 2 and 5, we check whether is the smallest among the positive integers for which divides . By Theorem 5.35, this is equivalent to the irreducibility of . If survives the test in line 2, then, by Theorem 5.35, we know that is square-free and must divide . Using at most fast exponentiations modulo , we can thus decide if is irreducible.
Theorem 5.36 If the field is given and is an integer, then the field can be constructed using a randomised Las Vegas algorithm which runs in time polynomial in and .
Proof. The algorithm is the following.
Finite-Field-Construction(
)
1FOR
to 2DO
a random element (uniformly distributed) of 3 4IF
Irreducibility-Test
“yes” 5THEN
RETURN
6ELSE
RETURN
“fail”
In lines 1–3, we choose a uniformly distributed random polynomial with leading coefficient and degree . Then, in line 4, we efficiently check if is irreducible. By Theorem 5.31, the polynomial is irreducible with a reasonably high probability.
In this section we consider the special case of the factorisation problem in which is odd and the polynomial is of the form
where the are pairwise relatively prime irreducible polynomials in with the same degree , and we also assume that . Our motivation for investigating this special case is that a square-free distinct degree factorisation reduces the general factorisation problem to such a simpler problem. If is even, then Berlekamp's method, presented in Subsection 5.3.4, gives a deterministic polynomial time solution. There is a variation of the method discussed in the present section that works also for even ; see Problem 5-2.
Lemma 5.37 Suppose that is odd. Then there are pairs such that exactly one of and is equal to .
Proof. Suppose that is a primitive element in ; that is, , but for . Then , and further, as , but , we obtain that . Therefore , and so half of the element give , while the other half give . If then clearly . Thus there are pairs such that , but , and, obviously, we have the same number of pairs for which the converse is valid. Thus the number of pairs that satisfy the condition is .
Theorem 5.38 Suppose that is odd and the polynomial is of the form (5.5) and has degree . Choose a uniformly distributed random polynomial with degree less than . (That is, choose pairwise independent, uniformly distributed scalars , and consider the polynomial .) Then, with probability at least , the greatest common divisor
is a proper divisor of .
Proof. The element corresponds to an element of the residue class field . By the Chinese remainder theorem (Theorem 5.15), choosing the polynomial uniformly implies that the residues of modulo the factors are independent and uniformly distributed random polynomials. By Lemma 5.37, the probability that exactly one of the residues of the polynomial modulo and is zero is precisely . In this case the greatest common divisor in the theorem is indeed a divisor of . For, if , but this congruence is not valid modulo , then the polynomial is divisible by the factor , but not divisible by , and so its greatest common divisor with is a proper divisor of . The function
is strictly increasing in , and it takes its smallest possible value if is the smallest odd prime-power, namely 3. The minimum is, thus, .
The previous theorem suggests the following randomised Las Vegas polynomial time algorithm for factoring a polynomial of the form (5.5) to a product of two factors.
Cantor-Zassenhaus-Odd(
)
1 2FOR
to 3DO
a random element (uniformly distributed) of 4 5 6IF
7THEN
RETURN
8ELSE
RETURN
“fail”
If one of the polynomials in the output is not irreducible, then, as it is of the form (5.5), it can be fed, as input, back into the algorithm. This way we obtain a polynomial time randomised algorithm for factoring .
In the computation of the greatest common divisor, the residue should be computed using fast exponentiation.
Now we can conclude that the general factorisation problem (5.3) over a field with odd order can be solved using a randomised polynomial time algorithm.
Here we will describe an algorithm that reduces the problem of factoring polynomials to the problem of searching through the underlying field or its prime field. We assume that
where the are pairwise non-associate, irreducible polynomials in , and also that . The Chinese remainder theorem (Theorem 5.15) gives an isomorphism between the rings and
The isomorphism is given by the following map:
where . The most important technical tools in Berlekamp's algorithm are the -th and -th power maps in the residue class ring . Taking -th and -th powers on both sides of the isomorphism above given by the Chinese remainder theorem, we obtain the following maps:
The Berlekamp subalgebra of the polynomial is the subring of the residue class ring consisting of the fixed points of the -th power map. Further, the absolute Berlekamp subalgebra of consists of the fixed points of the -th power map. In symbols,
It is easy to see that . The term subalgebra is used here, because both types of Berlekamp subalgebras are subrings in the residue class ring (that is they are closed under addition and multiplication modulo ), and, in addition, is also linear subspace over , that is, it is closed under multiplication by the elements of . The absolute Berlekamp subalgebra is only closed under multiplication by the elements of the prime field .
The Berlekamp subalgebra is a subspace, as the map is an -linear map of into itself, by Lemma 5.23 and Theorem 5.19. Hence a basis of can be computed as a solution of a homogeneous system of linear equations over , as follows.
For all , compute the polynomial with degree at most that satisfies . For each , such a polynomial can be determined by fast exponentiation using multiplications of polynomials and divisions with remainder. Set . The class of a polynomial with degree less than lies in the Berlekamp subalgebra if and only if
which, considering the coefficient of for , leads to the following system of homogeneous linear equations in variables:
Similarly, computing a basis of the absolute Berlekamp subalgebra over can be carried out by solving a system of homogeneous linear equations in variables over the prime field , as follows. We represent the elements of in the usual way, namely using polynomials with degree less than in . We perform the operations modulo , where is an irreducible polynomial with degree over the prime field . Then the polynomial of degree less than can be written in the form
where . Let, for all and for all , be the unique polynomial with degree at most for which . The polynomial is of the form . The criterion for being a member of the absolute Berlekamp subalgebra of with is
which, considering the coefficients of the monomials , is equivalent to the following system of equations:
This is indeed a homogeneous system of linear equations in the variables . Systems of linear equations over fields can be solved in polynomial time (see Section 31.4), the operations in the ring can be performed in polynomial time, and the fast exponentiation also runs in polynomial time. Thus the following theorem is valid.
Theorem 5.39 Let . Then it is possible to compute the Berlekamp subalgebras and , in the sense that an -basis of and -basis of can be obtained, using polynomial time deterministic algorithms.
and
The following theorem shows that the elements of the Berlekamp subalgebra can be characterised by their Chinese remainders.
and
Proof. Using the Chinese remainder theorem, and equations (5.8), (5.9), we are only required to prove that
and
where is an irreducible polynomial, is an arbitrary polynomial and is a positive integer. In both of the cases, the direction is a simple consequence of Theorem 5.19. As , the implication concerning the absolute Berlekamp subalgebra follows from that concerning the Berlekamp subalgebra, and so it suffices to consider the latter.
The residue class ring is a field, and so the polynomial has at most roots in . However, we already obtain distinct roots from Theorem 5.19, namely the elements of (the constant polynomials modulo ). Thus
Hence, if , then is of the form where . Let be an arbitrary positive integer. Then
If we choose large enough so that holds, then, by the congruence above, also holds.
An element of or is said to be non-trivial if there is no element such that . By the previous theorem and the Chinese remainder theorem, this holds if and only if there are such that . Clearly a necessary condition is that , that is, must have at least two irreducible factors.
Lemma 5.41 Let be a non-trivial element of the Berlekamp subalgebra . Then there is an element such that the polynomial is a proper divisor of . If , then there exists such an element in the prime field .
Proof. Let and be integers such that , , and . Then, choosing , the polynomial is divisible by , but not divisible by . If, in addition, , then also .
Assume that we have a basis of at hand. At most one of the basis elements can be trivial, as a trivial element is a scalar multiple of 1. If is not a power of an irreducible polynomial, then there will surely be a non-trivial basis element , and so, using the idea in the previous lemma, can be factored two factors.
Theorem 5.42 A polynomial can be factored with a deterministic algorithm whose running time is polynomial in , , and .
Proof. It suffices to show that can be factored to two factors within the given time bound. The method can then be repeated.
Berlekamp-Deterministic
1 a basis of 2IF
3THEN
a non-trivial element of 4FOR
5DO
6IF
7THEN
RETURN
8ELSE
RETURN
“a power of an irreducible”
In the first stage, in line 1, we determine a basis of the absolute Berlekamp subalgebra. The cost of this is polynomial in and . In the second stage (lines 2–8), after taking a non-trivial basis element , we compute the greatest common divisors for all . The cost of this is polynomial in and .
If there is no non-trivial basis-element, then is 1-dimensional and is the -th power of the irreducible polynomial where and can, for instance, be determined using the ideas presented in Section 5.3.1.
The time bound in the previous theorem is not polynomial in the input size, as it contains instead of . However, if is small compared to the other parameters (for instance in coding theory we often have ), then the running time of the algorithm will be polynomial in the input size.
Corollary 5.43 Suppose that can be bounded by a polynomial function of and . Then the irreducible factorisation of can be obtained in polynomial time.
The previous two results are due to E. R. Berlekamp. The most important open problem in the area discussed here is the existence of a deterministic polynomial time method for factoring polynomials. The question is mostly of theoretical interest, since the randomised polynomial time methods, such as the a Cantor -Zassenhaus algorithm, are very efficient in practice.
We can obtain a good randomised algorithm using Berlekamp subalgebras. Suppose that is odd, and, as before, is the polynomial to be factored.
Let be a random element in the Berlekamp subalgebra . An argument, similar to the one in the analysis of the Cantor-Zassenhaus algorithm shows that, provided has at least two irreducible factors, the greatest common divisor is a proper divisor of with probability at least 4/9. Now we present a variation of this idea that uses less random bits: instead of choosing a random element from , we only choose a random element from .
Lemma 5.44 Suppose that is odd and let and be two distinct elements of . Then there are at least elements such that exactly one of the elements and is .
Proof. Using the argument at the beginning of the proof of Lemma 5.37, one can easily see that there are elements in the set whose -th power is . It is also quite easy to check, for a given element , that there is a unique such that . Indeed, the required is the solution of a linear equation.
By the above, there are elements such that
For such a , one of the elements and is equal to and the other is equal to .
Theorem 5.45 Suppose that is odd and the polynomial has at least two irreducible factors in . Let be a non-trivial element in the Berlekamp subalgebra . If we choose a uniformly distributed random element , then, with probability at least , the greatest common divisor is a proper divisor of the polynomial .
Proof. Let , where the factors are pairwise distinct irreducible polynomials. The element is a non-trivial element of the Berlekamp subalgebra, and so there are indices and elements such that and . Using Lemma 5.44 with and , we find, for a random element , that the probability that exactly one of the elements and is zero is at least . If, for instance, , but , then but , that is, the polynomial is divisible by , but not divisible by . Thus the greatest common divisor is a proper divisor of .
The quantity is a strictly increasing function in , and so it takes its smallest value for the smallest odd prime-power, namely 3. The minimum is 1/3.
The previous theorem gives the following algorithm for factoring a polynomial to two factors.
Berlekamp-Randomised
1 a basis of 2IF
3THEN
a non-trivial elements of 4 a random element (uniformly distributed) of 5 6IF
7THEN
RETURN
8ELSE
RETURN
“fail” 9ELSE
RETURN
“a power of an irreducible”
Exercises
5.3-1 Let be an irreducible polynomial, and let be an element of the field . Give a polynomial time algorithm for computing .
Hint. Use the result of Exercise 5.1-6
5.3-2 Let . Using the Distinct-Degree-Factorisation
algorithm, determine the factorisation (5.4) of .
5.3-3 Follow the steps of the Cantor-Zassenhaus algorithm to factor the polynomial .
5.3-4 Let . Show that coincides with the absolute Berlekamp subalgebra of , that is, .
5.3-5 Let . Using Berlekamp's algorithm, determine the irreducible factors of : first find a non-trivial element in the Berlekamp subalgebra , then use it to factor .
Our aim in the rest of this chapter is to present the Lenstra-Lenstra-Lovász algorithm for factoring polynomials with rational coefficients. First we study a geometric problem, which is interesting also in its own right, namely finding short lattice vectors. Finding a shortest non-zero lattice vector is hard: by a result of Ajtai, if this problem could be solved in polynomial time with a randomised algorithm, then so could all the problems in the complexity class . For a lattice with dimension , the lattice reduction method presented in this chapter outputs, in polynomial time, a lattice vector whose length is not greater than times the length of a shortest non-zero lattice vector.
First, we recall a couple of concepts related to real vector spaces. Let denote the collection of real vectors of length . It is routine to check that is a vector space over the field . The scalar product of two vectors and in is defined as the number . The quantity is called the length of the vector . The vectors and are said to be orthogonal if . A basis of the space is said to be orthonormal, if, for all , and, for all and such that , we have .
The rank and the determinant of a real matrix, and definite matrices are discussed in Section 31.1.
Definition 5.46 A set is said to be a lattice, if is a subgroup with respect to addition, and is discrete, in the sense that each bounded region of contains only finitely many points of . The rank of the lattice is the dimension of the subspace generated by . Clearly, the rank of coincides with the cardinality of a maximal linearly independent subset of . If has rank , then is said to be a full lattice. The elements of are called lattice vectors or lattice points.
Definition 5.47 Let be linearly independent elements of a lattice . If all the elements of can be written as linear combinations of the elements with integer coefficients, then the collection is said to be a basis of .
In this case, as the vectors are linearly independent, all vectors of can uniquely be written as real linear combinations of .
By the following theorem, the lattices are precisely those additive subgroups of that have bases.
Theorem 5.48 Let be linearly independent vectors in and let be the set of integer linear combinations of . Then is a lattice and the vectors form a basis of . Conversely, if is a lattice in , then it has a basis.
Obviously, is a subgroup, that is, it is closed under addition and subtraction. In order to show that it is discrete, let us assume that . This assumption means no loss of generality, as the subspace spanned by is isomorphic to . In this case, is an invertible linear map of onto itself. Consequently, both and are continuous. Hence the image of a discrete set under is also discrete. As , it suffices to show that is discrete in . This, however, is obvious: if is a bounded region in , then there is a positive integer , such that the absolute value of each of the coordinates of the elements of is at most . Thus has at most elements in .
The second assertion is proved by induction on . If , then we have nothing to prove. Otherwise, by discreteness, there is a shortest non-zero vector, say, in . We claim that the vectors of that lie on the line are exactly the integer multiples of . Indeed, suppose that is a real number and consider the vector . As usual, denotes the fractional part of . Then , yet , that is is the difference of two vectors of , and so is itself in . This, however, contradicts to the fact that was a shortest non-zero vector in . Thus our claim holds.
The claim verified in the previous paragraph shows that the theorem is valid when . Let us, hence, assume that . We may write an element of as the sum of two vectors, one of them is parallel to and the other one is orthogonal to :
Simple computation shows that , and the map is linear. Let . We show that is a lattice in the subspace, or hyperplane, formed by the vectors orthogonal to . The map is linear, and so is closed under addition and subtraction. In order to show that it is discrete, let be a bounded region in . We are required to show that only finitely many points of are in . Let be a vector such that . Let be the integer that is closest to the number and let . Clearly, and . Further, we also have that , and so the vector lies in the bounded region . However, there are only finitely many vectors in this latter region, and so also has only finitely many lattice vectors .
We have, thus, shown that is a lattice in , and, by the induction hypothesis, it has a basis. Let be lattice vectors such that the vectors form a basis of the lattice . Then, for an arbitrary lattice vector , the vector can be written in the form where the coefficients are integers. Then and, as the map is linear, we have . This, however, implies that is a lattice vector on the line , and so with some integer . Therefore , that is, is an integer linear combination of the vectors . Thus the vectors form a basis of .
A lattice is always full in the linear subspace spanned by . Thus, without loss of generality, we will consider only full lattices, and, in the sequel, by a lattice we will always mean a full lattice.
Example 5.4 Two familiar lattices in :
1. The square lattice is the lattice in with basis .
2. The triangular lattice is the lattice with basis , .
The following simple fact will often be used.
Lemma 5.49 Let be a lattice in , and let be a basis of . If we reorder the basis vectors , or if we add to a basis vector an integer linear combination of the other basis vectors, then the collection so obtained will also form a basis of .
Let be a basis in . The Gram matrix of is the matrix with entries . The matrix is positive definite, since it is of the form where is a full-rank matrix (see Theorem 31.6). Consequently, is a positive real number.
Lemma 5.50 Let and be bases of a lattice and let and be the matrices and . Then the determinants of and coincide.
For all , the vector is of the form where the are integers. Let be the matrix with entries . Then, as
we have , and so . The number is a non-negative integer, since the entries of are integers. Swapping the two bases, the same argument shows that is also a non-negative integer. This can only happen if .
Definition 5.51 (The determinant of a lattice) The determinant of a lattice is where is the Gram matrix of a basis of .
By the previous lemma, is independent of the choice of the basis. The quantity has a geometric meaning, as is the volume of the solid body, the so-called parallelepiped, formed by the vectors .
Remark 5.52 Assume that the coordinates of the vectors in an orthonormal basis of are (). Then the Gram matrix of the vectors is where is the matrix . Consequently, if is a basis of a lattice , then .
We will need a fundamental result in convex geometry. In order to prepare for this, we introduce some simple notation. Let . The set is said to be centrally symmetric, if implies . The set is convex, if implies for all .
Theorem 5.53 (Minkowski's Convex Body Theorem) Let be a lattice in and let be a centrally symmetric, bounded, closed, convex set. Suppose that the volume of is at least . Then .
Proof. By the conditions, the volume of the set is at least . Let be a basis of the lattice and let be the corresponding half-open parallelepiped. Then each of the vectors in can be written uniquely in the form where and . For an arbitrary lattice vector , we let
As the sets and are bounded, so is the set
As is discrete, only has finitely many points in ; that is, , except for finitely many . Hence is a finite set, and, moreover, the set is the disjoint union of the sets (). Therefore, the total volume of these sets is at least . For a given , we set . Consider the closure and of the sets and , respectively:
and . The total volume of the closed sets is at least as large as the volume of the set , and so these sets cannot be disjoint: there are and such that , that is, and . As is centrally symmetric, we find that . As is convex, we also have . Hence . On the other hand, the difference of two lattice points lies in .
Minkowski's theorem is sharp. For, let be an arbitrary positive number, and let be the lattice of points with integer coordinates in . Let be the set of vectors for which (). Then is bounded, closed, convex, centrally symmetric with respect to the origin, its volume is , yet .
Corollary 5.54 Let be a lattice in . Then has a lattice vector whose length is at most .
Proof. Let be the following centrally symmetric cube with side length :
The volume of the cube is exactly , and so it contains a non-zero lattice vector. However, the vectors in have length at most .
We remark that, for , we can find an even shorter lattice vector, if we replace the cube in the proof of the previous assertion by a suitable ball.
Our goal is to design an algorithm that finds a non-zero short vector in a given lattice. In this section we consider this problem for two-dimensional lattices, which is the simplest non-trivial case. Then there is an elegant, instructive, and efficient algorithm that finds short lattice vectors. This algorithm also serves as a basis for the higher-dimensional cases. Let be a lattice with basis in .
Gauss(
)
1 2FOREVER
3DO
the shortest lattice vector on the line 4IF
5THEN
6ELSE
RETURN
In order to analyse the procedure, the following facts will be useful.
Lemma 5.55 Suppose that and are two linearly independent vectors in the plane , and let be the lattice generated by them. The vector is a shortest non-zero vector of on the line if and only if
Proof. We write as the sum of a vector parallel to and a vector orthogonal to :
Then, as the vectors and are orthogonal,
This quantity takes its smallest value for the integer that is the closest to the number . Hence gives the minimal value if and only if (5.10) holds.
Lemma 5.56 Suppose that the linearly independent vectors and form a basis for a lattice and that inequality (5.10) holds. Assume, further, that
Write , as in (5.11), as the sum of the vector , which is parallel to , and the vector , which is orthogonal to . Then
Further, either or is a shortest non-zero vector in .
Rearranging the last displayed line, we obtain .
The length of a vector can be computed as
which implies whenever . If and , then . Similarly, and gives . It remains to consider the case when and . As , we may assume that . In this case, however, is of the form (), and, by Lemma 5.55, the vector is a shortest lattice vector on this line.
Theorem 5.57 Let be a shortest non-zero lattice vector in . Then Gauss' algorithm terminates after iterations, and the resulting vector is a shortest non-zero vector in .
Proof. First we verify that, during the course of the algorithm, the vectors and will always form a basis for the lattice . If, in line 3, we replace by a vector of the form , then, as , the pair remains a basis of . The swap in line 5 only concerns the order of the basis vectors. Thus and is always a basis of , as we claimed.
By Lemma 5.55, inequality (5.10) holds after the first step (line 3) in the loop, and so we may apply Lemma 5.56 to the scenario before lines 4–5. This shows that if none of and is shortest, then . Thus, except perhaps for the last execution of the loop, after each swap in line 5, the length of is decreased by a factor of at least . Thus we obtain the bound for the number of executions of the loop. Lemma 5.56 implies also that the vector at the end is a shortest non-zero vector in .
Gauss' algorithm gives an efficient polynomial time method for computing a shortest vector in the lattice . The analysis of the algorithm gives the following interesting theoretical consequence.
Corollary 5.58 Let be a lattice in , and let be a shortest non-zero lattice vector in . Then .
Proof. Let be a vector in such that is linearly independent of and (5.10) holds. Then
which yields . The area of the fundamental parallelogram can be computed using the well-known formula
and so . The number can now be bounded by the previous inequality.
Let be a linearly independent collection of vectors in . For an index with , we let denote the component of that is orthogonal to the subspace spanned by . That is,
where
Clearly . The vectors span the same subspace as the vectors , and so, with suitable coefficients , we may write
and
By the latter equations, the vectors form an orthogonal system, and so
The set of the vectors is said to be the Gram-Schmidt orthogonalisation of the vectors .
Lemma 5.59 Let be a lattice with basis . Then
Proof. Set and , if . Then , and so
that is, where and are the Gram matrices of the collections and , respectively, and is the matrix with entries . The matrix is a lower triangular matrix with ones in the main diagonal, and so . As is a diagonal matrix, we obtain .
Corollary 5.60 (Hadamard inequality) .
Proof. The vector can be written as the sum of the vector and a vector orthogonal to , and hence .
The vector is the component of orthogonal to the subspace spanned by the vectors . Thus does not change if we subtract a linear combination of the vectors from . If, in this linear combination, the coefficients are integers, then the new sequence will be a basis of the same lattice as the original. Similarly to the first step of the loop in Gauss' algorithm, we can make the numbers in (5.15) small. The input of the following procedure is a basis of a lattice .
Weak-Reduction
1FOR
DOWNTO
2DO
FOR
TO
3 where is the integer nearest the number 4RETURN
Definition 5.61 (Weakly reduced basis) A basis of a lattice is said to be weakly reduced if the coefficients in (5.15) satisfy
Lemma 5.62 The basis given by the procedure Weak-Reduction
is weakly reduced.
Proof. By the remark preceding the algorithm, we obtain that the vectors never change. Indeed, we only subtract linear combinations of vectors with index less than from . Hence the inner instruction does not change the value of with . The values of the do not change for either. On the other hand, the instruction achieves, with the new , that the inequality holds:
By the observations above, this inequality remains valid during the execution of the procedure.
First we define, in an arbitrary dimension, a property of the bases that usually turns out to be useful. The definition will be of a technical nature. Later we will see that these bases are interesting, in the sense that they consist of short vectors. This property will make them widely applicable.
Definition 5.63 A basis of a lattice is said to be (Lovász-)reduced if
it is weakly reduced,
and, using the notation introduced for the Gram-Schmidt orthogonalisation,
for all .
Let us observe the analogy of the conditions above to the inequalities that we have seen when investigating Gauss' algorithm. For , and , being weakly reduced ensures that is a shortest vector on the line . The second condition is equivalent to the inequality , but here it is expressed in terms of the Gram-Schmidt basis. For a general index , the same is true, if plays the role of the vector , and plays the role of the component of the vector that is orthogonal to the subspace spanned by .
Lovász-Reduction
1FOREVER
2DO
Weak-Reduction
3 find an index for which the second condition of being reduced is violated 4IF
there is such an 5THEN
6ELSE
RETURN
Theorem 5.64 Suppose that in the lattice each of the pairs of the lattice vectors has an integer scalar product. Then the swap in the 5th line of the Lovász-Reduction
occurs at most times where is the upper left -subdeterminant of the Gram matrix of the initial basis .
Proof. The determinant is the determinant of the Gram matrix of , and, by the observations we made at the discussion of the Gram-Schmidt orthogonalisation, . This, of course, implies that for . By the above, the procedure Weak-Reduction
cannot change the vectors , and so it does not change the product either. Assume, in line 5 of the procedure, that a swap takes place. Observe that, unless , the sets do not change, and neither do the determinants . The role of the vector is taken over by the vector , whose length, because of the conditions of the swap, is at most times the length of . That is, the new is at most times the old. By the observation above, the new value of will also be at most times the old one. Then the assertion follows from the fact that the quantity remains a positive integer.
Corollary 5.65 Under the conditions of the previous theorem, the cost of the procedure Lovász-Reduction
is at most arithmetic operations with rational numbers where is the maximum of and the quantities with .
Proof. It follows from the Hadamard inequality that
Hence and . By the previous theorem, this is the number of iterations in the algorithm. The cost of the Gram-Schmidt orthogonalisation is operations, and the cost of weak reduction is scalar product computations, each of which can be performed using operations (provided the vectors are represented by their coordinates in an orthogonal basis).
One can show that the length of the integers that occur during the run of the algorithm (including the numerators and the denominators of the fractions in the Gram-Schmidt orthogonalisation) will be below a polynomial bound.
Theorem 5.67 of this section gives a summary of the properties of reduced bases that turn out to be useful in their applications. We will find that a reduced basis consists of relatively short vectors. More precisely, will approximate, within a constant factor depending only on the dimension, the length of a shortest non-zero lattice vector.
Lemma 5.66 Let us assume that the vectors form a reduced basis of a lattice . Then, for ,
In particular,
Proof. Substituting , , Lemma 5.56 gives, for all , tha
Thus, inequality (5.16) follows by induction.
Now we can formulate the fundamental theorem of reduced bases.
Theorem 5.67 Assume that the vectors form a reduced basis of a lattice . Then
(i) .
(ii) for all lattice vectors . In particular, the length of is not greater than times the length of a shortest non-zero lattice vector.
(iii) .
Proof. (i) Using inequality (5.17),
and so assertion (i) holds.
(ii) Let with be a lattice vector. Assume that is the last non-zero coefficient and write where is a linear combination of the vectors . Hence where lies in the subspace spanned by . As is orthogonal to this subspace,
and so assertion (ii) is valid.
(iii) First we show that . This inequality is obvious if , and so we assume that . Using the decomposition (5.14) of the vector and the fact that the basis is weakly reduced, we obtain that
Multiplying these inequalities for ,
which is precisely the inequality in (iii).
It is interesting to compare assertion (i) in the previous theorem and Corollary 5.54 after Minkowski's theorem. Here we obtain a weaker bound for the length of , but this vector can be obtained by an efficient algorithm. Essentially, the existence of the basis that satisfies assertion (iii) was first shown by Hermite using the tools in the proofs of Theorems 5.48 and 5.67. Using a Lovász-reduced basis, the cost of finding a shortest vector in a lattice with dimension is at most polynomial in the input size and in ; see Exercise 5.4-4.
Exercises
5.4-1 The triangular lattice is optimal. Show that the bound in Corollary 5.58 is sharp. More precisely, let be a full lattice and let be a shortest vector in . Verify that the inequality holds if and only if is similar to the triangular lattice.
5.4-2 The denominators of the Gram-Schmidt numbers. Let us assume that the Gram matrix of a basis has only integer entries. Show that the numbers in (5.15) can be written in the form where the are integers and is the determinant of the Gram matrix of the vectors .
5.4-3 The length of the vectors in a reduced basis. Let be a reduced basis of a lattice and let us assume that the numbers are integers. Give an upper bound depending only on and for the length of the vectors . More precisely, prove that
5.4-4 The coordinates of a shortest lattice vector. Let be a reduced basis of a lattice . Show that each of the shortest vectors in is of the form where and . Consequently, for a bounded , one can find a shortest non-zero lattice vector in polynomial time.
Hint. Assume, for some lattice vector , that . Let us write in the basis :
It follows from the assumption that each of the components of (in the orthogonal basis) is at most as long as :
Use then the inequalities and (5.17).
In this section we study the problem of factoring polynomials with rational coefficients. The input of the factorisation problem is a polynomial . Our goal is to compute a factorisation
where the polynomials are pairwise relatively prime, and irreducible over , and the numbers are positive integers. By Theorem 5.4, determines, essentially uniquely, the polynomials and the exponents .
First we reduce the problem (5.18) to another problem that can be handled more easily.
Lemma 5.68 We may assume that the polynomial has integer coefficients and it has leading coefficient .
Proof. Multiplying by the common denominator of the coefficients, we may assume that . Performing the substitution , we obtain the polynomial
which has integer coefficients and its leading coefficient is 1. Using a factorisation of , a factorisation of can be obtained efficiently.
Definition 5.69 A polynomial is said to be primitive, if the greatest common divisor of its coefficients is .
A polynomial can be written in a unique way as the product of an integer and a primitive polynomial in . Indeed, if is the greatest common divisor of the coefficients, then . Clearly, is a primitive polynomial with integer coefficients.
Lemma 5.70 (Gauss' Lemma) If are primitive polynomials, then so is the product .
Proof. We argue by contradiction and assume that is a prime number that divides all the coefficients of . Set , and let and be the smallest indices such that and . Let and consider the coefficient of in the product . This coefficient is
Both of the sums on the right-hand side of this equation are divisible by , while is not, and hence the coefficient of in cannot be divisible by after all. This, however, is a contradiction.
Claim 5.71 Let us assume that are polynomials with rational coefficients and leading coefficient such that the product has integer coefficients. Then the polynomials and have integer coefficients.
Proof. Let us multiply and by the least common multiple and , respectively, of the denominators of their coefficients. Then the polynomials and are primitive polynomials with integer coefficients. Hence, by Gauss' Lemma, so is the product . As the coefficients of are integers, each of its coefficients is divisible by the integer . Hence , and so . Therefore and are indeed polynomials with integer coefficients.
One can show similarly, for a polynomial , that factoring in is equivalent to factoring the primitive part of in and factoring an integer, namely the greatest common divisor of the coefficients
As we work over an infinite field, we have to pay attention to the size of the results in our computations.
Definition 5.72 The norm of a polynomial with complex coefficients is the real number .
The inequality implies that a polynomial with integer coefficients can be described using bits.
Lemma 5.73 Let be a polynomial with complex coefficients. Then, for all , we have
where is the usual conjugate of the complex number .
Proof. Let us assume that and set . Then
and hence
Performing similar computations with the right-hand side of the equation in the lemma, we obtain that
and so
The proof of the lemma is now complete.
Theorem 5.74 (Mignotte) Let us assume that the polynomials have complex coefficients and leading coefficient and that . If , then .
Proof. By the fundamental theorem of algebra, where are the complex roots of the polynomial (with multiplicity). Then there is a subset such that . First we claim, for an arbitrary set , that
If contains an integer with , then this inequality will trivially hold. Let us hence assume that for every . Set and . Applying Lemma 5.73 several times, we obtain that
where . As the leading coefficient of is 1, , and so
Let us express the coefficients of using its roots:
For an arbitrary polynomial , the inequality is valid. Therefore, using inequality (5.19), we find that
The proof is now complete.
Corollary 5.75 The bit size of the irreducible factors in of an with leading coefficient is polynomial in the bit size of .
Let be an arbitrary field, and let be polynomials with degree and , respectively: , where . We recall the concept of the resultant from Chapter 3. The resultant of and is the determinant of the -matrix
The matrix above is usually referred to as the Sylvester matrix. The blank spaces in the Sylvester matrix represent zero entries.
The resultant provides information about the common factors of and . One can use it to express, particularly elegantly, the fact that two polynomials are relatively prime:
Corollary 5.76 Let be a square-free (in ), non-constant polynomial. Then is an integer. Further, assume that is a prime not dividing . Then the polynomial is square-free in if and only if does not divide .
Proof. The entries of the Sylvester matrix corresponding to and are integers, and so is its determinant. The polynomial has no multiple roots over , and so, by Exercise 5.5-1, , which gives, using (5.21), that . Let denote the polynomial reduced modulo . Then it follows from our assumptions that is precisely the residue of modulo . By Exercise 5.5-1, the polynomial is square-free precisely when , which is equivalent to . This amounts to saying that does not divide the integer .
Corollary 5.77 If is a square-free polynomial with degree , then there is a prime (that is, the absolute value of is polynomial in the bit size of ) such that the polynomial is square-free in .
Proof. By the Prime Number Theorem (Theorem 33.37), for large enough , the product of the primes in the interval is at least .
Set . If is large enough, then
where are primes not larger than , and is the leading coefficient of .
Let us suppose, for the primes , that is not square-free in . Then the product divides , and so
(In the last two inequalities, we used the Hadamard inequality, and the fact that .) This contradicts to inequality (5.22), which must be valid because of the choice of .
We note that using the Prime Number Theorem more carefully, one can obtain a stronger bound for .
We present a general procedure that can be used to obtain, given a factorisation modulo a prime , a factorisation modulo of a polynomial with integer coefficients.
Theorem 5.78 (Hensel's lemma) Suppose that are polynomials with leading coefficient such that , and, in addition, and are relatively prime in . Then, for an arbitrary positive integer , there are polynomials such that
both of the leading coefficients of and are equal to ,
and ,
.
Moreover, the polynomials and satisfying the conditions above are unique modulo .
Proof. From the conditions concerning the leading coefficients, we obtain that , and, further, that and , provided the suitable polynomials and indeed exist. The existence is proved by induction on . In the initial step, and the choice and is as required.
The induction step : let us assume that there exist polynomials and that are well-defined modulo and satisfy the conditions. If the polynomials and exist, then they must satisfy the conditions imposed on and . As and are unique modulo , we may write and where and are polynomials with integer coefficients. The condition concerning the leading coefficients guarantees that and that .
By the induction hypothesis, where . The observations about the degrees of the polynomials and imply that the degree of is smaller than . Now we may compute that
As , the congruence above holds modulo . Thus and satisfy the conditions if and only if
This, however, amounts to saying, after cancelling from both sides, that
Using the congruences and we obtain that this is equivalent to the congruence
Considering the inequalities and and the fact that in the polynomials and are relatively prime, we find that equation (5.23) can be solved uniquely in . For, if and form a solution to , then, by Theorem 5.12, the polynomials
and
form a solution of (5.23). The uniqueness of the solution follows from the bounds on the degrees, and from the fact that and relatively prime. The details of this are left to the reader.
Corollary 5.79 Assume that , and the polynomials satisfy the conditions of Hensel's lemma. Set and let be a positive integer. Then the polynomials and can be obtained using arithmetic operations modulo .
Proof. The proof of Theorem 5.78 suggests the following algorithm.
Hensel-Lifting(
)
1 is a solution, in , of 2 3FOR
TO
4DO
5 reduced modulo (in ) 6 reduced modulo (in ) 7 (in ) 8RETURN
The polynomials and can be obtained using operations in (see Theorem 5.12 and the remark following it). An iteration consists of a constant number of operations with polynomials, and the cost of one run of the main loop is operations (modulo and ). The total cost of reaching is operations.
The factorisation problem (5.18) was efficiently reduced to the case in which the polynomial has integer coefficients and leading coefficient 1. We may also assume that has no multiple factors in . Indeed, in our case , and so the possible multiple factors of can be separated using the idea that we already used over finite fields as follows. By Lemma 5.13, the polynomial is already square-free, and, using Lemma 5.14, it suffices to find its factors with multiplicity one. From Proposition 5.71, we can see that has integer coefficients and leading coefficient 1. Computing the greatest common divisor and dividing polynomials can be performed efficiently, and so the reduction can be carried out in polynomial time. (In the computation of the greatest common divisor, the intermediate expression swell can be avoided using the techniques used in number theory.)
In the sequel we assume that the polynomial
we want to factor is square-free, its coefficients are integers, and its leading coefficient is 1.
The fundamental idea of the Berlekamp-Zassenhaus algorithm is that we compute the irreducible factors of modulo where is a suitably chosen prime and is large enough. If, for instance, , and we have already computed the coefficients of a factor modulo , then, by Mignotte's theorem, we can obtain the coefficients of a factor in .
From now on, we will also assume that is a prime such that the polynomial is square-free in . Using linear search such a prime can be found in polynomial time (Corollary 5.77). One can even assume that is polynomial in the bit size of .
The irreducible factors in of the polynomial can be found using Berlekamp's deterministic method (Theorem 5.42). Let be polynomials, all with leading coefficient 1, such that the are the irreducible factors of the polynomial in .
Using the technique of Hensel's lemma (Theorem 5.78) and Corollary 5.79, the system can be lifted modulo . To simplify the notation, we assume now that are polynomials with leading coefficients 1 such that
and the are the irreducible factors of the polynomial in .
Let be an irreducible factor with leading coefficient 1 of the polynomial in . Then there is a uniquely determined set for which
Let be the smallest integer such that . Mignotte's bound shows that the polynomial on the right-hand side, if its coefficients are represented by the residues with the smallest absolute values, coincides with .
We found that determining the irreducible factors of is equivalent to finding minimal subsets for which there is a polynomial with leading coefficient 1 such that , the absolute values of the coefficients of are at most , and, moreover, divides . This can be checked by examining at most sets . The cost of examining a single is polynomial in the size of .
To summarise, we obtained the following method to factor, in , a square-free polynomial with integer coefficients and leading coefficient 1.
Berlekamp-Zassenhaus(
)
1 a prime such that is square-free in
and
2 the irreducible factors of in
(using Berlekamp's deterministic method)
3
4 the Hensel lifting of the system modulo
5 the collection of minimal subsets of such that
reduced modulo divides
6 RETURN
Theorem 5.80 Let be a square-free polynomial with integer coefficients and leading coefficient , and let be a prime number such that the polynomial is square-free in and . Then the irreducible factors of the polynomial in can be obtained by the Berlekamp-Zassenhaus algorithm. The cost of this algorithm is polynomial in , and where is the number of irreducible factors of the polynomial in .
Example 5.5 (Swinnerton-Dyer polynomials) Let
where are the first prime numbers, and the product is taken over all possible combinations of the signs and . The degree of is , and one can show that it is irreducible in . On the other hand, for all primes , the polynomial is the product of factors with degree at most 2. Therefore these polynomials represent hard cases for the Berlekamp-Zassenhaus algorithm, as we need to examine about sets to find out that is irreducible.
Our goal in this section is to present the Lenstra-Lenstra-Lovász algorithm (LLL algorithm) for factoring polynomials . This was the first polynomial time method for solving the polynomial factorisation problem over . Similarly to the Berlekamp-Zassenhaus method, the LLL algorithm starts with a factorisation of modulo and then uses Hensel lifting. In the final stages of the work, it uses lattice reduction to find a proper divisor of , provided one exists. The powerful idea of the LLL algorithm is that it replaced the search, which may have exponential complexity, in the Berlekamp-Zassenhaus algorithm by an efficient lattice reduction.
Let be a square-free polynomial with leading coefficient 1 such that , and let be a prime such that the polynomial is square free in and .
Lemma 5.81 Suppose that where and are polynomials with integer coefficients and leading coefficient . Let with and assume that for some polynomial such that has integer coefficients and . Let us further assume that . Then in .
Proof. Let . By the assumptions,
Suppose that and . (We know that . If , then , and similarly, if , then .) Rewriting the congruence, we obtain
Considering the coefficient vectors of the polynomials and , this congruence amounts to saying that adding to the -th row of the Sylvester matrix (5.20) a suitable linear combination of the other rows results in a row in which all the elements are divisible by . Consequently, . The Hadamard inequality (Corollary 5.60) yields that , but this can only happen if . However, , and so, by (5.21), .
Set
Further, we let be a polynomial with leading coefficient 1 such that is an irreducible factor of . Set . Define the set as follows:
Clearly, is closed under addition of polynomials. We identify a polynomial with degree less than with its coefficient vector of length . Under this identification, becomes a lattice in . Indeed, it is not too hard to show (Exercise 5.5-2) that the polynomials
or, more precisely, their coefficient vectors, form a basis of .
Theorem 5.82 Let be a polynomial with degree less than such that the coefficient vector of is the first element in a Lovász-reduced basis of . Then is irreducible in if and only if .
Proof. As , it is clear that whenever is irreducible. In order to show the implication in the other direction, let us assume that is reducible and let be a proper divisor of such that is divisible by in . Using Hensel's lemma (Theorem 5.78), we conclude that is divisible by , that is, . Mignotte's theorem (Theorem 5.74) shows that
Now, if we use the properties of reduced bases (second assertion of Theorem 5.67), then we obtain
and so
We can hence apply Lemma 5.81, which gives .
Based on the previous theorem, the LLL algorithm can be outlined as follows (we only give a version for factoring to two factors). The input is a square-free polynomial with integer coefficients and leading coefficient 1 such that .
LLL-Polynomial-Factorisation(
)
1 a prime such that is square-free in and 2 an irreducible factor in (using Berlekamp's deterministic method) 3IF
4THEN
RETURN
“irreducible” 5ELSE
6Hensel-Lifting
7 a basis of the lattice in (5.24) 8Lovász-Reduction
9 10IF
11THEN
RETURN
12ELSE
RETURN
“irreducible”
Theorem 5.83 Using the LLL algorithm, the irreducible factors in of a polynomial can be obtained deterministically in polynomial time.
Proof. The general factorisation problem, using the method introduced at the discussion of the Berlekamp-Zassenhaus procedure, can be reduced to the case in which the polynomial is square-free and has leading coefficient 1. By the observations made there, the steps in lines 1–7 can be performed in polynomial time. In line 8, the Lovász reduction can be carried out efficiently (Corollary 5.65). In line 9, we may use a modular version of the Euclidean algorithm to avoid intermediate expression swell (see Chapter 6).
The correctness of the method is asserted by Theorem 5.82. The LLL algorithm can be applied repeatedly to factor the polynomials in the output, in case they are not already irreducible.
One can show that the Hensel lifting costs operations with moderately sized integers. The total cost of the version of the LLL algorithm above is .
Exercises
5.5-1 Let be a field and let . The polynomial has no irreducible factors with multiplicity greater than one if and only if .
Hint. In one direction, one can use Lemma 5.13, and use Lemma 5.14 in the other.
5.5-2 Show that the polynomials
form a basis of the lattice in (5.24).
Hint. It suffices to show that the polynomials () can be expressed with the given polynomials. To show this, divide by and compute the remainder.
PROBLEMS |
5-1
The trace in finite fields
Let be finite fields. The definition of the trace map on is as follows: if then
a. Show that the map is -linear and its image is precisely .
Hint. Use the fact that is defined using a polynomial with degree to show that is not identically zero.
b. Let be a uniformly distributed random pair of elements from . Then the probability that is .
5-2
The Cantor-Zassenhaus algorithm for fields of characteristic 2
Let and let be a polynomial of the form
where the are pairwise relatively prime and irreducible polynomials with degree in . Also assume that .
a. Let be a uniformly distributed random polynomial with degree less than . Then the greatest common divisor
is a proper divisor of with probability at least .
Hint. Apply the previous exercise taking and , and follow the argument in Theorem 5.38.
b. Using part (a), give a randomised polynomial time method for factoring a polynomial of the form (5.25) over .
5-3
Divisors and zero divisors
Let be a field. The ring is said to be an -algebra (in case is clear from the context, is simply called an algebra), if is a vector space over , and holds for all and . It is easy to see that the rings and are -algebras.
Let be a finite-dimensional -algebra. For an arbitrary , we may consider the map defined as for . The map is -linear, and so we may speak about its minimal polynomial , its characteristic polynomial , and its trace . In fact, if is an ideal in , then is an invariant subspace of , and so we can restrict to , and we may consider the minimal polynomial, the characteristic polynomial, and the trace of the restriction.
a. Let with . Show that the residue class is a zero divisor in the ring if and only if does not divide and .
b. Let be an algebra over , and let be an element with minimal polynomial . Show that if is not irreducible over , then contains a zero divisor. To be precise, if is a non-trivial factorisation (), then and form a pair of zero divisors, that is, both of them are non-zero, but their product is zero.
5-4
Factoring polynomials over algebraic number fields
a. Let be a field with characteristic zero and let be a finite-dimensional -algebra with an identity element. Let us assume that where and are non-zero -algebras. Let be a basis of over . Show that there is a such that is not irreducible in .
Hint. This exercise is for readers who are familiar with the elements of linear algebra. Let us assume that the minimal polynomial of is the irreducible polynomial . Let be the characteristic polynomial of on the invariant subspace (for ). Here and are the sets of elements of the form and , respectively where . Because of our conditions, we can find suitable exponents such that . This implies that the trace of the map on the subspace is . Set . Obviously, , which gives . If the assertion of the exercise is false, then the latter equation holds for all , and so, as the trace is linear, it holds for all . This, however, leads to a contradiction: if (1 denotes the unity in ), then clearly and .
b. Let be an algebraic number field, that is, a field of the form where , and there is an irreducible polynomial such that . Let be a square-free polynomial and set . Show that is a finite-dimensional algebra over . More precisely, if and , then the elements of the form (, ) form a basis over .
c. Show that if is reducible over , then there are -algebras such that .
Hint. Use the Chinese remainder theorem .
d. Consider the polynomial above and suppose that a field and a polynomial are given. Assume, further, that is square-free and is not irreducible over . The polynomial can be factored to the product of two non-constant polynomials in polynomial time.
Hint. By the previous remarks, the minimal polynomial over of at least one of the elements (, ) is not irreducible in . Using the LLL algorithm, can be factored efficiently in . From a factorisation of , a zero divisor of can be obtained, and this can be used to find a proper divisor of in .
CHAPTER NOTES |
The abstract algebraic concepts discussed in this chapter can be found in many textbooks; see, for instance, Hungerford's book [120].
The theory of finite fields and the related algorithms are the theme of the excellent books by Lidl and Niederreiter [163] and Shparlinski [237].
Our main algorithmic topics, namely the factorisation of polynomials and lattice reduction are thoroughly treated in the book by von zur Gathen and Gerhard [83]. We recommend the same book to the readers who are interested in the efficient methods to solve the basic problems concerning polynomials. Theorem 8.23 of that book estimates the cost of multiplying polynomials by the Schönhage-Strassen method, while Corollary 11.6 is concerned with the cost of the asymptotically fast implementation of the Euclidean algorithm. Ajtai's result about shortest lattice vectors was published in [8].
The method by Kaltofen and Shoup is a randomised algorithm for factoring polynomials over finite fields, and currently it has one of the best time bounds among the known algorithms. The expected number of -operations in this algorithm is where . Further competitive methods were suggested by von zur Gathen and Shoup, and also by Huang and Pan. The number of operations required by the latter is , if . Among the deterministic methods, the one by von zur Gathen and Shoup is the current champion. Its cost is operations in where . An important related problem is constructing the field . The fastest randomised method is by Shoup. Its cost is . For finding a square-free factorisation, Yun gave an algorithm that requires field operations in .
The best methods to solve the problem of lattice reduction and that of factoring polynomials over the rationals use modular and numerical techniques. After slightly modifying the definition of reduced bases, an algorithm using bit operations for the former problem was presented by Storjohann. (We use the original definition introduced in the paper by Lenstra, Lenstra and Lovász [161].) We also mention Schönhage's method using bit operations for factoring polynomials with integer coefficients ( is the length of the coefficients).
Besides factoring polynomials with rational coefficients, lattice reduction can also be used to solve lots of other problems: to break knapsack cryptosystems and random number generators based on linear congruences, simultaneous Diophantine approximation, to find integer linear dependencies among real numbers (this problem plays an important role in experiments that attempt to find mathematical identities). These and other related problems are discussed in the book [83].
A further exciting application area is the numerical solution of Diophantine equations. One can read about these developments in in the books by Smart [245] and Gaál [79]. The difficulty of finding a shortest lattice vector was verified in Ajtai's paper [8].
Finally we remark that the practical implementations of the polynomial methods involving lattice reduction are not competitive with the implementations of the Berlekamp-Zassenhaus algorithm, which, in the worst case, has exponential complexity. Nevertheless, the basis reduction performs very well in practice: in fact it is usually much faster than its theoretically proven speed. For some of the problems in the application areas listed above, we do not have another useful method.
The work of the authors was supported in part by grants NK72845 and T77476 of the Hungarian Scientific Research Fund.
Table of Contents
Computer systems performing various mathematical computations are inevitable in modern science and technology. We are able to compute the orbits of planets and stars, command nuclear reactors, describe and model many of the natural forces. These computations can be numerical and symbolical.
Although numerical computations may involve not only elementary arithmetical operations (addition, subtraction, multiplication, division) but also more sophisticated calculations, like computing numerical values of mathematical functions, finding roots of polynomials or computing numerical eigenvalues of matrices, these operations can only be carried out on numbers. Furthermore, in most cases these numbers are not exact. Their degree of precision depends on the floating-point arithmetic of the given computer hardware architecture.
Unlike numerical calculations, symbolic and algebraic computations operate on symbols that represent mathematical objects. These objects may be numbers such as integers, rational numbers, real and complex numbers, but may also be polynomials, rational and trigonometric functions, equations, algebraic structures such as groups, rings, ideals, algebras or elements of them, or even sets, lists, tables.
Computer systems with the ability to handle symbolic computations are called computer algebra systems or symbolic and algebraic systems or formula manipulation systems. In most cases, these systems are able to handle both numerical and graphical computations. The word “symbolic” emphasises that, during the problem-solving procedure, the objects are represented by symbols, and the adjective “algebraic” refers to the algebraic origin of the operations on these symbolic objects.
To characterise the notion “computer algebra”, one can describe it as a collection of computer programs developed basically to perform
exact representations of mathematical objects and
arithmetic with these objects.
On the other hand, computer algebra can be viewed as a discipline which has been developed in order to invent, analyse and implement efficient mathematical algorithms based on exact arithmetic for scientific research and applications.
Since computer algebra systems are able to perform error-free computations with arbitrary precision, first we have to clarify the data structures assigned to the various objects. Subsection 6.1 deals with the problems of representing mathematical objects. Furthermore, we describe the symbolic algorithms which are indispensable in modern science and practice.
The problems of natural sciences are mainly expressed in terms of mathematical equations. Research in solving symbolic linear systems is based on the well-known elimination methods. To find the solutions of non-linear systems, first we analyse different versions of the Euclidean algorithm and the method of resultants. In the mid-sixties of the last century, Bruno Buchberger presented a method in his PhD thesis for solving multivariate polynomial equations of arbitrary degree. This method is known as the Gröbner basis method. At that time , the mathematical community paid little attention to his work, but since then it became the basis of a powerful set of tools for computing with higher degree polynomial equations. This topic is discussed in Subsections 6.2 and 6.3.
The next area to be introduced is the field of symbolic integration. Although the nature of the problem was understood long ago (Liouville's principle), it was only in that Robert Risch invented an algorithm to solve the following: given an elementary function of a real variable , decide whether the indefinite integral is also an elementary function, and if so, compute the integral. We describe the method in Subsection 6.4.
At the end of this section, we offer a brief survey of the theoretical and practical relations of symbolic algorithms in Subsection 6.5, devoting an independent part to the present computer algebra systems.
In computer algebra, one encounters mathematical objects of different kinds. In order to be able to manipulate these objects on a computer, one first has to represent and store them in the memory of that computer. This can cause several theoretical and practical difficulties. We examine these questions in this subsection.
Consider the integers. We know from our studies that the set of integers is countable, but computers can only store finitely many of them. The range of values for such a single-precision integer is limited by the number of distinct encodings that can be made in the computer word, which is typically or bits in length. Hence, one cannot directly use the computer's integers to represent the mathematical integers, but must be prepared to write programs to handle “arbitrarily” large integers represented by several computer integers. The term arbitrarily large does not mean infinitely large since some architectural constraints or the memory size limits in any case. Moreover, one has to construct data structures over which efficient operations can be built. In fact, there are two standard ways of performing such a representation.
Radix notation (a generalisation of conventional decimal notation), in which is represented as , where the digits are single precision integers. These integers can be chosen from the canonical digit set or from the symmetrical digit set , where base can be, in principle, any positive integer greater than . For efficiency, is chosen so that is representable in a single computer word. The length of the linear list used to represent a multiprecision integer may be dynamic (i.e. chosen approximately for the particular integer being represented) or static (i.e. pre-specified fixed length), depending on whether the linear list is implemented using linked list allocation or using array (sequential) notation. The sign of is stored within the list, possibly as the sign of or one or more of the other entries.
Modular notation, in which is represented by its value modulo a sufficient number of large (but representable in one computer word) primes. From the images one can reconstruct using the Chinese remainder algorithm.
The modular form is fast for addition, subtraction and multiplication but is much slower for divisibility tasks. Hence, the choice of representation influences the algorithms that will be chosen. Indeed, not only the choice of representation influences the algorithms to be used but also the algorithms influence the choice of representation.
Example 6.1 For the sake of simplicity, in the next example we work only with natural numbers. Suppose that we have a computer architecture with machine word bits in length, i.e. our computer is able to perform integer arithmetic with the integers in range . Using this arithmetic, we carry out a new arithmetic by which we are able to perform integer arithmetic with the integers in range .
Using radix representation let , and let
Then,
where the sum and the product were computed using radix notation.
Switching to modular representation we have to choose pairwise relatively prime integers from the interval such that their product is greater than . Let, for example, the primes be
where . Then, an integer from the interval can be represented by a -tuple from the interval . Therefore,
furthermore, . Hence
where addition and multiplication were carried out using modular arithmetic.
More generally, concerning the choice of representation of other mathematical objects, it is worth distinguishing three levels of abstraction:
Object level. This is the level where the objects are considered as formal mathematical objects. For example , and are all representations of the integer . On the object level, the polynomials and are considered equal.
Form level. On this level, one has to distinguish between different representations of an object. For example and are considered different representations of the same polynomial, namely the former is a product, a latter is a sum.
Data structure level. On this level, one has to consider different ways of representing an object in a computer memory. For example, we distinguish between representations of the polynomial as
an array ,
a linked list .
In order to represent objects in a computer algebra system, one has to make choices on both the form and the data structure level. Clearly, various representations are possible for many objects. The problem of “how to represent an object” becomes even more difficult when one takes into consideration other criteria, such as memory space, computation time, or readability. Let us see an example. For the polynomial
the product form is more comprehensive, but the second one is more suitable to know the coefficient of, say, . Two other illustrative examples are
and ,
and .
It is very hard to find any good strategy to represent mathematical objects satisfying several criteria. In practice, one object may have several different representations. This, however, gives rise to the problem of detecting equality when different representations of the same object are encountered. In addition, one has to be able to convert a given representation to others and simplify the representations.
Consider the integers. In the form level, one can represent the integers using base representation, while at the data structure level they can be represented by a linked list or as an array.
Rational numbers can be represented by two integers, a numerator and a denominator. Considering memory constraints, one needs to ensure that rational numbers are in lowest terms and also that the denominator is positive (although other choices, such as positive numerator, are also possible). This implies that a greatest common divisor computation has to be performed. Since the ring of integers is a Euclidean domain, this can be easily computed using the Euclidean algorithm. The uniqueness of the representation follows from the choice of the denominator's sign.
Multivariate polynomials (elements of , where is an integral domain) can be represented in the form , where and for , one can write for . In the form level, one can consider the following levels of abstraction:
Expanded or factored representation, where the products are multiplied out or the expression is in product form. Compare
, and
.
Recursive or distributive representation (only for multivariate polynomials). In the bivariate case, the polynomial can be viewed as an element of the domain , or . Compare
,
, and
.
At the data structure level, there can be dense or sparse representation. Either all terms are considered, or only those having non-zero coefficients. Compare and . In practice, multivariate polynomials are represented mainly in the sparse way.
The traditional approach of representing power series of the form is to truncate at some specified point, and then to regard them as univariate polynomials. However, this is not a real representation, since many power series can have the same representation. To overcome this disadvantage, there exists a technique of representing power series by a procedure generating all coefficients (rather than by any finite list of coefficients). The generating function is a computable function such that . To perform an operation with power series, it is enough to know how to compute the coefficients of the resulting series from the coefficients of the operands. For example, the coefficients of the product of the power series and can be computed as . In that way, the coefficients are computed when they are needed. This technique is called lazy evaluation.
Since computer algebra programs compute in a symbolic way with arbitrary accuracy, in addition to examining time complexity of the algorithms it is also important to examine their space complexity.
Footnote. We consider the running time as the number of operations executed, according to the RAM-model. Considering the Turing-machine model, and using machine words with constant length, we do not have this problem, since in this case space is always bounded by the time.
Consider the simple problem of solving a linear system having equations an unknowns with integer coefficients which require computer word of storage. Using Gaussian elimination, it is easy to see that each coefficient of the reduced linear system may need computer words of storage. In other words, Gaussian elimination suffers from exponential growth in the size of the coefficients. Note that if we applied the same method to linear systems having polynomial coefficients, we would have exponential growth both in the size of the numerical coefficients of the polynomials and in the degrees of the polynomials themselves. In spite of the observed exponential growth, the final result of the Gaussian elimination will always be of reasonable size because by Cramer's rule we know that each component of the solution to such a linear system is a ratio of two determinants, each of which requires approximately computer words. The phenomenon described above is called intermediate expression swell. This often appears in computer algebra algorithms.
Example 6.2 Using only integer arithmetic we solve the following system of linear equations:
First, we eliminate variable from the second equation. We multiply the first row by , the second by and take their sum. If we apply this strategy for the third equation to eliminate variable , we get the following system.
Now, we eliminate variable multiplying the second equation by , the third one by , then taking their sum. The result is
Continuing this process of eliminating variables, we get the following system:
After some simplification, we get that . If we apply greatest common divisor computations in each elimination step, the coefficient growth will be less drastic.
In order to avoid the intermediate expression swell phenomenon, one uses modular techniques. Instead of performing the operations in the base structure (e.g. Euclidean ring), they are performed in some factor structure, and then, the result is transformed back to (Figure 6.1). In general, modular computations can be performed efficiently, and the reconstruction steps can be made with some interpolation strategy.
Let be an integral domain and let
be arbitrary polynomials with . Let us give a necessary and sufficient condition for and sharing a common root in .
If is a field, then is a Euclidean domain. Recall that we call an integral domain Euclidean together with the function if for all , there exist such that , where or ; furthermore, for all , we have . The element is called the quotient and is called the remainder. If we are working in a Euclidean domain, we would like the greatest common divisor to be unique. For this, a unique element has to be chosen from each equivalence class obtained by multiplying by the units of the ring . (For example, in case of integers we always choose the non-negative one from the classes ) Thus, every element has a unique form
where is called the normal form of . Let us consider a Euclidean domain over a field . Let the normal form of be the corresponding normalised monic polynomial, that is, , where denotes the leading coefficient of polynomial . Let us summarise these important cases:
If then and ,
if ( is a field) then (the leading coefficient of polynomial with the convention ), and .
The following algorithm computes the greatest common divisor of two arbitrary elements of a Euclidean domain. Note that this is one of the most ancient algorithms of the world, already known by Euclid around 300 B.C.
Classical-Euclidean(
)
1 2 3WHILE
4DO
5 6 7RETURN
In the ring of integers, the remainder in line 4 becomes . When , where is a field, the remainder in line 4 can be calculated by the algorithm Euclidean-Division-Univariate-Polynomials(
)
, the analysis of which is left to Exercise 6.2-1.
Figure 6.2 shows the operation of the Classical-Euclidean
algorithm in and . Note that in the program only enters the while loop with non-negative numbers and the remainder is always non-negative, so the normalisation in line 7 is not needed.
Before examining the running time of the Classical-Euclidean
algorithm, we deal with an extended version of it.
Extended-Euclidean(
)
1 2 3WHILE
4DO
5 6 7 8 9 10RETURN
Figure 6.2. Illustration of the operation of the Classical-Euclidean
algorithm in and . In case (a), the input is . The first two lines of the pseudocode compute the absolute values of the input numbers. The loop between lines and is executed four times, values , and in these iterations are shown in the table. The Classical-Euclidean(
,
)
algorithm outputs as result. In case (b), the input parameters are . The first two lines compute the normal form of the polynomials, and the while loop is executed three times. The output of the algorithm is the polynomial .
It is known that in the Euclidean domain , the greatest common divisor of elements can be expressed in the form with appropriate elements . However, this pair is not unique. For if are appropriate, then so are and for all :
The Classical-Euclidean
algorithm is completed in a way that beside the greatest common divisor it outputs an appropriate pair as discussed above.
Let , where is a Euclidean domain together with the function . The equations
are obviously fulfilled due to the initialisation in the first two lines of the pseudocode Extended-Euclidean
. We show that equations (6.3) are invariant under the transformations of the while loop of the pseudocode. Let us presume that the conditions (6.3) are fulfilled before an iteration of the loop. Then lines 4–5 of the pseudocode imply
hence, because of lines 6–7,
Lines 8–9 perform the following operations: take the values of and , then take the values of and , while takes the value of and . Thus, the equalities in (6.3) are also fulfilled after the iteration of the while loop. Since in each iteration of the loop, the series obtained in lines 8–9 is a strictly decreasing series of natural numbers, so sooner or later the control steps out of the while loop. The greatest common divisor is the last non-zero remainder in the series of Euclidean divisions, that is, in lines 8–9.
Example 6.3 Let us examine the series of remainders in the case of polynomials
The values of the variables before the execution of line are
The return values are:
We can see that the size of the coefficients show a drastic growth. One might ask why we do not normalise in every iteration of the while loop? This idea leads to the normalised version of the Euclidean algorithm for polynomials.
Extended-Euclidean-Normalised(
)
1 2 3 4 5WHILE
6DO
7 8 9 10 11 12 13 14RETURN
Example 6.4 Let us look at the series of remainders and series obtained in the Extended-Euclidean-Normalised
algorithm in case of the polynomials (6.4) and (6.5)
Before the execution of line of the pseudocode, the values of the variables are
Looking at the size of the coefficients in, the advantage of the normalised version is obvious, but we could still not avoid the growth. To get a machine architecture-dependent description and analysis of the Extended-Euclidean-Normalised
algorithm, we introduce the following notation. Let
where is the word length of the computer in bits. It is easy to verify that if and , then
We give the following theorems without proof.
Theorem 6.1 If and , then the Classical-Euclidean
and Extended-Euclidean
algorithms require machine-word arithmetic operations.
Theorem 6.2 If is a field and , then the Classical-Euclidean
, Extended-Euclidean
and Extended-Euclidean-Normalised
algorithms require elementary operations in .
Can the growth of the coefficients be due to the choice of our polynomials? Let us examine a single Euclidean division in the Extended-Euclidean-Normalised
algorithm. Let , where
and are monic polynomials, , , , and consider the case . Then
Note that the bound (6.6) is valid for the coefficients of the remainder polynomial as well, that is, . So in case , the size of the coefficients may only grow by a factor of around three in each Euclidean division. This estimate seems accurate for pseudorandom polynomials, the interested reader should look at Problem 6-1. The worst case estimate suggests that
where denotes the running time of the Extended-Euclidean-Normalised
algorithm, practically, the number of times the while loop is executed. Luckily, this exponential growth is not achieved in each iteration of the loop, and altogether the growth of the coefficients is bounded polynomially in terms of the input. Later we will see that the growth can be eliminated using modular techniques.
Summarising: after computing the greatest common divisor of the polynomials ( is a field), and have a common root if and only if their greatest common divisor is not a constant. For if is not a constant, then the roots of are also roots of and , since divides and . On the other hand, if and have a root in common, then their greatest common divisor cannot be a constant, since the common root is also a root of it.
If is a UFD (unique factorisation domain, where every non-zero, non-unit element can be written as a product of irreducible elements in a unique way up to reordering and multiplication by units) but not necessarily a Euclidean domain then, the situation is more complicated, since we may not have a Euclidean algorithm in . Luckily, there are several useful methods due to: (1) unique factorisation in , (2) the existence of a greatest common divisor of two or more arbitrary elements.
The first possible method is to perform the calculations in the field of fractions of . The polynomial is called a primitive polynomial if there is no prime in that divides all coefficients of . A famous lemma by Gauss says that the product of primitive polynomials is also primitive, hence, for the primitive polynomials , if and only if , where denotes the field of fractions of . So we can calculate greatest common divisors in instead of . Unfortunately, this approach is not really effective because arithmetic in the field of fractions is much more expensive than in .
A second possibility is an algorithm similar to the Euclidean algorithm: in the ring of polynomials in one variable over an integral domain, a so-called pseudo-division can be defined. Using the polynomials (6.1), (6.2), if , then there exist , such that
where or . The polynomial is called the pseudo-quotient of and and is called the pseudo-remainder. The notation is .
Then .
On the other hand, each polynomial can be written in a unique form
up to a unit factor, where and are primitive polynomials. In this case, is called the content, is called the primitive part of . The uniqueness of the form can be achieved by the normalisation of units. For example, in the case of integers, we always choose the positive ones from the equivalence classes of .
The following algorithm performs a series of pseudo-divisions. The algorithm uses the function , which computes the pseudo-remainder, and it assumes that we can calculate greatest common divisors in , contents and primitive parts in . The input is , where is a UFD. The output is the polynomial .
Primitive-Euclidean(
)
1 2 3WHILE
4DO
5 6 7 8 9RETURN
The operation of the algorithm is illustrated by Figure 6.3. The running time of the Primitive-Euclidean
algorithm is the same as the running time of the previous versions of the Euclidean algorithm.
Figure 6.3. The illustration of the operation of the Primitive-Euclidean
algorithm with input . The first two lines of the program compute the primitive parts of the polynomials. The loop between lines and is executed three times, the table shows the values of , and in the iterations. In line , variable equals . The Primitive-Euclidean(
)
algorithm returns as result.
The Primitive-Euclidean
algorithm is very important because the ring of multivariate polynomials is a UFD, so we apply the algorithm recursively, e.g. in , using computations in the UFDs . In other words, the recursive view of multivariate polynomial rings leads to the recursive application of the Primitive-Euclidean
algorithm in a straightforward way.
We may note that, like above, the algorithm shows a growth in the coefficients.
Let us take a detailed look at the UFD . The bound on the size of the coefficients of the greatest common divisor is given by the following theorem, which we state without proof.
Theorem 6.3 (Landau-Mignotte) Let , , and . Then
Corollary 6.4 With the notations of the previous theorem, the absolute value of any coefficient of the polynomial is smaller than
Proof. The greatest common divisor of and obviously divides both and , and its degree is at most the minimum of their degrees. Furthermore, the leading coefficient of the greatest common divisor divides and , so it also divides .
Example 6.6 Corollary 6.4 implies that the absolute value of the coefficients of the greatest common divisor is at most for the polynomials (6.4), (6.5), and at most for the polynomials (6.7) and (6.8).
The following method describes the necessary and sufficient conditions for the common roots of (6.1) and (6.2) in the most general context. As a further advantage, it can be applied to solve algebraic equation systems of higher degree.
Let be an integral domain and its field of fractions. Let us consider the smallest extension of over which both of (6.1) and of (6.2) splits into linear factors. Let us denote the roots (in ) of the polynomial by , and the roots of by . Let us form the following product:
It is obvious that equals to if and only if for some and , that is, and have a common root. The product is called the resultant of the polynomials and . Note that the value of the resultant depends on the order of and , but the resultants obtained in the two ways can only differ in sign.
Evidently, this form of the resultant cannot be applied in practice, since it presumes that the roots are known. Let us examine the different forms of the resultant. Since
hence,
Thus,
Although it looks a lot more friendly, this form still requires the roots of at least one polynomial. Next we examine how the resultant may be expressed only in terms of the coefficients of the polynomials. This leads to the Sylvester form of the resultant.
Let us presume that polynomial in (6.1) and polynomial in (6.2) have a common root. This means that there exists a number such that
Multiply these equations by the numbers , and , respectively. We get equations from the first one and from the second one. Consider these equations as a homogeneous system of linear equations in indeterminates. This system has the obviously non-trivial solution . It is a well-known fact that a homogeneous system with as many equations as indeterminates has non-trivial solutions if and only if its determinant is zero. We get that and can only have common roots if the determinant
equals to (there are 0s everywhere outside the dotted areas). Thus, a necessary condition for the existence of common roots is that the determinant of order is 0. Below we prove that equals to the resultant of and , hence, is also a sufficient condition for common roots. The determinant (6.9) is called the Sylvester form of the resultant.
Theorem 6.5 Using the above notation
Proof. We will precede by induction on . If , then , so the right-hand side is . The left-hand side is a determinant of order with everywhere in the diagonal, and 0 everywhere else. Thus, , so the statement is true. In the following, presume that and the statement is true for . If we take the polynomial
instead of , then and fulfil the condition:
Since , the coefficients of and satisfy
Thus,
We transform the determinant in the following way: add times the first column to the second column, then add times the new second column to the third column, etc. This way the -s disappear from the first lines, so the first lines of and the transformed are identical. In the last rows, subtract times the second one from the first one, and similarly, always subtract times a row from the row right above it. In the end, becomes
Using the last row for expansion, we get , which implies by the induction hypothesis.
We get that , that is, polynomials and have a common root in if and only if determinant vanishes.
From an algorithmic point of view, the computation of the resultant in Sylvester form for higher degree polynomials means the computation of a large determinant. The following theorem implies that pseudo-division may simplify the computation.
Theorem 6.6 For the polynomials of (6.1) and of (6.2), in case of
Proof. Multiply the first line of the determinant (6.9) by . Let and be the uniquely determined polynomials with
where . Then multiplying row of the resultant by , row by etc., and subtracting them from the first row we get the determinant
Here is in the th column of the first row, and is in the th column of the first row.
Similarly, multiply the second row by , then multiply rows by etc., and subtract them from the second row. Continue the same way for the third, th row. The result is
After reordering the rows
Note that
thus,
and therefore
Equation (6.10) describes an important relationship. Instead of computing the possibly gigantic determinant , we perform a series of pseudo-divisions and apply (6.10) in each step. We calculate the resultant only when no more pseudo-division can be done. An important consequence of the theorem is the following corollary.
Corollary 6.7 There exist polynomials such that , with , .
Proof. Multiply the th column of the determinant form of the resultant by and add it to the last column for . Then
Using the last column for expansion and factoring and , we get the statement with the restrictions on the degrees.
The most important benefit of the resultant method, compared to the previously discussed methods, is that the input polynomials may contain symbolic coefficients as well.
Then the existence of common rational roots of and cannot be decided by variants of the Euclidean algorithm, but we can decide it with the resultant method. Such a root exists if and only if
that is, when or .
The significance of the resultant is not only that we can decide the existence of common roots of polynomials, but also that using it we can reduce the solution of algebraic equation systems to solving univariate equations.
Consider polynomials and as elements of . They have a common root if and only if
Common roots in can exist for . For each , we substitute into equations (6.11) and (6.12) (already in ) and get that the integer solutions are .
We note that the resultant method can also be applied to solve polynomial equations in several variables, but it is not really effective. One problem is that computational space explosion occurs in the computation of the determinant. Note that computing the resultant of two univariate polynomials in determinant form using the usual Gauss-elimination requires operations, while the variants of the Euclidean algorithm are quadratic. The other problem is that computational complexity depends strongly on the order of the indeterminates. Eliminating all variables together in a polynomial equation system is much more effective. This leads to the introduction of multivariate resultants.
All methods considered so far for the existence and calculation of common roots of polynomials are characterised by an explosion of computational space. The natural question arises: can we apply modular techniques? Below we examine the case with . Let us consider the polynomials (6.4), (6.5) and let a prime number. Then the series of remainders in in the Classical-Euclidean
algorithm is
We get that polynomials and are relatively prime in . The following theorem describes the connection between greatest common divisors in and .
Theorem 6.8 Let . Let be a prime such that and . Let furthermore , , and . Then
(1) ,
(2) if , then .
Proof. (1): Since and , thus . So
By the hypothesis , which implies
(2): Since and is non-trivial,
If , then the right-hand side of (6.13) is non-trivial, thus . But the resultant is the sum of the corresponding products of the coefficients, so , contradiction.
Corollary 6.9 There are at most a finite number of primes such that , and .
In case statement (1) of Theorem 6.8 is fulfilled, we call a “lucky prime'”. We can outline a modular algorithm for the computation of the gcd.
Modular-Gcd-Bigprime(
)
1 the Landau-Mignotte constant (from Corollary 6.4) 2 3WHILE
TRUE
4DO
a prime with , , and 5 6IF
and 7THEN
RETURN
8ELSE
The first line of the algorithm requires the calculation of the Landau-Mignotte bound. The fourth line requires a “sufficiently large” prime which does not divide the leading coefficient of and . The fifth line computes the greatest common divisor of polynomials and modulo (for example with the Classical-Euclidean
algorithm in ). We store the coefficients of the resulting polynomials with symmetrical representation. The sixth line examines whether and are fulfilled, in which case is the required greatest common divisor. If this is not the case, then is an “unlucky prime”, so we choose another prime. Since, by Theorem 6.8, there are only finitely many “unlucky primes”, the algorithm eventually terminates. If the primes are chosen according to a given strategy, set is not needed.
The disadvantage of the Modular-gcd-bigprime
algorithm is that the Landau-Mignotte constant grows exponentially in terms of the degree of the input polynomials, so we have to work with large primes. The question is how we could modify the algorithm so that we can work with “many small primes”. Since the greatest common divisor in is only unique up to a constant factor, we have to be careful with the coefficients of the polynomials in the new algorithm. So, before applying the Chinese remainder theorem for the coefficients of the modular greatest common divisors taken modulo different primes, we have to normalise the leading coefficient of . If and are the leading coefficients of and , then the leading coefficient of divides . Therefore, we normalise the leading coefficient of to in case of primitive polynomials and ; and finally take the primitive part of the resulting polynomial. Just like in the Modular-gcd-bigprime
algorithm, modular values are stored with symmetrical representation. These observations lead to the following modular gcd algorithm using small primes.
Modular-Gcd-Smallprimes(
)
1 2 a prime such that 3 4 5 6 7 8WHILE
true
9DO
IF
10THEN
IF
11THEN
RETURN
12 13IF
14THEN
15IF
and 16THEN
RETURN
17 a prime such that and 18 19 20 21IF
22THEN
23IF
24THEN
IF
25THEN
Coeff-build(
)
26IF
27THEN
28ELSE
29 30
Coeff-Build(
)
1 2 3FOR
DOWNTO
4DO
5 6 7RETURN
We may note that the algorithm Modular-Gcd-Smallprimes
does not require as many small primes as the Landau-Mignotte bound tells us. When the value of polynomial does not change during a few iterations, we test in lines 13–16 if is a greatest common divisor. The number of these iterations is stored in the variable of line six. Note that the value of could vary according to the input polynomial. The primes used in the algorithms could preferably be chosen from an (architecture-dependent) prestored list containing primes that fit in a machine word, so the use of set becomes unnecessary. Corollary 6.9 implies that the Modular-Gcd-Smallprimes
algorithm always terminates.
The Coeff-Build
algorithm computes the solution of the equation system obtained by taking congruence relations modulo and for the coefficients of identical degree in the input polynomials and . This is done according to the Chinese remainder theorem. It is very important to store the results in symmetrical modular representation form.
Example 6.9 Let us examine the operation of the Modular-Gcd-Smallprimes
algorithm for the previously seen polynomials (6.4), (6.5). For simplicity, we calculate with small primes. Recall that
After the execution of the first six lines of the algorithm with , we have and . Since due to line 7, lines 10–12 are executed. Polynomial is not zero, so , and will be the values after the execution. The condition in line 13 is not fulfilled, so we choose another prime, is a bad choice, but is allowed. According to lines 19–20, , . Since , we have and lines 25–30 are not executed. Polynomial is constant, so the return value in line 11 is 1, which means that polynomials and are relatively prime.
Example 6.10 In our second example, consider the already discussed polynomials
Let again . After the first six lines of the polynomials . After the execution of lines 10–12, we have . Let the next prime be . So the new values are . Since , and the new value of is after lines 25–30. The value of the variable is still 1. Let the next prime be 11. Then . Polynomials and have the same degree, so we modify the coefficients of . Then and since , we get and . Let the new prime be 13. Then . The degrees of and are still equal, thus lines 25–30 are executed and the variables become .After the execution of lines 17–18, it turns out that and , so is the greatest common divisor.
We give the following theorem without proof.
Theorem 6.10 The Modular-Gcd-Smallprimes
algorithm works correctly. The computational complexity of the algorithm is machine word operations, where , and is the Landau-Mignotte bound for polynomials and .
Exercises
6.2-1 Let be a commutative ring with identity element, , , furthermore, a unit, . The following algorithm performs Euclidean division for and and outputs polynomials for which and or holds.
Euclidean-Division-Univariate-Polynomials(
)
1 2FOR
DOWNTO
3DO
IF
4THEN
5 6ELSE
7 and 8RETURN
Prove that the algorithm uses at most
operations in .
6.2-2 What is the difference between the algorithms Extended-Euclidean
and Extended-Euclidean-Normalised
in ?
6.2-3 Prove that .
6.2-4 The discriminant of polynomial (, ) is the element
where denotes the derivative of with respect to . Polynomial has a multiple root if and only if its discriminant is 0. Compute for general polynomials of second and third degree.
Let be a field and be a multivariate polynomial ring in variables over . Let . First we determine a necessary and sufficient condition for the polynomials having common roots in . We can see that the problem is a generalisation of the case from the previous subsection. Let
denote the ideal generated by polynomials . Then the polynomials form a basis of ideal . The variety of an ideal is the set
The knowledge of the variety means that we also know the common roots of . The most important questions about the variety and ideal are as follows.
How “big” is ?
Given , in which case is ?
?
Fortunately, in a special basis of ideal , in the so-called Gröbner basis, these questions are easy to answer. First let us study the case . Since is a Euclidean ring,
We may assume that . Let and divide by with remainder. Then there exist unique polynomials with and . Hence,
Moreover, if are the distinct linear factors of . Unfortunately, equality (6.14) is not true in case of two or more variables. Indeed, a multivariate polynomial ring over an arbitrary field is not necessary Euclidean, therefore we have to find a new interpretation of division with remainder. We proceed in this direction.
Recall that a partial order is a total order (or simply order) if either or for all . The total order ` ` is allowable if
(i) for all ,
(ii) for all .
It is easy to prove that any allowable order on is a well-order (namely, every nonempty subset of has a least element). With the notation already adopted consider the set
The elements of are called monomials. Observe that is closed under multiplication in , constituting a commutative monoid. The map , is an isomorphism, therefore, for an allowable total order on , we have that
(i) for all ,
(ii)
The allowable orders on are called monomial orders. If , the natural order is a monomial order, and the corresponding univariate monomials are ordered by their degree. Let us see some standard examples of higher degree monomial orders. Let
where the variables are ordered as .
Pure lexicographic order.
and .
Graded lexicographic order.
or ( and ).
Graded reverse lexicographic order.
or ( and and ).
The proof that these orders are monomial orders is left as an exercise. Observe that if , then . The graded reverse lexicographic order is often called a total degree order and it is denoted by .
Example 6.11 Let and let . Then
Let and again, . Then
Let a monomial order be given. Furthermore, we identify the vector with the monomial . Let be a non-zero polynomial, . Then are the terms of polynomial , is the multidegree of the polynomial (where the maximum is with respect to the monomial order), is the leading coefficient of , is the leading monomial of , and is the leading term of . Let and .
Example 6.12 Consider the polynomial . Let and . Then
If and , then
In this subsection, our aim is to give an algorithm for division with remainder in . Given multivariate polynomials and monomial order , we want to compute the polynomials and such that and no monomial in is divisible by any of .
Multivariate-Division-with-Remainder(
)
1 2 3FOR
TO
4DO
5WHILE
6DO
IF
divides for some 7THEN
choose such an and 8 9ELSE
10 11RETURN
and
The correctness of the algorithm follows from the fact that in every iteration of the while cycle of lines 5–10, the following invariants hold
(i) and ,
(ii) for all ,
(iii) no monomial in is divisible by any .
The algorithm has a weakness, namely, the multivariate division with remainder is not deterministic. In line 7, we can choose arbitrarily from the appropriate values of .
Example 6.13 Let , , , the monomial order , , and in line 7, we always choose the smallest possible . Then the result of the algorithm is , , . But if we change the order of the functions and (that is, and ), then the output of the algorithm is , and .
As we have seen in the previous example, we can make the algorithm deterministic by always choosing the smallest possible in line . In this case, the quotients and the remainder are unique, which we can express as .
Observe that if , then the algorithm gives the answer to the ideal membership problem: if and only if the remainder is zero. Unfortunately, if , then this is not true anymore. For example, with the monomial order
and the quotients are , . On the other hand, , which shows that .
Our next goal is to find a special basis for an arbitrary polynomial ideal such that the remainder on division by that basis is unique, which gives the answer to the ideal membership problem. But does such a basis exist at all? And if it does, is it finite?
The ideal is a monomial ideal if there exists a subset such that
that is, ideal is generated by monomials.
Lemma 6.11 Let be a monomial ideal, and . Then
Proof. The direction is obvious. Conversely, let and such that . Then the sum has at least one member which contains , therefore .
The most important consequence of the lemma is that two monomial ideals are identical if and only if they contain the same monomials.
Lemma 6.12 (Dickson's lemma) Every monomial ideal is finitely generated, namely, for every , there exists a finite subset such that .
Lemma 6.13 Let be an ideal in . If is a finite subset such that , then .
Proof. Let . If is an arbitrary polynomial, then division with remainder gives , with , such that either or no term of is divisible by the leading term of any . But , hence, . This, together with Lemma (6.11), implies that , therefore .
Together with Dickson's lemma applied to , and the fact that the zero polynomial generates the zero ideal, we obtain the following famous result.
Theorem 6.14 (Hilbert's basis theorem) Every ideal is finitely generated, namely, there exists a finite subset such that and .
Corollary 6.15 (ascending chain condition) Let be an ascending chain of ideals in . Then there exists an such that .
Proof. Let . Then is an ideal, which is finitely generated by Hilbert's basis theorem. Let . With , we have .
A ring satisfying the ascending chain condition is called Noetherian. Specifically, if is a field, then is Noetherian.
Let be a monomial order on and an ideal. A finite set is a Gröbner basis of ideal with respect to if . Hilbert's basis theorem implies the following corollary
Corollary 6.16 Every ideal in has a Gröbner basis.
It is easy to show that the remainder on division by the Gröbner basis does not depend on the order of the elements of G. Therefore, we can use the notation . Using the Gröbner basis, we can easily answer the ideal membership problem.
Theorem 6.17 Let be a Gröbner basis of ideal with respect to a monomial order and let . Then .
Proof. We prove that there exists a unique such that (1) , (2) no term of is divisible by any monomial of . The existence of such an comes from division with remainder. For the uniqueness, let for arbitrary and suppose that no term of or is divisible by any monomial of . Then , and by Lemma 6.11, is divisible by for some . This means that .
Thus, if is a Gröbner basis of , then for all ,
Unfortunately, Hilbert's basis theorem is not constructive, since it does not tell us how to compute a Gröbner basis for an ideal and basis . In the following, we investigate how the finite set can fail to be a Gröbner basis for an ideal .
Let be nonzero polynomials, , , and . The S-polynomial of and is
It is easy to see that , moreover, since , therefore .
The following theorem yields an easy method to test whether a given set is a Gröbner basis of the ideal .
Theorem 6.18 The set is the Gröbner basis of the ideal if and only if
Using the -polynomials, it is easy to give an algorithm for constructing the Gröbner basis. We present a simplified version of Buchberger's method (): given a monomial order and polynomials , the algorithm yields a Gröbner basis of the ideal .
Gröbner-basis(
)
1 2 3WHILE
4DO
an arbitrary pair from 5 6 7IF
8THEN
9 10RETURN
First we show the correctness of the Gröbner-basis
algorithm assuming that the procedure terminates. At any stage of the algorithm, set is a basis of ideal , since initially it is, and at any other step only those elements are added to that are remainders of the -polynomials on division by . If the algorithm terminates, the remainders of all -polynomials on division by are zero, and by Theorem (6.18), is a Gröbner basis.
Next we show that the algorithm terminates. Let and be the sets corresponding to successive iterations of the while cycle (lines 3-9). Clearly, and . Hence, ideals in successive iteration steps form an ascending chain, which stabilises by Corollary (6.15). Thus, after a finite number of steps, we have . We state that in this case. Let and . Then and either or , and using the definition of the remainder, we conclude that .
by step one, and by step two.
At the first iteration of the while cycle, let us choose the pair . Then , and . Therefore, and .
At the second iteration of the while cycle, let us choose the pair . Then , , , hence, and .
At the third iteration of the while cycle, let us choose the pair . Then , , .
At the fourth iteration, let us choose the pair . Then , , .
In the same way, the remainder of the -polynomials of all the remaining pairs on division by are zero hence, the algorithm returns with which constitutes a Gröbner basis.
In general, the Gröbner basis computed by Buchberger's algorithm is neither unique nor minimal. Fortunately, both can be achieved by a little finesse.
Lemma 6.19 If is a Gröbner basis of , and , then is a Gröbner basis of as well.
We say that the set is a minimal Gröbner basis for ideal if it is a Gröbner basis, and for all ,
,
.
An element of a Gröbner basis is said to be reduced with respect to if no monomial of is in the ideal . A minimal Gröbner basis for is reduced if all of its elements are reduced with respect to .
Theorem 6.20 Every ideal has a unique reduced Gröbner basis.
Example 6.15 In Example 6.14 not only but also is a Gröbner basis. It is not hard to show that is a reduced Gröbner basis.
The last forty years (since Buchberger's dissertation) was not enough to clear up entirely the algorithmic complexity of Gröbner basis computation. Implementation experiments show that we are faced with the intermediate expression swell phenomenon. Starting with a few polynomials of low degree and small coefficients, the algorithm produces a large number of polynomials with huge degrees and enormous coefficients. Moreover, in contrast to the variants of the Euclidean algorithm, the explosion cannot be kept under control. In , Kühnle and Mayr gave an exponential space algorithm for computing a reduced Gröbner basis. The polynomial ideal membership problem over is EXPSPACE-complete.
Let be polynomials over a field with (). If , then
for polynomials for which their degrees are bounded by . The double exponential bound is essentially unavoidable, which is shown by several examples. Unfortunately, in case , the ideal membership problem falls into this category. Fortunately, in special cases better results are available. If (Hilbert's famous Nullstellensatz), then in case , the bound is , while for , the bound is . But the variety is empty if and only if , therefore the solvability problem of a polynomial system is in PSPACE. Several results state that under specific circumstances, the (general) ideal membership problem is also in PSPACE. Such a criterion is for example that is zero-dimensional (contains finitely many isolated points).
In spite of the exponential complexity, there are many successful stories for the application of Gröbner bases: geometric theorem proving, robot kinematics and motion planning, solving polynomial systems of equations are the most widespread application areas. In the following, we enumerate some topics where the Gröbner basis strategy has been applied successfully.
Equivalence of polynomial equations. Two sets of polynomials generate the same ideal if and only if their Gröbner bases are equal with arbitrary monomial order.
Solvability of polynomial equations. The polynomial system of equations is solvable if and only if .
Finitely many solutions of polynomial equations. The polynomial system of equations has a finite number of solutions if and only if in any Gröbner basis of for every variable , there is a polynomial such that its leading term with respect to the chosen monomial order is a power of .
The number of solutions. Suppose that the system of polynomial equations has a finite number of solutions. Then the number of solutions counted with multiplicityes is equal to the cardinality of the set of monomials that are not multiples of the leading monomials of the polynomials in the Gröbner basis, where any monomial order can be chosen.
Simplification of expressions.
We show an example for the last item.
Example 6.16 Let be given such that
Compute the value of . So let and be elements of and let , . Then the Gröbner basis of is
Since , the answer to the question follows.
Exercises
6.3-1 Prove that the orders and are monomial orders.
6.3-2 Let be a monomial order on , . Prove the following:
a. ,
b. if , then , where equality holds if .
6.3-3 Let .
a. Determine the order of the monomials in for the monomial orders , and with in all cases.
b. For each of the three monomial orders from (), determine , , and .
6.3-4 Prove Dickson's lemma.
6.3-5 Compute the Gröbner basis and the reduced Gröbner basis of the ideal using the monomial order , where . Which of the following polynomials belong to : ?
The problem of indefinite integration is the following: given a function , find a function the derivative of which is , that is, ; for this relationship the notation is also used. In introductory calculus courses, one tries to solve indefinite integration problems by different methods, among which one tries to choose in a heuristic way: substitution, trigonometric substitution, integration by parts, etc. Only the integration of rational functions is usually solved by an algorithmic method.
It can be shown that indefinite integration in the general case is algorithmically unsolvable. So we only have the possibility to look for a reasonably large part that can be solved using algorithms.
The first step is the algebraisation of the problem: we discard every analytical concept and consider differentiation as a new (unary) algebraic operation connected to addition and multiplication in a given way, and we try to find the “inverse” of this operation. This approach leads to the introduction of the concept of differential algebra.
The integration routines of computer algebra systems (e.g. MAPLE
), similarly to us, first try a few heuristic methods. Integrals of polynomials (or a bit more generally, finite Laurent-series) are easy to determine. This is followed by a simple table lookup process (e.g. in case of MAPLE
35 basic integrals are used). One can, of course, use integral tables entered from books as well. Next we may look for special cases where appropriate methods are known. For example, for integrands of the form
where is a polynomial, the integral can be determined using integration by parts. When the above methods fail, a form of substitution called the “derivative-divides” method is tried: if the integrand is a composite expression, then for every sub-expression , we divide by the derivative of , and check if vanishes from the result after the substitution . Using these simple methods we can determine a surprisingly large number of integrals. To their great advantage, they can solve simple problems very quickly. If they do not succeed, we try algorithmic methods. The first one is for the integration of rational functions. As we will see, there is a significant difference between the version used in computer algebra systems and the version used in hand computations, since the aims are short running times even in complicated cases, and the simplest possible form of the result. The Risch algorithm for integrating elementary functions is based on the algorithm for the integration of rational functions. We describe the Risch algorithm, but not in full detail. In most cases, we only outline the proofs.
In this subsection, we introduce the notion of differential field and differential extension field, then we describe Hermite's method.
Let be a field of characteristic 0, with a mapping of into itself satisfying:
(1) (additivity);
(2) (Leibniz-rule).
The mapping is called a differential operator, differentiation or derivation, and is called a differential field. The set is the field of constants or constant subfield in . If , we also write . Obviously, for any constant , we have . The logarithmic derivative of an element is defined as (the “derivative of ”).
Theorem 6.21 With the notations of the previous definition, the usual rules of derivation hold:
(1) ;
(2) derivation is -linear: for , ;
(3) if , is arbitrary, then ;
(4) for and ;
(5) for (integration by parts).
Example 6.17 (1) With the notations of the previous definition, the mapping on is the trivial derivation, for this we have .
(2) Let . There exists a single differential operator on with , it is the usual differentiation. For this the constants are the elements of . Indeed, for by induction, so the elements of and are also constants. We have by induction that the derivative of power functions is the usual one, thus, by linearity, this is true for polynomials, and by the differentiation rule of quotients, we get the statement. It is not difficult to calculate that for the usual differentiation, the constants are the elements of .
(3) If , where is an arbitrary field of characteristic 0, then there exists a single differential operator on with constant subfield and : it is the usual differentiation. This statement is obtained similarly to the previous one.
If is an arbitrary field of characteristic 0, and with the usual differentiation, then is not the derivative of anything. (The proof of the statement is very much like the proof of the irrationality of , but we have to work with divisibility by rather than by 2.)
The example shows that for the integration of and other similar functions, we have to extend the differential field. In order to integrate rational functions, an extension by logarithms will be sufficient.
Let be a differential field and a subfield of . If differentiation doesn't lead out of , then we say that is a differential subfield of , and is a differential extension field of . If for some we have , that is, the derivative of is the logarithmic derivative of , then we write . (We note that , just like is a relation rather than a function. In other words, is an abstract concept here and not a logarithm function to a given base.) If we can choose , we say that is logarithmic over .
Example 6.18 (1) Let , , where is a new indeterminate, and let , that is, . Then .
(2) Analogically,
is in the differential field .
(3) Since
the integral can be considered as an element of
or an element of
as well. Obviously, it is more reasonable to choose the first possibility, because in this case there is no need to extend the base field.
Let be a field of characteristic 0, non-zero relatively prime polynomials. To compute the integral using Hermite's method, we can find polynomials with
where and is monic and square-free. The rational function is called the rational part, the expression is called the logarithmic part of the integral. The method avoids the factorisation of into linear factors (in a factor field or some larger field), and even its decomposition into irreducible factors over .
Trivially, we may assume that is monic. By Euclidean division we have , where , thus, . The integration of the polynomial part is trivial. Let us determine the square-free factorisation of , that is, find monic and pairwise relatively prime polynomials such that and . Let us construct the partial fraction decomposition (this can be achieved by the Euclidean algorithm):
where every has smaller degree than .
The Hermite-reduction is the iteration of the following step: if , then the integral is reduced to the sum of a rational function and an integral similar to the original one, with reduced by 1. Using that is square-free, we get , thus, we can obtain polynomials by the application of the extended Euclidean algorithm such that and . Hence, using integration by parts,
It can be shown that using fast algorithms, if , then the procedure requires operations in the field , where is a bound on the number of operations needed to multiply to polynomials of degree at most .
Hermite's method has a variant that avoids the partial fraction decomposition of . If , then is square-free. If , then let
Since , there exist polynomials such that
Dividing both sides by and integrating by parts,
thus, is reduced by one.
Note that and can be determined by the method of undetermined coefficients (Horowitz's method). After division, we may assume . As it can be seen from the algorithm, we can choose and . Differentiating (6.15), we get a system of linear equations on coefficients of and coefficients of , altogether coefficients. This method in general is not as fast as Hermite's method.
The algorithm below performs the Hermite-reduction for a rational function of variable .
Hermite-Reduction(
)
1 2 3Square-Free
4 construct the partial fraction decomposition of , compute numerators belonging to 5 6 7FOR
TO
8DO
9FOR
TO
10DO
11WHILE
12DO
determine and from the equation 13 14 15 16 17 18RETURN
If for some field of characteristic 0, we want to compute the integral , where are non-zero relatively prime polynomials with , square-free and monic, we can proceed by decomposing polynomial into linear factors, , in its splitting field , then constructing the partial fraction decomposition, , over , finally integrating we get
The disadvantage of this method, as we have seen in the example of the function , is that the degree of the extension field can be too large. An extension degree as large as can occur, which leads to totally unmanageable cases. On the other hand, it is not clear either if a field extension is needed at all: for example, in case of the function , can we not compute the integral without extending the base field? The following theorem enables us to choose the degree of the field extension as small as possible.
Theorem 6.22 (Rothstein-Trager integration algorithm) Let be a field of characteristic 0, non-zero relatively prime polynomials, , square-free and monic. If is an algebraic extension of , are square-free and pairwise relatively prime monic polynomials, then the following statements are equivalent:
(1) ,
(2) The polynomial can be decomposed into linear factors over , are exactly the distinct roots of , and if . Here, is the resultant taken in the indeterminate .
Example 6.19 Let us consider again the problem of computing the integral . In this case,
the roots of which are and Thus,
The algorithm, which can be easily given based on the previous theorem, can be slightly improved: instead of computing (by calculations over the field ), can also be computed over , applying the Extended-Euclidean-Normalised
algorithm. This was discovered by Trager, and independently by Lazard and Rioboo. It is not difficult to show that the running time of the complete integration algorithm obtained this way is if .
Theorem 6.23 (Lazard-Rioboo-Trager-formula) Using the notations of the previous theorem, let denote the multiplicity of as a root of the polynomial . Then
(1) ;
(2) if denotes the remainder of degree in the Extended-Euclidean-Normalised
algorithm performed in on and , then .
The algorithm below is the improved Lazard-Rioboo-Trager version of the Rothstein-Trager method. We compute for the rational function of indeterminate , where is square-free and monic, and .
Integrate-Logarithmic-Part(
)
1 Let by the subresultant algorithm, furthermore 2 let be the remainder of degree during the computation 3Square-free
4 5FOR
TO
6DO
IF
7THEN
the gcd of the coefficients of 8Extended-Euclidean-Normalised
9Primitive-Part
10Factors
11FOR
TO
12DO
13Solve
14IF
15THEN
16ELSE
FOR
TO
17DO
18RETURN
Example 6.20 Let us consider again the problem of computing the integral . In this case,
The polynomial is irreducible in , thus, we cannot avoid the extension of . The roots of are . From the Extended-Euclidean-Normalised
-algorithm over , , thus, the integral is
Surprisingly, the methods found for the integration of rational functions can be generalised for the integration of expressions containing basic functions (, etc.) and their inverse. Computer algebra systems can compute the integral of remarkably complicated functions, but sometimes they fail in seemingly very simple cases, for example the expression is returned unevaluated, or the result contains a special non-elementary function, for example the logarithmic integral. This is due to the fact that in such cases, the integral cannot be given in “closed form”.
Although the basic results for integration in “closed form” had been discovered by Liouville in 1833, the corresponding algorithmic methods were only developed by Risch in 1968.
The functions usually referred to as functions in “closed form” are the ones composed of rational functions, exponential functions, logarithmic functions, trigonometric and hyperbolic functions, their inverses and -th roots (or more generally “inverses” of polynomial functions, that is, solutions of polynomial equations); that is, any nesting of the above functions is also a function in “closed form”.
One might note that while is usually given in the form , the algorithm for the integration of rational functions returns
as solution. Since trigonometric and hyperbolic functions and their inverses over can be expressed in terms of exponentials and logarithms, we can restrict our attention to exponentials and logarithms. Surprisingly, it also turns out that the only extensions needed are logarithms (besides algebraic numbers) in the general case.
Let be a differential extension field of the differential field . If for a , there exists a such that , that is, the logarithmic derivative of equals the derivative of an element of , then we say that is exponential over and we write . If only the following is true: for an element , there is a such that , that is, the logarithmic derivative of is an element of , then is called hyperexponential over .
Logarithmic, exponential or hyperexponential elements may be algebraic or transcendent over .
Let be a differential extension field of the differential field . If
where for , is logarithmic, exponential or algebraic over the field
(), then is called an elementary extension of . If for , is either transcendental and logarithmic, or transcendental and exponential over , then is a transcendental elementary extension of .
Let be the differential field of rational functions with the usual differentiation and constant subfield . An elementary extension of is called a field of elementary functions, a transcendental elementary extension of is called a field of transcendental elementary functions.
Example 6.21 The function can be written in the form , where , , . Trivially, is exponential over , is exponential over and is exponential over . Since and , can be written in the simpler form . The function is not only exponential but also algebraic over , since , that is, . So . But can be put in an even simpler form:
can be written in form , where , , , and satisfies the algebraic equation . But can also be given in the much simpler form .
Example 6.23 The function can be written in the form , where and , so is logarithmic over , and is exponential over . But , so is algebraic over , and .
The integral of an element of a field of elementary functions will be completely characterised by Liouville's Principle in case it is an elementary function. Algebraic extensions, however, cause great difficulty if not only the constant field is extended.
Here we only deal with the integration of elements of fields of transcendental elementary functions by the Risch integration algorithm.
In practice, this means an element of the field of transcendental elementary functions , where are algebraic over and the integral is an element of the field
of elementary functions. In principle, it would be simpler to choose as constant subfield but, as we have seen in the case of rational functions, this is impossible, for we can only compute in an exact way in special fields like algebraic number fields; and we even have to keep the number and degrees of as small as possible. Nevertheless, we will deal with algebraic extensions of the constant subfield dynamically: we can imagine that the necessary extensions have already been made, while in practice, we only perform extensions when they become necessary.
After the conversion of trigonometric and hyperbolic functions (and their inverses) to exponentials (and logarithms, respectively), the integrand becomes an element of a field of elementary functions. Examples 6.21 and 6.22 show that there are functions that do not seem to be elements of a transcendental elementary extension “at first sight”, and yet they are; while Example 6.23 shows that there are functions that seem to be elements of such an extension “at first sight”, and yet they are not. The first step is to represent the integrand as an element of a field of transcendental elementary functions using the algebraic relationships between the different exponential and logarithmic functions. We will not consider how this can be done. It can be verified whether we succeeded by the following structure theorem by Risch. We omit the proof of the theorem. We will need a definition.
An element is monomial over a differential field if and have the same constant field, is transcendental over and it is either exponential or logarithmic over .
Theorem 6.24 (Structure theorem) Let be the field of constants and a differential extension field of , which has constant field . Let us assume that for all , either is algebraic over , or , with and , or , with and . Then
1. , where , is monomial over if and only if there is no product
which is an element of ;
2. , where , is monomial over if and only if there is no linear combination
which is an element of .
Product and summation is only taken for logarithmic and exponential steps.
The most important classic result of the entire theory is the following theorem.
Theorem 6.25 (Liouville's Principle) Let be a differential field with constant field . Let be a differential extension field of with the same constant field. Let us assume that . Then there exist constants and elements such that
that is,
Note that the situation is similar to the case of rational functions.
We will not prove this theorem. Although the proof is lengthy, the idea of the proof is easy to describe. First we show that a transcendental exponential extension cannot be “eliminated”, that is differentiating a rational function of it, the new element does not vanish. This is due to the fact that differentiating a polynomial of an element of the transcendental exponential extension, we get a polynomial of the same degree, and the original polynomial does not divide the derivative, except for the case when the original polynomial is monomial. Next we show that no algebraic extension is needed to express the integral. This is essentially due to the fact that substituting an element of an algebraic extension into its minimal polynomial, we get zero, and differentiating this equation, we get the derivative of the extending element as a rational function of the element. Finally, we have to examine extensions by transcendental logarithmic elements. We show that such an extending element can be eliminated by differentiation if and only if it appears in a linear polynomial with constant leading coefficient. This is due to the fact that differentiating a polynomial of such an extending element, we get a polynomial the degree of which is either the same or is reduced by one, the latter case only being possible when the leading coefficient is constant.
Let be an algebraic number field over , and a field of transcendental elementary functions. The algorithm is recursive in : using the notation , we will integrate a function , where . (The case is the integration of rational functions.) We may assume that and are relatively prime and is monic. Besides differentiation with respect to , we will also use differentiation with respect to , which we denote by . In the following, we will only present the algorithms.
Using the notations of the previous paragraph, we first presume that is transcendental and logarithmic, , . By Euclidean division, , hence
Unlike the integration of rational functions, here the integration of the polynomial part is more difficult. Therefore, we begin with the integration of the rational part.
Let be the square-free factorisation of . Then
is obvious. It can be shown that the much stronger condition is also fulfilled. By partial fraction decomposition,
We use Hermite-reduction: using the extended Euclidean algorithm we get polynomials that satisfy and . Using integration by parts,
Continuing this procedure while , we get
where , and is a square-free and monic polynomial.
It can be shown that the Rothstein-Trager method can be applied to compute the integral . Let us calculate the resultant
It can be shown that the integral is elementary if and only if is of the form , where and . If we compute the primitive part of , choose it as and any coefficient of is not a constant, then there is no elementary integral. Otherwise, let be the distinct roots of in its factor field and le
for . It can be shown that
Let us consider a few examples.
Example 6.24 The integrand of the integral is , where . Since
is a primitive polynomial and it has a coefficient that is, not constant, the integral is not elementary.
Example 6.25 The integrand of the integral is , where . Here,
which has primitive part . Every coefficient of this is constant, so the integral is elementary, , , so
The remaining problem is the integration of the polynomial part
According to Liouville's Principle is elementary if and only if
where and for , is an extension of and . We will show that can be an algebraic extension of . A similar reasoning to the proof of Liouville's Principle shows that and (that is, independent of ) for . Thus,
We also get, by the reasoning used in the proof of Liouville's Principle, that the degree of is at most . So if , then
Hence, we get the following system of equations:
where in the last equation. The solution of the first equation is simply a constant . Substituting this into the next equation and integrating both sides, we get
Applying the integration procedure recursively, the integral of can be computed, but this equation can only be solved if the integral is elementary, it uses at most one logarithmic extension, and it is exactly . If this is not fulfilled, then cannot be elementary. If it is fulfilled, then for some and , hence, and with an arbitrary integration constant . Substituting for into the next equation and rearranging, we get
so we have
after integration. The integrand on the right hand side is in , so we can call the integration procedure in a recursive way. Just like above, the equation can only be solved when the integral is elementary, it uses at most one logarithmic extension and it is exactly . Let us assume that this is the case and
where and . Then the solution is and , where is an arbitrary integration constant. Continuing the procedure, the solution of the penultimate equation is and with an integration constant . Substituting for into the last equation, after rearrangement and integration, we get
This time the only condition is that the integral should be an elementary function. If it is elementary, say
then is the coefficient of in , , and the result is
Let us consider a few examples.
Example 6.26 The integrand of the integral is , where . If the integral is elementary, then
and , , . With the unknown constant from the second equation, . Since , we get , . From the third equation . Since , after integration and , we get , , hence, .
Example 6.27 The integrand of the integral is , where and . If the integral is elementary, then
and , , . With the unknown constant from the second equation, . Since , we get , . From the third equation . Since , the equation
must hold but we know from Example 6.24 that the integral on the left hand side is not elementary.
Now we assume that is transcendental and exponential, , . By Euclidean division, , hence
We plan using Hermite's method for the rational part. But we have to face an unpleasant surprise: although for the square-free factors
is obviously satisfied, the much stronger condition is not. For example, if , then
It can be shown, however, that this unpleasant phenomenon does not appear if , in which case . Thus, it will be sufficient to eliminate from the denominator. Let , where , and let us look for polynomials with , and . Dividing both sides by , we get
Using the notation , is a finite Laurent-series the integration of which will be no harder than the integration of a polynomial. This is not surprising if we note . Even so, the integration of the “polynomial part” is more difficult here, too. We start with the other one.
Let be the square-free factorisation of . Then, since , . Using partial fraction decomposition
Hermite-reduction goes the same way as in the logarithmic case. We get
where , and is a square-free and monic polynomial, .
It can be shown that the Rothstein-Trager method can be applied to compute the integral . Let us calculate the resultant
It can be shown that the integral is elementary if and only if is of form , where and . If we compute the primitive part of , choose it as and any coefficient of is not a constant, then there is no elementary integral. Otherwise, let be the distinct roots of in its factor field and let
for . It can be shown that
Let us consider a few examples.
Example 6.28 The integrand of the integral is , where . Since
is a primitive polynomial which only has constant coefficients, the integral is elementary, , , thus,
Example 6.29 The integrand of the integral is , where . Since
is a primitive polynomial that has a non-constant coefficient, the integral is not elementary.
The remaining problem is the integration of the “polynomial part”
According to Liouville's Principle is elementary if and only if
where and for , is an extension of and . It can be shown that can be an algebraic extension of . A similar reasoning to the proof of Liouville's Principle shows that we may assume without breaking generality that is either an element of (that is, independent of ), or it is monic and irreducible in for . Furthermore, it can be shown that there can be no non-monomial factor in the denominator of , since such a factor would also be present in the derivative. Similarly, the denominator of can have no non-monomial factor either. So we get that either , or , since this is the only irreducible monic monomial. But if , then the corresponding term of the sum is , which can be incorporated into . Hence, we get that if has an elementary integral, then
where and . The summation should be taken over the same range as in since
Comparing the coefficients, we get the system
where . The solution of the equation is simply ; if this integral is not elementary, then cannot be elementary either, but if it is, then we have determined . In the case , we have to solve a differential equation called Risch differential equation to determine . The differential equation is of the form , where the given functions and are elements of , and we are looking for solutions in . At first sight it looks as if we had replaced the problem of integration with a more difficult problem, but the linearity of the equations and that the solution has to be in means a great facility. If any Risch differential equation fails to have a solution in , then is not elementary, otherwise
The Risch differential equation can be solved algorithmically, but we will not go into details.
Let us consider a few examples.
Example 6.30 The integrand of the integral is , where . If the integral is elementary, then , where . It is not difficult to show that the differential equation has no rational solutions, so is not elementary.
Example 6.31 The integrand of the integral is , where and . If the integral is elementary, then , where . Differentiating both sides, , thus, . Since is transcendental over , by comparing the coefficients and , which has no solutions. Therefore, is not elementary.
Example 6.32 The integrand of the integral
is
where . If the integral is elementary, then it is of the form , where
The second equation can be integrated and is elementary. The solution of the first equation is . Hence,
Exercises
6.4-1 Apply Hermite-reduction to the following function :
6.4-2 Compute the integral , where
6.4-3 Apply the Risch integration algorithm to compute the following integral:
So far in the chapter we tried to illustrate the algorithm design problems of computer algebra through the presentation of a few important symbolic algorithms. Below the interested reader will find an overview of the wider universe of the research of symbolic algorithms.
Besides the resultant method and the theory of Gröbner-bases presented in this chapter, there also exist algorithms for finding real symbolic roots of non-linear equations and inequalities. (Collins).
There are some remarkable algorithms in the area of symbolic solution of differential equations. There exists a decision procedure similar to the Risch algorithm for the computation of solutions in closed form of a homogeneous ordinary differential equation of second degree with rational function coefficients. In case of higher degree linear equations, Abramov's procedure gives closed rational solutions of an equation with polynomial coefficients, while Bronstein's algorithm gives solutions of the form . In the case of partial differential equations Lie's symmetry methods can be used. There also exists an algorithm for the factorisation of linear differential operators over formal power series and rational functions.
Procedures based on factorisation are of great importance in the research of computer algebra algorithms. They are so important that many consider the birth of the entire research field with Berlekamp's publication on an effective algorithm for the factorisation of polynomials of one variable over finite fields of small characteristic . Later, Berlekamp extended his results for larger characteristic. In order to have similarly good running times, he introduced probabilistic elements into the algorithm. Today's computer algebra systems use Berlekamp's procedure even for large finite fields as a routine, perhaps without most of the users knowing about the probabilistic origin of the algorithm. The method will be presented in another chapter of the book. We note that there are numerous algorithms for the factorisation of polynomials over finite fields.
Not much time after polynomial factorisation over finite fields was solved, Zassenhaus, taking van der Waerden's book Moderne Algebra from 1936 as a base, used Hensel's lemma for the arithmetic of -adic numbers to extend factorisation. “Hensel-lifting” – as his procedure is now called – is a general approach for the reconstruction of factors from their modular images. Unlike interpolation, which needs multiple points from the image, Hensel-lifting only needs one point from the image. The Berlekamp–Zassenhaus-algorithm for the factorisation of polynomials with integer coefficients is of fundamental importance but it has two hidden pitfalls. First, for a certain type of polynomial the running time is exponential. Unfortunately, many “bad” polynomials appear in the case of factorisation over algebraic number fields. Second, a representation problem appears for multivariable polynomials, similar to what we have encountered at the Gauss-elimination of sparse matrices. The first problem was solved by a Diophantine optimisation based on the geometry of numbers, a so-called lattice reduction algorithm by Lenstra-Lenstra-Lovász [161]; it is used together with Berlekamp's method. This polynomial algorithm is completed by a procedure which ensures that the Hensel-lifting will start from a “good” modular image and that it will end “in time”. Solutions have been found for the mentioned representation problem of the factorisation of multivariable polynomials as well. This is the second area where randomisation plays a crucial role in the design of effective algorithms. We note that in practice, the Berlekamp-Zassenhaus-Hensel-algorithm proves more effective than the Lenstra-Lenstra-Lovász-procedure. As a contrast, the problem of polynomial factorisation can be solved in polynomial time, while the best proved algorithmic bound for the factorisation of the integer is (Pollard and Strassen) in the deterministic case and (Lenstra and Pomerance) in the probabilistic case, where .
In fact, a new theory of heuristic or probabilistic methods in computer algebra is being born to avoid computational or memory explosion and to make algorithms with deterministically large running times more effective. In the case of probabilistic algorithms, the probability of inappropriate operations can be positive, which may result in an incorrect answer (Monte Carlo algorithms) or—although we always get the correct answer (Las Vegas algorithms)—we may not get anything in polynomial time. Beside the above examples, nice results have been achieved in testing polynomial identity, irreducibility of polynomials, determining matrix normal forms (Frobenius, Hilbert, Smith), etc. Their role is likely to increase in the future.
So far in the chapter we gave an overview of the most important symbolic algorithms. We mentioned in the introduction that most computer algebra systems are also able to perform numeric computations: unlike traditional systems, the precision can be set by the user. In many cases, it is useful to combine the symbolic and numeric computations. Let us consider for example the symbolically computed power series solution of a differential equation. After truncation, evaluating the power series with the usual floating point arithmetics in certain points, we get a numerical approximation of the solution. When the problem is an approximation of a physical problem, the attractivity of the symbolic computation is often lost, simply because they are too complicated or too slow and they are not necessary or useful, since we are looking for a numerical solution. In other cases, when the problem cannot be dealt with using symbolic computation, the only way is the numerical approximation. This may be the case when the existing symbolic algorithm does not find a closed solution (e.g. the integral of non-elementary functions, etc.), or when a symbolic algorithm for the specified problem does not exist. Although more and more numerical algorithms have symbolic equivalents, numerical procedures play an important role in computer algebra. Let us think of differentiation and integration: sometimes traditional algorithms—integral transformation, power series approximation, perturbation methods—can be the most effective.
In the design of computer algebra algorithms, parallel architectures will play an increasing role in the future. Although many existing algorithms can be parallelised easily, it is not obvious that good sequential algorithms will perform optimally on parallel architectures as well: the optimal performance might be achieved by a completely different method.
The development of computer algebra systems is linked with the development of computer science and algorithmic mathematics. In the early period of computers, researchers of different fields began the development of the first computer algebra systems to facilitate and accelerate their symbolic computations; these systems, reconstructed and continuously updated, are present in their manieth versions today. General purpose computer algebra systems appeared in the seventies, and provide a wide range of built-in data structures, mathematical functions and algorithms, trying to cover a wide area of users. Because of their large need of computational resources, their expansion became explosive in the beginning of the eighties when microprocessor-based workstations appeared. Better hardware environments, more effective resource management, the use of system-independent high-level languages and, last but not least social-economic demands gradually transformed general purpose computer algebra systems into market products, which also resulted in a better user interface and document preparation.
Below we list the most widely known general and special purpose computer algebra systems and libraries.
General purpose computer algebra systems: AXIOM, DERIVE, FORM, GNU-CALC, JACAL, MACSYMA, MAXIMA, MAPLE, DISTRIBUTED MAPLE, MATHCAD, MATLAB SYMBOLIC MATH TOOLBOX, SCILAB, MAS, MATHEMATICA, MATHVIEW, MOCK-MMA, MUPAD, REDUCE, RISA
.
Algebra and number theory: BERGMAN, COCOA, FELIX, FERMAT, GRB, KAN, MACAULAY, MAGMA, NUMBERS, PARI, SIMATH, SINGULAR
.
Algebraic geometry: CASA, GANITH
.
Group theory: GAP, LIE, MAGMA, SCHUR
.
Tensor analysis: CARTAN, FEYNCALC, GRG, GRTENSOR, MATHTENSOR, REDTEN, RICCI, TTC
.
Computer algebra libraries: APFLOAT, BIGNUM, GNU MP, KANT, LIDIA, NTL, SACLIB, UBASIC, WEYL, ZEN
.
Most general purpose computer algebra systems are characterised by
interactivity,
knowledge of mathematical facts,
a high-level, declarative
programming language with the possibility of functional programming and the knowledge of mathematical objects,
expansibility towards the operational system and other programs,
integration of symbolic and numeric computations,
automatic (optimised) C and Fortran code generation,
graphical user interface,
2- and 3-dimensional graphics and animation,
possibility of editing text and automatic LaTeX conversion,
on-line help.
Computer algebra systems are also called mathematical expert systems. Today we can see an astonishing development of general purpose computer algebra systems, mainly because of their knowledge and wide application area. But it would be a mistake to underestimate special systems, which play a very important role in many research fields, besides, in many cases are easier to use and more effective due to their system of notation and the low-level programming language implementation of their algorithms. It is essential that we choose the most appropriate computer algebra system to solve a specific problem.
Footnote. Declarative programming languages specify the desired result unlike imperative languages, which describe how to get the result.
PROBLEMS |
6-1
The length of coefficients in the series of remainders in a Euclidean division
Generate two pseudorandom polynomials of degree in with coefficients of decimal digits. Perform a single Euclidean division in (in ) and compute the ratio of the maximal coefficients of the remainder and the original polynomial (determined by the function ). Repeat the computation times and compute the average. What is the result? Repeat the experiment with .
6-2
Simulation of the Modular-Gcd-Smallprimes algorithm
Using simulation, give an estimation for the optimal value of the variable in the Modular-Gcd-Smallprimes
algorithm. Use random polynomials of different degrees and coefficient magnitudes.
6-3
Modified pseudo-euclidean division
Let . Modify the pseudo-euclidean division in such a way that in the equation
instead of the exponent put the smallest value such that . Replace the procedures and in the Primitive-Euclidean
algorithm by the obtained procedures and . Compare the amount of memory space required by the algorithms.
6-4
Construction of reduced Gröbner basis
Design an algorithm that computes a reduced Gröbner basis from a given Gröbner-basis .
6-5
Implementation of Hermite-reduction
Implement Hermite-reduction in a chosen computer algebra language.
6-6
Integration of rational functions
Write a program for the integration of rational functions.
CHAPTER NOTES |
The algorithms Classical-Euclidean
and Extended-Euclidean
for non-negative integers are described in [54]. A natural continuation of the theory of resultants leads to subresultants, which can help in reducing the growth of the coefficients in the Extended-Euclidean
algorithm (see e.g. [82], [88]).
Gröbner bases were introduced by B. Buchberger in 1965 [35]. Several authors examined polynomial ideals before this. The most well-known is perhaps Hironaka, who used bases of ideals of power series to resolve singularities over . He was rewarded a Fields-medal for his work. His method was not constructive, however. Gröbner bases have been generalised for many algebraic structures in the last two decades.
The bases of differential algebra have been layed by J. F. Ritt in 1948 [213]. The square-free factorisation algorithm used in symbolic integration can be found for example in the books [82], [88]. The importance of choosing the smallest possible extension degree in Hermite-reduction is illustrated by Example 11.11 in [88], where the splitting field has very large degree but the integral can be expressed in an extension of degree 2. The proof of the Rothstein-Trager integration algorithm can be found in [82] (Theorem 22.8). We note that the algorithm was found independently by Rothstein and Trager. The proof of the correctness of the Lazard-Rioboo-Trager formula, the analysis of the running time of the Integrate-Logarithmic-Part
algorithm, an overview of the procedures that deal with the difficulties of algebraic extension steps, the determination of the hyperexponential integral (if exists) of a hyperexponential element over , the proof of Liouville's principle and the proofs of the statements connected to the Risch algorithm can be found in the book [82].
There are many books and publications available on computer algebra and related topics. The interested reader will find mathematical description in the following general works: Caviness [42], Davenport et al. [60], von zur Gathen et al. [82], Geddes et al. [88], Knuth [144], [145], [146], Mignotte [181], Mishra [184], Pavelle et al. [201], Winkler [276].
The computer-oriented reader will find further information on computer algebra in Christensen [44], Gonnet and Gruntz [96], Harper et al. [107] and on the world wide web.
A wide range of books and articles deal with applications, e.g. Akritas [10], Cohen et al. (eds.) [50], [49], Grossman (ed.) [101], Hearn (ed.) [110], Kovács [148] and Odlyzko [195].
For the role of computer algebra systems in education see for example the works of Karian [138] and Uhl [260].
Conference proceedings: AAECC, DISCO, EUROCAL, EUROSAM, ISSAC
and SYMSAC
.
Computer algebra journals: Journal of Symbolic Computation—Academic Press, Applicable Algebra in Engineering, Communication and Computing—Springer-Verlag, Sigsam Bulletin—ACM Press.
The Department of Computer Algebra of the Eötvös Loránd University, Budapest takes works [82], [88], [181], [276] as a base in education.
Table of Contents
This chapter introduces a number of cryptographic protocols and their underlying problems and algorithms. A typical cryptographic scenario is shown in Figure 7.1 (the design of Alice and Bob is due to Crépeau). Alice and Bob wish to exchange messages over an insecure channel, such as a public telephone line or via email over a computer network. Erich is eavesdropping on the channel. Knowing that their data transfer is being eavesdropped, Alice and Bob encrypt their messages using a cryptosystem.
In Section 7.1, various symmetric cryptosystems are presented. A cryptosystem is said to be symmetric if one and the same key is used for encryption and decryption. For symmetric systems, the question of key distribution is central: How can Alice and Bob agree on a joint secret key if they can communicate only via an insecure channel? For example, if Alice chooses some key and encrypts it like a message using a symmetric cryptosystem to send it to Bob, then which key should she use to encrypt this key?
This paradoxical situation is known as the secret-key agreement problem, and it was considered unsolvable for a long time. Its surprisingly simple, ingenious solution by Whitfield Diffie and Martin Hellman in 1976 is a milestone in the history of cryptography. They proposed a protocol that Alice and Bob can use to exchange a few messages after which they both can easily determine their joint secret key. Eavesdropper Erich, however, does not have clue about their key, even if he was able to intercept every single bit of their transferred messages. Section 7.2 presents the Diffie-Hellman secret-key agreement protocol.
It may be considered an irony of history that this protocol, which finally solved the long-standing secret-key agreement problem that is so important in symmetric cryptography, opened the door to public-key cryptography in which there is no need to distribute joint secret keys via insecure channels. In 1978, shortly after Diffie and Hellman had published their pathbreaking work in 1976, Rivest, Shamir, and Adleman developed their famous RSA system, the first public-key cryptosystem in the open literature. Section 7.3 describes the RSA cryptosystem and the related digital signature scheme. Using the latter protocol, Alice can sign her message to Bob such that he can verify that she indeed is the sender of the message. Digital signatures prevent Erich from forging Alice's messages.
The security of the Diffie-Hellman protocol rests on the assumption that computing discrete logarithms is computationally intractable. That is why modular exponentiation (the inverse function of which is the discrete logarithm) is considered to be a candidate of a one-way function. The security of RSA similarly rests on the assumption that a certain problem is computationally intractable, namely on the assumption that factoring large integers is computationally hard. However, the authorised receiver Bob is able to efficiently decrypt the ciphertext by employing the factorisation of some integer he has chosen in private, which is his private “trapdoor” information.
Section 7.4 introduces a secret-key agreement protocol developed by Rivest and Sherman, which is based on so-called strongly noninvertible associative one-way functions. This protocol can be modified to yield a digital signature scheme as well.
Section 7.5 introduces the fascinating area of interactive proof systems and zero-knowledge protocols that has practical applications in cryptography, especially for authentication issues. In particular, a zero-knowledge protocol for the graph isomorphism problem is presented. On the other hand, this area is also central to complexity theory and will be revisited in Chapter 8, again in connection to the graph isomorphism problem.
Cryptography is the art and science of designing secure cryptosystems, which are used to encrypt texts and messages so that they be kept secret and unauthorised decryption is prevented, whereas the authorised receiver is able to efficiently decrypt the ciphertext received. This section presents two classical symmetric cryptosystems. In subsequent sections, some important asymmetric cryptosystems and cryptographic protocols are introduced. A “protocol” is dialog between two (or more) parties, where a “party” may be either a human being or a computing machine. Cryptographic protocols can have various cryptographic purposes. They consist of algorithms that are jointly executed by more than one party.
Cryptanalysis is the art and science of (unauthorised) decryption of ciphertexts and of breaking existing cryptosystems. Cryptology captures both these fields, cryptography and cryptanalysis. In this chapter, we focus on cryptographic algorithms. Algorithms of cryptanalysis, which are used to break cryptographic protocols and systems, will be mentioned as well but will not be investigated in detail.
Figure 7.1 shows a typical scenario in cryptography: Alice and Bob communicate over an insecure channel that is eavesdropped by Erich, and thus they encrypt their messages using a cryptosystem.
Definition 7.1 (Cryptosystem) A cryptosystem is a quintuple with the following properties:
1. , , and are finite sets, where is the plaintext space, is the ciphertext space, and is the key space. The elements of are called the plaintexts, and the elements of are called the ciphertexts. A message is a string of plaintext symbols.
2. is a family of functions , which are used for encryption. is a family of functions , which are used for decryption.
3. For each key , there exists some key such that for each plaintext ,
A cryptosystem is said to be symmetric (or private-key) if either or if at least can “easily” be determined from . A cryptosystem is said to be asymmetric (or public-key) if and it is “computationally infeasible” to determine the private key from the corresponding public key .
At times, we may use distinct key spaces for encryption and decryption, with the above definition modified accordingly.
We now introduce some easy examples of classical symmetric cryptosystems. Consider the alphabet , which will be used both for the plaintext space and for the ciphertext space. We identify with so as to be able to perform calculations with letters as if they were numbers. The number corresponds to the letter , the corresponds to , and so on. This coding of plaintext or ciphertext symbols by nonnegative integers is not part of the actual encryption or decryption.
Messages are elements of , where denotes the set of strings over . If some message is subdivided into blocks of length and is encrypted blockwise, as it is common in many cryptosystems, then each block of is viewed as an element of .
Example 7.1 [Shift Cipher] The first example is a monoalphabetic symmetric cryptosystem. Let . The shift cipher encrypts messages by shifting every plaintext symbol by the same number of letters in the alphabet modulo . Shifting each letter in the ciphertext back using the same key , the original plaintext is recovered. For each key , the encryption function and the decryption function are defined by:
where addition and subtraction by modulo are carried out characterwise.
Figure 7.2 shows an encryption of the message by the shift cipher with key . The resulting ciphertext is . Note that the particular shift cipher with key is also known as the Caesar cipher, since the Roman Emperor allegedly used this cipher during his wars to keep messages secret.
Footnote. Historic remark: Gaius Julius Caesar reports in his book De Bello Gallico that he sent an encrypted message to Q. Tullius Cicero (the brother of the famous speaker) during the Gallic Wars (58 until 50 B.C.). The system used was monoalphabetic and replaced Latin letters by Greek letters; however, it is not explicitly mentioned there if the cipher used indeed was the shift cipher with key . This information was given later by Suetonius.
This cipher is a very simple substitution cipher in which each letter is substituted by a certain letter of the alphabet.
Since the key space is very small, the shift cipher can easily be broken. It is already vulnerable by attacks in which the attacker knows the ciphertext only, simply by checking which of the possible keys reveals a meaningful plaintext, provided that the ciphertext is long enough to allow unique decryption.
The shift cipher is a monoalphabetic cryptosystem, since every plaintext letter is replaced by one and the same letter in the ciphertext. In contrast, a polyalphabetic cryptosystem can encrypt the same plaintext symbols by different ciphertext symbols, depending on their position in the text. Such a polyalphabetic cryptosystem that is based on the shift cipher, yet much harder to break, was proposed by the French diplomat Blaise de Vigenére (1523 until 1596). His system builds on previous work by the Italian mathematician Leon Battista Alberti (born in 1404), the German abbot Johannes Trithemius (born in 1492), and the Italian scientist Giovanni Porta (born in 1675). It works like the shift cipher, except that the letter that encrypts some plaintext letter varies with its position in the text.
Example 7.2 [Vigenére Cipher] This symmetric polyalphabetic cryptosystem uses a so-called Vigenére square, a matrix consisting of rows and columns, see Figure 7.3. Every row has the letters of the alphabet, shifted from row to row by one position. That is, the single rows can be seen as a shift cipher obtained by the keys . Which row of the Vigenére square is used for encryption of some plaintext symbol depends on its position in the text.
Messages are subdivided into blocks of a fixed length and are encrypted blockwise, i.e., . The block length is also called the period of the system. In what follows, the th symbol in a string is denoted by .
For each key , the encryption function and the decryption function , both mapping from to , are defined by:
where addition and subtraction by modulo are again carried out characterwise. That is, the key is written letter by letter above the symbols of the block to be encrypted. If the last plaintext block has less than symbols, one uses less key symbols accordingly. In order to encrypt the th plaintext symbol , which has the key symbol sitting on top, use the th row of the Vigenére square as in the shift cipher.
For example, choose the block length and the key . Figure 7.4 shows the encryption of the message , which consists of seven blocks, into the ciphertext by the Vigenére cipher using key . To the first plaintext letter, “H”, the key symbol “T” is assigned. The intersection of the “H” column with the “T” row of the Vigenére square yields “A” as the first letter of the ciphertext, see Figure 7.3.
There are many other classical cryptosystems, which will not be described in detail here. There are various ways to classify cryptosystems according to their properties or to the specific way they are designed. In Definition 7.1, the distinction between symmetric and asymmetric cryptosystems was explained. The two examples above (the shift cipher and the Vigenére cipher) demonstrated the distinction between monoalphabetic and polyalphabetic systems. Both are substitution ciphers, which may be contrasted with permutation ciphers (a.k.a. transposition ciphers) in which the plaintext letters are not substituted by certain ciphertext letters but go to another position in the text remaining otherwise unchanged.
Moreover, block ciphers such as the Vigenére system can be contrasted with stream ciphers, which produce a continuous stream of key symbols depending on the plaintext context to be encrypted. One can also distinguish between different types of block ciphers. An important type are the affine linear block ciphers, which are defined by affine linear encryption functions and decryption functions , both mapping from to . That is, they are of the following form:
Here, and are the keys used for encryption and decryption, respectively; is a matrix with entries from ; is the inverse matrix for ; , , and are vectors in , and all arithmetics is carried out modulo . Some mathematical explanations are in order (see also Definition 7.2 in Subsection 7.1.3): An matrix over the ring has a multiplicative inverse if and only if . The inverse matrix for is defined by , where is the determinant of , and is the adjunct matrix for . The determinant of is recursively defined: For and , ; for and each , , where denotes the entry of and the matrix results from by cancelling the th row and the th column. The determinant of a matrix and thus its inverse (if it exists) can be computed efficiently, see Problem 7-3.
For example, the Vigenére cipher is an affine linear cipher whose key contains the unity matrix as its first component. If in (7.2) is the zero vector, then it is a linear block cipher. A classical example is the Hill cipher, invented by Lester Hill in 1929. Here, the key space is the set of all matrices with entries in such that . This condition guarantees the invertibility of those matrices that are allowed as keys, since the inverse matrix is used for decryption of the messages encrypted by key . For each key , the Hill cipher is defined by the encryption function and the decryption function . Thus, it is the most general linear cipher. The permutation cipher also is linear, and hence is a special case of the Hill cipher.
Cryptanalysis aims at breaking existing cryptosystems and, in particular, at determining the decryption keys. In order to characterise the security or vulnerability of the cryptosystem considered, one distinguishes different types of attacks according to the information available for the attacker. For the shift cipher, ciphertext-only attacks were already mentioned. They are the weakest type of attack, and a cryptosystem that does not resist such attacks is not of much value.
Affine linear block ciphers such as the Vigenére and the Hill cipher are vulnerable to attacks in which the attacker knows the plaintext corresponding to some ciphertext obtained and is able to conclude the keys used. These attacks are called known-plaintext attacks. Affine linear block ciphers are even more vulnerable to chosen-plaintext attacks, in which the attacker can choose some plaintext and is then able to see which ciphertext corresponds to the plaintext chosen. Another type of attack is in particular relevant for asymmetric cryptosystems: In an encryption-key attack, the attacker merely knows the public key but does not know any ciphertext yet, and seeks to determine the private key from this information. The difference is that the attacker now has plenty of time to perform computations, whereas in the other types of attacks the ciphertext was already sent and much less time is available to decrypt it. That is why keys of much larger size are required in public-key cryptography to guarantee the security of the system used. Hence, asymmetric cryptosystems are much less efficient than symmetric cryptosystems in many practical applications.
For the attacks mentioned above, the method of frequency counts is often useful. This method exploits the redundancy of the natural language used. For example, in many natural languages, the letter “E” occurs statistically significant most frequently. On average, the “E” occurs in long, “typical” texts with a percentage of in English, of in French, and even of in German. In other languages, different letters may occur most frequently. For example, the “A” is the most frequent letter in long, “typical” Finnish texts, with a percentage of .
That the method of frequency counts is useful for attacks on monoalphabetic cryptosystems is obvious. For example, if in a ciphertext encrypting a long German text by the shift cipher, the letter occurring most frequently is “Y”, which is rather rare in German (as well as in many other languages), then it is most likely that “Y” encrypts “E”. Thus, the key used for encryption is “U” (). In addition to counting the frequency of single letters, one can also count the frequency with which certain pairs of letters (so-called digrams) and triples of letters (so-called trigrams) occur, and so on. This kind of attack also works for polyalphabetic systems, provided the period (i.e., the block length) is known.
Polyalphabetic cryptosystems with an unknown period, however, provide more security. For example, the Vigenére cipher resisted each attempt of breaking it for a long time. No earlier than in 1863, about 300 years after its discovery, the German cryptanalyst Friedrich Wilhelm Kasiski found a method of breaking the Vigenére cipher. He showed how to determine the period, which initially is unknown, from repetitions of the same substring in the ciphertext. Subsequently, the ciphertext can be decrypted by means of frequency counts. Singh writes that the British eccentric Charles Babbage, who was considered a genius of his time by many, presumably had discovered Kasiski's method even earlier, around 1854, although he didn't publish his work.
The pathbreaking work of Claude Shannon (1916 until 2001), the father of modern coding and information theory, is now considered a milestone in the history of cryptography. Shannon proved that there exist cryptosystems that guarantee perfect secrecy in a mathematically rigorous sense. More precisely, a cryptosystem guarantees perfect secrecy if and only if , the keys in are uniformly distributed, and for each plaintext and for each ciphertext there exists exactly one key with . That means that such a cryptosystem often is not useful for practical applications, since in order to guarantee perfect secrecy, every key must be at least as long as the message to be encrypted and can be used only once.
In order to understand some of the algorithms and problems to be presented later, some fundamental notions, definitions, and results from algebra and, in particular, from number theory, group theory, and graph theory are required. This concerns both the cryptosystems and zero-knowledge protocols in Chapter 7 and some of the problems to be considered in upcoming Chapter 8. The present subsection may as well be skipped for now and revisited later when the required notions and results come up. In this section, most of the proofs are omitted.
Definition 7.2 (Group, ring, and field)
A group is defined by some nonempty set and a two-ary operation on that satisfy the following axioms:
– Closure: .
– Associativity: .
– Neutral element: .
– Inverse element: .
The element is called the neutral element of the group . The element is called the inverse element for . is said to be a monoid if satisfies associativity and closure under , even if has no neutral element or if not every element in has an inverse. A group (respectively, a monoid ) is said to be commutative (or abelian) if and only if for all . The number of elements of a finite group is said to be the order and is denoted by .
is said to be a subgroup of a group (denoted by ) if and only if and satisfies the group axioms.
A ring is a triple such that is an abelian group and is a monoid and the distributive laws are satisfied:
A ring is said to be commutative if and only if the monoid is commutative. The neutral element group is called the zero element (the zero, for short) of the ring . A neutral element of the monoid is called the one element (the one, for short) of the ring .
Let be a ring with one. An element of is said to be invertible (or a unity of ) if and only if it is invertible in the monoid .
A field is a commutative ring with one in which each nonzero element is invertible.
Example 7.3 (Group, ring, and field)
Let . The set is a finite group with respect to addition modulo , with neutral element . With respect to addition and multiplication modulo , is a commutative ring with one, see Problem 7-1. If is a prime number, then is a field with respect to addition and multiplication modulo .
Let denote the greatest common divisor of two integers and . For , define the set . With respect to multiplication modulo , is a finite group with neutral element .
If the operation of a group is clear from the context, we omit stating it explicitly. The group from Example 7.3 will play a particular role in Section7.3, where the RSA cryptosystem is introduced. The Euler function gives the order of this group, i.e., . The following properties of follow from the definition:
for all with , and
for all prime numbers .
The proof of these properties is left to the reader as Exercise 7.1-2. In particular, we will apply the following fact in Subsection 7.3.1, which is a consequence of the properties above.
Claim 7.3 If for prime numbers and , then .
Euler's Theorem below is a special case (namely, for the group ) of Lagrange's Theorem, which says that for each group element of a finite multiplicative group of order and with neutral element , . The special case of Euler's theorem, where is a prime number not dividing , is known as Fermat's Little Theorem.
Theorem 7.4 (Euler) For each , .
Corollary 7.5 (Fermat's Little Theorem) If is a prime number and , then .
In Section 8.4, algorithms for the graph isomorphism problem will be presented. This problem, which also is related to the zero-knowledge protocols to be introduced in Subsection 7.5.2, can be seen as a special case of certain group-theoretic problems. In particular, permutation groups are of interest here. Some examples for illustration will be presented later.
Definition 7.6 (Permutation group)
A permutation is a bijective mapping of a set onto itself. For each integer , let . The set of all permutations of is denoted by . For algorithmic purposes, permutations are given as pairs from .
If one defines the composition of permutations as an operation on , then becomes a group. For two permutations and in , their composition is defined to be that permutation in that results from first applying and then to the elements of , i.e., for each . The neutral element of the permutation group is the identical permutation, which is defined by for each . The subgroup of that contains as its only element is denoted by .
For any subset of , the permutation group generated by is defined as the smallest subgroup of containing . Subgroups of are represented by their generating sets, sometimes dubbed the generators of . In , the orbit of an element is defined as .
For any subset of , let denote the subgroup of that maps every element of onto itself. In particular, for and a subgroup of , the (pointwise) stabiliser of in is defined by
Observe that and .
Let and be permutation groups with . For , is said to be a right coset of in . Any two right cosets of in are either identical or disjoint. Thus, the permutation group is partitioned by the right cosets of in :
Every right coset of in has the cardinality . The set in (7.3) is called the complete right transversal of in .
The notion of pointwise stabilisers is especially important for the design of algorithms solving problems on permutation groups. The crucial structure exploited there is the so-called tower of stabilisers of a permutation group :
For each with , let be the complete right transversal of in . Then, is said to be a strong generator of , and we have . Every then has a unique factorisation with . The following basic algorithmic results about permutation groups will be useful later in Section 8.4.
Theorem 7.7 Let a permutation group be given by a generator. Then, we have:
1. For each , the orbit of in can be computed in polynomial time.
2. The tower of stabilisers can be computed in time polynomially in , i.e., for each with , the complete right transversals of in and thus a strong generator of can be determined efficiently.
The notions introduced in Definition 7.6 for permutation groups are now explained for concrete examples from graph theory. In particular, we consider the automorphism group and the set of isomorphisms between graphs. We start by introducing some useful graph-theoretical concepts.
Definition 7.8 (Graph isomorphism and graph automorphism) A graph consists of a finite set of vertices, , and a finite set of edges, , that connect certain vertices with each other. We assume that no multiple edges and no loops occur. In this chapter, we consider only undirected graphs, i.e., the edges are not oriented and can be seen as unordered vertex pairs. The disjoint union of two graphs and is defined as the graph with vertex set and edge set , where we assume that that the vertex sets and are made disjoint (by renaming if necessary).
Let and be two graphs with the same number of vertices. An isomorphism between and is an edge-preserving bijection of the vertex set of onto that of . Under the convention that , and are isomorphic (, for short) if and only if there exists a permutation such that for all vertices ,
An automorphism of is an edge-preserving bijection of the vertex set of onto itself. Every graph has the trivial automorphism . By we denote the set of all isomorphisms between and , and by we denote the set of all automorphisms of . Define the problems graph automorphism (, for short) and graph automorphism (, for short) by:
For algorithmic purposes, graphs are represented either by their vertex and edge lists or by an adjacency matrix, which has the entry at position if is an edge, and the entry otherwise. This graph representation is suitably encoded over the alphabet . In order to represent pairs of graphs, we use a standard bijective pairing function from onto that is polynomial-time computable and has polynomial-time computable inverses.
Example 7.4 (Graph isomorphism and graph automorphism) The graphs and shown in Figure 7.5 are isomorphic.
An isomorphism preserving adjacency of the vertices according to (7.4) is given, e.g., by , or in cycle notation by . There are three further isomorphisms between and , i.e., , see Exercise 7.1-4. However, neither nor is isomorphic to . This is immediately seen from the fact that the sequence of vertex degrees (the number of edges fanning out of each vertex) of and , respectively, differs from the sequence of vertex degrees of : For and , this sequence is , whereas it is for . A nontrivial automorphism of is given, e.g., by , or ; another one by , or . There are two more automorphisms of , i.e., , see Exercise 7.1-4.
The permutation groups , , and are subgroups of . The tower of stabilisers of consists of the subgroups , , and . In the automorphism group of , the vertices and have the orbit , the vertices and have the orbit , and vertex has the orbit .
and have the same number of elements if and only if and are isomorphic. To wit, if and are isomorphic, then follows from . Otherwise, if , then is empty but contains always the trivial automorphism . This implies (7.5) in Lemma 7.9 below, which we will need later in Section 8.4. For proving (7.6), suppose that and are connected; otherwise, we simply consider instead of and the co-graphs and , see Exercise 7.1-5. An automorphism of that switches the vertices of and , consists of an isomorphism in and an isomorphism in . Thus, , which implies (7.6) via (7.5).
Lemma 7.9 For any two graphs and , we have
If and are isomorphic graphs and if , then . Thus, is a right coset of in . Since any two right cosets are either identical or disjoint, can be partitioned into right cosets of according to (7.3):
where for all , . The set of permutations in thus is a complete right transversal of in . Denoting by the graph that results from applying the permutation to the vertices of , we have . Since there are exactly permutations in , it follows from (7.7) that
This proves the following corollary.
Corollary 7.10 If is a graph with vertices, then there are exactly graphs isomorphic to .
For the graph in Figure 7.5 from Example 7.4, there thus exist exactly isomorphic graphs. The following lemma will be used later in Section 8.4.
Lemma 7.11 Let and be two graphs with vertices. Define the set
Then, we have
Proof. If and are isomorphic, then . Hence, by Corollary 7.10, we have
Analogously, one can show that . If and are isomorphic, then
It follows that . If and are nonisomorphic, then and are disjoint sets. Thus, .
Exercises
7.1-1 Figure 7.6 shows two ciphertexts, and . It is known that both encrypt the same plaintext and that one was obtained using the shift cipher, the other one using the Vigenére cipher. Decrypt both ciphertexts.
Hint. After decryption you will notice that the plaintext obtained is a true statement for one of the two ciphertexts, whereas it is a false statement for the other ciphertext. Is perhaps the method of frequency counts useful here?
7.1-2 Prove that is a ring with respect to ordinary addition and multiplication. Is it also a field? What can be said about the properties of the algebraic structures , , and ?
7.1-3 Prove the properties stated for Euler's function:
a. for all with .
b. for all prime numbers .
Using these properties, prove Proposition 7.3.
7.1-4 Consider the graphs , , and from Figure 7.5 in Example 7.4.
a. Determine all isomorphisms between and .
b. Determine all automorphisms of , , and .
c. For which isomorphisms between and is a right coset of in , i.e., for which is ? Determine the complete right transversals of , , and in .
d. Determine the orbit of all vertices of in and the orbit of all vertices of in .
e. Determine the tower of stabilisers of the subgroups and .
f. How many graphs with vertices are isomorphic to ?
7.1-5 The co-graph of a graph is defined by the vertex set and the edge set . Prove the following claims: a. .
b. .
c. is connected if is not connected.
The basic number-theoretic facts presented in Subsection 7.1.3 will be needed in this and the subsequent sections. In particular, recall the multiplicative group from Example 7.3 and Euler's function. The arithmetics in remainder class rings will be explained at the end of this chapter, see Problem 7-1.
Figure 7.7 shows the Diffie-Hellman secret-key agreement protocol, which is based on exponentiation with base and modulus , where is a prime number and is a primitive root of . A primitive root of a number is any element such that for each with . A primitive root of generates the entire group , i.e., . Recall that for any prime number the group has order . has exactly primitive roots, see also Exercise 7.2-1.
Example 7.5 Consider . Since , we have , and the two primitive roots of are and . Both and generate all of , since:
Not every number has a primitive root; is the smallest such example. It is known that a number has a primitive root if and only if either is from , or has the form or for some odd prime number .
Definition 7.12 (Discrete logarithm) Let be a prime number, and let be a primitive root of . The modular exponential function with base and modulus is the function that maps from to , and is defined by . Its inverse function is called the discrete logarithm, and maps for fixed and the value to . We write .
The protocol given in Figure 7.7 works, since (in the arithmetics modulo )
so Alice indeed computes the same key as Bob. Computing this key is not hard, since modular exponentiation can be performed efficiently using the square-and-multiply method from algorithm Square-and-Multiply
.
Erich, however, has a hard time when he attempts to determine their key, since the discrete logarithm is considered to be a hard problem. That is why the modular exponential function is a candidate of a one-way function, a function that is easy to compute but hard to invert. Note that it is not known whether or not one-way functions indeed exist; this is one of the most challenging open research questions in cryptography. The security of many cryptosystems rests on the assumption that one-way functions indeed exist.
Square-and-Multiply(
)
1 is the modulus, is the base, and is the exponent
2 determine the binary expansion of the exponent , where
3 compute successively , where , using that
4 compute in the arithmetics modulo
5 as soon as a factor in the product and are determined,
can be deleted and need not be stored
6 RETURN
Why can the modular exponential function be computed efficiently? Naively performed, this computation may require many multiplications, depending on the size of the exponent . However, using algorithm Square-and-Multiply
there is no need to perform multiplications as in the naive approach; no more than multiplications suffice. The square-and-multiply algorithm thus speeds modular exponentiation up by an exponential factor.
Note that in the arithmetics modulo , we have
Thus, the algorithm Square-and-Multiply
is correct.
Example 7.6 [Square-and-Multiply in the Diffie-Hellman Protocol] Alice and Bob have chosen the prime number and the primitive root of . Alice picks the secret number . In order to send her public number to Bob, Alice wishes to compute . The binary expansion of the exponent is . Alice successively computes the values:
Then, she computes . Note that Alice does not have to multiply times but merely performs four squarings and one multiplication to determine .
Suppose that Bob has chosen the secret exponent . By the same method, he can compute his part of the key, . Now, Alice and Bob determine their joint secret key according to the Diffie-Hellman protocol from Figure 7.7; see Exercise 7.2-2.
Note that the protocol is far from being secure in this case, since the prime number and the secret exponents and are much too small. This toy example was chosen just to explain how the protocol works. In practice, and should have at least bits each.
If Erich was listening very careful, he knows the values , , , and after Alice and Bob have executed the protocol. His aim is to determine their joint secret key . This problem is known as the Diffie-Hellman problem. If Erich were able to determine and efficiently, he could compute the key just like Alice and Bob and thus would have solved the Diffie-Hellman problem. Thus, this problem is no harder than the problem of computing discrete logarithms. The converse question of whether or not the Diffie-Hellman problem is at least as hard as solving the discrete logarithm (i.e., whether or not the two problems are equally hard) is still just an unproven conjecture. As many other cryptographic protocols, the Diffie-Hellman protocol currently has no proof of security.
However, since up to date neither the discrete logarithm nor the Diffie-Hellman problem can be solved efficiently, this direct attack is not a practical threat. On the other hand, there do exist other, indirect attacks in which the key is determined not immediately from the values and communicated in the Diffie-Hellman protocol. For example, Diffie-Hellman is vulnerable by the “man-in-the-middle” attack. Unlike the passive attack described above, this attack is active, since the attacker Erich aims at actively altering the protocol to his own advantage. He is the “man in the middle” between Alice and Bob, and he intercepts Alice's message to Bob and Bob's message to Alice. Instead of and , he forwards his own values to Bob and to Alice, where the private numbers and were chosen by Erich. Now, if Alice computes her key , which she falsely presumes to share with Bob, in fact is a key for future communications with Erich, who determines the same key by computing (in the arithmetics modulo )
Similarly, Erich can share a key with Bob, who has not the slightest idea that he in fact communicates with Erich. This raised the issue of authentication, which we will deal with in more detail later in Section 7.5 about zero-knowledge protocols.
Exercises
7.2-1 a. How many primitive roots do and have?
b. Determine all primitive roots of and , and prove that they indeed are primitive roots.
c. Show that every primitive root of and of , respectively, generates all of and .
7.2-2 a. Determine Bob's number from Example 7.6 using the algorithm Square-and-Multiply
.
b. For and from Example 7.6, determine the joint secret key of Alice and Bob according to the Diffie-Hellman protocol from Figure 7.7.
The RSA cryptosystem, which is named after its inventors Ron Rivest, Adi Shamir, and Leonard Adleman [214], is the first public-key cryptosystem. It is very popular still today and is used by various cryptographic applications. Figure 7.8 shows the single steps of the RSA protocol, which we will now describe in more detail, see also Example 7.7.
1. Key generation: Bob chooses two large prime numbers at random, and with , and computes their product . He then chooses an exponent satisfying
and computes the inverse of , i.e., the unique number such that
The pair is Bob's public key, and is Bob's private key.
2. Encryption: As in Section 7.1, messages are strings over an alphabet . Any message is subdivided into blocks of a fixed length, which are encoded as positive integers in -adic representation. These integers are then encrypted. Let be the number encoding some block of the message Alice wishes to send to Bob. Alice knows Bob's public key and encrypts as the number , where the encryption function is defined by
3. Decryption: Let with be the number encoding one block of the ciphertext, which is received by Bob and also by the eavesdropper Erich. Bob decrypts by using his private key and the following decryption function
Theorem 7.13 states that the RSA protocol described above indeed is a cryptosystems in the sense of Definition 7.1. The proof of Theorem 7.13 is left to the reader as Exercise 7.3-1.
Theorem 7.13 Let be the public key and be the private key in the RSA protocol. Then, for each message with ,
Hence, RSA is a public-key cryptosystem.
To make RSA encryption and (authorised) decryption efficient, the algorithm Square-and-Multiply
algorithm is again employed for fast exponentiation.
How should one choose the prime numbers and in the RSA protocol? First of all, they must be large enough, since otherwise Erich would be able to factor the number in Bob's public key using the extended Euclidean algorithm. Knowing the prime factors and of , he could then easily determine Bob's private key , which is the unique inverse of , where . To keep the prime numbers and secret, they thus must be sufficiently large. In practice, and should have at least digits each. To this end, one generates numbers of this size randomly and then checks using one of the known randomised primality tests whether the chosen numbers are primes indeed. By the Prime Number Theorem, there are about prime numbers not exceeding . Thus, the odds are good to hit a prime after reasonably few trials.
In theory, the primality of and can be decided even in deterministic polynomial time. Agrawal et al. [2], [3] recently showed that the primality problem, which is defined by
is a member of the complexity class P. Their breakthrough solved a longstanding open problem in complexity theory: Among a few other natural problems such as the graph isomorphism problem, the primality problem was considered to be one of the rare candidates of problems that are neither in P nor NP-complete.
Footnote. The complexity classes P and NP will be defined in Section 8.1 and the notion of NP-completeness will be defined in Section 8.2.
For practical purposes, however, the known randomised algorithms are more useful than the deterministic algorithm by Agrawal et al. The running time of obtained in their original paper [2], [3] could be improved to meanwhile, applying a more sophisticated analysis, but this running time is still not as good as that of the randomised algorithms.
Miller-Rabin(
)
1 determine the representation , where and are odd 2 choose a number at random under the uniform distribution 3 compute 4IF
5THEN
RETURN
“ is a prime number” 6ELSE
FOR
TO
7DO
IF
8THEN
RETURN
“ is a prime number” 9ELSE
10RETURN
“ is not a prime number”
One of the most popular randomised primality tests is the algorithm Miller-Rabin
developed by Rabin [209], which is based on the ideas underlying the deterministic algorithm of Miller [182]. The Miller-Rabin test is a so-called Monte Carlo algorithm, since the “no” answers of the algorithm are always reliable, whereas its “yes” answers have a certain error probability. An alternative to the Miller-Rabin test is the primality test of Solovay and Strassen [246]. Both primality tests run in time . However, the Solovay-Strassen test is less popular because it is not as efficient in practice and also less accurate than the Miller-Rabin test.
The class of problems solvable via Monte Carlo algorithms with always reliable “yes” answers is named RP, which stands for Randomised Polynomial Time. The complementary class, , contains all those problems solvable via Monte Carlo algorithms with always reliable “no” answers. Formally, RP is defined via nondeterministic polynomial-time Turing machines (NPTMs, for short; see Section 8.1 and in particular Definitions 8.1, 8.2, and 8.3) whose computations are viewed as random processes: For each nondeterministic guess, the machine flips an unbiased coin and follows each of the resulting two next configurations with probability until a final configuration is reached. Depending on the number of accepting computation paths of the given NPTM, one obtains a certain acceptance probability for each input. Errors may occur. The definition of RP requires that the error probability must be below the threshold of for an input to be accepted, and there must occur no error at all for an input to be rejected.
Definition 7.14 (Randomised polynomial time) The class RP consists of exactly those problems for which there exists an NPTM such that for each input , if then accepts with probability at least , and if then accepts with probability .
Theorem 7.15 follows from the fact that, for example, the Miller-Rabin test is a Monte Carlo algorithm for the primality problem. We present a proof sketch only. We show that the Miller-Rabin test accepts with one-sided error probability as in Definition 7.14: If the given number (represented in binary) is a prime number then the algorithm cannot answer erroneously that is not prime. For a contradiction, suppose is prime but the Miller-Rabin test halts with the output: “ is not a prime number”. Hence, . Since is squared in each iteration of the loop, we sequentially test the values
modulo . By assumption, for none of these values the algorithm says were prime. It follows that for each with ,
Since , Fermat's Little Theorem (see Corollary 7.5) implies . Thus, is a square roots of modulo . Since is prime, there are only two square roots of modulo , namely , see Exercise 7.3-1. Since , we must have . But then, again is a square root of modulo . By the same argument, . Repeating this argument again and again, we eventually obtain , a contradiction. It follows that the Miller-Rabin test works correctly for each prime number. On the other hand, if is not a prime number, it can be shown that the error probability of the Miller-Rabin tests does not exceed the threshold of . Repeating the number of independent trials, the error probability can be made arbitrarily close to zero, at the cost of increasing the running time of course, which still will be polynomially in , where is the size of the input represented in binary.
Example 7.7 [RSA] Suppose Bob chooses the prime numbers and . Thus, , so we have . If Bob now chooses the smallest exponent possible for , namely , then is his public key. Using the extended Euclidean algorithm, Bob determines his private key , and we have ; see Exercise 7.3-2. As in Section 7.1, we identify the alphabet with the set . Messages are strings over and are encoded in blocks of fixed length as natural numbers in -adic representation. In our example, the block length is .
More concretely, any block of length with is represented by the number
Since , we have
The RSA encryption function encrypts the block (i.e., the corresponding number ) as . The ciphertext for block then is with . Thus, RSA maps blocks of length injectively to blocks of length . Figure 7.9 shows how to subdivide a message of length into blocks of length and how to encrypt the single blocks, which are represented by numbers. For example, the first block, “RS”, is turned into a number as follows: “R” corresponds to and “S” to , and we have
The resulting number is written again in -adic representation and can have the length : , where , see also Exercise 7.3-2. So, the first block, , is encrypted to yield the ciphertext “BAV”.
Decryption is also done blockwise. In order to decrypt the first block with the private key , compute , again using fast exponentiation with Square-and-Multiply
. To prevent the numbers from becoming too large, it is recommendable to reduce modulo after each multiplication. The binary expansion of the exponent is , and we obtain
as desired.
The public-key cryptosystem RSA from Figure 7.8 can be modified so as to produce digital signatures. This protocol is shown in Figure 7.10. It is easy to see that this protocol works as desired; see Exercise 7.3-2. This digital signature protocol is vulnerable to “chosen-plaintext attacks” in which the attacker can choose a plaintext and obtains the corresponding ciphertext. This attack is described, e.g., in [217].
As mentioned above, the security of the RSA cryptosystem crucially depends on the assumption that large numbers cannot be factored in a reasonable amount of time. Despite much effort in the past, no efficient factoring algorithm has been found until now. Thus, it is widely believed that there is no such algorithm and the factoring problem is hard. A rigorous proof of this hypothesis, however, has not been found either. And even if one could prove this hypothesis, this would not imply a proof of security of RSA. Breaking RSA is at most as hard as the factoring problem; however, the converse is not known to hold. That is, it is not known whether these two problems are equally hard. It may be possible to break RSA without factoring .
We omit listing potential attacks on the RSA system here. Rather, the interested reader is pointed to the comprehensive literature on this subject; note also Problem 7-4 at the end of this chapter. We merely mention that for each of the currently known attacks on RSA, there are suitable countermeasures, rules of thumb that either completely prevent a certain attack or make its probability of success negligibly small. In particular, it is important to take much care when choosing the prime numbers and , the modulus , the public exponent , and the private key .
Finally, since the factoring attacks on RSA play a particularly central role, we briefly sketch two such attacks. The first one is based on Pollard's method [207]. This method is effective for composite numbers having a prime factor such that the prime factors of each are small. Under this assumption, a multiple of can be determined without knowing . By Fermat's Little Theorem (see Corollary 7.5), it follows that for all integers coprime with . Hence, divides . If does not divide , then is a nontrivial divisor of . Thus, the number can be factored.
How can the multiple of be determined? Pollard's method uses as candidates for the products of prime powers below a suitably chosen bound :
If all prime powers dividing are less than , then is a multiple of . The algorithm determines for a suitably chosen base . If no nontrivial divisor of is found, the algorithm is restarted with a new bound .
Other factoring methods, such as the quadratic sieve, are described, e.g., in [219], [249]. They use the following simple idea. Suppose is the number to be factored. Using the sieve, determine numbers and such that:
Hence, divides but neither nor . Thus, is a nontrivial factor of .
There are also sieve methods other than the quadratic sieve. These methods are distinguished by the particular way of how to determine the numbers and such that (7.10) is satisfied. A prominent example is the “general number field sieve”, see [160].
Exercises
7.3-1 a. Prove Theorem 7.13.
Hint. Show and using Corollary 7.5, Fermat's Little Theorem. Since and are prime numbers with and , the claim now follows from the Chinese remainder theorem.
b. The proof sketch of Theorem 7.15 uses the fact that any prime number can have only two square roots of modulo , namely . Prove this fact.
Hint. It may be helpful to note that is a square root of 1 modulo if and only if divides .
7.3-2 a. Let and be the values from Example 7.7. Show that the extended Euclidean algorithm indeed provides the private key , the inverse of .
b. Consider the plaintext in Figure 7.9 from Example 7.7 and its RSA encryption. Determine the encoding of this ciphertext by letters of the alphabet for each of the 17 blocks.
c. Decrypt each of the ciphertext blocks in Figure 7.9 and show that the original message is obtained indeed.
d. Prove that the RSA digital signature protocol from Figure 7.10 works.
Rivest, Rabi, and Sherman proposed protocols for secret-key agreement and digital signatures. The secret-key agreement protocol given in Figure 7.11 is due to Rivest and Sherman. Rabi and Sherman modified this protocol to a digital signature protocol, see Exercise 7.4-1.
The Rivest-Sherman protocol is based on a total, strongly noninvertible, associative one-way functioní. Informally put, a one-way function is a function that is easy to compute but hard to invert. One-way functions are central cryptographic primitives and many cryptographic protocols use them as their key building blocks. To capture the notion of noninvertibility, a variety of models and, depending on the model used, various candidates for one-way functions have been proposed. In most cryptographic applications, noninvertibility is defined in the average-case complexity model. Unfortunately, it is not known whether such one-way functions exist; the security of the corresponding protocols is merely based on the assumption of their existence. Even in the less challenging worst-case model, in which so-called “complexity-theoretic” one-way functions are usually defined, the question of whether any type of one-way function exists remains an open issue after many years of research.
A total (i.e., everywhere defined) function mapping from to is associative if and only if holds for all , where we use the infix notation instead of the prefix notation . This property implies that the above protocol works:
so Alice and Bob indeed compute the same secret key.
The notion of strong noninvertibility is not to be defined formally here. Informally put, is said to be strongly noninvertible if is not only a one-way function, but even if in addition to the function value one of the corresponding arguments is given, it is not possible to compute the other argument efficiently. This property implies that the attacker Erich, knowing and and , is not able to compute the secret numbers and , from which he could easily determine the secret key .
Exercises
7.4-1 Modify the Rivest-Sherman protocol for secret-key agreement from Figure 7.11 to a protocol for digital signatures.
7.4-2 a. Try to give a formal definition of the notion of “strong noninvertibility” that is defined only informally above. Use the worst-case complexity model.
b. Suppose is a partial function from to , i.e., may be undefined for some pairs in . Give a formal definition of “associativity” for partial functions. What is wrong with the following (flawed) attempt of a definition: “A partial function is said to be associative if and only if holds for all for which each of the four pairs , , , and is in the domain of .”
Hint. A comprehensive discussion of these notions can be found in [111], [113], [114].
In Section 7.2, the “man-in-the-middle” attack on the Diffie-Hellman protocol was mentioned. The problem here is that Bob has not verified the true identity of his communication partner before executing the protocol. While he assumes to communicate with Alice, he in fact exchanges messages with Erich. In other words, Alice's task is to convince Bob of her true identity without any doubt. This cryptographic task is called authentication. Unlike digital signatures, whose purpose is to authenticate electronically transmitted documents such as emails, electronic contracts, etc., the goal now is to authenticate individuals such as human or computer parties participating in a cryptographic protocol.
Footnote. Here, an “individual” or a “party” is not necessarily a human being; it may also be a computer program that automatically executes a protocol with another computer program.
In order to authenticate herself, Alice might try to prove her identity by a secret information known to her alone, say by giving her PIN (“Personal Identifaction Number”) or any other private information that no one knows but her. However, there is a catch. To prove her identity, she would have to give her secret away to Bob. But then, it no longer is a secret! Bob, knowing her secret, might pretend to be Alice in another protocol he executes with Chris, a third party. So the question is how to prove knowledge of a secret without giving it away. This is what zero-knowledge is all about. Zero-knowledge protocols are special interactive proof systems, which were introduced by Goldwasser, Micali, and Rackoff and, independently, by Babai and Moran. Babai and Moran's notion (which is essentially equivalent to the interactive proof systems proposed by Goldwasser et al.) is known as Arthur-Merlin games, which we will now describe informally.
Merlin and Arthur wish to jointly solve a problem , i.e., they wish to jointly decide whether or not a given input belongs to . The mighty wizard Merlin is represented by an NP machine , and the impatient King Arthur is represented by a randomised polynomial-time Turing machine . To make their decision, they play the following game, where they are taking turns to make alternating moves. Merlin's intention is always to convince Arthur that belongs to (no matter whether or not that indeed is the case). Thus, each of Merlin's moves consists in presenting a proof for “ ”, which he obtains by simulating , where is the input and describes all previous moves in this game. That is, the string encodes all previous nondeterministic choices of and all previous random choices of .
King Arthur, however, does not trust Merlin. Of course, he cannot check the mighty wizard's proofs all alone; this task simply exceeds his computational power. But he knows Merlin well enough to express some doubts. So, he replies with a nifty challenge to Merlin by picking some details of his proofs at random and requiring certificates for them that he can verify. In order to satisfy Arthur, Merlin must convince him with overwhelming probability. Thus, each of Arthur's moves consists in the simulation of , where again is the input and describes the previous history of the game.
The idea of Arthur-Merlin games can be captured via alternating existential and probabilistic quantifiers, where the former formalise Merlin's NP computation and the latter formalise Arthur's randomised polynomial-time computation.
Footnote. This is similar to the well-known characterisation of the levels of the polynomial hierarchy via alternating and quantifiers, see Section 8.4 and in particular item 3 Theorem 8.11.
In this way, a hierarchy of complexity classes can be defined, the so-called Arthur-Merlin hierarchy. We here present only the class MA from this hierarchy, which corresponds to an Arthur-Merlin game with two moves, with Merlin moving first.
Definition 7.16 (MA in the Arthur-Merlin hierarchy) The class MA contains exactly those problems for which there exists an NP machine and a randomised polynomial-time Turing machine such that for each input :
If then there exists a path of such that accepts with probability at least (i.e., Arthur cannot refute Merlin's proof for “ ”, and Merlin thus wins).
If then for each path of , rejects with probability at least (i.e., Arthur is not taken in by Merlin's wrong proofs for “ ” and thus wins).
Analogously, the classes AM, MAM, AMA,... can be defined, see Exercise 7.5-1.
In Definition 7.16, the probability threshold of for Arthur to accept or to reject, respectively, is chosen at will and does not appear to be large enough at first glance. In fact, the probability of success can be amplified using standard techniques and can be made arbitrarily close to one. In other words, one might have chosen even a probability threshold as low as , for an arbitrary fixed constant , and would still have defined the same class. Furthermore, it is known that for a constant number of moves, the Arthur-Merlin hierarchy collapses down to AM:
It is an open question whether or not any of the inclusions is a strict one.
A similar model, which can be used as an alternative to the Arthur-Merlin games, are the interactive proof systems mentioned above. The two notions use different terminology: Merlin now is called the “prover” and Arthur the “verifier”. Also, their communication is not interpreted as a game but rather as a protocol. Another difference between the two models, which appears to be crucial at first, is that Arthur's random bits are public (so, in particular, Merlin knows them), whereas the random bits of the verifier in an interactive proof system are private. However, Goldwasser and Sipser [95] proved that, in fact, it does not matter whether the random bits are private or public, so Arthur-Merlin games essentially are equivalent to interactive proof systems.
If one allows a polynomial number of moves (instead of a constant number), then one obtains the complexity class IP. Note that interactive proof systems are also called IP protocols. By definition, IP contains all of NP. In particular, the graph isomorphism problem is in IP. We will see later that IP also contains problems from that are supposed to be not in NP. In particular, the proof of Theorem 8.16 shows that the complement of the graph isomorphism problem is in AM and thus in IP. A celebrated result by Shamir [232] says that IP equals PSPACE, the class of problems solvable in polynomial space.
Let us now turn back to the problem of authentication mentioned above, and to the related notion of zero-knowledge protocols. Here is the idea. Suppose Arthur and Merlin play one of their games. So, Merlin sends hard proofs to Arthur. Merlin alone knows how to get such proofs. Being a wise wizard, he keeps this knowledge secret. And he uses his secret to authenticate himself in the communication with Arthur.
Now suppose that malicious wizard Marvin wishes to fool Arthur by pretending to be Merlin. He disguises as Merlin and uses his magic to look just like him. However, he does not know Merlin's secret of how to produce hard proofs. His magic is no more powerful than that of an ordinary randomised polynomial-time Turing machine. Still, he seeks to simulate the communication between Merlin and Arthur. An interactive proof system has the zero-knowledge property if the information communicated between Marvin and Arthur cannot be distinguished from the information communicated between Merlin and Arthur. Note that Marvin, who does not know Merlin's secret, cannot introduce any information about this secret into the simulated IP protocol. Nonetheless, he is able to perfectly copy the original protocol, so no one can tell a difference. Hence, the (original) protocol itself must have the property that it does not leak any information whatsoever about Merlin's secret. If there is nothing to put in, there can be nothing to take out.
Definition 7.17 (Zero-knowledge protocol) Let be any set in IP, and let be an interactive proof system for , where is an NPTM and is a randomised polynomial-time Turing machine. The IP protocol is a zero-knowledge protocol for if and only if there exists a randomised polynomial-time Turing machine such that simulates the original protocol and, for each , the tuples and representing the information communicated in and in , respectively, are identically distributed over the random choices in and in , respectively.
The notion defined above is called “honest-verifier perfect zero-knowledge” in the literature, since (a) it is assumed that the verifier is honest (which may not necessarily be true in cryptographic applications though), and (b) it is required that the information communicated in the simulated protocol perfectly coincides with the information communicated in the original protocol. Assumption (a) may be somewhat too idealistic, and assumption (b) may be somewhat too strict. That is why also other variants of zero-knowledge protocols are studied, see the notes at the end of this chapter.
Let us consider a concrete example now. As mentioned above, the graph isomorphism problem (, for short) is in NP, and the complementary problem is in AM, see the proof of Theorem 8.16. Thus, both problems are contained in IP. We now describe a zero-knowledge protocol for that is due to Goldreich, Micali, and Wigderson [91]. Figure 7.12 shows this IP protocol between the prover Merlin and the verifier Arthur.
Although there is no efficient algorithm known for , Merlin can solve this problem, since is in NP. However, there is no need for him to do so in the protocol. He can simply generate a large graph with vertices and a random permutation . Then, he computes the graph and makes the pair public. The isomorphism between and is kept secret as Merlin's private information.
Of course, Merlin cannot send to Arthur, since he does not want to give his secret away. Rather, to prove that the two graphs, and , indeed are isomorphic, Merlin randomly chooses an isomorphism under the uniform distribution and a bit and computes the graph . He then sends to Arthur whose response is a challenge for Merlin: Arthur sends a random bit , chosen under the uniform distribution, to Merlin and requests to see an isomorphism between and . Arthur accepts if and only if indeed satisfies .
The protocol works, since Merlin knows his secret isomorphism and his random permutation : It is no problem for Merlin to compute the isomorphism between and and thus to authenticate himself. The secret is not given away. Since and are isomorphic, Arthur accepts with probability one. The case of two nonisomorphic graphs does not need to be considered here, since Merlin has chosen isomorphic graphs and in the protocol; see also the proof of Theorem 8.16.
Now, suppose, Marvin wishes to pretend to be Merlin when communicating with Arthur. He does know the graphs and , but he doesn't know the secret isomorphism . Nonetheless, he tries to convince Arthur that he does know . If Arthur's random bit happens to be the same as his bit , to which Marvin committed before he sees , then Marvin wins. However, if , then computing or requires knowledge of . Since is not efficiently solvable (and even too hard for a randomised polynomial-time Turing machine), Marvin cannot determine the isomorphism for sufficiently large graphs and . But without knowing , all he can do is guess. His chances of hitting a bit with are at most . Of course, Marvin can always guess, so his success probability is exactly . If Arthur challenges him in independent rounds of this protocol again and again, the probability of Marvin's success will be only . Already for , this probability is negligibly small: Marvin's probability of success is then less than one in one million.
It remains to show that the protocol from Figure 7.12 is a zero-knowledge protocol. Figure 7.13 shows a simulated protocol with Marvin who does not know Merlin's secret but pretends to know it. The information communicated in one round of the protocol has the form of a triple: . If Marvin is lucky enough to choose a random bit with , he can simply send and wins: Arthur (or any third party watching the communication) will not notice the fraud. On the other hand, if then Marvin's attempt to betray will be uncovered. However, that is no problem for the malicious wizard: He simply deletes this round from the protocol and repeats. Thus, he can produce a sequence of triples of the form that is indistinguishable from the corresponding sequence of triples in the original protocol between Merlin and Arthur. It follows that Goldreich, Micali, and Wigderson's protocol for is a zero-knowledge protocol.
Exercises
7.5-1 Arthur-Merlin hierarchy:
a. Analogously to MA from Definition 7.16, define the other classes AM, MAM, AMA, ... of the Arthur-Merlin hierarchy.
b. What is the inclusion structure between the classes MA, coMA, AM, coAM, and the classes of the polynomial hierarchy defined in Definition 8.10 of Subsection 8.4.1.
7.5-2 Zero-knowledge protocol for graph isomorphism:
a. Consider the graphs and from Example 7.4 in Section 7.1.3. Execute the zero-knowledge protocol from Figure 7.12 with the graphs and and the isomorphism . Use an isomorphism of your choice, and try all possibilities for the random bits and . Repeat this Arthur-Merlin game with an unknown isomorphism chosen by somebody else.
b. Modify the protocol from Figure 7.12 such that, with only one round of the protocol, Marvin's success probability is less than .
PROBLEMS |
Let and . We say is congruent to modulo (, for short) if and only if divides the difference . For example, and . The congruence modulo defines an equivalence relation on , i.e., it is reflexive (), symmetric (if then ), and transitive ( and imply ).
The set is the remainder class of . For example, the remainder class of is
We represent the remainder class of by the smallest natural number in . For instance, represents the remainder class of . The set of all remainder classes is . On , addition is defined by , and multiplication is defined by . For example, in the arithmetics modulo , we have and . Prove that in the arithmetics modulo :
a. is a commutative ring with one;
b. , which is defined in Example 7.3, is a multiplicative group;
c. is a field for each prime number .
d. Prove that the neutral element of a group and the inverse of each group element are unique.
e. Let be a commutative ring with one. Prove that the set of all invertible elements in forms a multiplicative group. Determine this group for the ring .
The graph isomorphism problem can be solved efficiently on special graph classes, such as on the class of trees. An (undirected) tree is a connected graph without cycles. (A cycle is a sequence of pairwise incident edges that returns to the point of origin.) The leaves of a tree are the vertices with degree one. The tree isomorphism problem is defined by
Design an efficient algorithm for this problem.
Hint. Label the vertices of the given pair of trees successively by suitable number sequences. Compare the resulting sequences of labels in the single loops of the algorithm. Starting from the leaves of the trees and then working step by step towards the center of the trees, the algorithm halts as soon as all vertices are labelled. This algorithm can be found, e.g., in [150].
Design an efficient algorithm in pseudocode for computing the determinant of a matrix. Implement your algorithm in a programming language of your choice. Can the inverse of a matrix be computed efficiently?
a. Consider the RSA cryptosystem from Figure 7.8. For the sake of efficiency, the public exponent has been popular. However, this choice is dangerous. Suppose Alice, Bob, and Chris encrypt the same message with the same public exponent , but perhaps with distinct moduli, , , and . Erich intercepts the resulting three ciphertexts: for . Then, Erich can easily decrypt the message . How?
Hint. Erich knows the Chinese remainder theorem, which also was useful in Exercise 7.3-1. A recommended value for the public exponent is , since its binary expansion has only two ones. Thus, the square-and-multiply algorithm runs fast for this .
b. The attack described above can be extended to ciphertexts that are related with each other as follows. Let and be known, , and suppose that messages are sent and are intercepted by Erich. Further, suppose that and . How can attacker Erich now determine the original message ?
Hint. Apply so-called lattice reduction techniques (see, e.g., Micciancio and Goldwasser [179]). The attack mentioned here is due to Hastad [109] and has been strengthened later by Coppersmith [53].
c. How can the attacks described above be prevented?
CHAPTER NOTES |
Singh's book [241] gives a nice introduction to the history of cryptology, from its ancient roots to modern cryptosystems. For example, you can find out there about recent discoveries related to the development of RSA and Diffie-Hellman in the nonpublic sector. Ellis, Cocks, and Williamson from the Communications Electronics Security Group (CESG) of the British Government Communications Head Quarters (GCHQ) proposed the RSA system from Figure 7.8 and the Diffie-Hellman protocol from Figure 7.7 even earlier than Rivest, Shamir, and Adleman and at about the same time as but independent of Diffie and Hellman, respectively. RSA and Diffie-Hellman are described in probably every book about cryptography written since their invention. A more comprehensive list of attacks against RSA than that of Section 7.3 can be found in, e.g., [29], [136], [187], [217], [219], [233].
Primality tests such as the Miller-Rabin test and factoring algorithms are also described in many books, e.g., in [92], [219], [223], [249].
The notion of strongly noninvertible associative one-way functions, on which the secret-key agreement protocol from Figure 7.11 is based, is due to Rivest and Sherman. The modification of this protocol to a digital signature scheme is due to Rabi and Sherman. In their paper [208], they also proved that commutative, associative one-way function exist if and only if . However, the one-way functions they construct are neither total nor strongly noninvertible, even if is assumed. Hemaspaandra and Rothe [114] proved that total, strongly noninvertible, commutative, associative one-way functions exist if and only if . Further investigations on this topic can be found in [28], [111], [113], [116].
The notion of interactive proof systems and zero-knowledge protocols is due to Goldwasser, Micali, and Rackoff [94]. One of the best and most comprehensive sources on this field is Chapter 4 in Goldreich's book [92]; see also the books [152], [198], [219] and the surveys [93], [90], [217]. Arthur-Merlin games were introduced by Babai and Moran [19], [18] and have been investigated in many subsequent papers. Variants of the notion of zero-knowledge, which differ from the notion in Definition 7.17 in their technical details, are extensively discussed in, e.g., [92] and also in, e.g., [90], [93], [217].
Table of Contents
In Chapter 7, efficient algorithms were introduced that are important for cryptographic protocols. Designing efficient algorithms of course is a central task in all areas of computer science. Unfortunately, many important problems have resisted all attempts in the past to devise efficient algorithms solving them. Well-known examples of such problems are the satisfiability problem for boolean formulas and the graph isomorphism problem.
One of the most important tasks in complexity theory is to classify such problems according to their computational complexity. Complexity theory and algorithmics are the two sides of the same medal; they complement each other in the following sense. While in algorithmics one seeks to find the best upper bound for some problem, an important goal of complexity theory is to obtain the best possible lower bound for the same problem. If the upper and the lower bound coincide, the problem has been classified.
The proof that some problem cannot be solved efficiently often appears to be “negative” and not desirable. After all, we wish to solve our problems and we wish to solve them fast. However, there is also some “positive” aspect of proving lower bounds that, in particular, is relevant in cryptography (see Chapter 7). Here, we are interested in the applications of inefficiency: A proof that certain problems (such as the factoring problem or the discrete logarithm) cannot be solved efficiently can support the security of some cryptosystems that are important in practical applications.
In Section 8.1, we provide the foundations of complexity theory. In particular, the central complexity classes P and NP are defined. The question of whether or not these two classes are equal is still open after more than three decades of intense research. Up to now, neither a proof of the inequality (which is widely believed) could be achieved, nor were we able to prove the equality of P and NP. This question led to the development of the beautiful and useful theory of NP-completeness.
One of the best understood NP-complete problems is , the satisfiability problem of propositional logic: Given a boolean formula , does there exist a satisfying assignment for , i.e., does there exist an assignment of truth values to 's variables that makes true? Due to its NP-completeness, it is very unlikely that there exist efficient deterministic algorithms for . In Section 8.3, we present a deterministic and a randomised algorithm for that both run in exponential time. Even though these algorithms are asymptotically inefficient (which is to say that they are useless in practice for large inputs), they are useful for sufficiently small inputs of sizes still relevant in practice. That is, they outperform the naive deterministic exponential-time algorithm for in that they considerably increase the input size for which the algorithm's running time still is tolerable.
In Section 8.4, we come back to the graph isomorphism problem, which was introduced in Section 7.1.3 (see Definition 7.8) and which was useful in Section 7.5.2 with regard to the zero-knowledge protocols. This problem is one of the few natural problems in NP, which (under the plausible assumption that ) may be neither efficiently solvable nor be NP-complete. In this regard, this problem is special among the problems in NP. Evidence for the hypothesis that the graph isomorphism problem may be neither in P nor NP-complete comes from the theory of lowness, which is introduced in Section 8.4. In particular, we present Schöning's result that is contained in the low hierarchy within NP. This result provides strong evidence that is not NP-complete. We also show that is contained in the complexity class SPP and thus is low for certain probabilistic complexity classes. Informally put, a set is low for a complexity class if it does not provide any useful information when used as an “oracle” in computations. For proving the lowness of , certain group-theoretic algorithms are useful.
As mentioned above, complexity theory is concerned with proving lower bounds. The difficulty in such proofs is that it is not enough to analyse the runtime of just one concrete algorithm for the problem considered. Rather, one needs to show that every algorithm solving the problem has a runtime no better than the lower bound to be proven. This includes also algorithms that have not been found as yet. Hence, it is necessary to give a formal and mathematically precise definition of the notion of algorithm.
Since the 1930s, a large variety of formal algorithm models has been proposed. All these models are equivalent in the sense that each such model can be transformed (via an algorithmic procedure) into any other such model. Loosely speaking, one might consider this transformation as some sort of compilation between distinct programming languages. The equivalence of all these algorithm models justifies Church's thesis, which says that each such model captures precisely the somewhat vague notion of “intuitive computability”. The algorithm model that is most common in complexity theory is the Turing machine, which was introduced in 1936 by Alan Turing (1912 until 1954) in his pathbreaking work [259]. The Turing machine is a very simple abstract model of a computer. In what follows, we describe this model by defining its syntax and its semantics, and introduce at the same time two basic computation paradigms: determinism and nondeterminism. It makes sense to first define the more general model of nondeterministic Turing machines. Deterministic Turing machines then are easily seen to be a special case.
First, we give some technical details and describe how Turing machines work. A Turing machine has infinite work tapes that are subdivided into cells. Every cell may contain one letter of the work alphabet. If a cell does not contain a letter, we indicate this by a special blank symbol, denoted by . The computation is done on the work tapes. Initially, a designated input tape contains the input string, and all other cells contain the blank. If a computation halts, its result is the string contained in the designated output tape.
Footnote. One can require, for example, that the input tape is a read-only and the output tape is a write-only tape. Similarly, one can specify a variety of further variations of the technical details, but we do not pursue this any further here.
To each tape belongs a head that accesses exactly one cell of this tape. In each step of the computation, the head can change the symbol currently read and then moves to the left or to the right or does not move at all. At the same time, the current state of the machine, which is stored in its “finite control” can change. Figure 8.1 displays a Turing machine with one input and two work tapes.
Definition 8.1 (Syntax of Turing machines) A nondeterministic Turing machine with tapes (a -tape NTM, for short) is a -tuple , where is the input alphabet, is the work alphabet, is a finite set of states disjoint with , is the transition function, is the initial state, is the blank symbol, and is the set of final states. Here, denotes the power set of set , i.e., the set of all subsets of .
For readability, we write instead of with , and . This transition has the following meaning. If the current state is and the head currently reads a symbol , then:
is replaced by ,
is the new state, and
the head moves according to , i.e., the head either moves one cell to the left (if ), or one cell to the right (if ), or it does not move at all (if ).
The special case of a deterministic Turing machine with tapes (-tape DTM, for short) is obtained by requiring that the transition function maps from to .
For , we obtain the one-tape Turing machine, abbreviated simply by NTM and DTM, respectively. Every -tape NTM or -tape DTM can be simulated by a Turing machine with only one tape, where the runtime at most doubles. Turing machines can be considered both as acceptors (which accept languages) and as transducers (which compute functions).
Definition 8.2 (Semantics of Turing machines) Let be an {NTM}. A configuration of is a string , where is interpreted as follows: is the current tape inscription, the head reads the first symbol of , and is the current state of . On the set of all configurations of , define a binary relation , which describes the transition from a configuration into a configuration according to . For any two strings and in , where and , and for all , define
Two special cases need be considered separately:
1. If and (i.e., 's head moves to the right and reads a symbol), then .
2. If and (i.e., 's head moves to the left and reads a symbol), then .
The initial configuration of on input is always . The final configurations of on input have the form with and . Let be the reflexive, transitive closure of : For , we have if and only if there is a finite sequence of configurations in such that
where possibly . If is the initial configuration of on input , then this sequence of configurations is a finite computation of , and we say halts on input . The language accepted by is defined by
For NTMs, any configuration may be followed by more than one configuration. Thus, they have a computation tree, whose root is labelled by the initial configuration and whose leaves are labelled by the final configurations. Note that trees are special graphs (recall Definition 7.8 in Section 7.1.3 and Problem 7-2), so they have vertices and edges. The vertices of a computation tree are the configurations of on input . For any two configurations and from , there is exactly one directed edge from to if and only if . A path in the computation tree of is a sequence of configurations . The computation tree of an NTM can have infinite paths on which never a halting configuration is reached. For DTMs, each non-halting configuration has a unique successor configuration. Thus, the computation tree of a DTM degenerates to a linear chain.
Example 8.1 Consider the language . A Turing machine accepting is given by
where the transitions of are stated in Figure 8.2. Figure 8.3 provides the meaning of the single states of and the intention corresponding to the each state. See also Exercise 8.1-2.
In order to classify problems according to their computational complexity, we need to define complexity classes. Each such class is defined by a given resource function and contains all problems that can be solved by a Turing machine that requires no more of a resource (e.g., computation time or memory space) than is specified by the resource function. We consider only the resource time here, i.e., the number of steps—as a function of the input size—needed to solve (or to accept) the problem. Further, we consider only the traditional worst-case complexity model. That is, among all inputs of size , those that require the maximum resource are decisive; one thus assumes the worst case to occur. We now define deterministic and nondeterministic time complexity classes.
Definition 8.3 (Deterministic and nondeterministic time complexity)
Let be a DTM with and let be an input. Define the time function of , which maps from to , as follows:
Define the function by:
Let be an NTM with and let be an input. Define the time function of , which maps from to , as follows:
Define the function by
Let be a computable function that maps from to . Define the deterministic and nondeterministic complexity classes with time function by
Let be the set of all polynomials. Define the complexity classes P and NP as follows:
Why are the classes P and NP so important? Obviously, exponential-time algorithms cannot be considered efficient in general. Garey and Johnson compare the rates of growth of some particular polynomial and exponential time functions for certain input sizes relevant in practice, see Figure 8.4. They assume that a computer executes one million operations per second. Then all algorithms bounded by a polynomial run in a “reasonable” time for inputs of size up to , whereas for example an algorithm with time bound takes more than years already for the modest input size of . For it takes almost centuries, and for a truly astronomic amount of time.
The last decades have seen an impressive development of computer and hardware technology. Figure 8.5 (taken from [81]) shows that this is not enough to provide an essentially better runtime behaviour for exponential-time algorithms, even assuming that the previous trend in hardware development continues. What would happen if one had a computer that is times or even times as fast as current computers are? For functions , , let be the maximum size of inputs that can be solved by a time-bounded algorithm within one hour. Figure 8.5 also taken from [81]) shows that a computer times faster than today's computers increases for by only an additive value close to . In contrast, using computers with the same increase in speed, an time-bounded algorithm can handle problem instances about four times as large as before.
Intuitively, the complexity class P contains the efficiently solvable problems. The complexity class NP contains many problems important in practice but currently not efficiently solvable. Examples are the satisfiability problem and the graph isomorphism problem that will be dealt with in more detail later in this chapter. The question of whether or not the classes P and NP are equal is still open. This famous P-versus-NP question gave rise to the theory of NP-completeness, which is briefly introduced in Section 8.2.
Exercises
8.1-1 Can Church's thesis ever be proven formally?
8.1-2 Consider the Turing machine in Example 8.1.
a. What are the sequences of configurations of for inputs and , respectively?
b. Prove that is correct, i.e., show that .
c. Estimate the running time of .
d. Show that the graph isomorphism problem and the graph automorphism problem introduced in Definition 7.8 are both in NP.
The theory of NP-completeness provides methods to prove lower bounds for problems in NP. An NP problem is said to be complete in NP if it belongs to the hardest problems in this class, i.e., if it is at least as hard as any NP problem. The complexity of two given problems can be compared by polynomial-time reductions. Among the different types of reduction one can consider, we focus on the polynomial-time many-one reducibility in this section. In Section 8.4, more general reducibilities will be introduced, such as the polynomial-time Turing reducibility and the polynomial-time (strong) nondeterministic Turing reducibility.
Definition 8.4 (Reducibility, NP-Completeness) A set is reducible to a set (in symbols, ) if and only if there exists a polynomial-time computable function such that for all , . A set is said to be -hard for NP if and only if for each set . A set is said to be -complete in NP (NP-complete, for short) if and only if is -hard for NP and .
Reductions are efficient algorithms that can be used to show that problems are not efficiently solvable. That is, if one can efficiently transform a hard problem into another problem via a reduction, the hardness of the former problem is inherited by the latter problem. At first glance, it might seem that infinitely many efficient algorithms are required to prove some problem NP-hard, namely one reduction from each of the infinitely many NP problems to . However, an elementary result says that it is enough to find just one such reduction, from some NP-complete problem . Since the -reducibility is transitive (see Exercise 8.2-2), the NP-hardness of implies the NP-hardness of via the reduction for each NP problem .
In 1971, Stephen Cook found a first such NP-complete problem: the satisfiability problem of propositional logic, for short. For many NP-completeness result, it is useful to start from the special problem , the restriction of the satisfiability problem in which each given Boolean formula is in conjunctive normal form and each clause contains exactly three literals. is also NP-complete.
Definition 8.5 (Satisfiability problem) The Boolean constants false and true are represented by and . Let be Boolean variables, i.e., for each . Variables and their negations are called literals. A Boolean formula is satisfiable if and only if there is an assignment to the variables in that makes the formula true. A Boolean formula is in conjunctive normal form (CNF, for short) if and only if is of the form , where the are literals over . The disjunctions of literals are called the clauses of . A Boolean formula is in -CNF if and only if is in CNF and each clause of contains exactly literals. Define the following two problems:
Example 8.2 (Boolean formulas) Consider the following two satisfiable Boolean formulas (see also Exercise 8.2-1:
Here, is in 3-CNF, so is in . However, is not in 3-CNF, since the first clause contains four literals. Thus, is in but not in .
Theorem 8.6 states the above-mentioned result of Cook.
Theorem 8.6 (Cook) The problems and are NP-complete.
The proof that is NP-complete is omitted. The idea is to encode the computation of an arbitrary NP machine running on input into a Boolean formula such that is satisfiable if and only if accepts .
is a good starting point for many other NP-completeness results. In fact, in many cases it is very useful to start with its restriction . To give an idea of how such proofs work, we now show that , which implies that is NP-complete. To this end, we need to find a reduction that transforms any given Boolean formula in CNF into another Boolean formula in -CNF (i.e., with exactly three literals per clause) such that
Let be the given formula with clauses . Construct the formula from as follows. The variables of are
's variables and
for each clause of , a number of additional variables as needed, where depends on the structure of according to the case distinction below.
Now, define , where each clause of is constructed from the clause of as follows. Suppose that , where each is a literal over . Distinguish the following four cases.
It remains to show that (a) the reduction is polynomial-time computable, and (b) the equivalence (8.1) is true. Both claims are easy to see; the details are left to the reader as Exercise 8.2-4.
Thousands of problems have been proven NP-complete by now. A comprehensive collection can be found in the work of Garey und Johnson [81].
Exercises
8.2-1 Find a satisfying assignment each for the Boolean formulas and from Example 8.2.
8.2-2 Show that the -reducibility is transitive: .
8.2-4 Consider the reduction . Prove the following:
a. the reduction is polynomial-time computable, and
b. the equivalence (8.1) holds.
By Theorem 8.6, and are both NP-complete. Thus, if were in P, it would immediately follow that , which is considered unlikely. Thus, it is very unlikely that there is a polynomial-time deterministic algorithm for or . But what is the runtime of the best deterministic algorithms for them? And what about randomised algorithms? Let us focus on the problem in this section.
The “naive” deterministic algorithm for works as follows: Given a Boolean formula with variables, sequentially check the possible assignments to the variables of . Accept if the first satisfying assignment is found, otherwise reject. Obviously, this algorithm runs in time . Can this upper bound be improved?
Yes, it can. We will present an slightly better deterministic algorithm for that still runs in exponential time, namely in time , where the notation neglects polynomial factors as is common for exponential-time algorithms.
Footnote. The result presented here is not the best result known, but see Figure 8.7 for further improvements.
The point of such an improvement is that a algorithm, where is a constant, can deal with larger instances than the naive algorithm in the same amount of time before the exponential growth rate eventually hits and the running time becomes infeasible. For example, if then . Thus, this algorithm can deal with inputs twice as large as the naive algorithm in the same amount of time. Doubling the size of inputs that can be handled by some algorithm can be quite important in practice.
The deterministic algorithm for is based on a simple “backtracking” idea. Backtracking is useful for problems whose solutions consist of components each having more than one choice possibility. For example, a solution of is composed of the truth values of a satisfying assignment, and each such truth value can be either true (represented by ) or false (represented by ).
The idea is the following. Starting from the initially empty solution (i.e., the empty truth assignment), we seek to construct by recursive calls to our backtracking algorithm, step by step, a larger and larger partial solution until eventually a complete solution is found, if one exists. In the resulting recursion tree, the root is marked by the empty solution, and the leaves are marked with complete solutions of the problem.
Footnote. The inner vertices of the recursion tree represent the recursive calls of the algorithm, its root is the first call, and the algorithm terminates at the leaves without any further recursive call.
If the current branch in the recursion tree is “dead” (which means that the subtree underneath it cannot lead to a correct solution), one can prune this subtree and “backtracks” to pursue another extention of the partial solution constructed so far. This pruning may save time in the end.
Backtracking-SAT(
)
1IF
( assigns truth values to all variables of ) 2THEN
RETURN
3ELSE
IF
( makes one of the clauses of false) 4THEN
RETURN
0 5 “dead branch” 6ELSE
IF
Backtracking-SAT
7THEN
RETURN
1 8ELSE
RETURN
Backtracking-SAT
The input of algorithm Backtracking-SAT
are a Boolean formula and a partial assignment to some of the variables of . This algorithm returns a Boolean value: 1, if the partial assignment can be extended to a satisfying assignment to all variables of , and 0 otherwise. Partial assignments are here considered to be strings of length at most over the alphabet .
The first call of the algorithm is Backtracking-SAT
, where denotes the empty assignment. If it turns out that the partial assignment constructed so far makes one of the clauses of false, it cannot be extended to a satisfying assignment of . Thus, the subtree underneath the corresponding vertex in the recursion tree can be pruned; see also Exercise 8.3-1.
To estimate the runtime of Backtracking-SAT
, note that this algorithm can be specified so as to select the variables in an “intelligent” order that minimises the number of steps needed to evaluate the variables in any clause. Consider an arbitrary, fixed clause of the given formula . Each satisfying assignment of assigns truth values to the three variables in . There are possibilities to assign a truth value to these variables, and one of them can be excluded certainly: the assignment that makes false. The corresponding vertex in the recursion tree of Backtracking-SAT
thus leads to a “dead” branch, so we prune the subtree underneath it.
Depending on the structure of , there may exist further “dead” branches whose subtrees can also be pruned. However, since we are trying to find an upper bound in the worst case, we do not consider these additional “dead” subtrees. It follows that
is an upper bound for Backtracking-SAT
in the worst case. This bound slightly improves upon the trivial upper bound of the “naive” algorithm for .
As mentioned above, the deterministic time complexity of can be improved even further. For example, Monien and Speckenmeyer [186] proposed a divide-and-conquer algorithm with runtime . Dantsin et al. [59] designed a deterministic “local search with restart” algorithm whose runtime is , which was further improved by Brueggemann and Kern [34] in 2004 to a bound.
There are also randomised algorithms that have an even better runtime. One will be presented now, a “random-walk” algorithm that is due to Schöning [228].
A random walk can be done on a specific structure, such as in the Euclidean space, on an infinite grid or on a given graph. Here we are interested in random walks occurring on a graph that represents a certain stochastic automaton. To describe such automata, we first introduce the notion of a finite automaton.
A finite automaton can be represented by its transition graph, whose vertices are the states of the finite automaton, and the transitions between states are directed edges marked by the symbols of the alphabet . One designated vertex is the initial state in which the computation of the automaton starts. In each step of the computation, the automaton reads one input symbol (proceeding from left to right) and moves to the next state along the edge marked by the symbol read. Some vertices represent final states. If such a vertex is reached after the entire input has been read, the automaton accepts the input, and otherwise it rejects. In this way, a finite automaton accepts a set of input strings, which is called its language.
A stochastic automaton is a finite automaton whose edges are marked by probabilities in addition. If the edge from to in the transition graph of is marked by , where , then moves from state to state with probability . The process of random transitions of a stochastic automaton is called a Markov chain in probability theory. Of course, the acceptance of strings by a stochastic automaton depends on the transition probabilities.
Random-SAT(
)
1FOR
TO
is the number of variables in 2DO
randomly choose an assignment under the uniform distribution 3FOR
TO
4IF
5THEN
RETURN
the satisfying assignment to 6ELSE
choose a clause with 7 randomly choose a literal under the uniform distribution 8 determine the bit in assigning 9 swap to in 10RETURN
“ is not satisfiable”
Here, we are not interested in recognising languages by stochastic automata, but rather we will use them to describe a random walk by the randomised algorithm Random-SAT
. Given a Boolean formula with variables, Random-SAT
tries to find a satisfying assignment to 's variables, if one exists.
On input , Random-SAT
starts by guessing a random initial assignment , where each bit of takes on the value and with probability . Suppose is satisfiable. Let be an arbitrary fixed assignment of . Let be a random variable that expresses the Hamming distance between and , i.e., gives the number of bits in which and differ. Clearly, can take on values and is distributed according to the binomial distribution with parameters and . That is, the probability for is exactly .
Random-SAT
now checks whether the initial assignment already satisfies , and if so, it accepts. Otherwise, if does not satisfy , there must exist a clause in not satisfied by . Random-SAT
now picks any such clause, randomly chooses under the uniform distribution some literal in this clause, and “flips” the corresponding bit in the current assignment . This procedure is repeated times. If the current assignment still does not satisfy , Random-SAT
restarts with a new initial assignment, and repeats this entire procedure times, where .
Figure 8.6 shows a stochastic automaton , whose edges are not marked by symbols but only by transition probabilities. The computation of Random-SAT
on input can be viewed as a random walk on as follows. Starting from the initial state , which will never be reached again later, Random-SAT
first moves to one of the states according to the binomial distribution with parameters and . This is shown in the upper part of Figure 8.6 for a formula with variables. Reaching such a state means that the randomly chosen initial assignment and the fixed satisfying assignment have Hamming distance . As long as , Random-SAT
changes one bit to in the current assignment , searching for a satisfying assignment in each iteration of the inner loop. In the random walk on , this corresponds to moving one step to the left to state or moving one step to the right to state , where only states less than or equal to can be reached.
The fixed assignment satisfies , so it sets at least one literal in each clause of to true. If we fix exactly one of the literals satisfied by in each clause, say , then Random-SAT
makes a step to the left if and only if was chosen by Random-SAT
. Hence, the probability of moving from state to state is 1/3, and the probability of moving from state to state is 2/3.
If the state is reached eventually after at most iterations of this process, and have Hamming distance , so satisfies and Random-SAT
returns and halts accepting. Of course, one might also hit a satisfying assignment (distinct from ) in some state . But since this would only increase the acceptance probability, we neglect this possibility here.
If this process is repeated times unsuccessfully, then the initial assignment was chosen so badly that Random-SAT
now dumps it, and restarts the above process from scratch with a new initial assignment. The entire procedure is repeated at most times, where . If it is still unsuccessful after trials, Random-SAT
rejects its input.
Since the probability of moving away from state to the right is larger than the probability of moving toward to the left, one might be tempted to think that the success probability of Random-SAT
is very small. However, one should not underestimate the chance that one already after the initial step from reaches a state close to . The closer to the random walk starts, the higher is the probability of reaching by random steps to the left or to the right.
We only give a rough sketch of estimating the success probability (assuming that is satisfiable) and the runtime of Random-SAT
. For convenience, suppose that 3 divides . Let be the probability for the event that Random-SAT
reaches the state within steps after the initial step, under the condition that it reaches some state with the initial step from . For example, if the state is reached with the initial step and if no more than steps are done to the right, then one can still reach the final state by a total of at most steps. If one does more than steps to the right starting from state , then the final state cannot be reached within steps. In general, starting from state after the initial step, no more than steps to the right may be done. As noted above, a step to the right is done with probability , and a step to the left is done with probability . It follows that
Now, let be the probability for the event that Random-SAT
reaches some state with the initial step. Clearly, we have
Finally, let be the probability for the event that Random-SAT
reaches the final state within the inner loop. Of course, this event can occur also when starting from a state . Thus,
Approximating this sum by the entropy function and estimating the binomial coefficients from (8.2) and (8.3) in the single terms by Stirling's formula, we obtain the lower bound for .
To reduce the error probability, Random-SAT
performs a total of independent trials, each starting with a new initial assignment . For each trial, the probability of success (i.e., the probability of finding a satisfying assignment of , if one exists) is at least , so the error is bounded by . Since the trials are independent, these error probabilities multiply, which gives an error of . Thus, the total probabilitiy of success of Random-SAT
is at least if is satisfiable. On the other hand, Random-SAT
does not make any error at all if is unsatisfiable; in this case, the output is : “ is not satisfiable”.
The particular choice of this value of can be explained as follows. The runtime of a randomised algorithm, which performs independent trials such as Random-SAT
, is roughly reciprocal to the success probability of one trial, . In particular, the error probability (i.e., the probability that that in none of the trials a satisfying assignment of is found even though is satisfiable) can be estimated by . If a fixed error of is to be not exceeded, it is enough to choose such that ; equivalently, such that . Up to constant factors, this can be accomplished by choosing . Hence, the runtime of the algorithm is in .
Exercises
8.3-1 Start the algorithm Backtracking-SAT
for the Boolean formula and construct step by step a satisfying assignment of . How does the resulting recursion tree look like?
In this section, we need some of the group-theoretic and graph-theoretic foundations presented in Section 7.1.3 In particular, recall the notion of permutation group from Definition 7.6 and the graph isomorphism problem (, for short) and the graph automorphism problem (, for short) from Definition 7.8; see also Example 7.4 in Chapter 7. We start by providing some more background from complexity theory.
In Section 8.2, we have seen that the problems and are NP-complete. Clearly, if and only if every NP problem (including the NP-complete problems) is in P, which in turn is true if and only if some NP-complete problem is in P. So, no NP-complete problem can be in P if . An interesting question is whether, under the plausible assumption that , every NP problem is either in P or NP-complete. Or, assuming , can there exist NP problems that are neither in P nor NP-complete? A result by Ladner [157] answers this question.
Theorem 8.7 (Ladner) If then there exist sets in NP that are neither in P nor NP-complete.
The problems constructed in the proof of Theorem 8.7 are not overly natural problems. However, there are also good candidates of “natural” problems that are neither in P nor NP-complete. One such candidate is the graph isomorphism problem. To provide some evidence for this claim, we now define two hierarchies of complexity classes, the low hierarchy and the high hierarchy, both introduced by Schöning [226]. First, we need to define the polynomial hierarchy, which builds upon NP. And to this end, we need a more flexible reducibility than the (polynomial-time) many-one reducibility from Definition 8.4, namely the Turing reducibility . We will also define the (polynomial-time) nondeterministic Turing reducibility, , and the (polynomial-time) strong nondeterministic Turing reducibility, . These two reducibilities are important for the polynomial hierarchy and for the high hierarchy, respectively. Turing reducibilities are based on the notion of oracle Turing machines, which we now define.
Definition 8.8 (Oracle Turing machine) An oracle set is a set of strings. An oracle Turing machine , with oracle , is a Turing machine that has a special worktape, which is called the oracle tape or query tape. In addition to other states, contains a special query state, , and the two answer states and . During a computation on some input, if is not in the query state , it works just like a regular Turing machine. However, when enters the query state , it interrupts its computation and queries its oracle about the string currently written on the oracle tape. Imagine the oracle as some kind of “black box”: answers the query of whether or not it contains within one step of 's computation, regardless of how difficult it is to decide the set . If , then changes its current state into the new state and continues its computation. Otherwise, if , continues its computation in the new state . We say that the computation of on input is performed relative to the oracle , and we write .
The language accepted by is denoted . We say a language is represented by an oracle Turing machine if and only if . We say a class of languages is relativizable if and only if it can be represented by oracle Turing machines relative to the empty oracle. For any relativizable class and for any oracle , define the class relative to by
For any class of oracle sets, define .
Let NPOTM (respectively, DPOTM) be a shorthand for nondeterministic (respectively, deterministic) polynomial-time oracle Turing machine. For example, the following classes can be defined:
For the empty oracle set , we obtain the unrelativized classes and , and we then write NPTM instead of NPOTM and DPTM instead of DPOTM.
In particular, oracle Turing machines can be used for prefix search. Let us consider an example.
Example 8.3 (Prefix search by an oracle Turing machine)
Suppose we wish to find the smallest solution of the graph isomorphism problem, which is in NP; see Definition 7.8 in Subsection 7.1.3. Let and be two given graphs with vertices each. An isomorphism between and is called a solution of “ ”. The set of isomorphisms between and , , contains all solutions of “ ”.
Our goal is to find the lexicographically smallest solution if ; otherwise, we output the empty string to indicate that . That is, we wish to compute the function defined by if , and if , where the minimum is to be taken according to the lexicographic order on . More precisely, we view a permutation as the string of length over the alphabet , and we write for if and only if there is a such that for all and .
From a permutation , one obtains a partial permutation by cancelling some pairs . A partial permutation can be represented by a string over the alphabet , where indicates an undefined position. Let . A prefix of length of is a partial permutation of containing each pair with , but none of the pairs with . In particular, for , the empty string is the (unique) length prefix of . For , the total permutation is the (unique) length prefix of itself. Suppose that is a prefix of length of and that is a string over of length with none of the occurring in . Then, denotes the partial permutation that extends by the pairs . If in addition for , then is also a prefix of . For our prefix search, given two graphs and , we define the set of prefixes of the isomorphisms in by
Note that, for , the empty string does not encode a permutation in . Furthermore, if and only if , which in turn is true if and only if .
Starting from the empty string, we will construct, bit by bit, the smallest isomorphism between the two given graphs (if there exists any). We below present an DPOTM that, using the NP set as its oracle, computes the function by prefix search; see also Exercise 8.4-2. Denoting the class of functions computable in polynomial time by , we thus have shown that is in . Since is in NP (see Exercise 8.4-2), it follows that is in .
N-Pre-Iso(
)
1IF
2THEN
RETURN
3ELSE
4 5WHILE
and have vertices each. 6DO
7WHILE
8DO
9 10 11RETURN
Example 8.3 shows that also Turing machines computing functions can be equipped with an oracle, and that also function classes such as can be relativizable. We now return to oracle machines accepting languages and use them to define several reducibilities. All reducibilities considered here are polynomial-time computable.
Definition 8.9 (Turing reducibilities) Let be a binary alphabet, let and be sets of strings over , and let be any complexity class. The set of complements of sets in is defined by .
Define the following reducibilities:
Turing reducibility: for some DPOTM .
Nondeterministic Turing reducibility: for some NPOTM .
Strong nondeterministic Turing reducibility: .
Let be one of the reducibilities defined above. We call a set -hard for if and only if for each set . A set is said to be -complete in if and only if is -hard for and .
is the closure of under the -reducibility.
is the closure of under the -reducibility.
Using the -reducibility and the -reducibility introduced in Definition 8.9, we now define the polynomial hierarchy, and the low and the high hierarchy within NP.
Definition 8.10 (Polynomial hierarchy) Define the polynomial hierarchy PH inductively as follows: , , and for , and .
In particular, and and . The following theorem, which is stated without proof, provides some properties of the polynomial hierarchy, see Problem 8-2.
Theorem 8.11 (Meyer and Stockmeyer) For all holds:
1. .
2. , , , and are closed under -reductions. is even closed under -reductions.
3. contains exactly those sets for which there exist a set in P and a polynomial such that for each :
where the quantifiers and are polynomially length-bounded, and if is odd, and if is even.
4. If for some , then collapses to .
5. If for some , then collapses to .
6. There are -complete problems for each of the classes , , and . In contrast, if has a -complete problem, then collapses to a finite level, i.e., for some .
Definition 8.12 (Low hierarchy and high hierarchy within NP) For , define the th level of the
low hierarchy in NP by ;
high hierarchy in NP by .
Informally put, a set is in if and only if it is useless as an oracle for a computation. All information contained in can be computed by the machine itself without the help of an oracle. On the other hand, a set in is so rich and provides so useful information that the computational power of a machine is increased by the maximum amount an NP set can provide: it “jumps” to the next level of the polynomial hierarchy. That is, by using an oracle from , a machine can simulate any computation. Thus, is as useful for a machine as any NP-complete set.
For , the question of whether or not is nothing other than the P-versus-NP question. Theorem 8.13 lists some important properties of these hierarchies, omitting the proofs, see [226] and Exercise 8-2. For the class mentioned in the first item of Theorem 8.13, the reader is referred to the definition of the Arthur-Merlin hierarchy introduced in Subsection 7.5.1; cf. Definition 7.16. Ladner's theorem (Theorem 8.7) is a special case (for ) of item 7 in Theorem 8.13.
1. and and .
2. .
3. .
4. .
5. .
6. For each , is nonempty if and only if .
7. For each , NP contains sets that are neither in nor in if and only if .
8. There exist sets in NP that are neither in nor in if and only if is a strictly infinite hierarchy, i.e., if and only if does not collapse to a finite level.
We now turn to the result that the graph isomorphism problem () is in , the second level of the low hierarchy. This result provides strong evidence against the NP-completeness of , as can be seen as follows. If were NP-complete, then would be in , since by Theorem 8.13 contains exactly the -complete NP sets and in particular the -complete sets in NP. Again by Theorem 8.13, we have that is nonempty if and only if collapses to , which is considered very unlikely.
To prove the lowness of the graph isomorphism problem, we first need a technical prerequisite, the so-called hashing lemma, stated here as Lemma 8.15. Hashing is method widely used in computer science for dynamic data management. The idea is the following. Assign to every data set some (short) key in a unique manner. The set of all potential keys, called the universe , is usually very large. In contrast, the set of those keys actually used is often much smaller. A hash function maps the elements of to the hash table . Hash functions are many-to-one mappings. That is, distinct keys from can be mapped to the same address in . However, if possible, one wishes to map two distinct keys from to distinct addresses in . That is, one seeks to avoid collisions on the set of keys actually used. If possible, a hash function thus should be injective on .
Among the various known hashing techniques, we are here interested in universal hashing, which was introduced by Carter and Wegman [41] in 1979. The idea is to randomly choose a hash function from a suitable family of hash functions. This method is universal in the sense that it does no longer depend on a particular set of keys actually used. Rather, it seeks to avoid collisions with high probability on all sufficiently small sets . The probability here is with respect to the random choice of the hash function.
In what follows, we assume that keys are encoded as strings over the alphabet . The set of all length strings in is denoted by .
Definition 8.14 (Hashing) Let , and let and be positive integers with . A hash function is a linear mapping determined by a Boolean matrix , where . For and , the th bit of in is given by , where denotes the logical exclusive-or operation, i.e.,
Let be a family of hash functions for the parameters and :
On , we assume the uniform distribution: A hash function is chosen from by picking the bits in independently and uniformly distributed.
Let . For any subfamily of , we say there is a collision on if
Otherwise, is said to be collision-free on .
A collision on means that none of the hash functions in a subfamily is injective on . The following lemma says that, if is sufficiently small, a randomly chosen subfamily of must be collision-free. In contrast, if is too large, collisions cannot be avoided. The proof of Lemma 8.15 is omitted.
Lemma 8.15 (Hashing lemma) Let be fixed parameters, where . Let and let be a family of hash functions randomly chosen from under the uniform distribution. Let
be the event that for a collision occurs on . Then, the following two statements hold:
1. If , then occurs with probability at most .
2 . If , then occurs with probability .
In Section 7.5, the Arthur-Merlin hierarchy has been defined, and it was mentioned that this hierarchy collapses to its second level. Here, we are interested in the class , cf. Definition 7.16 in Subsection 7.5.1.
Theorem 8.16 (Schöning) is in .
Proof. By Theorem 8.13, every set is in . Thus, to prove that in , it is enough to show that is in . Let and be two graphs with vertices each. We wish to apply Lemma 8.15. A first idea is to use
as the set from that lemma. By Lemma 7.11, we have if , and if .
The machine we wish to construct for is polynomial-time bounded. Thus, the parameters and from the hashing lemma must be polynomial in . So, to apply Lemma 7.11, we would have to choose a polynomial such that
This would guarantee that the set would be large enough to tell two isomorphic graphs and apart from two nonisomorphic graphs and . Unfortunately, it is not possible to find a polynomial that satisfies (8.4). Thus, we define a different set , which yields a gap large enough to distinguish isomorphic graphs from nonisomorphic graphs.
Define . Now, (8.4) changes to
and this inequality can be satisfied by choosing , which is polynomially in as desired.
Construct a machine for as follows. Given the graphs and each having vertices, first computes the parameter . The set contains -tuples of pairs each having the form , where is a graph with vertices, and where is a permutation in the automorphism group . The elements of can be suitably encoded as strings over the alphabet , for a suitable polynomial . All computations performed so far are deterministic.
Then, performs Arthur's probabilistic move by randomly choosing a family of hash functions from under the uniform distribution. Each hash function is represented by a Boolean matrix. Thus, the hash functions in can be represented as a string of length for a suitable polynomial . Modify the collision predicate defined in the hashing lemma as follows:
Note that the quantifier in ranges over only polynomially many and can thus be evaluated in deterministic polynomial time. It follows that the two quantifiers in can be merged into a single polynomially length-bounded quantifier. By Theorem 8.11, is a set in . Let be an NPTM for . For the string that encodes randomly picked hash functions from , now simulates the computation of . This corresponds to Merlin's move. Finally, accepts its input if and only if accepts.
We now estimate the probability (taken over the random choice of the hash functions in that accepts its input . If and isomorphic, then by Lemma 7.11. Inequality (8.5) implies . By Lemma 8.15, the probability that is in (and that thus accepts) is at most . However, if and are nonisomorphic, Lemma 7.11 implies that . Inequality (8.5) now gives . By Lemma 8.15, the probability that is in and thus accepts is . It follows that is in as desired.
The probabilistic complexity class RP was introduced in Definition 7.14 in Subsection 7.3.1. In this section, two other probabilistic complexity classes are important that we will now define: and , which stand for Probabilistic Polynomial Time and Stoic Probabilistic Polynomial Time, respectively.
Definition 8.17 (PP and SPP) The class contains exactly those problems for which there exists an NPTM such that for each input : If then accepts with probability at least , and if then accepts with probability less than .
For any NPTM running on input , let denote the number of accepting computation paths of and let denote the number of rejecting computation paths of . Define .
The class contains exactly those problems for which there exists an NPTM such that for each input : and .
In other words, an machine is “stoic” in the sense that its “gap” (i.e., the difference between its accepting and rejecting computation paths) can take on only two out of an exponential number of possible values, namely and . Unlike , is a so-called “promise class”. since an machine “promises” that for each .
The notion of lowness can be defined for any relativizable complexity class : A set is said to be -low if and only if . In particular, for each , the th level of the low hierarchy within NP (see Definition 8.12) contains exactly the NP sets that are -low. It is known that all sets in are -low. This and other useful properties of are listed in the following theorem without proof; see also [71], [150], [151].
1. is -low, i.e., .
2. is self-low, i.e., .
3. Let be a set in NP via some NPTM and let be a set in via some NPOTM such that, for each input , asks only queries satisfying . Then, is in .
4. Let be a set in NP via some NPTM and let be a function in via some DPOTM such that, for each input , asks only queries satisfying . Then, is in .
The following theorem says that the lexicographically smallest permutation in a right coset (see Definition 7.6 in Section 7.1.3) can be determined efficiently. The lexicographic order on is defined in Example 8.3.
Theorem 8.19 Let be a permutation group with and let be a permutation in . There is a polynomial-time algorithm that, given , computes the lexicographically smallest permutation in the right coset of in .
Proof. We now state our algorithm LERC
for computing the lexicographically smallest permutation in the right coset of in , where the permutation group is given by a generator , see Definition 7.6 in Subsection 7.1.3.
LERC(
)
1 compute the tower of stabilisers in 2 3FOR
TO
4DO
5 compute the element in the orbit for which is minimum 6 compute a permutation in such that 7 8RETURN
By Theorem 7.7, the tower of stabilisers of can be computed in polynomial time. More precisely, for each with , the complete right transversals in are determined, and thus a strong generator of .
Note that and . Thus, to prove that the algorithm works correctly, it is enough to show that for each with , the lexicographically smallest permutation of is contained in . By induction, it follows that also contains the lexicographically smallest permutation of . Thus, algorithm LERC
indeed outputs the lexicographically smallest permutation of of .
To prove the above claim, let us denote the orbit of an element in a permutation group by . Let be the permutation in that maps onto the element in the orbit for which is the minimal element in the set .
By Theorem 7.7, the orbit can be computed in polynomial time. Since contains at most elements, can be determined efficiently. Our algorithm ensures that . Since every permutation in maps each element of onto itself, and since , it follows that for each with , for each , and for each ,
In particular, it follows for the lexicographically smallest permutation in that every permutation from must coincide with on the first elements, i.e. on .
Moreover, for each and for the element defined above, we have
Clearly, . The claim now follows from the fact that for the lexicographically smallest permutation of .
Thus, LERC
is a correct algorithm. It easy to see that it is also efficient.
Theorem 8.19 can easily be extended to Corollary 8.20, see Exercise 8-3.
Corollary 8.20 Let be a permutation group with , and let and be two given permutations in . There exists a polynomial-time algorithm that, given , computes the lexicographically smallest permutation of .
We now prove Theorem 8.21, the main result of this section.
Theorem 8.21 (Arvind and Kurur) is in .
Proof. Define the (functional) problem as follows: Given a graph , compute a strong generator of the automorphism group ; see Definition 7.6 and the subsequent paragraph and Definition 7.8 for these notions. By Mathon's [176] result, the problems and are Turing-equivalent (see also [151]), i.e., is in and is in . Thus, it is enough to show that is in because the self-lowness of stated in Theorem 8.18 implies that is in , which will complete the proof. So, our goal is to find an algorithm for . Given a graph , this algorithm has to compute a strong generator for , where
is the tower of stabilisers of and , , is a complete right transversal of in .
Starting with the trivial case, , we build step by step a strong generator for , where is decreasing. Eventually, we will thus obtain a strong generator for . So suppose a strong generator for has already been found. We now describe how to determine a complete right transversal of in by our algorithm. Define the oracle set
By Theorem 8.19, the lexicographically smallest permutation LERC
of the right coset can be determined in polynomial time by our algorithm. The partial permutation belongs to the input , since we wish to use as an oracle in order to find the lexicographically smallest permutation by prefix search; cf. Example 8.3.
Consider the following NPTM for :
N(
)
1 verify that 2 nondeterministically guess a permutation ; has vertices 3IF
and and extends andLERC
4THEN
accept and halt 5ELSE
reject and halt
Thus, is in NP. Note that if then , for each permutation in the right coset .
We now show that if then the number of accepting computation paths of on input is either or . In general, .
Suppose is in and . If for some and , then the right coset contains exactly those permutations in that map to . Thus, the only accepting computation path of corresponds to the unique lexicographically smallest permutation LERC
. If, on the other hand, is a strict subgroup of , then can be written as the disjoint union of right cosets of . In general, thus possesses accepting computation paths if is in , and otherwise it has no accepting computation path.
M-A(
)
1 set for each , ; has vertices 2 will be a complete right transversal of in . 3 set for each , 4 set will be a strong generator for . 5FOR
DOWNTO
1 is already found at the start of the th iteration, and will now be computed. 6DO
let be the partial permutation with for each 7 For , is the nowhere defined partial permutation. 8FOR
TO
9DO
, i.e., extends by the pair with 10IF
11THEN
Construct the smallest permutation in mapping to by prefix search. 12FOR
TO
13DO
find the element not in the image of with 14 15 Now, is a total permutation in 16 now, is a complete right transversal of in 17 . 18RETURN
is a strong generator for
The above algorithm is an algorithm for . The DPOTM makes only queries to its oracle for which . Thus, for each query actually asked. By item 4 of Theorem 8.18, it follows that is in .
The claim that the output of indeed is a strong generator for can be shown by induction on . The induction base is , and of course generates .
For the induction step, assume that prior to the th iteration a strong generator for has already been found. We now show that after the th iteration the set is a strong generator for . For each with , the oracle query “ ?” checks whether there is a permutation in mapping to . By prefix search, which is performed by making suitable oracle queries to again, the lexicographically smallest permutation in with is constructed. Note that, as claimed above, only queries satisfying are made to , since is a strong generator for , so . By construction, after the th iteration, is a complete right transversal of in . It follows that is a strong generator for . Eventually, after iterations, a strong generator for is found.
From Theorem 8.21 and the first two items of Theorem 8.18, we obtain Corollary 8.22.
Corollary 8.22 is low for both and , i.e., and .
Exercises
8.4-1 By Definition 8.9, if and only if . Show that if and only if .
8.4-2 Show that the set defined in Example 8.3 is in NP. Moreover, prove that the machine defined in Example 8.3 runs in polynomial time, i.e., show that is a DPOTM.
PROBLEMS |
A strong NPOTM is an NPOTM with three types of final states, i.e., the set of final states of is partitioned into (accepting states), (rejecting states), and (“don't know” states) such that the following holds: If then has at least one computation path halting in an accepting state from but no computation path halting in a rejecting state from . If then has at least one computation path halting in a rejecting state from but no computation path halting in an accepting state from . In both cases may have computation paths halting in “don't know” states from . In other words, strong NPOTMs are machines that never lie. Prove the following two assertions:
(a) if and only if there exists a strong NPOTM with .
(b) if and only if .
Hint. Look at Exercise 8.4-1.
Prove the assertions from Theorems 8.11 and 8.13. (Some are far from being trivial!)
8-3
Modification of the proofs
Modify the proof of Theorem 8.19 so as to obtain Corollary 8.20.
CHAPTER NOTES |
Parts of the Chapters 7 and 8 are based on the book [219] that provides the proofs omitted here, such as those of Theorems 8.11, 8.13, and 8.18 and of Lemma 8.15, and many more details.
More background on complexity theory can be found in the books [112], [198], [266], [271]. A valuable source on the theory of NP-completeness is still the classic [81] by Garey and Johnson. The -reducibility was introduced by Cook [51], and the -reducibility by Karp [140]. A deep and profound study of polynomial-time reducibilities is due to Ladner, Lynch, and Selman [157].
Exercise 8-1 and Problem 8-1 are due to Selman [231].
Dantsin et al. [59] obtained an upper bound of for the deterministic time complexity of , which was further improved by Brueggemann and Kern [34] to . The randomised algorithm presented in Subsection 8.3.2 is due to Schöning [228]; it is based on a “limited local search with restart”. For with , the algorithm by Paturi et al. [200] is slightly better than Schöning's algorithm. Iwama and Tamaki [131] combined the ideas of Schöning [228] and Paturi et al.[200] to obtain a bound of for with . For with , their algorithm is not better than that by Paturi et al. [200].
Figure 8.7 gives an overview over some algorithms for the satisfiability problem.
For a thorough, comprehensive treatment of the graph isomorphism problem the reader is referred to the book by Köbler, Schöning, and Torán [152], particularly under complexity-theoretic aspects. Hoffman [115] investigates group-theoretic algorithms for the graph isomorphism problem and related problems.
The polynomial hierarchy was introduced by Meyer and Stockmeyer [178], [250]. In particular, Theorem 8.11 is due to them. Schöning [226] introduced the low hierarchy and the high hierarchy within NP. The results stated in Theorem 8.13 are due to him [226]. He also proved that is in , see [227]. Köbler et al. [150], [151] obtained the first lowness results of for probabilistic classes such as . These results were improved by Arvind and Kurur [12] who proved that is even in . The class generalises Valiant's class , see [261]. So-called “promise classes” such as and have been thoroughly studied in a number of papers; see, e.g., [12], [30], [71], [150], [151], [210], [218]. Lemma 8.15 is due to Carter and Wegman [41].
The author is grateful to Uwe Schöning for his helpful comments on an earlier version of this chapter and for sending the slides of one of this talks that helped simplifying the probability analysis for the algorithm Random-SAT
sketched in Subsection 8.3; a more comprehensive analysis can be found in Schöning's book [229]. Thanks are also due to Dietrich Stoyan, Robert Stoyan, Sigurd Assing, Gábor Erdélyi and Holger Spakowski for proofreading previous versions of Chapters 7 and 8. This work was supported in part by the Deutsche Forschungsgemeinschaft (DFG) under grants RO 1202/9-1 and RO 1202/9-3 and by the Alexander von Humboldt Foundation in the TransCoop program.
Table of Contents
Table of Contents
In on-line computation an algorithm must make its decisions based only on past events without secure information on future. Such methods are called on-line algorithms. On-line algorithms have many applications in different areas such as computer science, economics and operations research.
The first results in this area appeared around 1970, and later since 1990 more and more researchers have started to work on problems related to on-line algorithms. Many subfields have been developed and investigated. Nowadays new results of the area have been presented on the most important conferences about algorithms. This chapter does not give a detailed overview about the results, because it is not possible in this framework. The goal of the chapter is to show some of the main methods of analysing and developing on-line algorithms by presenting some subareas in more details.
In the next section we define the basic notions used in the analysis of on-line algorithms. After giving the most important definitions we present one of the best-known on-line problems—the on-line -server problem—and some of the related results. Then we deal with a new area by presenting on-line problems belonging to computer networks. In the next section the on-line bin packing problem and its multidimensional generalisations are presented. Finally in the last section of this chapter we show some basic results concerning the area of on-line scheduling.
Since an on-line algorithm makes its decisions based on partial information without knowing the whole instance in advance, we cannot expect it to give the optimal solution which can be given by an algorithm having full information. An algorithm which knows the whole input in advance is called off-line algorithm.
There are two main methods to measure the performance of on-line algorithms. One possibility is to use average case analysis where we hypothesise some distribution on events and we study the expected total cost.
The disadvantage of this approach is that usually we do not have any information about the distribution of the possible inputs. In this chapter we do not use the average case analysis.
An another approach is a worst case analysis, which is called competitive analysis. In this case we compare the objective function value of the solution produced by the on-line algorithm to the optimal off-line objective function value.
In case of on-line minimisation problems an on-line algorithm is called -competitive, if the cost of the solution produced by the on-line algorithm is at most times more than the optimal off-line cost for each input. The competitive ratio of an algorithm is the smallest for which the algorithm is -competitive.
For an arbitrary on-line algorithm ALG
we denote the objective function value achieved on input by . The optimal off-line objective function value on is denoted by . Using this notation we can define the competitiveness as follows.
Algorithm ALG
is -competitive, if is valid for each input .
There are two further versions of the competitiveness which are often used. For a minimisation problem an algorithm ALG
is called weakly -competitive, if there exists such a constant that is valid for each input .
The weak competitive ratio of an algorithm is the smallest for which the algorithm is weakly -competitive.
A further version of the competitive ratio is the asymptotic competitive ratio. For minimisation problems the asymptotic competitive ratio of algorithm can be defined as follows:
An algorithm is called asymptotically -competitive if its asymptotic competitive ratio is at most .
The main property of the asymptotic ratio is that it considers the performance of the algorithm under the assumption that the size of the input tends to . This means that this ratio is not effected by the behaviour of the algorithm on the small size inputs.
Similar definitions can be given for maximisation problems. In that case algorithm ALG
is called -competitive, if is valid for each input , and the algorithm is weakly -competitive if there exists such a constant that is valid for each input . The asymptotic ratio for maximisation problems can be given as follows:
The algorithm is called asymptotically -competitive if its asymptotic ratio is at least .
Many papers are devoted randomised on-line algorithms, in which case the objective function value achieved by the algorithm is a random variable, and the expected value of this variable is used in the definition of the competitive ratio. Since we consider only deterministic on-line algorithms in this chapter, we do not detail the notions related to randomised on-line algorithms.
One of the best-known on-line problems is the on-line -server problem. To give the definition of the general problem the notion of metric spaces is needed. A pair (where contains the points of the space and is the distance function defined on the set ) is called metric space if the following properties are valid:
for all ,
for all ,
for all ,
holds if and only if .
In the -server problem a metric space is given, and there are servers which can move in the space. The decision maker has to satisfy a list of requests appearing at the points of the metric space by sending a server to the point where the request appears.
The problem is on-line which means that the requests arrive one by one, and we must satisfy each request without any information about the further requests. The goal is to minimise the total distance travelled by the servers. In the remaining parts of the section the multiset which contains the points where the servers are is called the configuration of the servers. We use multisets, since different servers can be at the same points of the space.
The first important results for the -server problem were achieved by Manasse, McGeoch and Sleator. They developed the following algorithm called Balance
, which we denote by BAL
. During the procedure the servers are in different points. The algorithm stores for each server the total distance travelled by the server. The servers and the points in the space where the servers are located are denoted by . Let the total distance travelled by the server be . After the arrival of a request at point algorithm BAL
uses server for which the value is minimal. This means that the algorithm tries to balance the distances travelled by the servers. Therefore the algorithm maintains server configuration and the distances travelled by the servers which distances have starting values . The behaviour of the algorithm on input can be given by the following pseudocode:
BAL(
)
1FOR
TO
2DO
3 serve the request with server 4 5
Example 9.1 Consider the two dimensional Euclidean space as the metric space. The points are two dimensional real vectors , and the distance between and is . Suppose that there are two servers which are located at points and at the beginning. Therefore at the beginning , , . Suppose that the first request appears at point . Then , thus the second server is used to satisfy the request and after the action of the server , , . Suppose that the second request appears at point , so , thus again the second server is used, and after serving the request , , . Suppose that the third request appears at point , so , thus the first server is used, and after serving the request , , .
The algorithm is efficient in the cases of some particular metric spaces as it is shown by the following statement. The references where the proof of the following theorem can be found are in the chapter notes at the end of the chapter.
Theorem 9.1 Algorithm Balance
is weakly -competitive for the metric spaces containing points.
The following statement shows that there is no on-line algorithm which is better than -competitive for the general -server problem.
Theorem 9.2 There is no metric space containing at least points where an on-line algorithm exists with smaller competitive ratio than .
Proof. Consider an arbitrary metric space containing at least points and an arbitrary on-line algorithm say ONL
. Denote the points of the starting configuration of ONL
by , and let be another point of the metric space. Consider the following long list of requests . The next request appears at the point among where ONL
has no server.
First calculate the value . The algorithm does not have any servers at point after serving , thus the request appeared at is served by the server located at point . Therefore the cost of serving is , which yields
where denotes the point from which the server was sent to serve . (This is the point where the -th request would appear.) Now consider the cost . Instead of calculating the optimal off-line cost we define different off-line algorithms, and we use the mean of the costs resulting from these algorithms. Since the cost of each off-line algorithm is at least as much as the optimal off-line cost, the calculated mean is an upper bound for the optimal off-line cost.
We define the following off-line algorithms, denoted by . Suppose that the servers are at points in the starting configuration of . We can move the servers into this starting configuration using an extra constant cost .
The algorithms satisfy the requests as follows. If an algorithm has a server at point , then none of the servers moves. Otherwise the request is served by the server located at point . The algorithms are well-defined, if does not contain a server, then each of the other points contains a server, thus there is a server located at . Moreover , thus at the beginning each algorithm has a server at the requested point.
We show that the servers of algorithms are always in different configurations. At the beginning this property is valid because of the definition of the algorithms. Now consider the step where a request is served. Call the algorithms which do not move a server for serving the request stable, and the other algorithms unstable. The server configurations of the stable algorithms remain unchanged, so these configurations remain different from each other. Each unstable algorithm moves a server from point . This point is the place of the last request, thus the stable algorithms have server at it. Therefore, an unstable algorithm and a stable algorithm cannot have the same configuration after serving the request. Furthermore, each unstable algorithms moves a server from to , thus the server configurations of the unstable algorithms remain different from each other.
So at the arrival of the request at point the servers of the algorithms are in different configurations. On the other hand, each configuration has a server at point , therefore there is only one configuration where there is no server located at point . Consequently, the cost of serving is for one of the algorithms and for the other algorithms.
Therefore
where is an absolute constant which is independent of the input (this is the cost of moving the servers to the starting configuration of the defined algorithms).
On the other hand, the optimal off-line cost cannot be larger than the cost of any of the above defined algorithms, thus . This yields
which inequality shows that the weak competitive ratio of ONL
cannot be smaller than , since the value can be arbitrarily large as the length of the input is increasing.
There are many interesting results in connection with this problem.have appeared during the next few years. For the general case the first constant-competitive algorithm (-competitive) was developed by Fiat, Rabani and Ravid. Later Koutsoupias and Papadimitriou could analyse an algorithm based on the work function technique and they could prove that it is -competitive. They could not determine the competitive ratio of the algorithm, but it is a widely believed hypothesis that the algorithm is -competitive. Determining the competitive ratio of the algorithm, or developing a -competitive algorithm is still among the most important open problems in the area of on-line algorithms. We present the work function algorithm below.
Denote the starting configuration of the on-line servers by . Then after the -th request the work function value belonging to multiset is the minimal cost needed to serve the first requests starting at configuration and ending at configuration . This value is denoted by . The Work-Function
algorithm is based on the above defined work function. Suppose that is the server configuration before the arrival of the -th request, and denote the place of the -th request by . The Work-Function
algorithm uses server to serve the request for which the value is minimal, where denotes the point where the server is actually located.
Example 9.2 Consider the metric space containing three points , and with the distances , , . Suppose that we have two servers and the starting configuration is . In this case the starting work function values are , , , , , . Suppose that the first request appears at point . Then and , thus algorithm Work Function
uses the server from point to serve the request.
The following statement is valid for the algorithm.
Theorem 9.3 The Work-Function
algorithm is weakly -competitive.
Besides the general problem many particular cases have been investigated. If the distance of any pair of points is , then we obtain the on-line paging problem as a special case. Another well investigated metric space is the line. The points of the line are considered as real numbers, and the distance of points and is . In this special case a -competitive algorithm was developed by Chrobak and Larmore, which algorithm is called Double-Coverage
. A request at point is served by server which is the closest to . Moreover, if there are servers also on the opposite side of , then the closest server among them moves distance into the direction of . Hereafter we denote the Double-Coverage
algorithm by DC
. The input of the algorithm is the list of requests which is a list of points (real numbers) denoted by and the starting configuration of the servers is denoted by which contains points (real numbers) too. The algorithm can be defined by the following pseudocode:
DC(
)
1FOR
TO
2DO
3IF
or 4THEN
</tab/> the request is served by the -th server 5 6ELSE
IF
7THEN
8 the request is served by the -th server 9 10 11ELSE
IF
12THEN
13 the request is served by the -th server 14 15
Example 9.3 Suppose that there are three servers located at points . If the next request appears at point , then DC
uses the closest server to serve the request. The locations of the other servers remain unchanged, the cost of serving the request is and the servers are at points . If the next request appears at point , then DC
uses the closest server to serve the request, but there are servers on the opposite side of the request, thus also travels distance into the direction of . Therefore the cost of serving the request is and the servers will be at points .
The following statement, which can be proved by the potential function technique, is valid for algorithm DC
. This technique is often used in the analysis of on-line algorithms.
Theorem 9.4 Algorithm DC
is weakly -competitive on the line.
Proof. Consider an arbitrary sequence of requests and denote this input by . During the analysis of the procedure we suppose that one off-line optimal algorithm and DC
are running parallel on the input. We also suppose that each request is served first by the off-line algorithm and then by the on-line algorithm. The servers of the on-line algorithm and also the positions of the servers (which are real numbers) are denoted by , and the servers of the optimal off-line algorithm and also the positions of the servers are denoted by . We suppose that for the positions and are always valid, this can be achieved by swapping the notations of the servers.
We prove the theorem by the potential function technique. The potential function assigns a value to the actual positions of the servers, so the on-line and off-line costs are compared using the changes of the potential function. Let us define the following potential function:
The following statements are valid for the potential function.
While OPT
is serving a request the increase of the potential function is not more than times the distance travelled by the servers of OPT
.
While DC
is serving a request, the decrease of is at least as much as the cost of serving the request.
If the above properties are valid, then one can prove the theorem easily. In this case , where and are the final and the starting values of the potential function. Furthermore, is nonnegative, so we obtain that , which yields that the algorithms is weakly -competitive ( does not depend on the input sequence only on the starting position of the servers).
Now we prove the properties of the potential function.
First consider the case when one of the off-line servers travels distance . The first part of the potential function increases at most by . The second part does not change, thus we proved the first property of the potential function.
Consider the servers of DC
. Suppose that the request appears at point . Since the request is first served by OPT
, for some . The following two cases are distinguished depending on the positions of the on-line servers.
First suppose that the on-line servers are on the same side of . We can assume that the positions of the servers are not smaller than , since the other case is completely similar. In this case is the closest server and DC
sends to and the other on-line servers do not move. Therefore the cost of DC
is . In the first sum of the potential function only changes; it decreases by , thus the first part decreases by . The second sum is increasing; the increase is , thus the value of decreases by .
Assume that there are servers on both sides of ; suppose that the closest servers are and . We assume that is closer to , the other case is completely similar. In this case the cost of DC
is . Consider the first sum of the potential function. The -th and the -th part are changing. Since for some , thus one of the -th and the -th parts decreases by and the increase of the other one is at most , thus the first sum does not increase. The change of the second sum of is
Thus we proved that the second property of the potential function is also valid in this case.
Exercises
9.2-1 Suppose that is a metric space. Prove that is also a metric space where .
9.2-2 Consider the greedy algorithm which serves each request by the server which is closest to the place of the request. Prove that the algorithm is not constant competitive for the line.
9.2-3 Prove that for arbitrary -element multisets and and for arbitrary the inequality is valid, where is the cost of the minimal matching of and , (the minimal cost needed to move the servers from configuration to configuration ).
9.2-4 Consider the line as a metric space. Suppose that the servers of the on-line algorithm are at points , and the servers of the off-line algorithm are at points . Calculate the value of the potential function used in the proof of Theorem 9.4. How does this potential function change, if the on-line server moves from point to point ?
The theory of computer networks has become one of the most significant areas of computer science. In the planning of computer networks many optimisation problems arise and most of these problems are actually on-line, since neither the traffic nor the changes in the topology of a computer network cannot be precisely predicted. Recently some researchers working at the area of on-line algorithms have defined some on-line mathematical models for problems related to computer networks. In this section we consider this area; we present three problems and show the basic results. First the data acknowledgement problem is considered, then we present the web caching problem, and the section is closed by the on-line routing problem.
In the communication of a computer network the information is sent by packets. If the communication channel is not completely safe, then the arrival of the packets are acknowledged. The data acknowledgement problem is to determine the time of sending acknowledgements. An acknowledgement can acknowledge many packets but waiting for long time can cause the resending of the packets and that results in the congestion of the network. On the other hand, sending an acknowledgement about the arrival of each packet immediately would cause again the congestion of the network. The first optimisation model for determining the sending times of the acknowledgements was developed by Dooly, Goldman and Scott in 1998. We present the developed model and some of the basic results.
In the mathematical model of the data acknowledgement problem the input is the list of the arrival times of the packets. The decision maker has to determine when to send acknowledgements; these times are denoted by . In the optimisation model the cost function is:
where is the number of the sent acknowledgements and is the total latency collected by the -th acknowledgement. We consider the on-line problem which means that at time the decision maker only knows the arrival times of the packets already arrived and has no information about the further packets. We denote the set of the unacknowledged packets at the arrival time by .
For the solution of the problem the class of the alarming algorithms has been developed. An alarming algorithm works as follows. At the arrival time an alarm is set for time . If no packet arrives before time , then an acknowledgement is sent at time which acknowledges all of the unacknowledged packets. Otherwise at the arrival of the next packet at time the alarm is reset for time . Below we analyse an algorithm from this class in details. This algorithm sets the alarm to collect total latency by the acknowledgement. The algorithm is called Alarm
. We obtain the above defined rule from the general definition using the solution of the following equation as value :
Example 9.4 Consider the following example. The first packet arrives at time (), so Alarm
sets an alarm with value for time . Suppose that the next arrival time is . This arrival is before the alarm time, thus the first packet has not been acknowledged yet and we reset the alarm with value for time . Suppose that the next arrival time is . This arrival is before the alarm time, thus the first two packets have not been acknowledged yet and we reset the alarm with value for time . Suppose that the next arrival time is . No packet arrived before the alarm time , thus at that time the first three packets were acknowledged and the alarm is set for the new packet with value for time .
Theorem 9.5 Algorithm Alarm
is 2-competitive.
Proof. Suppose that algorithm Alarm
sends acknowledgements. These acknowledgements divide the time into time intervals. The cost of the algorithm is , since is the cost of the acknowledgements, and the alarm is set to have total latency for each acknowledgement.
Suppose that the optimal off-line algorithm sends acknowledgements. If , then is obviously valid, thus we obtain that the algorithm is -competitive. If , then at least time intervals among the ones defined by the acknowledgements of algorithm Alarm
do not contain any of the off-line acknowledgements. This yields that the off-line total latency is at most , thus we obtain that which inequality proves that Alarm
is -competitive.
As the following theorem shows, algorithm Alarm
has the smallest possible competitive ratio.
Theorem 9.6 There is no on-line algorithm for the data acknowledgement problem which has smaller competitive ratio than .
Proof. Consider an arbitrary on-line algorithm and denote it by ONL
. Analyse the following input. Consider a long sequence of packets where the packets always arrive immediately after the time when ONL
sends an acknowledgement. The on-line cost of a sequence containing packets is , since the cost resulted from the acknowledgements is , and the latency of the -th acknowledgement is , where the value is used.
Consider the following two on-line algorithms. ODD
sends the acknowledgements after the odd numbered packets and EVEN
sends the acknowledgements after the even numbered packets.
The costs achieved by these algorithms are
and
Therefore . On the other hand, none of the costs achieved by ODD
and EVEN
is greater than the optimal off-line cost, thus , which yields that . From this inequality it follows that the competitive ratio of ONL
is not smaller than 2, because using a sufficiently long sequence of packets the value can be arbitrarily large.
The file caching problem is a generalisation of the caching problem presented in the chapter on memory management. World-wide-web browsers use caches to store some files. This makes it possible to use the stored files if a user wants to see some web-page many times during a short time interval. If the cache becomes full, then some files must be eliminated to make space for the new file. The file caching problem models this scenario; the goal is to find good strategies for determining which files should be eliminated. It differs from the standard paging problem in the fact that the files have size and retrieval cost (the problem is reduced to the paging if each size and each retrieval cost are ). So the following mathematical model describes the problem.
There is a given cache of size and the input is a sequence of pages. Each page has a size denoted by and a retrieval cost denoted by . The pages arrive from a list one by one and after the arrival of a page the algorithm has to place it into the cache. If the page is not contained in the cache and there is not enough space to put it into the cache, then the algorithm has to delete some pages from the cache to make enough space for the requested page. If the required page is in the cache, then the cost of serving the request is , otherwise the cost is . The aim is to minimise the total cost. The problem is on-line which means that for the decisions (which pages should be deleted from the cache) only the earlier pages and decisions can be used, the algorithm has no information about the further pages. We assume that the size of the cache and also the sizes of the pages are positive integers.
For the solution of the problem and for its special cases many algorithms have been developed. Here we present algorithm Landlord
which was developed by Young.
The algorithm stores a credit value for each page which is contained in the current cache. In the rest of the section the set of the pages in the current cache of Landlord
is denoted by . If Landlord
has to retrieve a page then the following steps are performed.
Landlord(
)
1IF
is not contained in 2THEN
WHILE
there is not enough space for 3DO
4 for each let 5 evict some pages with 6 place into cache and let 7ELSE
reset to any value between and
Example 9.5 Suppose that and contains the following three pages: with , with and with . Suppose that the next requested page is , with parameters and . Therefore, there is not enough space for it in the cache, so some pages must be evicted. Landlord
determines the value and changes the credits as follows: and , thus is evicted from cache . There is still not enough space for in the cache. The new value is and the new credits are: , thus is evicted from the cache. Then there is enough space for , thus it is placed into cache with the credit value .
Landlord
is weakly -competitive, but a stronger statement is also true. For the web caching problem an on-line algorithm ALG
is called -competitive, if there exists such a constant , that is valid for each input, where is the cost of ALG
using a cache of size and is the optimal off-line cost using a cache of size . The following statement holds for algorithm Landlord
.
Theorem 9.7 If , then algorithm Landlord
is -competitive.
Proof. Consider an arbitrary input sequence of pages and denote the input by . We use the potential function technique. During the analysis of the procedure we suppose that an off-line optimal algorithm with cache size and Landlord
with cache size are running parallel on the input. We also suppose that each page is placed first into the off-line cache by the off-line algorithm and then it is placed into by the on-line algorithm. We denote the set of the pages contained in the actual cache of the optimal off-line algorithm by OPT
. Consider the following potential function
The changes of the potential function during the different steps are as follows
OPT
places into its cache.
In this case OPT
has cost . In the potential function only the second part may change. On the other hand, , thus the increase of the potential function is at most .
Landlord
decreases the credit value for each .
In this case for each the decrease of is , thus the decrease of is
where and denote the total size of the pages contained in sets and , respectively. At the time when this step is performed, OPT
have already placed page into its cache , but the page is not contained in cache . Therefore . On the other hand, this step is performed if there is not enough space for the page in thus , which yields , because the sizes are positive integers. Therefore we obtain that the decrease of is at least
Since and , this decrease is at least .
Landlord
evicts a page from cache .
Since Landlord
only evicts pages having credit , during this step remains unchanged.
Landlord
places page into cache and sets the value .
The cost of Landlord
is . On the other hand, was not contained in cache before the performance of this step, thus was valid. Furthermore, first OPT
places the page into its cache, thus is also valid. Therefore the decrease of is .
Landlord
resets the credit of a page to a value between and .
In this case is valid, since OPT
places page into its cache first. Value is not decreased and , thus can not increase during this step.
Hence the potential function has the following properties.
If OPT
places a page into its cache, then the increase of the potential function is at most times more than the cost of OPT
.
If Landlord
places a page into its cache, then the decrease of is times more than the cost of Landlord
.
During the other steps does not increase.
By the above properties we obtain that , where and are the starting and final values of the potential function. The potential function is nonnegative, thus we obtain that , which proves that Landlord
is -competitive.
In computer networks the congestion of the communication channels decreases the speed of the communication and may cause loss of information. Thus congestion control is one of the most important problems in the area of computer networks. A related important problem is the routing of the communication, where we have to determine the path of the messages in the network. Since we have no information about the further traffic of the network, thus routing is an on-line problem. Here we present two on-line optimisation models for the routing problem.
The mathematical model
The network is given by a graph, each edge has a maximal available bandwidth denoted by and the number of edges is denoted by . The input is a sequence of requests, where the -th request is given by a vector which means that to satisfy the request bandwidth must be reserved on a path from to for time duration and the benefit of serving a request is . Hereafter, we assume that , and we omit the value of from the requests. The problem is on-line which means that after the arrival of a request the algorithm has to make the decisions without any information about the further requests. We consider the following two models.
Load balancing model: In this model all requests must be satisfied. Our aim is to minimise the maximum of the overload of the edges. The overload is the ratio of the total bandwidth assigned to the edge and the available bandwidth. Since each request is served, thus the benefit is not significant in this model.
Throughput model: In this model the decision maker is allowed to reject some requests. The sum of the bandwidths reserved on an edge can not be more than the available bandwidth. The goal is to maximise the sum of the benefits of the accepted requests. We investigate this model in details. It is important to note that this is a maximisation problem thus the notion of competitiveness is used in the form defined for maximisation problems.
Below we define the exponential algorithm. We need the following notations to define and analyse the algorithm. Let denote the path which is assigned to the accepted request . Let denote the set of requests accepted by the on-line algorithm. In this case is the ratio of the total reserved bandwidth and the available bandwidth on before the arrival of request .
The basic idea of the exponential algorithm is the following. The algorithm assigns a cost to each , which is exponential in and chooses the path which has the minimal cost. Below we define and analyse the exponential algorithm for the throughput model. Let be a constant which depends on the parameters of the problem; its value will be given later. Let , for each request and edge . The exponential algorithm performs the following steps after the arrival of a request .
EXP(
)
1 let be the set of the paths 2 3IF
4THEN
reserve bandwidth on path 5ELSE
reject the request
Note. If we modify this algorithm to accept each request, then we obtain an exponential algorithm for the load balancing model.
Example 9.6 Consider the network which contains vertices and edges , where the available bandwidths of the edges are , , , . Suppose that and that the reserved bandwidths are: on path , on path , on path , on path . The next request is to reserve bandwidth on some path between and . Therefore values are: , , , . There are two paths between and and the costs are
The minimal cost belongs to path . Therefore, if , then the request is accepted and the bandwidth is reserved on path . Otherwise the request is rejected.
To analyse the algorithm consider an arbitrary input sequence . Let denote the set of the requests accepted by EXP
, and the set of the requests which are accepted by OPT
and rejected by EXP
. Furthermore let denote the path reserved by OPT
for each request accepted by OPT
. Define the value for each , which value gives the ratio of the reserved bandwidth and the available bandwidth for at the end of the on-line algorithm. Furthermore, let for each .
Let , where is an upper bound on the benefits and for each request and each edge the inequality
is valid. In this case the following statements hold.
Lemma 9.8 The solution given by algorithm EXP
is feasible, i.e. the sum of the reserved bandwidths is not more than the available bandwidth for each edge.
Proof. We prove the statement by contradiction. Suppose that there is an edge where the available bandwidth is violated. Let be the first accepted request which violates the available bandwidth on .
The inequality is valid for and (it is valid for all edges and requests). Furthermore, after the acceptance of request the sum of the bandwidths is greater than the available bandwidth on edge , thus we obtain that . On the other hand, this yields that the inequality
holds for value used in algorithm EXP
. Using the assumption on we obtain that , and , thus from the above inequality we obtain that
On the other hand, this inequality is a contradiction, since EXP
would reject the request. Therefore we obtained a contradiction thus we proved the statement of the lemma.
Lemma 9.9 For the solution given by OPT
the following inequality holds:
Proof. Since EXP
rejected each , thus for each , and this inequality is valid for all paths between and . Therefore
On the other hand, holds for each , thus we obtain that
The sum of the bandwidths reserved by OPT
is at most the available bandwidth for each , thus .
Consequently
which inequality is the one which we wanted to prove.
Lemma 9.10 For the solution given by algorithm EXP
the following inequality holds:
Proof. It is enough to show that the inequality is valid for each request . On the other hand,
Since , if , and because of the assumptions , we obtain that
Summarising the bounds given above we obtain that
Since EXP
accepts the requests with the property , the above inequality proves the required statement.
With the help of the above lemmas we can prove the following theorem.
Theorem 9.11 Algorithm EXP
is -competitive, if , where is an upper bound on the benefits, and for all edges and requests
Proof. From Lemma 9.8 it follows that the algorithm results in a feasible solution where the available bandwidths are not violated. Using the notations defined above we obtain that the benefit of algorithm EXP
on the input is , and the benefit of OPT
is at most . Therefore by Lemma 9.9 and Lemma 9.10 it follows that
which inequality proves the theorem.
Exercises
9.3-1 Consider the modified version of the data acknowledgement problem with the objective function , where is the number of acknowledgements and is the maximal latency of the -th acknowledgement. Prove that algorithm Alarm
is also 2-competitive in this modified model.
9.3-2 Represent the special case of the web caching problem, where for each page as a special case of the -server problem. Define the metric space which can be used.
9.3-3 In the web caching problem cache of size contains three pages with the following sizes and credits: . We want to retrieve a page of size and retrieval cost . The optimal off-line algorithm OPT
with cache of size already placed the page into its cache, so its cache contains the pages and . Which pages are evicted by Landlord
to place ? In what way does the potential function defined in the proof of Theorem 9.7 change?
9.3-4 Prove that if in the throughput model no bounds are given for the ratios , then there is no constant-competitive on-line algorithm.
In this section we consider the on-line bin packing problem and its multidimensional generalisations. First we present some fundamental results of the area. Then we define the multidimensional generalisations and present some details from the area of on-line strip packing.
In the bin packing problem the input is a list of items, where the -th item is given by its size . The goal is to pack the items into unit size bins and minimise the number of the bins used. In a more formal way we can say that we have to divide the items into groups where each group has the property that the total size of its items is at most , and the goal is to minimise the number of groups. This problem appears also in the area of memory management.
In this section we investigate the on-line problem which means that the decision maker has to make decisions about the packing of the -th item based on values without any information about the further items.
Algorithm Next-Fit
, bounded space algorithms
First we consider the model where the number of the open bins is limited. The -bounded space model means that if the number of open bins reaches bound , then the algorithm can open a new bin only after closing some of the bins, and the closed bins cannot be used for packing further items into them. If only one bin can be open, then the evident algorithm packs the item into the open bin if it fits, otherwise it closes the bin, opens a new one and puts the item into it. This algorithm is called Next-Fit
(NF
) algorithm. We do not present the pseudocode of the algorithm, since it can be found in this book in the chapter about memory management. The asymptotic competitive ratio of algorithm NF
is determined by the following theorem.
Theorem 9.12 The asymptotic competitive ratio of NF
is 2.
Proof. Consider an arbitrary sequence of items, denote it by . Let denote the number of bins used by OPT
and the number of bins used by NF
. Furthermore, let , denote the total size of the items packed into the -th bin by NF
.
Then , since in the opposite case the first item of the -th bin fits into the -th bin which contradicts to the definition of the algorithm. Therefore the total size of the items is more than .
On the other hand the optimal off-line algorithm cannot put items with total size more than into the same bin, thus we obtain that . This yields that , thus
Consequently, we proved that the algorithm is asymptotically -competitive.
Now we prove that the bound is tight. Consider the following sequence for each denoted by . The sequence contains items, the size of the -th item is , the size of the -th item is , . Algorithm NF
puts the -th and the -th items into the -th bin for each bin, thus . The optimal off-line algorithm puts pairs of size items into the first bins and it puts one size item and the small items into the -th bin, thus . Since and this function tends to as tends to , we proved that the asymptotic competitive ratio of the algorithm is at least .
If , then there are better algorithms than NF
for the -bounded space model. The best known bounded space on-line algorithms belong to the family of harmonic algorithms, where the basic idea is that the interval is divided into subintervals and each item has a type which is the subinterval of its size. The items of the different types are packed into different bins. The algorithm runs several NF
algorithms simultaneously; each for the items of a certain type.
Algorithm First-Fit
and the weight function technique
In this section we present the weight function technique which is often used in the analysis of the bin packing algorithms. We show this method by analysing algorithm First-Fit
(FF
).
Algorithm FF
can be used when the number of open bins is not bounded. The algorithm puts the item into the first opened bin where it fits. If the item does not fit into any of the bins, then a new bin is opened and the algorithm puts the item into it. The pseudocode of the algorithm is also presented in the chapter of memory management. The asymptotic competitive ratio of the algorithm is bounded above by the following theorem.
Theorem 9.13 FF
is asymptotically 1.7-competitive.
Proof. In the proof we use the weight function technique whose idea is that a weight is assigned to each item to measure in some way how difficult it can be to pack the certain item. The weight function and the total size of the items are used to bound the off-line and on-line objective function values. We use the following weight function
Let for any set of items. The properties of the weight function are summarised in the following two lemmas. Both lemmas can be proven by case disjunction based on the sizes of the possible items. The proofs are long and contain many technical details, therefore we omit them.
Lemma 9.14 If is valid for a set of items, then also holds.
Lemma 9.15 For an arbitrary list of items .
Using these lemmas we can prove that the algorithm is asymptotically 1.7-competitive. Consider an arbitrary list of items. The optimal off-line algorithm can pack the items of the list into bins. The algorithm packs items with total size at most into each bin, thus from Lemma 9.14 it follows that . On the other hand considering Lemma 9.15 we obtain that , which yields that , and this inequality proves that the algorithm is asymptotically 1.7-competitive.
It is important to note that the bound is tight, i.e. it is also true that the asymptotic competitive ratio of FF
is . Many algorithms have been developed with smaller asymptotic competitive ratio than , the best algorithm known at present time is asymptotically -competitive.
Lower bounds
In this part we consider the techniques for proving lower bounds on the possible asymptotic competitive ratio. First we present a simple lower bound and then we show how the idea of the proof can be extended into a general method.
Theorem 9.16 No on-line algorithm for the bin packing problem can have smaller asymptotic competitive ratio than 4/3.
Proof. Let be an arbitrary on-line algorithm. Consider the following sequence of items. Let and be a list of items of size , and be a list of items of size . The input is started by . Then packs two items or one item into the bins. Denote the number of bins containing two items by . In this case the number of the used bins is . On the other hand, the optimal off-line algorithm can pack pairs of items into the bins, thus .
Now suppose that the input is the combined list . The algorithm is an on-line algorithm, therefore it does not know whether the input is or at the beginning, thus it also uses bins for packing two items from the part . Therefore among the items of size only can be paired with earlier items and the other ones need separate bin. Thus . On the other hand, the optimal off-line algorithm can pack a smaller (size ) item and a larger (size ) item into each bin, thus .
So we obtained that there is a list for algorithm where
Moreover for the above constructed lists is at least , which can be arbitrarily great. This yields that the above inequality proves that the asymptotic competitive ratio of A is at least , and this is what we wanted to prove.
The fundamental idea of the above proof is that a long sequence (in this proof ) is considered, and depending on the behaviour of the algorithm a prefix of the sequence is selected as input for which the ratio of the costs is maximal. It is an evident extension to consider more difficult sequences. Many lower bounds have been proven based on different sequences. On the other hand, the computations which are necessary to analyse the sequence have become more and more difficult. Below we show how the analysis of such sequences can be interpreted as a mixed integer programming problem, which makes it possible to use computers to develop lower bounds.
Consider the following sequence of items. Let , where contains identical items of size . If algorithm is asymptotically -competitive, then the inequality
is valid for each . It is enough to consider an algorithm for which the technique can achieve the minimal lower bound, thus our aim is to determine the value
which value gives a lower bound on the possible asymptotic competitive ratio. We can determine this value as an optimal solution of a mixed integer programming problem. To define this problem we need the following definitions.
The contain of a bin can be described by the packing pattern of the bin, which gives that how many elements are contained in the bin from each subsequence. Formally, a packing pattern is a -dimensional vector , where coordinate is the number of elements contained in the bin from subsequence . For the packing patterns the constraint must hold. (This constraint ensures that the items described by the packing pattern fit into the bin.)
Classify the set of the possible packing patterns. For each let be the set of the patterns for which the first positive coordinate is the -th one. (Pattern belongs to class if for each and .)
Consider the packing produced by A. Each bin is packed by some packing pattern, therefore the packing can be described by the packing patterns. For each denote the number of bins which are packed by the pattern by . The packing produced by the algorithm is given by variables .
Observe that the bins which are packed by a pattern from class receive their first element from subsequence . Therefore we obtain that the number of bins opened by A to pack the elements of subsequence can be given by variables as follows
Consequently, for a given the required value can be determined by the solution of the following mixed integer programming problem.
Min
,
.
The first constraints describe that the algorithm has to pack all items. The second constraints describe that is at least as large as the ratio of the on-line and off-line costs for the subsequences considered.
The set of the possible packing patterns and also the optimal solutions can be determined by the list .
In this problem the number and the value of the variables can be large, thus instead of the problem its linear programming relaxation is considered. Moreover, we are interested in the solution under the assumption that tends to and it can be proven that the integer programming and the linear programming relaxation give the same bound in this case.
The best currently known bound was proven by this method and it states that no on-line algorithm can have smaller asymptotic competitive ratio than .
The bin packing problem has three different multidimensional generalisations: the vector packing, the box packing and the strip packing models. We consider only the strip packing problem in details. For the other generalisations we give only the model. In the vector packing problem the input is a list of -dimensional vectors, and the algorithm has to pack these vectors into the minimal number of bins. A packing is legal for a bin if for each coordinate the sum of the values of the elements packed into the bin is at most . In the on-line version the vectors are coming one by one and the algorithm has to assign the vectors to the bins without any information about the further vectors. In the box packing problem the input is a list of -dimensional boxes and the goal is to pack the items into the minimal number of d-dimensional unit cube without overlapping. In the on-line version the items are coming one by one and the algorithm has to pack them into the cubes without any information about the further items.
On-line strip packing
In the strip packing problem there is a set of two dimensional rectangles, defined by their widths and heights, and the task is to pack them into a vertical strip of width without rotation minimising the total height of the strip. We assume that the widths of the rectangles is at most and the heights of the rectangles is at most . This problem appears in many situations. Usually, scheduling of tasks with shared resource involves two dimensions, namely the resource and the time. We can consider the widths as the resources and the heights as the times. Our goal is to minimise the total amount of time used. Some applications can be found in computer scheduling problems. We consider the on-line version where the rectangles arrive from a list one by one and we have to pack each rectangle into the vertical strip without any information about the further items. Most of the algorithms developed for the strip packing problem belong to the class of shelf algorithms. We consider this family of algorithms below.
Shelf algorithms
A basic way of packing into the strip is to define shelves and pack the rectangles onto the shelves. By shelf we mean a rectangular part of the strip. Shelf packing algorithms place each rectangle onto one of the shelves. If the algorithm decides which shelf will contain the rectangle, then the rectangle is placed onto the shelf as much to the left as it is possible without overlapping the other rectangles placed onto the shelf earlier. Therefore, after the arrival of a rectangle, the algorithm has to make two decisions. The first decision is whether to create a new shelf or not. If the algorithm creates a new shelf, it also has to decide the height of the new shelf. The created shelves always start from the top of the previous shelf. The first shelf is placed to the bottom of the strip. The algorithm also has to choose which shelf to put the rectangle onto. Hereafter we will say that it is possible to pack a rectangle onto a shelf, if there is enough room for the rectangle on the shelf. It is obvious that if a rectangle is higher than a shelf, we cannot place it onto the shelf.
We consider only one algorithm in details. This algorithm was developed and analysed by Baker and Schwarz in 1983 and it is called Next-Fit-Shelf
() algorithm. The algorithm depends on a parameter . For each there is at most one active shelf with height . We define how the algorithm works below.
After the arrival of a rectangle choose a value for which satisfies . If there is an active shelf with height and it is possible to pack the rectangle onto it, then pack it there. If there is no active shelf with height , or it is not possible to pack the rectangle onto the active shelf with height , then create a new shelf with height , put the rectangle onto it, and let this new shelf be the active shelf with height (if we had an active shelf with height earlier, then we close it).
Example 9.7 Let . Suppose that the size of the first item is . Therefore, it is assigned to a shelf of height . We define a shelf of height at the bottom of the strip; this will be the active shelf with height and we place the item into the left corner of this shelf. Suppose that the size of the next item is . In this case it is placed onto a shelf of height . There is no active shelf with this height so we define a new shelf of height on the top of the previous shelf. This will be the active shelf of height and the item is placed onto its left corner. Suppose that the size of the next item is . This item is placed onto a shelf of height . It is not possible to pack it onto the active shelf, thus we close the active shelf and we define a new shelf of height on the top of the previous shelf. This will be the active shelf of height and the item is placed into its left corner. Suppose that the size of the next item is . This item is placed onto a shelf of height . We can pack it onto the active shelf of height , thus we pack it onto that shelf as left as it is possible.
For the competitive ratio of the following statements are valid.
Theorem 9.17 Algorithm is -competitive. Algorithm is asymptotically -competitive.
Proof. First we prove that the algorithm is -competitive. Consider an arbitrary list of rectangles and denote it by . Let denote the sum of the heights of the shelves which are active at the end of the packing, and let be the sum of the heights of the other shelves. Let be the height of the highest active shelf ( for some ), and let be the height of the highest rectangle. Since the algorithm created a shelf with height , we have . As there is at most active shelf for each height,
Consider the shelves which are not active at the end. Consider the shelves of height for each , and denote the number of the closed shelves by . Let be one of these shelves with height . The next shelf with height contains one rectangle which would not fit onto . Therefore, the total width of the rectangles is at least . Furthermore, the height of these rectangles is at least , thus the total area of the rectangles packed onto and is at least . If we pair the shelves of height for each in this way, using the active shelf if the number of the shelves of the considered height is odd, we obtain that the total area of the rectangles assigned to the shelves of height is at least . Thus the total area of the rectangles is at least , and this yields that . On the other hand, the total height of the closed shelves is , and we obtain that .
Since is valid we proved the required inequality
Since the heights of the rectangles are bounded by , and are bounded by a constant, so we obtain the result about the asymptotic competitive ratio immediately.
Besides this algorithm some other shelf algorithms have been investigated for the solution of the problem. We can interpret the basic idea of as follows. We define classes of items belonging to types of shelves, and the rectangles assigned to the classes are packed by the classical bin packing algorithm NF
. It is an evident idea to use other bin packing algorithms. The best shelf algorithm known at present time was developed by Csirik and Woeginger in 1997. That algorithm uses the harmonic bin packing algorithm to pack the rectangles assigned to the classes.
Exercises
9.4-1 Suppose that the size of the items is bounded above by . Prove that under this assumption the asymptotic competitive ratio of NF
is .
9.4-2 Suppose that the size of the items is bounded above by . Prove Lemma 9.15 under this assumption.
9.4-3 Suppose that the sequence of items is given by a list , where contains items of size , contains items of size , contains items of size . Which packing patterns can be used? Which patterns belong to class ?
9.4-4 Consider the version of the strip packing problem where one can lengthen the rectangles keeping the area fixed. Consider the extension of which lengthen the rectangles before the packing to the same height as the shelf which is chosen to pack them onto. Prove that this algorithm is -competitive.
The area of scheduling theory has a huge literature. The first result in on-line scheduling belongs to Graham, who analysed the List scheduling algorithm in 1966. We can say that despite of the fact that Graham did not use the terminology which was developed in the area of the on-line algorithms, and he did not consider the algorithm as an on-line algorithm, he analysed it as an approximation algorithm.
From the area of scheduling we only recall the definitions which are used in this chapter.
In a scheduling problem we have to find an optimal schedule of jobs. We consider the parallel machines case, where machines are given, and we can use them to schedule the jobs. In the most fundamental model each job has a known processing time and to schedule the job we have to assign it to a machine, and we have to give its starting time and a completion time, where the difference between the completion time and the starting time is the processing time. No machine may simultaneously run two jobs.
Concerning the machine environment three different models are considered. If the processing time of a job is the same for each machine, then we call the machines identical machines. If each machine has a speed , the jobs have a processing weight and the processing time of job on the -th machine is , then we call the machines related machines. If the processing time of job is given by an arbitrary positive vector , where the processing time of the job on the -th machine is , then we call the machines unrelated machines.
Many objective functions are considered for scheduling problems, but here we consider only such models where the goal is the minimisation of the makespan (the maximal completion time).
In the next subsection we define the two most fundamental on-line scheduling models, and in the following two subsections we consider these models in details.
Probably the following models are the most fundamental examples of on-line machine scheduling problems.
LIST model
In this model we have a fixed number of machines denoted by , and the jobs arrive from a list. This means that the jobs and their processing times are revealed to the on-line algorithm one by one. When a job is revealed, the on-line algorithm has to assign the job to a machine with a starting time and a completion time irrevocably.
By the load of a machine, we mean the sum of the processing times of all jobs assigned to the machine. Since the objective function is to minimise the maximal completion time, it is enough to consider the schedules where the jobs are scheduled on the machines without idle time. For these schedules the maximal completion time is the load for each machine. Therefore this scheduling problem is reduced to a load balancing problem, i.e. the algorithm has to assign the jobs to the machines minimising the maximum load, which is the makespan in this case.
Example 9.8 Consider the LIST model and two identical machines. Consider the following sequence of jobs where the jobs are given by their processing time: . The on-line algorithm first receives job from the list, and the algorithm has to assign this job to one of the machines. Suppose that the job is assigned to machine . After that the on-line algorithm receives job from the list, and the algorithm has to assign this job to one of the machines. Suppose that the job is assigned to machine . After that the on-line algorithm receives job from the list, and the algorithm has to assign this job to one of the machines. Suppose that the job is assigned to machine . Finally, the on-line algorithm receives job from the list, and the algorithm has to assign this job to one of the machines. Suppose that the job is assigned to machine . Then the loads on the machines are and , and we can give a schedule where the maximal completion times on the machines are the loads: we can schedule the jobs on the first machine in time intervals and , and we can schedule the jobs on the second machine in time intervals and .
TIME model
In this model there are a fixed number of machines again. Each job has a processing time and a release time. A job is revealed to the on-line algorithm at its release time. For each job the on-line algorithm has to choose which machine it will run on and assign a start time. No machine may simultaneously run two jobs. Note that the algorithm is not required to assign a job immediately at its release time. However, if the on-line algorithm assigns a job at time then it cannot use information about jobs released after time and it cannot start the job before time . Our aim is to minimise the makespan.
Example 9.9 Consider the TIME model with two related machines. Let be the first machine with speed , and be the second machine with speed . Consider the following input , where the jobs are given by the (processing time, release time) pairs. Thus a job arrives at time with processing time , and the algorithm can either start to process it on one of the machines or wait for jobs with larger processing time. Suppose that the algorithm waits till time and then it starts to process the job on machine . The completion time of the job is . At time three further jobs arrive, and at that time only can be used. Suppose that the algorithm starts to process job on this machine. At time both jobs are completed. Suppose that the remaining jobs are started on machines and , and the completion times are and , thus the makespan achieved by the algorithm is . Observe that an algorithm which starts the first job immediately at time can make a better schedule with makespan . But it is important to note that in some cases it can be useful to wait for larger jobs before starting a job.
The first algorithm in this model has been developed by Graham. Algorithm LIST
assigns each job to the machine where the actual load is minimal. If there are more machines with this property, it uses the machine with the smallest index. This means that the algorithm tries to balance the loads on the machines. The competitive ratio of this algorithm is determined by the following theorem.
Theorem 9.18 The competitive ratio of algorithm LIST
is in the case of identical machines.
Proof. First we prove that the algorithm is -competitive. Consider an arbitrary input sequence denoted by , and denote the processing times by . Consider the schedule produced by LIST
. Let be a job with maximal completion time. Investigate the starting time of this job. Since LIST
chooses the machine with minimal load, thus the load was at least on each of the machines when was scheduled. Therefore we obtain that
This yields that
On the other hand, OPT
also processes all of the jobs, thus we obtain that . Furthermore, is scheduled on one of the machines of OPT
, thus . Due to these bounds we obtain that
which inequality proves that LIST
is -competitive.
Now we prove that the bound is tight. Consider the following input. It contains jobs with processing time and one job with processing time . LIST
assigns small jobs to each machine and the last large job is assigned to . Therefore its makespan is . On the other hand, the optimal algorithm schedules the large job on and small jobs on the other machines, and its makespan is . Thus the ratio of the makespans is which shows that the competitive ratio of LIST
is at least .
Although it is hard to imagine any other algorithm for the on-line case, many other algorithms have been developed. The competitive ratios of the better algorithms tend to smaller numbers than as the number of machines tends to . Most of these algorithms are based on the following idea. The jobs are scheduled keeping the load uniformly on most of the machines, but in contrast to LIST
, the loads are kept low on some of the machines, keeping the possibility of using these machines for scheduling large jobs which may arrive later.
Below we consider the more general cases where the machines are not identical. LIST
may perform very badly, and the processing time of a job can be very large on the machine where the actual load is minimal. However, we can easily change the greedy idea of LIST
as follows. The extended algorithm is called Greedy
and it assigns the job to the machine where the load with the processing time of the job is minimal. If there are several machines which have minimal value, then the algorithm chooses the machine where the processing time of the job is minimal from them, if there are several machines with this property, the algorithm chooses the one with the smallest index from them.
Example 9.10 Consider the case of related machines where there are machines and the speeds are , . Suppose that the input is , where the jobs are defined by their processing weight. The load after the first job is on machine and on the other machines, thus is assigned to . The load after job is on all of the machines, and its processing time is minimal on machine , thus Greedy
assigns it to . The load after job is on and , and on , thus the job is assigned to . The load after job is on , on , and on , thus the job is assigned to . Finally, the load after job is on , on and , and on , thus the job is assigned to .
Example 9.11 Consider the case of unrelated machines with two machines and the following input: , where the jobs are defined by the vectors of processing times. The load after job is on and on , thus the job is assigned to . The load after job is on and also on , thus the job is assigned to , because it has smaller processing time. The load after job is on and , thus the job is assigned to because it has smaller processing time. Finally, the load after job is on and on , thus the job is assigned to .
The competitive ratio of the algorithm is determined by the following theorems.
Theorem 9.19 The competitive ratio of algorithm Greedy
is in the case of unrelated machines.
Proof. First we prove that the competitive ratio of the algorithm is at least . Consider the following input sequence. Let be an arbitrarily small number. The sequence contains jobs. The processing time of job is on machine , on machine , and on the other machines, (). For job , the processing time is on machine , on machine and on the other machines (, , , if and ).
In this case job is scheduled on by Greedy
and the makespan is . On the other hand, the optimal off-line algorithm schedules on and is scheduled on for the other jobs, thus the optimal makespan is . The ratio of the makespans is . This ratio tends to , as tends to , and this proves that the competitive ratio of the algorithm is at least .
Now we prove that the algorithm is -competitive. Consider an arbitrary input sequence, denote the makespan in the optimal off-line schedule by and let denote the maximal load in the schedule produced by Greedy
after scheduling the first jobs. Since the processing time of the -th job is at least , and the load is at most on the machines in the off-line optimal schedule, we obtain that .
We prove by induction that the inequality is valid. Since the first job is assigned to the machine where its processing time is minimal, the statement is obviously true for . Let be an arbitrary number and suppose that the statement is true for . Consider the -th job. Let be the machine where the processing time of this job is minimal. If we assign the job to , then we obtain that the load on this machines is at most from the induction hypothesis.
On the other hand, the maximal load in the schedule produced by Greedy
can not be more than the maximal load in the case when the job is assigned to , thus , which means that we proved the inequality for .
Therefore we obtained that , which yields that the algorithm is -competitive.
To investigate the case of the related machines consider an arbitrary input. Let and denote the makespans achieved by GREEDY
and OPT
respectively. The analysis of the algorithm is based on the following lemmas which give bounds on the loads of the machines.
Lemma 9.20 The load on the fastest machine is at least .
Proof. Consider the schedule produced by GREEDY
. Consider a job which causes the makespan (its completion time is maximal). If this job is scheduled on the fastest machine, then the lemma follows immediately, i.e. the load on the fastest machine is . Suppose that is not scheduled on the fastest machine. The optimal maximal load is , thus the processing time of on the fastest machine is at most . On the other hand, the completion time of is , thus at the time when the job was scheduled the load was at least on the fastest machine, otherwise Greedy
would assign to the fastest machine.
Lemma 9.21 If the loads are at least on all machines having a speed of at least then the loads are at least on all machines having a speed of at least .
Proof. If , then the statement is obviously valid. Suppose that . Consider the jobs which are scheduled by Greedy
on the machines having a speed of at least in the time interval . The total processing weight of these jobs is at least times the total speed of the machines having a speed of at least . This yields that there exists a job among them which is assigned by OPT
to a machine having a speed of smaller than (otherwise the optimal off-line makespan would be larger than ). Let be such a job.
Since OPT
schedules on a machine having a speed of smaller than , thus the processing weight of is at most . This yields that the processing time of is at most on the machines having a speed of at least . On the other hand, Greedy
produces a schedule where the completion time of is at least , thus at the time when the job was scheduled the loads were at least on the machines having a speed of at most (otherwise Greedy
would assign to one of these machines).
Now we can prove the following statement.
Theorem 9.22 The competitive ratio of algorithm Greedy
is in the case of the related machines.
Proof. First we prove that Greedy
is -competitive. Consider an arbitrary input. Let and denote the makespans achieved by Greedy
and OPT
, respectively.
Let be the speed of the fastest machine. Then by Lemma 9.20 the load on this machine is at least . Then using Lemma 9.21 we obtain that the loads are at least on the machines having a speed of at least . Therefore the loads are at least on the machines having a speed of at least . Denote the set of the machines having a speed of at most by .
Denote the sum of the processing weights of the jobs by . OPT
can find a schedule of the jobs which has maximal load , and there are at most machines having smaller speed than , thus
On the other hand, Greedy
schedules the same jobs, thus the load on some machine not included in is smaller than in the schedule produced by Greedy
(otherwise we would obtain that the sum of the processing weights is greater than ).
Therefore we obtain that
which yields that , which proves that Greedy
is -competitive.
Now we prove that the competitive ratio of the algorithm is at least . Consider the following set of machines: contains one machine with speed and contains machines with speed . For each , contains machines with speed , and contains machines. Observe that the number of jobs of processing weight which can be scheduled during time unit is the same on the machines of and on the machines of . It is easy to calculate that , if , thus the number of machines is .
Consider the following input sequence. In the first phase jobs arrive having processing weight , in the second phase jobs arrive having processing weight , in the -th phase jobs arrive with processing weight , and the sequence ends with the -th phase, which contains one job with processing weight . An off-line algorithm can schedule the jobs of the -th phase on the machines of set achieving maximal load , thus the optimal off-line cost is at most .
Investigate the behaviour of algorithm Greedy
on this input. The jobs of the first phase can be scheduled on the machines of during time unit, and it takes also time unit to process these jobs on the machines of . Thus Greedy
schedules these jobs on the machines of , and each load is on these machines after the first phase. Then the jobs of the second phase are scheduled on the machines of , the jobs of the third phase are scheduled on the machines of and so on. Finally, the jobs of the -th and -th phase are scheduled on the machine of set . Thus the cost of Greedy
is , (this is the load on the machine of set ). Since , we proved the required statement.
In this model we investigate only one algorithm. The basic idea is to divide the jobs into groups by the release time and to use an optimal off-line algorithm to schedule the jobs of the groups. This algorithm is called interval scheduling algorithm and we denote it by INTV
. Let be the release time of the first job, and . The algorithm is defined by the following pseudocode:
INTV(
)
1WHILE
NOT
end of sequence 2 let be the set of the unscheduled jobs released till 3 let be an optimal off-line schedule of the jobs of 4 schedule the jobs as it is determined by starting the schedule at 5 let be the maximal completion time 6IF
a new job is released in time interval or the sequence is ended 7THEN
7ELSE
let be the release time of the next job 8
Example 9.12 Consider two identical machines. Suppose that the sequence of jobs is , where the jobs are defined by the (processing time, release time) pairs. In the first iteration are scheduled: an optimal off-line algorithm schedules on machine and on machine , so the jobs are completed at time . Since no new job have been released in the time interval , the algorithm waits for a new job until time . Then the second iteration starts: and are scheduled on and respectively in the time interval . During this time interval has been released thus at time the next iteration starts and INTV
schedules on in the time interval .
The following statement holds for the competitive ratio of algorithm INTV
.
Theorem 9.23 In the TIME model algorithm INTV
is 2-competitive.
Proof. Consider an arbitrary input and the schedule produced by INTV
. Denote the number of iterations by . Let , , and let denote the optimal off-line cost. In this case . This inequality is obvious if . If , then the inequality holds, because also the optimal off-line algorithm has to schedule the jobs which are scheduled in the -th iteration by INTV
, and INTV
uses an optimal off-line schedule in each iteration. On the other hand, . To prove this inequality first observe that the release time is at least for the jobs scheduled in the -th iteration (otherwise the algorithm would schedule them in the -th iteration).
Therefore also the optimal algorithm has to schedule these jobs after time . On the the other hand, it takes at least time units to process these jobs, because INTV
uses optimal off-line algorithm in the iterations. The makespan of the schedule produced by INTV
is , and we have shown that , thus we proved that the algorithm is -competitive.
Some other algorithms have also been developed in the TIME model. Vestjens proved that the on-line LPT algorithm is -competitive. This algorithm schedules the longest unscheduled, released job at each time when some machine is available. The following lower bound for the possible competitive ratios of the on-line algorithms is also given by Vestjens.
Theorem 9.24 The competitive ratio of any on-line algorithm is at least in the TIME model for minimising the makespan.
Proof. Let be the solution of the equation which belongs to the interval . We prove that no on-line algorithm can have smaller competitive ratio than . Consider an arbitrary on-line algorithm, denote it by ALG
. Investigate the following input sequence.
At time one job arrives with processing time . Let be the time when the algorithm starts to process the job on one of the machines. If , then the sequence is finished and , which proves the statement. So we can suppose that .
The release time of the next job is and its processing time is . Denote its starting time by . If , then we end the sequence with jobs having release time , and processing time . In this case an optimal off-line algorithm schedules the first two jobs on the same machine and the last jobs on the other machines starting them at time , thus its cost is . On the other hand, the on-line algorithm must schedule one of the last jobs after the completion of the first or the second job, thus in this case, which yields that the competitive ratio of the algorithm is at least . Therefore we can suppose that .
At time further jobs arrive with processing times and one job with processing time . The optimal off-line algorithm schedules the second and the last jobs on the same machine, and the other jobs are scheduled one by one on the other machines and the makespan of the schedule is . Since before time none of the last jobs is started by ALG
, after this time ALG
must schedule at least two jobs on one of the machines and the maximal completion time is at least . Since , the ratio is minimal if , and in this case the ratio is , which proves the theorem.
Exercises
9.5-1 Prove that the competitive ratio is at least for any on-line algorithm in the case of two identical machines.
9.5-2 Prove that LIST
is not constant competitive in the case of unrelated machines.
9.5-3 Prove that the modification of INTV
which uses a -approximation schedule (a schedule with at most times more cost than the optimal cost) instead of the optimal off-line schedule in each step is -competitive.
PROBLEMS |
Consider the special case of the -server problem, where the distance between each pair of points is . (This problem is equivalent with the on-line paging problem.) Analyse the algorithm which serves the requests not having server on their place by the server which was used least recently. (This algorithm is equivalent with the LRU
paging algorithm.) Prove that the algorithm is -competitive.
Consider the following alarming algorithm for the data acknowledgement problem. ALARM2
is obtained from the general definition with the values . Prove that the algorithm is not constant-competitive.
Prove, that no on-line algorithm can have smaller competitive ratio than using a sequence which contains items of size , , , where is a small positive number.
9-4
Strip packing with modifiable rectangles
Consider the following version of the strip packing problem. In the new model the algorithms are allowed to lengthen the rectangles keeping the area fixed. Develop a -competitive algorithm for the solution of the problem.
Consider the algorithm in the TIME model which starts the longest released job to schedule at each time when a machine is available. This algorithm is called on-line LPT
. Prove that the algorithm is -competitive.
CHAPTER NOTES |
More details about the results on on-line algorithms can be found in the books [31], [73].
The first results about the -server problem (Theorems 9.1 and 9.2) are published by Manasse, McGeoch and Sleator in [172]. The presented algorithm for the line (Theorem 9.3) was developed by Chrobak, Karloff, Payne and Viswanathan (see [45]). Later Chrobak and Larmore extended the algorithm for trees in [46]. The first constant-competitive algorithm for the general problem was developed by Fiat, Rabani and Ravid (see [72]). The best known algorithm is based on the work function technique. The first work function algorithm for the problem was developed by Chrobak and Larmore in [47]. Koutsoupias and Papadimitriou have proven that the work function algorithm is -competitive in [147].
The first mathematical model for the data acknowledgement problem and the first results (Theorems 9.5 and 9.6) are presented by Dooly, Goldman, and Scott in [64]. Online algorithms with lookahead property are presented in [124]. Albers and Bals considered a different objective function in [11]. Karlin Kenyon and Randall investigated randomised algorithms for the data acknowledgement problem in [139]. The Landlord
algorithm was developed by Young in [280]. The detailed description of the results in the area of on-line routing can be found in the survey [162] written by Leonardi. The exponential algorithm for the load balancing model is investigated by Aspnes, Azar, Fiat, Plotkin and Waarts in [13]. The exponential algorithm for the throughput objective function is applied by Awerbuch, Azar and Plotkin in [16].
A detailed survey about the theory of on-line bin packing is written by Csirik and Woeginger (see [56]). The algorithms NF
and FF
are analysed with competitive analysis by Johnson, Demers, Ullman, Garey and Graham in [133], [134], further results can be found in the PhD thesis of Johnson ([132]). Our Theorem 9.12 is a special case of Theorem 1 in [126] and Theorem 9.13 is a special case of Theorems 5.8 and 5.9 of the book [48] and Corollary 20.13 in the twentieth chapter of this book [22]. Van Vliet applied the packing patterns to prove lower bounds for the possible competitive ratios in [264], [265]. For the on-line strip packing problem algorithm was developed and analysed by Baker and Schwarz in [20]. Later further shelf packing algorithms were developed, the best shelf packing algorithm for the strip packing problem was developed by Csirik and Woeginger in [57].
A detailed survey about the results in the area of on-line scheduling was written by Sgall ([239]). The first on-line result is the analysis of algorithm LIST
, it was published by Graham in [97]. Many further algorithms were developed and analysed for the case of identical machines, the algorithm with smallest competitive ratio (tends to 1.9201 as the number of machines tends to ) was developed by Fleischer and Wahl in [76]. The lower bound for the competitive ratio of Greedy
in the related machines model was proved by Cho and Sahni in [43]. The upper bound, the related machines case and a more sophisticated exponential function based algorithm were presented by Aspnes, Azar, Fiat, Plotkin and Waarts in [13]. A summary of further results about the applications of the exponential function technique in the area of on-line scheduling can be found in the paper of Azar ([17]). The interval algorithm presented in the TIME model and Theorem 9.23 are based on the results of Shmoys, Wein and Williamson (see [236]). A detailed description of further results (on-line LPT
, lower bounds) in the area TIME model can be found in the PhD thesis of Vestjens [262]. We presented only the most fundamental on-line scheduling models in the chapter, although an interesting model has been developed recently, where the number of the machines is not fixed, and the algorithm is allowed to purchase machines. The model is investigated in papers [65], [66], [123], [125].
Problem 9-1 is based on [244], Problem 9-2 is based on [64], Problem 9-3 is based on [278], Problem 9-4 is based on [122] and Problem 9-5 is based on [262].
Table of Contents
In many situations in engineering and economy there are cases when the conflicting interests of several decision makers have to be taken into account simultaneously, and the outcome of the situation depends on the actions of these decision makers. One of the most popular methodology and modeling is based on game theory.
Let denote the number of decision makers (who will be called players), and for each let be the set of all feasible actions of player . The elements are called strategies of player , is the strategy set of this player. In any realization of the game each player selects a strategy, then the vector is called a simultaneous strategy vector of the players. For each each player has an outcome which is assumed to be a real value. This value can be imagined as the utility function value of the particular outcome, in which this function represents how player evaluates the outcomes of the game. If denotes this value, then is called the payoff function of player . The value is called the payoff of player and is called the payoff vector. The number of players, the sets of strategies and the payoff functions completely determine and define the -person game. We will also use the notation for this game.
The solution of game is the Nash-equilibrium, which is a simultaneous strategy vector such that for all ,
Condition means that the -th component of the equilibrium is a feasible strategy of player , and condition shows that none of the players can increase its payoff by unilaterally changing its strategy. In other words, it is the interest of all players to keep the equilibrium since if any player departs from the equilibrium, its payoff does not increase.
Game is called finite if the number of players is finite and all strategy sets contain finitely many strategies. The most famous two-person finite game is the prisoner's dilemma, which is the following.
Example 10.1 The players are two prisoners who committed a serious crime, but the prosecutor has only insufficient evidence to prosecute them. The prisoners are held in separate cells and cannot communicate, and the prosecutor wants them to cooperate with the authorities in order to get the needed additional information. So , and the strategy sets for both players have two elements: cooperating (), or not cooperating (). It is told to both prisoners privately that if he is the only one to confess, then he will get only a light sentence of year, while the other will go to prison for a period of 10 years. If both confess, then their reward will be a 5 year prison sentence each, and if none of them confesses, then they will be convicted to a less severe crime with sentence of 2 years each. The objective of both players are to minimize the time spent in prison, or equivalently to maximize its negative. Figure 10.1 shows the payoff values, where the rows correspond to the strategies of player , the columns show the strategies of player , and for each strategy pair the first number is the payoff of player , and the second number is the payoff of player . Comparing the payoff values, it is clear that only can be equilibrium, since
The strategy pair is really an equilibrium, since
In this case we have a unique equilibrium.
The existence of an equilibrium is not guaranteed in general, and if equilibrium exists, it might not be unique.
Example 10.2 Modify the payoff values of Figure10.1 as shown in Figure 10.2. It is easy to see that no equilibrium exists:
If all payoff values are identical, then we have multiple equilibria: any strategy pair is an equilibrium.
Let denote the number of players, and for the sake of notational convenience let denote the feasible strategies of player . That is, . A strategy vector is an equilibrium if and only if for all and ,
Notice that in the case of finite games inequality (10.1) reduces to (10.2).
In applying the enumeration method, inequality (10.2) is checked for all possible strategy -tuples to see if (10.2) holds for all and . If it does, then is an equilibrium, otherwise not. If during the process of checking for a particular we find a and such that (10.2) is violated, then is not an equilibrium and we can omit checking further values of and . This algorithm is very simple, it consists of imbedded loops with variables and .
The maximum number of comparisons needed equals
however in practical cases it might be much lower, since if (10.2) is violated with some , then the comparison must stop for the same strategy vector.
The algorithm can formally be given as follows:
Prisoner-Enumeration(
)
1FOR
TO
2DO
FOR
TO
3 4DO
FOR
TO
5DO
6FOR
TO
7DO
FOR
TO
8DO
IF
(10.2) fails 9THEN
and go to 10 10IF
11THEN
() is equilibrium 12RETURN
()
Consider next the two-person case, , and introduce the real matrixes and with elements and respectively. Matrixes and are called the payoff matrixes of the two players. A strategy vector is an equilibrium if and only if the element in matrix is the largest in its column, and in matrix it is the largest in its row. In the case when , the game is called zero-sum, and , so the game can be completely described by the payoff matrix of the first player. In this special case a strategy vector is an equilibrium if and only if the element is the largest in its column and smallest in its row. In the zero-sum cases the equilibria are also called the saddle points of the games. Clearly, the enumeration method to find equilibria becomes more simple since we have to deal with a single matrix only.
The simplified algorithm is as follows:
Equilibrium
1FOR
TO
2DO
FOR
TO
3DO
4FOR
TO
5DO
IF
6THEN
7 go to 12 8FOR
TO
9DO
IF
10THEN
11 go to 12 12IF
13THEN
RETURN
()
Many finite games have the common feature that they can be represented by a finite directed tree with the following properties:
there is a unique root of the tree (which is not the endpoint of any arc), and the game starts at this node;
to each node of the tree a player is assigned and if the game reaches this node at any time, then this player will decide on the continuation of the game by selecting an arc originating from this node. Then the game moves to the endpoint of the chosen arc;
to each terminal node (in which no arc originates) an -dimensional real vector is assigned which gives the payoff values for the players if the game terminates at this node;
each player knows the tree, the nodes he is assigned to, and all payoff values at the terminal nodes.
For example, the chess-game satisfies the above properties in which , the nodes of the tree are all possible configurations on the chessboard twice: once with the white player and once with the black player assigned to it. The arcs represent all possible moves of the assigned player from the originating configurations. The endpoints are those configurations in which the game terminates. The payoff values are from the set where means win, represents loss, and shows that the game ends with a tie.
Theorem 10.1 All games represented by finite trees have at least one equilibrium.
Proof. We present the proof of this result here, since it suggests a practical algorithm to find equilibria. The proof goes by induction with respect to the number of nodes of the game tree. If the game has only one node, then clearly it is the only equilibrium.
Assume next that the theorem holds for any tree with less than nodes (), and consider a game with nodes. Let be the root of the tree and let () be the nodes connected to by an arc. If denote the disjoint subtrees of with roots , then each subtree has less than nodes, so each of them has an equilibrium. Assume that player is assigned to . Let be the equilibrium payoffs of player on the subtrees and let . Then player will move to node from the root, and then the equilibrium continues with the equilibrium obtained on the subtree . We note that not all equilibria can be obtained by this method, however the payoff vectors of all equilibria, which can obtained by this method, are identical.
We note that not all equilibria can be obtained by this method, however the payoff vectors of all equilibria, which can be obtained by this method, are identical.
The proof of the theorem suggests a dynamic programming-type algorithm which is called backward induction. It can be extended to the more general case when the tree has chance nodes from which the continuations of the game are random according to given discrete distributions.
The solution algorithm can be formally presented as follows. Assume that the nodes are numbered so each arc connects nodes and only for . The root has to get the smallest number , and the largest number is given to one of the terminal nodes. For each node let denote the set of all nodes such that there is an arc from to . For each terminal node , is empty, and let denote the payoff vector associated to this node. And finally we will denote player assigned to node by for all . The algorithm starts at the last node and moves backward in the order and . Node is an endpoint, so vector has been already assigned. If in the process the next node is an endpoint, then is already given, otherwise we find the largest among the values , . Assume that the maximal value occurs at node , then we assign to node , and move to node . After all payoff vectors and are determined, then vector gives the equilibrium payoffs and the equilibrium path is obtained by nodes:
until an endpoint is reached, when the equilibrium path terminates.
At each node the number of comparisons equals the number of arcs starting at that node minus . Therefore the total number of comparisons in the algorithm is the total number of arcs minus the number of nodes.
This algorithm can be formally given as follows:
Backward-Induction
1FOR
TO
2DO
3 4 print sequence until an endpoint is reached
Example 10.3 Figure 10.3 shows a finite tree. In the circle at each nonterminal node we indicate the player assigned to that node. The payoff vectors are also shown at all terminal nodes. We have three players, so the payoff vectors have three elements.
First we number the nodes such that the beginning of each arc has a smaller number than its endpoint. We indicated these numbers in a box under each node. All nodes for are terminal nodes, as we start the backward induction with node . Since player is assigned to this node we have to compare the third components of the payoff vectors and associated to the endpoints of the two arcs originating from node . Since , player will select the arc to node as his best choice. Hence , and . Then we check node . By comparing the third components of vectors and it is clear that player will select node , so , and . In the graph we also indicated the choices of the players by thicker arcs. Continuing the procedure in the same way for nodes we finally obtain the payoff vector and equilibrium path .
Exercises
10.1-1 An entrepreneur (E) enters to a market, which is controlled by a chain store (C). Their competition is a two-person game. The strategies of the chain store are soft (S), when it allows the competitor to operate or tough (T), when it tries to drive out the competitor. The strategies of the entrepreneur are staying in (I) or leaving (L) the market. The payoff tables of the two player are assumed to be
Find the equilibrium.
10.1-2 A salesman sells an equipment to a buyer, which has 3 parts, under the following conditions. If all parts are good, then the customer pays $ to the salesman, otherwise the salesman has to pay $ to the customer. Before selling the equipment, the salesman is able to check any one or more of the parts, but checking any one costs him $ . Consider a two-person game in which player is the salesman with strategies (how many parts he checks before selling the equipment), and player is the equipment with strategies (how many parts are defective). Show that the payoff matrix of player is given as below when we assume that the different parts can be defective with equal probability.
10.1-3 Assume that in the previous problem the payoff of the second player is the negative of the payoff of the salesman. Give a complete description of the number of equilibria as a function of the parameter values . Determine the equilibria in all cases.
10.1-4 Assume that the payoff function of the equipment is its value ( if all parts are good, and zero otherwise) in the previous exercise. Is there an equilibrium point?
10.1-5 Exercise 10.1-1 can be represented by the tree shown in Figure 10.4.
Find the equilibrium with backward induction.
10.1-6 Show that in the one-player case backward induction reduces to the classical dynamic programming method.
10.1-7 Assume that in the tree of a game some nodes are so called “chance nodes” from which the game continuous with given probabilities assigned to the possible next nodes. Show the existence of the equilibrium for this more general case.
10.1-8 Consider the tree given in Figure 10.3, and double the payoff values of player , change the sign of the payoff values of player , and do not change those for Player . Find the equilibrium of this new game.
If the strategy sets are connected subsets of finite dimensional Euclidean Spaces and the payoff functions are continuous, then the game is considered continuous.
It is very intuitive and usefull from algorithmic point of view to reformulate the equilibrium concept as follows. For all players and define the mapping:
which is the set of the best choices of player with given strategies , of the other players. Note that does not depend on , it depends only on all other strategies , . There is no guarantee that maximum exists for all . Let be the subset of such that exists for all and . A simultaneous strategy vector is an equilibrium if and only if , and for all . By introducing the best reply mapping, we can further simplify the above reformulation:
Theorem 10.2 Vector is equilibrium if and only if and .
Hence we have shown that the equilibrium-problem of -person games is equivalent to find fixed points of certain point-to-set mappings.
The most frequently used existence theorems of equilibria are based on fixed point theorems such as the theorems of Brouwer, Kakutani, Banach, Tarski etc. Any algorithm for finding fixed points can be successfully applied for computing equilibria.
The most popular existence result is a straightforward application of the Kakutani-fixed point theorem.
Theorem 10.3 Assume that in an -person game
1. the strategy sets are nonempty, closed, bounded, convex subsets of finite dimensional Euclidean spaces;
for all ,
2. the payoff function are continuous on ;
3. is concave in with all fixed .
Then there is at least one equilibrium.
Example 10.4 Consider a 2-person game, , with strategy sets , and payoff functions , and . We will first find the best responses of both players. Both payoff functions are concave parabolas in their variables with vertices
For all and these values are clearly feasible strategies, so
So is equilibrium if and only if it satisfies equations:
It is easy to see that the unique solution is
which is therefore the unique equilibrium of the game.
Example 10.5 Consider a portion of a sea-channel, assume it is the unit interval . Player is a submarine hiding in location , player is an airplane dropping a bomb at certain location resulting in a damage to the submarine. Hence a special two-person game is defined in which , and . With fixed , is maximal if , therefore the best response of player is . Player wants to minimize which occurs if is as large as possible, which implies that
Clearly, there is no such that and , consequently no equilibrium exists.
Define the aggregation function as:
for all and from and some
Theorem 10.4 Vector is an equilibrium if and only if
for all .
Proof. Assume first that is an equilibrium, then inequality (10.1) holds for all and . Adding the -multiples of these relations for we immediately have (10.5).
Assume next that (10.5) holds for all . Select any and , define , and apply inequality (10.5). All but the -th terms cancel and the remaining term shows that inequality (10.1) holds. Hence is an equilibrium.
Introduce function , then clearly is an equilibrium if and only if
Relation (10.6) is known as Fan's inequality. It can be rewritten as a variational inequality (see in “Iterative computation of equilibrium.” later), or as a fixed point problem. We show here the second approach. For all define
Since for all , relation (10.6) holds if and only if , that is is a fixed-point of mapping . Therefore any method to find fixed point is applicable for computing equilibria.
The computation cost depends on the type and size of the fixed point problem and also on the selected method.
Example 10.6 Consider again the problem of Example 10.4. In this case
so the aggregate function has the form with :
Therefore
and
Notice that this function is strictly concave in and , and is separable in these variables. At the stationary points:
implying that at the optimum
since both right hand sides are feasible. At the fixed point
giving the unique solution:
Assume that for all
where is a vector variable vector valued function which is continuously differentiable in an open set containing . Assume furthermore that for all , the payoff function is continuously differentiable in on with any fixed .
If is an equilibrium, then for all , is the optimal solution of problem:
By assuming that at the Kuhn-Tucker regularity condition is satisfied, the solution has to satisfy the Kuhn-Tucker necessary condition:
where is an -element column vector, is its transpose, is the gradient of (as a row vector) with respect to and is the Jacobian of function .
Theorem 10.5 If is an equilibrium, then there are vectors such that relations (10.9) are satisfied.
Relations (10.9) for give a (usually large) system of equations and inequalities for the unknowns and (). Any equilibrium (if exists) has to be among the solutions. If in addition for all , all components of are concave, and is concave in , then the Kuhn-Tucker conditions are also sufficient, and therefore all solutions of (10.9) are equilibria.
The computation cost in solving system (10.9) depends on its type and the chosen method.
Example 10.7 Consider again the two-person game of the previous example. Clearly,
so we have
Simple differentiation shows that
therefore the Kuhn-Tucker conditions can be written as follows:
Notice that is concave in , is concave in , and all constraints are linear, therefore all solutions of this equality-inequality system are really equilibria. By systematically examining the combination of cases
and
it is easy to see that there is a unique solution
By introducing slack and surplus variables the Kuhn-Tucker conditions can be rewritten as a system of equations with some nonnegative variables. The nonnegativity conditions can be formally eliminated by considering them as squares of some new variables, so the result becomes a system of (usually) nonlinear equations without additional constraints. There is a large set of numerical methods for solving such systems.
Assume that all conditions of the previous section hold. Consider the following optimization problem:
The two first constraints imply that the objective function is nonnegative, so is the minimal value of it. Therefore system (10.9) has feasible solution if and only if the optimal value of the objective function of problem (10.10) is zero, and in this case any optimal solution satisfies relations (10.9).
Theorem 10.6 The -person game has equilibrium only if the optimal value of the objective function is zero. Then any equilibrium is optimal solution of problem (10.10). If in addition all components of are concave and is concave in for all , then any optimal solution of problem (10.10) is equilibrium.
Hence the equilibrium problem of the -person game has been reduced to finding the optimal solutions of this (usually nonlinear) optimization problem. Any nonlinear programming method can be used to solve the problem.
The computation cost in solving the optimization problem (10.10) depends on its type and the chosen method. For example, if (10.10) is an LP, and solved by the simplex method, then the maximum number of operations is exponential. However in particular cases the procedure terminates with much less operations.
Example 10.8 In the case of the previous problem the optimization problem has the following form:
Notice that the solution , and is feasible with zero objective function value, so it is also optimal. Hence it is a solution of system (10.9) and consequently an equilibrium.
We have seen earlier that a finite game does not necessary have equilibrium. Even if it does, in the case of repeating the game many times the players wish to introduce some randomness into their actions in order to make the other players confused and to seek an equilibrium in the stochastic sense. This idea can be modeled by introducing probability distributions as the strategies of the players and the expected payoff values as their new payoff functions.
Keeping the notation of Section 10.1 assume that we have players, the finite strategy set of player is . In the mixed extension of this finite game each player selects a discrete probability distribution on its strategy set and in each realization of the game an element of is chosen according to the selected distribution. Hence the new strategy set of player is
which is the set of all -element probability vectors. The new payoff function of this player is the expected value:
Notice that the original “pure” strategies can be obtained by selecting as the -th basis vector. This is a continuous game and as a consequence of Theorem 10.3 it has at least one equilibrium. Hence if a finite game is without an equilibrium, its mixed extension has always at least one equilibrium, which can be obtained by using the methods outlined in the previous sections.
Example 10.9 Consider the two-person case in which , and as in section 10.1 introduce matrices and with elements and . In this special case
The constraints of can be rewritten as:
so we may select
The optimization problem (10.10) now reduces to the following:
where , and , .
Notice this is a quadratic optimization problem. Computation cost depends on the selected method. Observe that the problem is usually nonconvex, so there is the possibility of stopping at a local optimum.
Mixed extensions of two-person finite games are called bimatrix games. They were already examined in Example 10.9. For notational convenience introduce the simplifying notation:
We will show that problem (10.15) can be rewritten as quadratic programming problem with linear constraints.
Consider the objective function first. Let
then the objective function can be rewritten as follows:
The last two constraints also simplify:
implying that
so we may rewrite the objective function again:
since
Hence we have the following quadratic programming problem:
where the last two conditions are obtained from (10.17) and the nonnegativity of vectors , .
Theorem 10.7 Vectors and are equilibria of the bimatrix game if and only if with some and , is optimal solution of problem (10.18). The optimal value of the objective function is zero.
This is a quadratic programming problem. Computation cost depends on the selected method. Since it is usually nonconvex, the algorithm might terminate at local optimum. We know that at the global optimum the objective function must be zero, which can be used for optimality check.
and
Then
so problem (10.18) has the form:
where and . We also know from Theorem 10.7 that the optimal objective function value is zero, therefore any feasible solution with zero objective function value is necessarily optimal. It is easy to see that the solutions
are all optimal, so they provide equilibria.
One might apply relations (10.9) to find equilibria by solving the equality-inequality system instead of solving an optimization problem. In the case of bimatrix games problem (10.9) simplifies as
which can be proved along the lines of the derivation of the quadratic optimization problem.
The computation cost of the solution of system (10.19) depends on the particular method being selected.
Example 10.11 Consider again the bimatrix game of the previous example. Substitute the first and second constraints and into the third and fourth condition to have
It is easy to see that the solutions given in the previous example solve this system, so they are equilibria.
We can also rewrite the equilibrium problem of bimatrix games as an equality-inequality system with mixed variables. Assume first that all elements of and are between and . This condition is not really restrictive, since by using linear transformations
where , and is the matrix all elements of which equal , the equilibria remain the same and all matrix elements can be transformed into interval .
Theorem 10.8 Vectors , are an equilibrium if and only if there are real numbers and zero-one vectors , and such that
where denotes the vector with all unit elements.
Proof. Assume first that , is an equilibrium, then with some and , (10.19) is satisfied. Define
Since all elements of and are between and , the values and are also between and . Notice that
which implies that (10.20) holds.
Assume next that (10.20) is satisfied. Then
If , then , (where is the -th basis vector), and if , then . Therefore
implying that . We can similarly show that . Thus (10.19) is satisfied, so, is an equilibrium.
The computation cost of the solution of system (10.20) depends on the particular method being seleced.
Example 10.12 In the case of the bimatrix game introduced earlier in Example 10.10 we have the following:
Notice that all three solutions given in Example 10.10 satisfy these relations with
and
respectively.
In the special case of , bimatrix games are called matrix games and they are represented by matrix . Sometimes we refer to the game as matrix game. Since , the quadratic optimization problem (10.18) becomes linear:
From this formulation we see that the set of the equilibrium strategies is a convex polyhedron. Notice that variables and can be separated, so we have the following result.
Theorem 10.9 Vectors and give an equilibrium of the matrix game if and only if with some and , and are optimal solutions of the linear programming problems:
Notice that at the optimum, . The optimal value is called the value of the matrix game.
Solving problem 10.22 requires exponential number of operations if the simplex method is chosen. With polynomial algorithm (such as the interior point method) the number of operations is only polynomial.
Example 10.13 Consider the matrix game with matrix:
In this case problems 10.22 have the form:
The application of the simplex method shows that the optimal solutions are , , , and .
We can also obtain the equilibrium by finding feasible solutions of a certain set of linear constraints. Since at the optimum of problem (10.21), , vectors , and scalers and are optimal solutions if and only if
The first phase of the simplex method has to be used to solve system (10.23), where the number of operations might be experimental. However in most practical examples much less operations are needed.
Example 10.14 Consider again the matrix game of the previous example. In this case system (10.23) has the following form:
It is easy to see that , , satisfy these relations, so , is an equilibrium.
Consider now a matrix game with matrix . The main idea of this method is that at each step both players determine their best pure strategy choices against the average strategies of the other player of all previous steps. Formally the method can be described as follows.
Let be the initial (mixed) strategy of player . Select (the st basis vector) such that
In any further step , let
and select so that
Let then
and select so that
By repeating the general step for two sequences are generated: , and . We have the following result:
Theorem 10.10 Any cluster point of these sequences is an equilibrium of the matrix game. Since all and are probability vectors, they are bounded. Therefore there is at least one cluster point.
Assume, matrix is . In (10.24) we need multiplications. In (10.25) and (10.27) multiplications and divisions. In (10.26) and (10.28) multiplications. If we make iteration steps, then the total number of multiplications and divisions is:
The formal algorithm is as follows:
Matrix-Equilibrium(
)
1 2 define such that 3 4 5 6 define such that 7 8 9 define such that 10 11IF
and 12THEN
is equilibrium 13ELSE
go back to 4
Here is a user selected error tolerance.
Example 10.15 We applied the above method for the matrix game of the previous example and started the procedure with . After 100 steps we obtained and . Comparing it to the true values of the equilibrium strategies we see that the error is below , showing the very slow convergence of the method.
A matrix game with skew-symmetric matrix is called symmetric. In this case and the two linear programming problems are identical. Therefore at the optimum , and the equilibrium strategies of the two players are the same. Hence we have the following result:
Theorem 10.11 A vector is equilibrium of the symmetric matrix game if and only if
Solving system (10.29) the first phase of the simplex method is needed, the number of operations is exponential in the worst case but in practical case usually much less.
Example 10.16 Consider the symmetric matrix game with matrix . In this case relations (10.29) simplify as follows:
Clearly the only solution is and , that is the first pure strategy.
We will see in the next subsection that linear programming problems are equivalent to symmetric matrix games so any method for solving such games can be applied to solve linear programming problems, so they serve as alternative methodology to the simplex method. As we will see next, symmetry is not a strong assumption, since any matrix game is equivalent to a symmetric matrix game.
Consider therefore a matrix game with matrix , and construct the skew-symmetric matrix
where all components of vector equal 1. Matrix games and are equivalent in the following sense. Assume that , which is not a restriction, since by adding the same constant to all element of they become positive without changing equilibria.
1. If is an equilibrium strategy of matrix game then with , and is an equilibrium of matrix game with value ;
2. If is an equilibrium of matrix game and is the value of the game, then
is equilibrium strategy of matrix game .
Proof. Assume first that is an equilibrium strategy of game , then , , , so
First we show that , that is . If , then (since is a probability vector) and , contradicting the second inequality of (10.30). If , then , and by the third inequality of (10.30), must have at least one positive component which makes the first inequality impossible.
Next we show that . From (10.30) we have
and by adding these inequalities we see that
and combining this relation with the third inequality of (10.30) we see that .
Select , then , so both , and are probability vectors, furthermore from (10.30),
So by selecting and , and are feasible solutions of the pair (10.22) of linear programming problems with , therefore is an equilibrium of matrix game . Part 2. can be proved in a similar way, the details are not given here.
In this section we will show that linear programming problems can be solved by finding the equilibrium strategies of symmetric matrix games and hence, any method for finding the equilibria of symmetric matrix games can be applied instead of the simplex method.
Consider the primal-dual linear programming problem pair:
Construct the skew-symmetric matrix:
Theorem 10.13 Assume is an equilibrium strategy of the symmetric matrix game with . Then
are optimal solutions of the primal and dual problems, respectively.
Proof. If is an equilibrium strategy, then , that is,
Since and , both vectors , and are nonnegative, and by dividing the first two relations of (10.32) by ,
showing that and are feasible for the primal and dual, respectively. From the last condition of (10.32) we have
However
consequently, , showing that the primal and dual objective functions are equal. The duality theorem implies the optimality of and .
Example 10.17 Consider the linear programming problem:
First we have to rewrite the problem as a primal problem. Introduce the new variables:
and multiply the -type constraint by . Then the problem becomes the following:
Hence
and so matrix becomes:
The fictitious play method is an iteration algorithm in which at each step the players adjust their strategies based on the opponent's strategies. This method can therefore be considered as the realization of a discrete system where the strategy selections of the players are the state variables. For symmetric matrix games John von Neumann introduced a continuous systems approach when the players continuously adjust their strategies. This method can be applied to general matrix games, since–as we have seen earlier–any matrix game is equivalent to a symmetric matrix game. The method can also be used to solve linear programming problems as we have seen earlier that any primal-dual pair can be reduced to the solution of a symmetric matrix game.
Let now be a skew-symmetric matrix. The strategy of player , is considered as the function of time . Before formulating the dynamism of the system, introduce the following notation:
For arbitrary probability vector solve the following nonlinear initial-value problem:
Since the right-hand side is continuous, there is at least one solution. The right hand side of the equation can be interpreted as follows. Assume that . If player selects strategy , then player is able to obtain a positive payoff by choosing the pure strategy , which results in a negative payoff for player . However if player increases to one by choosing the same strategy its payoff becomes zero, so it increases. Hence it is the interest of player to increase . This is exactly what the first term represents. The second term is needed to ensure that remains a probability vector for all .
The computation of the right hand side of equations (10.34) for all requires multiplications. The total computation cost depends on the length of solution interval, on the selected step size, and on the choice of the differential equation solver.
Theorem 10.14 Assume that is a positive strictly increasing sequence converging to , then any cluster point of the sequence is equilibrium strategy, furthermore there is a constant such that
Proof. First we have to show that is a probability vector for all . Assume that with some and , . Define
Since is continuous and , clearly , and for all , . Then for all ,
and the Lagrange mean-value theorem implies that with some ,
which is a contradiction. Hence is nonnegative. Next we show that for all . Let , then
so satisfies the homogeneous equation
with the initial condition . Hence for all , , showing that is a probability vector.
Assume that for some , . Then
By multiplying both sides by and adding the resulted equations for we have:
The first term is zero, since is skew-symmetric. Notice that this equation remains valid even as except the break-points (where the derivative of does not exist) since (10.36) remains true.
Assume next that with a positive , . Then for all , . Since equation (10.37) can be rewritten as
with
we see that satisfies a homogeneous equation with zero initial solution at , so the solution remains zero for all . Therefore showing that , that is, is equilibrium strategy.
If for all , then , and clearly
that is
Integrate both sides in interval to have
with , which implies that
By using the Cauchy–Schwartz inequality we get
which is valid even at the break points because of the continuity of functions . And finally, take a sequence with increasingly converging to . The sequence is bounded (being probability vectors), so there is at least one cluster point . From (10.40), by letting we have that showing that is an equilibrium strategy.
Example 10.18 Consider the matrix game with matrix
which was the subject of our earlier Example 10.13 In order to apply the method of von Neumann we have to find first an equivalent symmetric matrix game. The application of the method given in Theorem 10.12. requires that the matrix has to be positive. Without changing the equilibria we can add 2 to all matrix elements to have
and by using the method we get the skew-symmetric matrix
The differential equations (10.34) were solved by using the th order Runge–Kutta method in the interval with the step size and initial vector . From we get the approximations
of the equilibrium strategies of the original game. Comparing these values to the exact values:
we see that the maximum error is about .
Consider an -person continuous game and assume that all conditions presented at the beginning of Subsection 10.2.3 are satisfied. In addition, assume that for all , is bounded, all components of are concave and is concave in with any fixed . Under these conditions there is at least one equilibrium (Theorem 10.3). The uniqueness of the equilibrium is not true in general, even if all are strictly concave in . Such an example is shown next.
Example 10.19 Consider a two-person game with and . Clearly both payoff functions are strictly concave and there are infinitely many equilibria: .
Select an arbitrary nonnegative vector and define function
where , and is the gradient (as a row vector) of with respect to . The game is said to be diagonally strictly concave if for all , and for some ,
Theorem 10.15 Under the above conditions the game has exactly one equilibrium.
Proof. The existence of the equilibrium follows from Theorem 10.3. In proving uniqueness assume that and are both equilibria, and both satisfy relations (10.9). Therefore for ,
and the second equation can be rewritten as
where and are the th components of and , respectively. Multiplying (10.43) by for and by for and adding the resulted equalities for we have
Notice that the sum of the first two terms is positive by the diagonally strict concavity of the game, the concavity of the components of implies that
and
Therefore from (10.44) we have
where we used the fact that for all and ,
This is an obvious contradiction, which completes the proof.
In practical cases the following result is very useful in checking diagonally strict concavity of -person games.
Theorem 10.16 Assume is convex, is twice continuously differentiable for all , and is negative definite with some , where is the Jacobian of . Then the game is diagonally strictly concave.
Proof. Let , . Then for all , and
Integrate both side in to have
and by premultiplying both sides by we see that
completing the proof.
Example 10.20 Consider a simple two-person game with strategy sets , and payoff functions
and
Clearly all conditions, except diagonally strict concavity, are satisfied. We will use Theorem 10.16 to show this additional property. In this case
so
with Jacobian
We will show that
is negative definite with some . For example, select , then this matrix becomes
with characteristic polynomial
having negative eigenvalues , .
We have see earlier in Theorem 10.4 that is an equilibrium if and only if
for all , where is the aggregation function (10.4). In the following analysis we assume that the -person game satisfies all conditions presented at the beginning of Subsection 10.2.9 and (10.42) holds with some positive .
We first show the equivalence of (10.45) and a variational inequality.
Theorem 10.17 A vector satisfies (10.45) if and only if
for all , where is defined in (10.41).
Proof. Assume satisfies (10.45). Then as function of obtains maximum at , therefore
for all , and since is , we proved that satisfies (10.46).
Assume next that satisfies (10.46). By the concavity of in and the diagonally strict concavity of the game we have
so satisfies (10.45).
Hence any method available for solving variational inequalities can be used to find equilibria.
Next we construct a special two-person, game the equilibrium problem of which is equivalent to the equilibrium problem of the original -person game.
Theorem 10.18 Vector satisfies (10.45) if and only if is an equilibrium of the two-person game where
Assume first that satisfies (10.45). Then it satisfies (10.46) as well, so
We need in addition to show that
In contrary assume that with some , . Then
where we used (10.42) and (10.46). This is a clear contradiction.
Assume next that is an equilibrium of game . Then for any ,
The first part can be rewritten as
showing that (10.46) is satisfied, so is (10.45).
Consider the following iteration procedure.
Let be arbitrary, and solve problem
Let denote an optimal solution and define . If , then for all ,
so by Theorem 10.17, is an equilibrium. Since , we assume that . In the general step we have already vectors , and scalers . Then the next vector and next scaler are the solutions of the following problem:
Notice that
and
so we know that .
The formal algorithm is as follows:
Continuous-Equilibrium
1 2 solve problem (10.47), let be optimal solution 3IF
4THEN
is equilibrium 5RETURN
6 7 solve problem (10.48), let be optimal solution 8IF
9THEN
is equilibrium 10RETURN
11ELSE
go to 5
Before stating the convergence theorem of the algorithm we notice that in the special case when the strategy sets are defined by linear inequalities (that is, all functions are linear) then all constraints of problem (10.48) are linear, so at each iteration step we have to solve a linear programming problem.
In this linear case the simplex method has to be used in each iteration step with exponential computational cost, so the overall cost is also exponential (with prefixed number of steps).
Theorem 10.19 There is a subsequence of generated by the method that converges to the unique equilibrium of the -person game.
Proof. The proof consists of several steps.
First we show that as . Since at each new iteration an additional constraint is added to (10.48), sequence is nonincreasing. Since it is also nonnegative, it must be convergent. Sequence is bounded, since it is from the bounded set , so it has a convergent subsequence . Notice that from (10.48) we have
where the right hand side tends to zero. Thus and since the entire sequence is monotonic, the entire sequence converges to zero.
Let next be an equilibrium of the -person game, and define
By (10.42), for all . Define the indices so that
then for all , ?
which implies that
where we used again problem (10.48). From this relation we conclude that as . And finally, notice that function satisfies the following properties:
1. is continuous in ;
2. if (as it was shown just below relation (10.49));
3. if for a convergent sequence , , then necessarily .
By applying property with sequence it is clear that so . Thus the proof is complete.
Exercises
10.2-1 Consider a 2-person game with strategy sets , and payoff functions and . Show the existence of a unique equilibrium point by computing it. Show that Theorem 10.3. cannot be applied to prove existence.
10.2-2 Consider the “price war” game in which two firms are price setting. Assume that and are the strategies of the players, and the payoff functions are:
by assuming that . Is there an equilibrium? How many equilibria were found?
10.2-3 A portion of the sea is modeled by the unit square in which a submarine is hiding. The strategy of the submarine is the hiding place . An airplane drops a bomb in a location , which is its strategy. The payoff of the airplane is the damage occurred by the bomb, and the payoff of the submarine is its negative. Does this 2-person game have an equilibrium?
10.2-4 In the second-price auction they sell one unit of an item to bidders. They value the item as . Each of them offers a price for the item simultaneously without knowing the offers of the others. The bidder with the highest offer will get the item, but he has to pay only the second highest price. So the strategy of bidder is , so , and the payoff function for this bidder is:
What is the best response function of bidder ? Does this game have equilibrium?
10.2-5 Formulate Fan's inequality for Exercise 10.2-1.
10.2-6 Formulate and solve Fan's inequality for Exercise 10.2-2.
10.2-7 Formulate and solve Fan's inequality for Exercise 10.2-4.
10.2-8 Consider a 2-person game with strategy sets , and payoff functions
Formulate Fan's inequality.
10.2-9 Let , , . Formulate the Kuhn-Tucker conditions to find the equilibrium. Solve the resulted system of inequalities and equations.
10.2-10 Consider a 3-person game with , , and . Formulate the Kuhn-Tucker condition.
10.2-11 Formulate and solve system (10.9) for exercise 10.2-8.
10.2-12 Repeat the previous problem for the game given in exercise 10.2-1
10.2-13 Rewrite the Kuhn-Tucker conditions for exercise 10.2-8 into the optimization problem (10.10) and solve it.
10.2-14 Formulate the mixed extension of the finite game given in Exercise 10.1-1.
10.2-15 Formulate and solve optimization problem (10.10) for the game obtained in the previous problem.
10.2-16 Formulate the mixed extension of the game introduced in Exercise 10.2-3. Formulate and solve the corresponding linear optimization problems (10.22) with , , .
10.2-17 Use fictitious play method for solving the matrix game of exercise 10.2-16.
10.2-18 Generalize the fictitious play method for bimatrix games.
10.2-19 Generalize the fictitious play method for the mixed extensions of finite -person games.
10.2-20 Solve the bimatrix game with matrics and with the method you have developed in Exercise 10.2-18.
10.2-21 Solve the symmetric matrix game by linear programming.
10.2-22 Repeat exercise 10.2-21 with the method of fictitious play.
10.2-23 Develop the Kuhn-Tucker conditions (10.9) for the game given in Exercise 10.2-21 above.
10.2-24 Repeat Exercises 10.2-21, 10.2-22 and 10.2-23 for the matrix game (First find the equivalent symmetric matrix game!).
10.2-25 Formulate the linear programming problem to solve the matrix game with matrix .
10.2-26 Formulate a linear programming solver based on the method of fictitious play and solve the LP problem:
10.2-27 Solve the LP problem given in Example 8.17 by the method of fictitious play.
10.2-28 Solve Exercise 10.2-21 by the method of von Neumann.
10.2-29 Solve Exercise 10.2-24 by the method of von Neumann.
10.2-30 Solve Exercise 10.2-17 by the method of von Neumann.
10.2-31 Check the solution obtained in the previous exercises by verifying that all constraints of (10.21) are satisfied with zero objective function.
Hint. What and should be selected?
10.2-32 Solve exercise 10.2-26 by the method of von Neumann.
10.2-33 Let , , . Show that both payoff functions are strictly concave in and respectively. Prove that there are infinitely many equilibria, that is , the strict concavity of the payoff functions does not imply the uniqueness of the equilibrium.
10.2-34 Can matrix games be strictly diagonally concave?
10.2-35 Consider a two-person game with strategy sets , and payoff functions , . Show that this game satisfies all conditions of Theorem 10.16.
10.2-36 Solve the problem of the previous exercise by algorithm (10.47)–(10.48).
The previous sections presented general methodology, however special methods are available for almost all special classes of games. In the following parts of this chapter a special game, the oligopoly game will be examined. It describes a real-life economic situation when -firms produce a homogeneous good to a market, or offers the same service. This model is known as the classical Cournot model. The firms are the players. The strategy of each player is its production level with strategy set , where is its capacity limit. It is assumed that the market price depends on the total production level offered to the market: , and the cost of each player depends on its own production level: . The profit of each firm is given as
In this way an -person game is defined.
It is usually assumed that functions and are twice continuously differentiable, furthermore
;
;
for all , and . Under assumptions 1–3. the game satisfies all conditions of Theorem 10.3, so there is at least one equilibrium.
Notice that with the notation , the payoff function of player can be rewritten as
Since is a compact set and this function is strictly concave in , with fixed there is a unique profit maximizing production level of player , which is its best reply and is denoted by .
It is easy to see that there are three cases: if , if , and otherwise is the unique solution of the monotonic equation
Assume that . Then implicit differentiation with respect to shows that
showing that
Notice that from assumptions 2. and 3.,
which is also true for the other two cases except for the break points.
As in Subsection 10.2.1 we can introduce the best reply mapping:
and look for its fixed points. Another alternative is to introduce dynamic process which converges to the equilibrium.
Similarly to the method of fictitious play a discrete system can be developed in which each firm selects its best reply against the actions of the competitors chosen at the previous time period:
Based on relation (10.52) we see that for the right hand side mapping is a contraction, so it converges, however if , then no convergence can be established. Consider next a slight modification of this system: with some :
for . Clearly the steady-states of this system are the equilibria, and it can be proved that if is sufficiently small, then sequences are all convergent to the equilibrium strategies.
Consider next the continuous counterpart of model (10.55), when (similarly to the method of von Neumann) continuous time scales are assumed:
The following result shows the convergence of this process.
Theorem 10.20 Under assumptions 1–3, system (10.56) is asymptotically stable, that is, if the initial values are selected close enough to the equilibrium, then as , converges to the equilibrium strategy for all .
Proof. It is sufficient to show that the eigenvalues of the Jacobian of the system have negative real parts. Clearly the Jacobian is as follows:
where at the equilibrium. From (10.52) we know that for all . In order to compute the eigenvalues of we will need a simple but very useful fact. Assume that and are -element real vectors. Then
where is the identity matrix. This relation can be easily proved by using finite induction with respect to . By using (10.58), the characteristic polynomial of can be written as
where we used the notation
The roots of the first factor are all negative: , and the other eigenvalues are the roots of equation
Notice that by adding the terms with identical denominators this equation becomes
with , and the are different. If denotes the left hand side then clearly the values are the poles,
so strictly decreases locally. The graph of the function is shown in Figure 10.5. Notice first that (10.59) is equivalent to a polynomial equation of degree , so there are real or complex roots. The properties of function indicate that there is one root below , and one root between each and . Therefore all roots are negative, which completes the proof.
The general discrete model (10.55) can be examined in the same way. If for all , then model (10.55) reduces to the simple dynamic process (10.54).
Example 10.21 Consider now a -person oligopoly with price function
strategy sets , and cost functions
The profit of firm is therefore the following:
The best reply of play can be obtained as follows. Following the method outlined at the beginning of “Best reply mappings.” we have the following three cases. If , then is the best choice. If , then is the optimal decision. Otherwise is the solution of equation
where the only positive solution is
After the best replies are found, we can easily construct any of the methods presented before.
Consider an -firm oligopoly with price function and cost functions (). Introduce the following function
and define
for and let
Notice that if , then all elements of are also in this interval, therefore is a single-dimensional point-to-set mapping. Clearly is an equilibrium of the -firm oligopoly game if and only if is a fixed point of mapping and for all , . Hence the equilibrium problem has been reduced to find fixed points of only one-dimensional mappings. This is a significant reduction in the difficulty of the problem, since best replies are -dimensional mappings.
If conditions 1–3 are satisfied, then has exactly one element for all and :
where is the unique solution of the monotonic equation
in the interval . In the third case, the left hand side is positive at , negative at , and by conditions 2–3, it is strictly decreasing, so there is a unique solution.
In the entire interval , is nonincreasing. In the first two cases it is constant and in the third case strictly decreasing. Consider finally the single-dimensional equation
At the left hand side is nonnegative, at it is nonpositive, and is strictly decreasing. Therefore there is a unique solution (that is, fixed point of mapping ), which can be obtained by any method known to solve single-dimensional equations.
Let be the initial interval for the solution of equation (10.65). After bisection steps the accuracy becomes , which will be smaller than an error tolerance if .
Oligopoly-Equilibrium(
)
1 solve equation (10.65) for 2FOR
to 3DO
solve equation (10.64), and let 4 is equilibrium
Example 10.22 Consider the 3-person oligopoly examined in the previous example. From (10.63) we hav
where is the unique solution of equation
The first case occurs for , the second case never occurs, and in the third case there is a unique positive solution:
And finally equation (10.65) has the special form
A single program based on the bisection method gives the solution and then equation (10.66) gives the equilibrium strategies , , .
Notice first that in the case of -player oligopolies , so we select
and since the payoff functions are
the Kuhn-Tucker conditions (10.9) have the following form. The components of the 2-dimensional vectors will be denoted by and . So we have for ,
One might either look for feasible solutions of these relations or rewrite them as the optimization problem (10.10), which has the following special form in this case:
Computational cost in solving (10.69) or (10.70) depends on the type of functions and . No general characterization can be given.
Example 10.23 In the case of the three-person oligopoly introduced in Example 10.21 we have
A professional optimization software was used to obtain the optimal solutions:
and all .
If is an equilibrium of an -person oligopoly, then with fixed maximizes the payoff of player . Assuming that condition 1–3 are satisfied, is concave in , so maximizes if and only if at the equilibrium
So introduce the slack variables
and
Then clearly at the equilibrium
and by the definition of the slack variables
and if we add the nonnegativity conditions
then we obtain a system of nonlinear relations (10.71)–(10.75) which are equivalent to the equilibrium problem.
We can next show that relations (10.71)–(10.75) can be rewritten as a nonlinear complementarity problem, for the solution of which standard methods are available. For this purpose introduce the notation
then system (10.72)–(10.75) can be rewritten as
This problem is the usual formulation of nonlinear complementarity problems. Notice that the last condition requires that in each component either or or both must be zero.
The computational cost in solving problem (10.76) depends on the type of the involved functions and the choice of method.
Example 10.24 In the case of the 3-person oligopoly introduced and examined in the previous examples we have:
In this section -player oligopolies will be examined under the special condition that the price and all cost functions are linear :
where , , and are positive, but . Assume again that the strategy set of player is the interval . In this special case
for all , therefore
and relations (10.71)–(10.75) become more special:
where we changed the order of them. Introduce the following vectors and matrixes:
Then the above relations can be summarized as:
Next we prove that matrix is negative definite. With any nonzero vector ,
which proves the assertion.
Observe that relations (10.79) are the Kuhn-Tucker conditions of the strictly concave quadratic programming problem:
and since the feasible set is a bounded linear polyhedron and the objective function is strictly concave, the Kuhn-Tucker conditions are sufficient and necessary. Consequently a vector is an equilibrium if and only if it is the unique optimal solution of problem (10.80). There are standard methods to solve problem (10.80) known from the literature.
Since (10.79) is a convex quadratic programming problem, several algorithms are available. Their costs are different, so computation cost depends on the particular method being selected.
Example 10.25 Consider now a duopoly (two-person oligopoly) where the price function is and the cost functions are and with capacity limits . That is,
Therefore,
so the quadratic programming problem can be written as:
It is easy to see by simple differentiation that the global optimum at the objective function without the constraints is reached at and . They however satisfy the constraints, so they are the optimal solutions. Hence they provide the unique equilibrium of the duopoly.
Exercises
10.3-1 Consider a duopoly with , and costs . Examine the convergence of the iteration scheme (10.55).
10.3-2 Select , , and
Show that there are infinitely many equilibria:
10.3-3 Consider the duopoly of Exercise 10.3-1 above. Find the best reply mappings of the players and determine the equilibrium.
10.3-4 Consider again the duopoly of the previous problem.
(a) Construct the one-dimensional fixed point problem of mapping (10.62) and solve it to obtain the equilibrium.
(b) Formulate the Kuhn-Tucker equations and inequalities (10.69).
(c) Formulate the complementarity problem (10.76) in this case.
CHAPTER NOTES |
(Economic) Nobel Prize was given only once, in 1994 in the field of game theory. One of the winner was John Nash , who received this honor for his equilibrium concept, which was introduced in 1951 [190].
Backward induction is a more restrictive equilibrium concept. It was developed by Kuhn and can be found in [154]. Since it is more restrictive equilibrium, it is also a Nash equilibrium.
The existence and computation of equilibria can be reduced to those of fixed points. the different variants of fixed point theorems-such as that of Brouwer [33], Kakutani [135], Tarski [255] are successfully used to prove existence in many game classes. The article [194] uses the fixed point theorem of Kakutani. The books [254] and [77] discuss computer methods for computing fixed points. The most popular existence result is the well known theorem of Nikaido and Isoda [194].
The Fan inequality is discussed in the book of Aubin [15]. The Kuhn-Tucker conditions are presented in the book of Martos [175]. By introducing slack and surplus variables the Kuhn-Tucker conditions can be rewritten as a system of equations. For their computer solutions well known methods are available ([254] and [175]).
The reduction of bimatrix games to mixed optimization problems is presented in the papers of Mills [183] and Shapiro [235]. The reduction to quadratic programming problem is given in ([173]).
The method of fictitious play is discussed in the paper of Robinson [215]. In order to use the Neumann method we have to solve a system of nonlinear ordinary differential equations. The Runge–Kutta method is the most popular procedure for doing it. It can be found in [254].
The paper of Rosen [216] introduces diagonally strictly concave games. The computer method to find the equilibria of -person concave games is introduced in Zuhovitsky et al. [283].
The different extensions and generalizations of the classical Cournot model can be found in the books of Okuguchi and Szidarovszky [196], [197]. The proof of Theorem 10.20 is given in [253]. For the proof of Lemma 10.58 see the monograph [197]. The bisection method is described in [254]. The paper [137] contains methods which are applicable to solve nonlinear complementarity problems. The solution of problem (10.80) is discussed in the book of Hadley [105].
The book of von Neumann and Morgenstern [193] is considered the classical textbook of game theory. There is a large variety of game theory textbooks (see for example [77]).
Table of Contents
The recursive definition of the Fibonacci numbers is well-known: if is the Fibonacci number then
We are interested in an explicit form of the numbers for all natural numbers . Actually, the problem is to solve an equation where the unknown is given recursively, in which case the equation is called a recurrence equation. The solution can be considered as a function over natural numbers, because is defined for all . Such recurrence equations are also known as difference equations, but could be named as discrete differential equations for their similarities to differential equations.
Definition 11.1 A k order recurrence equation is an equation of the form
where has to be given in an explicit form.
For a unique determination of , initial values must be given. Usually these values are . These can be considered as initial conditions. In case of the equation for Fibonacci-numbers, which is of second order, two initial values must be given.
A sequence satisfying equation (11.1) and the corresponding initial conditions is called a particular solution. If all particular solutions of equation (11.1) can be obtained from the sequence , by adequately choosing of the constants , then this sequence is a general solution.
Solving recurrence equations is not an easy task. In this chapter we will discuss methods which can be used in special cases. For simplicity of writing we will use the notation instead of as it appears in several books (sequences can be considered as functions over natural numbers).
The chapter is divided into three sections. In Section 11.1 we deal with solving linear recurrence equations, in Section 11.2 with generating functions and their use in solving recurrence equations and in Section 11.3 we focus our attention on the numerical solution of recurrence equations.
If the recurrence equation is of the form
where are functions defined over natural numbers, , and has to be given explicitly, then the recurrence equation is linear. If is the zero function, then the equation is homogeneous, otherwise nonhomogeneous. If all the functions are constant, the equation is called a linear recurrence equation with constant coefficients.
Let the equation be
where are real constants, . If initial conditions are given (usually ), then the general solution of this equation can be uniquely given.
To solve the equation let us consider its characteristic equation
a polynomial equation with real coefficients. This equation has roots in the field of complex numbers. It can easily be seen after a simple substitution that if is a real solution of the characteristic equation, then is a solution of (11.2), for arbitrary .
The general solution of equation (11.2) is
where () are the linearly independent solutions of equation (11.2). The constants can be determined from the initial conditions by solving a system of equations.
The linearly independent solutions are supplied by the roots of the characteristic equation in the following way. A fundamental solution of equation (11.2) can be associated with each root of the characteristic equation. Let us consider the following cases.
Distinct real roots. Let be distinct real roots of the characteristic equation. Then
are solutions of equation (11.2), and
is also a solution, for arbitrary constants . If , then (11.4) is the general solution of the recurrence equation.
Example 11.1 Solve the recurrence equation
The corresponding characteristic equation is
with the solutions
These are distinct real solutions, so the general solution of the equation is
The constants and can be determined using the initial conditions. From , the following system of equations can be obtained.
The solution of this system of equations is . Therefore the general solution is
which is the th Fibonacci number .
Multiple real roots. Let be a real root of the characteristic equation with multiplicity . Then
are solutions of equation (11.2) (fundamental solutions corresponding to ), and
is also a solution, for any constants . If the characteristic equation has no other solutions, then (11.5) is a general solution of the recurrence equation.
Example 11.2 Solve the recurrence equation
The characteristic equation is
with a solution with multiplicity 2. Then
is a general solution of the recurrence equation.
From the initial conditions we have
From this system of equations , so the general solution is
Distinct complex roots. If the complex number , written in trigonometric form, is a root of the characteristic equation, then its conjugate is also a root, because the coefficients of the characteristic equation are real numbers. Then
are solutions of equation (11.2) and
is also a solution, for any constants and . If these are the only solutions of a second order characteristic equation, then (11.6) is a general solution.
Example 11.3 Solve the recurrence equation
The corresponding characteristic equation is
with roots and . These can be written in trigonometric form as and . Therefore
is a general solution of the recurrence equation. From the initial conditions
Therefore . Hence the general solution is
Multiple complex roots. If the complex number written in trigonometric form as is a root of the characteristic equation with multiplicity , then its conjugate is also a root with multiplicity . Then
and
are solutions of the recurrence equation (11.2). Then
is also a solution, where are arbitrary constants, which can be determined from the initial conditions. This solution is general if the characteristic equation has no other roots.
Example 11.4 Solve the recurrence equation
The characteristic equation is
which can be written as . The complex numbers and are double roots. The trigonometric form of these are
respectively. Therefore the general solution is
From the initial conditions we obtain
that is
Solving this system of equations , , and . Thus the general solution is
Using these four cases all linear homogeneous equations with constant coefficients can be solved, if we can solve their characteristic equations.
Example 11.5 Solve the recurrence equation
The characteristic equation is
with roots 2, and . Therefore the general solution is
After determining the constants we obtain
The general solution. The characteristic equation of the th order linear homogeneous equation (11.2) has roots in the field of complex numbers, which are not necessarily distinct. Let these roots be the following:
real, with multiplicity (),
real, with multiplicity (),
real, with multiplicity (), complex, with multiplicity (),
complex, with multiplicity (),
complex, with multiplicity ().
Since the equation has roots, .
In this case the general solution of equation (11.2) is
where
are constants, which can be determined from the initial conditions.
The above statements can be summarised in the following theorem.
Theorem 11.2 Let be an integer and real numbers with . The general solution of the linear recurrence equation (11.2) can be obtained as a linear combination of the terms , where are the roots of the characteristic equation (11.3) with multiplicity () and the coefficients of the linear combination depend on the initial conditions.
The proof of the theorem is left to the Reader (see Exercise 11.1-5).
The algorithm for the general solution is the following.
Linear-Homogeneous
1 determine the characteristic equation of the recurrence equation 2 find all roots of the characteristic equation with their multiplicities 3 find the general solution (11.7) based on the roots 4 determine the constants of (11.7) using the initial conditions, if these exists.
Consider the linear nonhomogeneous recurrence equation with constant coefficients
where are real constants, , and is not the zero function.
The corresponding linear homogeneous equation (11.2) can be solved using Theorem 11.2. If a particular solution of equation (11.8) is known, then equation (11.8) can be solved.
Theorem 11.3 Let be an integer, real numbers, . If is a particular solution of the linear nonhomogeneous equation (11.8) and is a general solution of the linear homogeneous equation (11.2), then
is a general solution of the equation (11.8).
The proof of the theorem is left to the Reader (see Exercise 11.1-6).
Example 11.6 Solve the recurrence equation
First we solve the homogeneous equation
and obtain the general solution
since the roots of the characteristic equation are and 1. It is easy to see that
is a solution of the nonhomogeneous equation. Therefore the general solution is
The constants and can be determined using the initial conditions. Thus,
A particular solution can be obtained using the method of variation of constants. However, there are cases when there is an easier way of finding a particular solution. In Figure 11.1 we can see types of functions , for which a particular solution can be obtained in the given form in the table. The constants can be obtained by substitutions.
In the previous example , so the first case can be used with and . Therefore we try to find a particular solution of the form . After substitution we obtain , thus the particular solution is
Exercises
11.1-1 Solve the recurrence equation
(Here is the optimal number of moves in the problem of the Towers of Hanoi.)
11.1-2 Analyse the problem of the Towers of Hanoi if discs have to be moved from stick to stick in such a way that no disc can be moved directly from to and vice versa.
Hint. Show that if the optimal number of moves is denoted by , and , then .
11.1-3 Solve the recurrence equation
11.1-4 Solve the linear nonhomogeneous recurrence equation
Hint. Try to find a particular solution of the form .
11.1-5 Prove Theorem 11.2.
11.1-6 Prove Theorem 11.3.
Generating functions can be used, among others, to solve recurrence equations, count objects (e.g. binary trees), prove identities and solve partition problems. Counting the number of objects can be done by stating and solving recurrence equations. These equations are usually not linear, and generating functions can help us in solving them.
Associate a series with the infinite sequence the following way
This is called the generating function of the sequence .
For example, in case of the Fibonacci numbers this generating function is
Multiplying both sides of the equation by , then by , we obtain
If we subtract the second and the third equations from the first one term by term, then use the defining formula of the Fibonacci numbers, we get
that is
The correctness of these operations can be proved mathematically, but here we do not want to go into details. The formulae obtained using generating functions can usually also be proved using other methods. Let us consider the following generating functions
The generating functions and are equal, if and only if for all natural numbers.
Now we define the following operations with the generating functions: addition, multiplication by real number, shift, multiplication, derivation and integration.
Addition and multiplication by real number.
Shift. The generating function
represents the sequence , while the generating function
represents the sequence .
Multiplication. If and are generating functions, then
where .
Special case. If for all natural numbers , then
If, in addition, for all , then
Derivation.
Example 11.8 After differentiating both sides of the generating function
we obtain
Integration.
After integrating both sides we get
Multiplying the above generating functions we obtain
where are the so-called harmonic numbers.
Changing the arguments. Let represent the sequence , then represents the sequence . The following statements holds
which can also be obtained by substituting by in . We can obtain the sum of the odd power terms in the same way,
Using generating functions we can obtain interesting formulae. For example, let . Then , which is the generating function of the Fibonacci numbers. From this
The coefficient of on the left-hand side is , that is the th Fibonacci number, while the coefficient of on the right-hand side is
after using the binomial formula in each term. Hence
Remember that the binomial formula can be generalised for all real , namely
which is the generating function of the binomial coefficients for a given . Here is a generalisation of the number of combinations for any real number , that is
We can obtain useful formulae using this generalisation for negative . Let
Since, by a simple computation, we get
the following formula can be obtained
Then
and
where is a natural number.
If the generating function of the general solution of a recurrence equation to be solved can be expanded in such a way that the coefficients are in closed form, then this method is successful. Let the recurrence equation be
To solve it, let us consider the generating function
If (11.14) can be written as and can be solved for , then can be expanded into series in such a way that can be written in closed form, equation (11.14) can be solved.
Now we give a general method for solving linear nonhomogeneous recurrence equations. After this we give three examples for the nonlinear case. In the first two examples the number of elements in some sets of binary trees, while in the third example the number of leaves of binary trees is computed. The corresponding recurrence equations (11.15), (11.17) and (11.18) will be solved using generating functions.
Multiply both sides of equation (11.8) by . Then
Summing up both sides of the equation term by term we get
Then
Let
The equation can be written as
This can be solved for . If is a rational fraction, then it can be decomposed into partial (elementary) fractions which, after expanding them into series, will give us the general solution of the original recurrence equation. We can also try to use the expansion into series in the case when the function is not a rational fraction.
Example 11.11 Solve the following equation using the above method
After multiplying and summing we have
and
Since , after decomposing the right-hand side into partial fractions, the solution of the equation is
After differentiating the generating function
term by term we get
Thus
therefore
Footnote. For decomposing the fraction into partial fractions we can use the Method of Undetermined Coefficients.
Let us denote by the number of binary trees with vertices. Then , , (see Figure 11.2). Let . (We will see later that this is a good choice.)
In a binary tree with vertices, there are altogether vertices in the left and right subtrees. If the left subtree has vertices and the right subtree has vertices, then there exists such binary trees. Summing over , we obtain exactly the number of binary trees. Thus for any natural number the recurrence equation in is
This can also be written as
Multiplying both sides by , then summing over all , we obtain
Let be the generating function of the numbers . The left-hand side of (11.16) is exactly (because ). The right-hand side looks like a product of two generating functions. To see which functions are in consideration, let us use the notation
Then the right-hand side of (11.16) is exactly , which is . Therefore
Solving this equation for gives
We have to choose the negative sign because . Thus
Therefore . The numbers are also called the Catalan numbers.
Remark. In the previous computation we used the following formula that can be proved easily
Let us count the number of leaves (vertices with degree 1) in the set of all binary trees of vertices. Denote this number by . We remark that the root is not considered leaf even if it is of degree 1. It is easy to see that , . Let and , conventionally. Later we will see that this choice of the initial values is appropriate.
As in the case of numbering the binary trees, consider the binary trees of vertices having vertices in the left subtree and vertices in the right subtree. There are such left subtrees and right subtrees. If we consider such a left subtree and all such right subtrees, then together there are leaves in the right subtrees. So for a given there are leaves. After summing we have
By an easy computation we get
This is a recurrence equation, with solution . Let
Multiplying both sides of (11.17) by and summing gives
Since and ,
Thus
and since
we have
After the computations
and
A bit harder problem: how many binary trees are there with vertices and leaves? Let us denote this number by . It is easy to see that , if . By a simple reasoning the case can be solved. The result is for any natural number . Let , conventionally. We will see later that this choice of the initial value is appropriate. Let us consider, as in case of the previous problems, the left and right subtrees. If the left subtree has vertices and leaves, then the right subtree has vertices and leaves. The number of these trees is . Summing over and gives
For solving this recurrence equation the generating function
will be used. Multiplying both sides of equation (11.18) by and summing over , we get
Changing the order of summation gives
Thus
or
Step by step, we can write the following:
Let us try to find the solution in the form
where , , . Substituting in (11.19) gives a recursion for the numbers
We solve this equation using the generating function method. If , then , and so . Let . If is the generating function of the numbers , then, using the formula of multiplication of the generating functions we obtain
thus
Since , only the negative sign can be chosen. After expanding the generating function we get
From this
Since for , it can be proved easily that . Thus
Using the formula
therefore
Thus
or
When solving linear nonhomogeneous equations using generating functions, the solution is usually done by the expansion of a rational fraction. The Z-transform method can help us in expanding such a function. Let be a rational fraction, where the degree of is less, than the degree of . If the roots of the denominator are known, the rational fraction can be expanded into partial fractions using the Method of Undetermined Coefficients.
Let us first consider the case when the denominator has distinct roots . Then
It is easy to see that
But
where . Now, by expanding this partial fraction, we get
Denote the coefficient of by , then , so
or
After the transformation and using we obtain
where
Thus in the expansion of the coefficient of is
If is a root of the polynomial , then is a root of . E.g. if
If case of multiple roots, e.g. if has multiplicity , their contribution to the solution is
Here is the derivative of order of the function .
All these can be summarised in the following algorithm. Suppose that the coefficients of the equation are in array , and the constants of the solution are in array .
Linear-Nonhomogeneous(
)
1 let be the equation, where
is a rational fraction; multiply both sides by , and sum over all
2 transform the equation into the form , where
, and are polynomials
3 use the transformation , and let the result be
, where are are polynomials
4 denote the roots of by
, with multiplicity , ,
, with multiplicity , ,
, with multiplicity , ;
then the general solution of the original equation is
, where
.
5 RETURN
If we substitute by in the generating function, the result is the so-called Z-transform, for which similar operations can be defined as for the generating functions. The residue theorem for the Z-transform gives the same result. The name of the method is derived from this observation.
Example 11.12 Solve the recurrence equation
Multiplying both sides by and summing we obtain
or
Thus
After the transformation we get
where the roots of the denominator are 1 with multiplicity 1 and 2 with multiplicity 2. Thus
Therefore the general solution is
Example 11.13 Solve the recurrence equation
Multiplying by and summing gives
so
that is
Then
The roots of the denominator are and . Let us compute and :
Since
raising to the th power gives
Exercises
11.2-1 How many binary trees are there with vertices and no empty left and right subtrees?
11.2-2 How many binary trees are there with vertices, in which each vertex which is not a leaf, has exactly two descendants?
11.2-3 Solve the following recurrent equation using generating functions.
( is the number of moves in the problem of the Towers of Hanoi.)
11.2-4 Solve the following recurrent equation using the Z-transform method.
11.2-5 Solve the following system of recurrence equations:
where .
Using the following function we can solve the linear recurrent equations numerically. The equation is given in the form
where . The coefficients are kept in array , the initial values in array . To find we will compute step by step the values , keeping in the previous values of the sequence in the first positions of (i.e. in the positions with indices ).
Recurrence(
)
1FOR
TO
2DO
3FOR
TO
4DO
5 6IF
7THEN
FOR
TO
8DO
9 10RETURN
Lines 2–5 compute the values () (using the previous values), denoted by in the algorithm. In lines 7–9, if is not yet reached, we copy the last values in the first positions of . In line 10 is obtained. It is easy to see that the computation time is , if we disregard the time to compute the values of the function.
Exercises
11.3-1 How many additions, subtractions, multiplications and divisions are required using the algorithm Recurrence
, while it computes using the data given in Example 11.4?
PROBLEMS |
11-1
Existence of a solution of a homogeneous equation using generating function
Prove that a linear homogeneous equation cannot be solved using generating functions (because is obtained) if and only if for all .
11-2
Complex roots in case of Z-transform
What happens if the roots of the denominator are complex when applying the Z-transform method? The solution of the recurrence equation must be real. Does the method ensure this?
CHAPTER NOTES |
Recurrence equations are discussed in details by Agarwal [1], Elaydi [69], Flajolet and Sedgewick [230], Greene and Knuth [99], Mickens [180], and also in the recent books written by Drmota [67], further by Flajolet and Sedgewick [75]. Knuth [144] and Graham, Knuth and Patashnik [98] deal with generating functions. In the book of Vilenkin [263] there are a lot of simple and interesting problems about recurrences and generating functions.
In [167] Lovász also presents problems on generating functions. Counting the binary trees is from Knuth [144], counting the leaves in the set of all binary trees and counting the binary trees with vertices and leaves are from Zoltán Kása [141].
Table of Contents
This title refers to a fast developing interdisciplinary area between mathematics, computers and applications. The subject is also often called as Computational Science and Engineering. Its aim is the efficient use of computer algorithms to solve engineering and scientific problems. One can say with a certain simplification that our subject is related to numerical mathematics, software engineering, computer graphics and applications. Here we can deal only with some basic elements of the subject such as the fundamentals of the floating point computer arithmetic, error analysis, the basic numerical methods of linear algebra and related mathematical software.
Let be the exact value and let be an approximation of (). The error of the approximation is defined by the formula (sometimes with opposite sign). The quantity is called an (absolute) error (bound) of approximation , if . For example, the error of the approximation is at most . In other words, the error bound of the approximation is . The quantities and (and accordingly and ) may be vectors or matrices. In such cases the absolute value and relation operators must be understood componentwise. We also measure the error by using matrix and vector norms. In such cases, the quantity is an error bound, if the inequality holds.
The absolute error bound can be irrelevant in many cases. For example, an approximation with error bound has no value in estimating a quantity of order . The goodness of an approximation is measured by the relative error ( for vectors and matrices), which compares the error bound to the approximated quantity. Since the exact value is generally unknown, we use the approximate relative error (). The committed error is proportional to the quantity , which can be neglected, if the absolute value (norm) of and is much greater than . The relative error is often expressed in percentages.
In practice, the (absolute) error bound is used as a substitute for the generally unknown true error.
In the classical error analysis we assume input data with given error bounds, exact computations (operations) and seek for the error bound of the final result. Let and be exact values with approximations and , respectively. Assume that the absolute error bounds of approximations and are and , respectively. Using the classical error analysis approach we obtain the following error bounds for the four basic arithmetic operations:
We can see that the division with a number near to can make the absolute error arbitrarily big. Similarly, if the result of subtraction is near to , then its relative error can become arbitrarily big. One has to avoid these cases. Especially the subtraction operation can be quite dangerous.
Example 12.1 Calculate the quantity with approximations and whose common absolute and relative error bounds are and , respectively. One obtains the approximate value , whose relative error bound is
that is . The true relative error is about . Yet it is too big, since it is approximately times bigger than the relative error of the initial data. We can avoid the subtraction operation by using the following trick
Here the nominator is exact, while the absolute error of the denominator is . Hence the relative error (bound) of the quotient is about . The latter result is in agreement with the relative error of the initial data and it is substantially smaller than the one obtained with direct subtraction operation.
The first order error terms of twice differentiable functions can be obtained by their first order Taylor polynomial:
The numerical sensitivity of functions at a given point is characterised by the condition number, which is the ratio of the relative errors of approximate function value and the input data (the Jacobian matrix of functions is denoted by at the point ):
We can consider the condition number as the magnification number of the input relative error. Therefore the functions is considered numerically stable (or well-conditioned) at the point , if is “small”. Otherwise is considered as numerically unstable (ill-conditioned.) The condition number depends on the point . A function can be well-conditioned at point , while it is ill-conditioned at point . The term “small” is relative. It depends on the problem, the computer and the required precision.
The condition number of matrices can be defined as the upper bound of a function condition number. Let us define the mapping by the solution of the equation (, ), that is, let . Then and
The upper bound of the right side is called the condition number of the matrix . This bound is sharp, since there exists a vector such that .
Let us investigate the calculation of the function value . If we calculate the approximation instead of the exact value , then the forward error . If for a value the equality holds, that is, is the exact function value of the perturbed input data , then is called the backward error. The connection of the two concepts is shown on the Figure 12.1.
The continuous line shows exact value, while the dashed one indicates computed value. The analysis of the backward error is called the backward error analysis. If there exist more than one backward error, then the estimation of the smallest one is the most important.
An algorithm for computing the value is called backward stable, if for any it gives a computed value with small backward error . Again, the term “small” is relative to the problem environment.
The connection of the forward and backward errors is described by the approximate thumb rule
which means that
This inequality indicates that the computed solution of an ill-conditioned problem may have a big relative forward error. An algorithm is said to be forward stable if the forward error is small. A forward stable method is not necessarily backward stable. If the forward error and the condition number are small, then the algorithm is forward stable.
Example 12.2 Consider the function the condition number of which is . For the condition number is big. Therefore the relative forward error is big for .
The classical error analysis investigates only the effects of the input data errors and assumes exact arithmetic operations. The digital computers however are representing the numbers with a finite number of digits, the arithmetic computations are carried out on the elements of a finite set of such numbers and the results of operations belong to . Hence the computer representation of the numbers may add further errors to the input data and the results of arithmetic operations may also be subject to further rounding. If the result of operation belongs to , then we have the exact result. Otherwise we have three cases:
(i) rounding to representable (nonzero) number;
(ii) underflow (rounding to );
(iii) overflow (in case of results whose moduli too large).
The most of the scientific-engineering calculations are done in floating point arithmetic whose generally accepted model is the following:
Definition 12.1 The set of floating point numbers is given by
where
– is the base (or radix) of the number system,
– is the mantissa in the number system with base ,
– is the exponent,
– is the length of mantissa (the precision of arithmetic),
– is the smallest exponent (underflow exponent),
– is the biggest exponent (overflow exponent).
The parameters of the three most often used number systems are indicated in the following table
The mantissa can be written in the form
We can observe that condition implies the inequality for the first digit . The remaining digits must satisfy (). Such arithmetic systems are called normalized. The zero digit and the dot is not represented. If , then the first digit is , which is also unrepresented. Using the representation (12.2) we can give the set in the form
where and .
Example 12.3 The set contains elements and its positive elements are given by
The elements of are not equally distributed on the real line. The distance of two consecutive numbers in is . Since the elements of are of the form , the distance of two consecutive numbers in is changing with the exponent. The maximum distance of two consecutive floating point numbers is , while the minimum distance is .
For the mantissa we have , since
Using this observation we can easily prove the following result on the range of floating point numbers.
Theorem 12.2 If , , then , where
Let and denote any of the four arithmetic operations . The following cases are possible:
(1) (exact result),
(2) (arithmetic overflow),
(3) (arithmetic underflow),
(4) , (not representable result).
In the last two cases the floating point arithmetic is rounding the result to the nearest floating point number in . If two consecutive floating point numbers are equally distant from , then we generally round to the greater number. For example, in a five digit decimal arithmetic, the number is rounded to the number .
Let . It is clear that . Let . The denotes an element of nearest to . The mapping is called rounding. The quantity is called the rounding error. If , then the rounding error is at most . The quantity is called the unit roundoff. The quantity is the relative error bound of .
Proof. Without loss of generality we can assume that . Let , be two consecutive numbers such that
Either or holds. Since holds in both cases, we have
either or . It follows that
Hence , where . A simple arrangement yields
Since , we proved the claim.
Thus we proved that the relative error of the rounding is bounded in floating point arithmetic and the bound is the unit roundoff .
Another quantity used to measure the rounding errors is the so called the machine epsilon (). The number is the distance of and its nearest neighbour greater than . The following algorithm determines in the case of binary base.
Machine-Epsilon
1 2WHILE
3DO
4 5RETURN
In the MATLAB system .
For the results of floating point arithmetic operations we assume the following (standard model):
The IEEE arithmetic standard satisfies this assumption. It is an important consequence of the assumption that for the relative error of arithmetic operations satisfies
Hence the relative error of the floating point arithmetic operations is small.
There exist computer floating point arithmetics that do not comply with the standard model (12.4). The usual reason for this is that the arithmetic lacks a guard digit in subtraction. For simplicity we investigate the subtraction in a three digit binary arithmetic. In the first step we equate the exponents:
If the computation is done with four digits, the result is the following
from which the normalized result is . Observe that the subtracted number is unnormalised. The temporary fourth digit of the mantissa is called a guard digit. Without a guard digit the computations are the following:
Hence the normalized result is with a relative error of . Several CRAY computers and pocket calculators lack guard digits.
Without the guard digit the floating point arithmetic operations satisfy only the weaker conditions
Assume that we have a guard digit and the arithmetic complies with standard model (12.4). Introduce the following notations:
The following results hold:
where denotes the error (matrix) of the actual operation.
The standard floating point arithmetics have many special properties. It is an important property that the addition is not associative because of the rounding.
Example 12.4 If , , then using MATLAB and AT386 type PC we obtain
We can have a similar result on Pentium1 machine with the choice .
The example also indicates that for different (numerical) processors may produce different computational results for the same calculations. The commutativity can also be lost in addition. Consider the computation of the sum . The usual algorithm is the recursive summation.
Recursive-Summation(
)
1 2FOR
to 3DO
4RETURN
for . The recursive summation algorithm (and MATLAB) gives the result
If the summation is done in the reverse (increasing) order, then the result is
If the two values are compared with the exact result , then we can see that the second summation gives better result. In this case the sum of smaller numbers gives significant digits to the final result unlike in the first case.
The last example indicates that the summation of a large number of data varying in modulus and sign is a complicated task. The following algorithm of W. Kahan is one of the most interesting procedures to solve the problem.
Compensated-Summation(
)
1 2 3FOR
TO
4DO
5 6 7 8RETURN
The ANSI/IEEE Standard 754-1985 of a binary () floating point arithmetic system was published in 1985. The standard specifies the basic arithmetic operations, comparisons, rounding modes, the arithmetic exceptions and their handling, and conversion between the different arithmetic formats. The square root is included as a basic operation. The standard does not deal with the exponential and transcendent functions. The standard defines two main floating point formats:
In both formats one bit is reserved as a sign bit. Since the floating point numbers are normalized and the first digit is always , this bit is not stored. This hidden bit is denoted by the “ ” in the table.
The arithmetic standard contains the handling of arithmetic exceptions.
(The numbers of the form , are called subnormal numbers.) The IEEE arithmetic is a closed system. Every arithmetic operations has a result, whether it is expected mathematically or not. The exceptional operations raise a signal and continue. The arithmetic standard conforms with the standard model (12.4).
The first hardware implementation of the IEEE standard was the Intel 8087 mathematical coprocessor. Since then it is generally accepted and used.
Remark. In the single precision we have about 7 significant digit precision in the decimal system. For double precision we have approximately 16 digit precision in decimals. There also exists an extended precision format of 80 bits, where and the exponential has bits.
Exercises
12.1-1 The measured values of two resistors are and . We connect the two resistors parallel and obtain the circuit resistance . Calculate the relative error bounds of the initial data and the approximate value of the resistance . Evaluate the absolute and relative error bounds and , respectively in the following three ways:
(i) Estimate first using only the absolute error bounds of the input data, then estimate the relative error bound .
(ii) Estimate first the relative error bound using only the relative error bounds of the input data, then estimate the absolute error bound .
(iii) Consider the circuit resistance as a two variable function .
12.1-2 Assume that is calculated with the absolute error bound . The following two expressions are theoretically equal:
(i)
(ii) .
Which expression can be calculated with less relative error and why?
12.1-3 Consider the arithmetic operations as two variable functions of the form , where .
(i) Derive the error bounds of the arithmetic operations from the error formula of two variable functions.
(ii) Derive the condition numbers of these functions. When are they ill-conditioned?
(iii) Derive error bounds for the power function assuming that both the base and the exponent have errors. What is the result if the exponent is exact?
(iv) Let , and . Determine the smallest and the greatest value of as a function of such that the relative error bound of should be at most .
12.1-4 Assume that the number () is calculated in a 24 bit long mantissa and the exponential function is also calculated with 24 significant bits. Estimate the absolute error of the result. Estimate the relative error without using the actual value of .
12.1-5 Consider the floating point number set and show that
(i) Every arithmetic operation can result arithmetic overflow;
(ii) Every arithmetic operation can result arithmetic underflow.
12.1-6 Show that the following expressions are numerically unstable for :
(i)
(ii) ;
(iii) .
Calculate the values of the above expressions for and estimate the error. Manipulate the expressions into numerically stable ones and estimate the error as well.
12.1-7 How many elements does the set have? How many subnormal numbers can we find?
12.1-8 If , then and equality holds if and only if . Is it true numerically? Check the inequality experimentally for various data (small and large numbers, numbers close to each other or different in magnitude).
The general form of linear algebraic systems with unknowns and equations is given by
This system can be written in the more compact form
where
The systems is called underdetermined if . For , the systems is called overdetermined. Here we investigate only the case , when the coefficient matrix is square. We also assume that the inverse matrix exists (or equivalently ). Under this assumption the linear system has exactly one solution: .
Definition 12.4 The matrix is upper triangular if for all . The matrix is lower triangular if for all .
For example the general form of the upper triangular matrices is the following:
We note that the diagonal matrices are both lower and upper triangular. It is easy to show that holds for the upper or lower triangular matrices. It is easy to solve linear systems with triangular coefficient matrices. Consider the following upper triangular linear system:
This can be solved by the so called back substitution algorithm.
Back-Substitution(
)
1 2FOR
DOWNTO
3DO
4RETURN
The solution of lower triangular systems is similar.
The Gauss method. The Gauss method or Gaussian elimination (GE) consists of two phases:
I. The linear system is transformed to an equivalent upper triangular system using elementary operations (see Figure 12.2).
II. The obtained upper triangular system is then solved by the back substitution algorithm.
The first phase is often called the elimination or forward phase. The second phase of GE is called the backward phase. The elementary operations are of the following three types:
1. Add a multiple of one equation to another equation.
2. Interchange two equations.
3. Multiply an equation by a nonzero constant.
The elimination phase of GE is based on the following observation. Multiply equation by and subtract it from equation :
If , then by choosing , the coefficient of becomes in the new equivalent equation, which replaces equation . Thus we can eliminate variable (or coefficient ) from equation .
The Gauss method eliminates the coefficients (variables) under the main diagonal of in a systematic way. First variable is eliminated from equations using equation , then is eliminated from equations using equation , and so on.
Assume that the unknowns are eliminated in the first columns under the main diagonal and the resulting linear system has the form
If , then multiplying row by and subtracting it from equation we obtain
Since for , we eliminated the coefficient (variable ) from equation . Repeating this process for we can eliminate the coefficients under the main diagonal entry . Next we denote by the element of matrix and by the vector . The Gauss method has the following form (where the pivoting discussed later is also included):
Gauss-Method(
)
1 Forward phase: 2 3FOR
TO
4DO
{pivoting and interchange of rows and columns} 5FOR
TO
6DO
7 8 9 Backward phase: see the back substitution algorithm. 10RETURN
The algorithm overwrites the original matrix and vector . It does not write however the zero entries under the main diagonal since these elements are not necessary for the second phase of the algorithm. Hence the lower triangular part of matrix can be used to store information for the decomposition of matrix .
The above version of the Gauss method can be performed only if the elements occurring in the computation are not zero. For this and numerical stability reasons we use the Gaussian elimination with pivoting.
If , then we can interchange row with another row, say , so that the new entry () at position should be nonzero. If this is not possible, then all the coefficients are zero and . In the latter case has no unique solution. The element is called the pivot element. We can always select new pivot elements by interchanging the rows. The selection of the pivot element has a great influence on the reliability of the computed results. The simple fact that we divide by the pivot element indicates this influence. We recall that is proportional to . It is considered advantageous if the pivot element is selected so that it has the greatest possible modulus. The process of selecting the pivot element is called pivoting. We mention the following two pivoting processes.
Partial pivoting: At the step, interchange the rows of the matrix so the largest remaining element, say , in the column is used as pivot. After the pivoting we have
Complete pivoting: At the step, interchange both the rows and columns of the matrix so that the largest element, say , in the remaining matrix is used as pivot After the pivoting we have
Note that the interchange of two columns implies the interchange of the corresponding unknowns. The significance of pivoting is well illustrated by the following
Example 12.6 The exact solution of the linear system
is and . The MATLAB program gives the result , and this is the best available result in standard double precision arithmetic. Solving this system with the Gaussian elimination without pivoting (also in double precision) we obtain the catastrophic result and . Using partial pivoting with the Gaussian elimination we obtain the best available numerical result .
Remark 12.5 Theoretically we do not need pivoting in the following cases: 1. If is symmetric and positive definite ( is positive definite , , ). 2. If is diagonally dominant in the following sense:
In case of symmetric and positive definite matrices we use the Cholesky method which is a special version of the Gauss-type methods.
During the Gaussian elimination we obtain a sequence of equivalent linear systems
where
Note that matrices are stored in the place of . The last coefficient matrix of phase I has the form
where is the pivot element. The growth factor of pivot elements is given by
Wilkinson proved that the error of the computed solution is proportional to the growth factor and the bounds
and
hold for complete and partial pivoting, respectively. Wilkinson conjectured that for complete pivoting. This has been proved by researchers for small values of . Statistical investigations on random matrices () indicate that the average of is for the partial pivoting and for the complete pivoting. Hence the case hardly occurs in the statistical sense.
We remark that Wilkinson constructed a linear system on which for the partial pivoting. Hence Wilkinson's bound for is sharp in the case of partial pivoting. There also exist examples of linear systems concerning discretisations of differential and integral equations, where is increasing exponentially if Gaussian elimination is used with partial pivoting.
The growth factor can be very large, if the Gaussian elimination is used without pivoting. For example, , if
Operations counts. The Gauss method gives the solution of the linear system () in a finite number of steps and arithmetic operations . The amount of necessary arithmetic operations is an important characteristic of the direct linear system solvers, since the CPU time is largely proportional to the number of arithmetic operations. It was also observed that the number of additive and multiplicative operations are nearly the same in the numerical algorithms of linear algebra. For measuring the cost of such algorithms C. B. Moler introduced the concept of flop.
Definition 12.6 (One (old) flop) is the computational work necessary for the operation (1 addition + 1 multiplication). One (new) flop is the computational work necessary for any of the arithmetic operations .
The new flop can be used if the computational time of additive and multiplicative operations are approximately the same. Two new flops equals to one old flop. Here we use the notion of old flop.
For the Gauss method a simple counting gives the number of additive and multiplicative operations.
Theorem 12.7 The computational cost of the Gauss method is flops.
V. V. Klyuyev and N. Kokovkin-Shcherbak proved that if only elementary row and column operations (multiplication of row or column by a number, interchange of rows or columns, addition of a multiple of row or column to another row or column) are allowed, then the linear system cannot be solved in less than flops.
Using fast matrix inversion procedures we can solve the linear system in flops. These theoretically interesting algorithms are not used in practice since they are considered as numerically unstable.
The -decomposition. In many cases it is easier to solve a linear system if the coefficient matrix can be decomposed into the product of two triangular matrices.
Definition 12.8 The matrix has an -decomposition, if , where is lower and is upper triangular matrix.
The -decomposition is not unique. If a nonsingular matrix has an -decomposition, then it has a particular -decomposition, where the main diagonal of a given component matrix consists of 's. Such triangular matrices are called unit (upper or lower) triangular matrices. The decomposition is unique, if is set to be lower unit triangular or is set to be unit upper triangular.
The -decomposition of nonsingular matrices is closely related to the Gaussian elimination method. If , where is unit lower triangular, then (), where is given by the Gauss algorithm. The matrix is the upper triangular part of the matrix we obtain at the end of the forward phase. The matrix can also be derived from this matrix, if the columns of the lower triangular part are divided by the corresponding main diagonal elements. We remind that the first phase of the Gaussian elimination does not annihilate the matrix elements under the main diagonal. It is clear that a nonsingular matrix has -decomposition if and only if holds for each pivot element for the Gauss method without pivoting.
Definition 12.9 A matrix whose every row and column has one and only one non-zero element, that element being , is called a permutation matrix.
In case of partial pivoting we permute the rows of the coefficient matrix (multiply by a permutation matrix on the left) so that holds for a nonsingular matrix. Hence we have
Theorem 12.10 If is nonsingular then there exists a permutation matrix such that has an -decomposition.
The the algorithm of -decomposition is essentially the Gaussian elimination method. If pivoting is used then the interchange of rows must also be executed on the elements under the main diagonal and the permutation matrix must be recorded. A vector containing the actual order of the original matrix rows is obviously sufficient for this purpose.
The - and Cholesky-methods. Let and consider the equation . Since , we can decompose into the equivalent linear system and , where is lower triangular and is upper triangular.
LU-Method(
)
1 Determine the -decomposition .
2 Solve .
3 Solve .
4 RETURN
Remark. In case of partial pivoting we obtain the decomposition and we set instead of .
In the first phase of the Gauss method we produce decomposition and the equivalent linear system with upper triangular coefficient matrix. The latter is solved in the second phase. In the -method we decompose the first phase of the Gauss method into two steps. In the first step we obtain only the decomposition . In the second step we produce the vector . The third step of the algorithm is identical with the second phase of the original Gauss method.
The -method is especially advantageous if we have to solve several linear systems with the same coefficient matrix:
In such a case we determine the -decomposition of matrix only once, and then we solve the linear systems , (, ). The computational cost of this process is flops.
The inversion of a matrix can be done as follows:
1. Determine the -decomposition . .
2. Solve , ( is the unit vector ).
The inverse of is given by . The computational cost of the algorithm is flops.
This implementation of the -method is known since the 60's. Vector contains the indices of the rows. At the start we set (). When exchanging rows we exchange only those components of vector that correspond to the rows.
LU-Method-with-Pointers(
)
1 2 3FOR
TO
4DO
compute index such that . 5IF
6THEN
exchange the components and . 7FOR
TO
8DO
9 10FOR
TO
11DO
12FOR
TO
13DO
14 15FOR
DOWNTO
16DO
17FOR
TO
18 19 20RETURN
If is symmetric and positive definite, then it can be decomposed in the form , where is lower triangular matrix. The -decomposition is called the Cholesky-decomposition. In this case we can save approximately half of the storage place for and half of the computational cost of the -decomposition (-decomposition). Let
Observing that only the first elements may be nonzero in the column of we obtain that
This gives the formulae
Using the notation () we can formulate the Cholesky-method as follows.
Cholesky-Method(
)
1 2FOR
TO
3DO
4FOR
TO
5DO
6RETURN
The lower triangular part of contains . The computational cost of the algorithm is flops and square roots. The algorithm, which can be considered as a special case of the Gauss-methods, does not require pivoting, at least in principle.
It often happens that linear systems have banded coefficient matrices.
Definition 12.11 Matrix is banded with lower bandwidth and upper bandwidth if
The possibly non-zero elements () form a band like structure. Schematically has the form
The banded matrices yield very efficient algorithms if and are significantly less than . If a banded matrix with lower bandwidth and upper bandwidth has an -decomposition, then both and are banded with lower bandwidth and upper bandwidth , respectively.
Next we give the -method for banded matrices in three parts.
The-LU-Decomposition-of-Banded-Matrix(
)
1FOR
TO
2DO
FOR
TO
3DO
4FOR
TO
5DO
6RETURN
Entry is overwritten by , if and by , if . The computational cost of is flops, where
The following algorithm overwrites by the solution of equation .
Solution-of-Banded-Unit-Lower-Triangular-System(
)
1FOR
TO
2DO
3RETURN
The total cost of the algorithm is flops. The next algorithm overwrites vector by the solution of .
Solution-of-Banded-Upper-Triangular-System(
)
1FOR
DOWNTO
2DO
3RETURN
The computational cost is flops.
Assume that is symmetric, positive definite and banded with lower bandwidth . The banded version of the Cholesky-methods is given by
Cholesky-decomposition-of-Banded-Matrices(
)
1FOR
TO
2DO
FOR
TO
3DO
4 5RETURN
The elements are overwritten by (). The total amount of work is given by flops és square roots.
Remark. If has lower bandwidth and upper bandwidth and partial pivoting takes place, then the upper bandwidth of increases up to .
There are several iterative methods for solving linear systems of algebraic equations. The best known iterative algorithms are the classical Jacobi-, the Gauss-Seidel- and the relaxation methods. The greatest advantage of these iterative algorithms is their easy implementation to large systems. At the same time they usually have slow convergence. However for parallel computers the multisplitting iterative algorithms seem to be efficient.
Consider the iteration
where és . It is known that converges for all , if and only if the spectral radius of satisfies ( is an eigenvalue of ). In case of convergence , that is we obtain the solution of the equation . The speed of convergence depends on the spectral radius . Smaller the spectral radius , faster the convergence.
Consider now the linear system
where is nonsingular. The matrices form a multisplitting of if
(i) , ,
(ii) is nonsingular, ,
(iii) is non-negative diagonal matrix, ,
(iv) .
Let be a given initial vector. The multisplitting iterative method is the following.
Multisplitting-Iteration(
)
1 2WHILE
exit condition =false
3DO
4FOR
TO
5DO
6 7RETURN
It is easy to show that and
Thus the condition of convergence is . The multisplitting iteration is a true parallel algorithm because we can solve linear systems parallel in each iteration (synchronised parallelism). The bottleneck of the algorithm is the computation of iterate .
The selection of matrices and is such that the solution of the linear system should be cheap. Let be a partition of , that is , () and . Furthermore let () be such that for at least one .
The non-overlapping block Jacobi splitting of is given by
for .
Define now the simple splitting
where is nonsingular,
It can be shown that
holds for the non-overlapping block Jacobi multisplitting.
The overlapping block Jacobi multisplitting of is defined by
for .
A nonsingular matrix is called an -matrix, if () and all the elements of are nonnegative.
Theorem 12.12 Assume that is nonsingular -matrix, is a non-overlapping, is an overlapping block Jacobi multisplitting of , where the weighting matrices are the same. The we have
where and .
We can observe that both iteration procedures are convergent and the convergence of the overlapping multisplitting is not slower than that of the non-overlapping procedure. The theorem remains true if we use block Gauss-Seidel multisplittings instead of the block Jacobi multisplittings. In this case we replace the above defined matrices and with their lower triangular parts.
The multisplitting algorithm has multi-stage and asynchronous variants as well.
We analyse the direct and inverse errors. We use the following notations and concepts. The exact (theoretical) solution of is denoted by , while any approximate solution is denoted by . The direct error of the approximate solution is given by . The quantity is called the residual error. For the exact solution , while for the approximate solution
We use various models to estimate the inverse error. In the most general case we assume that the computed solution satisfies the linear system , where and . The quantities and are called the inverse errors.
One has to distinguish between the sensitivity of the problem and the stability of the solution algorithm. By sensitivity of a problem we mean the sensitivity of the solution to changes in the input parameters (data). By the stability (or sensitivity) of an algorithm we mean the influence of computational errors on the computed solution. We measure the sensitivity of a problem or algorithm in various ways. One such characterization is the “condition number”, which compares the relative errors of the input and output values.
The following general principles are used when applying any algorithm:
- We use only stable or well-conditioned algorithms.
- We cannot solve an unstable (ill-posed or ill-conditioned) problem with a general purpose algorithm, in general.
Assume that we solve the perturbed equation
instead of the original . Let and investigate the difference of the two solutions.
Theorem 12.13 If is nonsingular and , then
where is the condition number of .
Here we can see that the condition number of may strongly influence the relative error of the perturbed solution . A linear algebraic system is said to be well-conditioned if is small, and ill-conditioned, if is big. It is clear that the terms “small” and “big” are relative and the condition number depends on the norm chosen. We identify the applied norm if it is essential for some reason. For example . The next example gives possible geometric characterization of the condition number.
Example 12.7 The linear system
is ill-conditioned (). The two lines, whose meshpoint defines the system, are almost parallel. Therefore if we perturb the right hand side, the new meshpoint of the two lines will be far from the previous meshpoint.
The inverse error is in the sensitivity model under investigation. Theorem 12.13 gives an estimate of the direct error which conforms with the thumb rule. It follows that we can expect a small relative error of the perturbed solution , if the condition number of is small.
Example 12.8 Consider the linear system with
Let . Then and , but .
Consider now the perturbed linear system
instead of . It can be proved that for this perturbation model there exist more than one inverse errors “inverse error” among which is the inverse error with minimal spectral norm, provided that .
The following theorem establish that for small relative residual error the relative inverse error is also small.
Theorem 12.14 Assume that is the approximate solution of , and . If , the the matrix satisfies and .
If the relative inverse error and the condition number of are small, then the relative residual error is small.
Theorem 12.15 If , , , and , then
If is ill-conditioned, then Theorem 12.15 is not true.
Example 12.9 Let , and , (). Then and . Let
Then and , which is not small.
In the most general case we solve the perturbed equation
instead of . The following general result holds.
Theorem 12.16 If is nonsingular, and , then
This theorem implies the following “thumb rule”.
Thumb rule. Assume that . If the entries of and are accurate to about decimal places and , where , then the entries of the computed solution are accurate to about decimal places.
The assumption of Theorem 12.16 guarantees that that matrix is nonsingular. The inequality is equivalent with the inequality and the distance of from the nearest singular matrix is just . Thus we can give a new characterization of the condition number:
Thus if a matrix is ill-conditioned, then it is close to a singular matrix. Earlier we defined the condition numbers of matrices as the condition number of the mapping .
Let us introduce the following definition.
Definition 12.17 A linear system solver is said to be weakly stable on a matrix class , if for all well-conditioned and for all , the computed solution of the linear system has small relative error .
Putting together Theorems 12.13–12.16 we obtain the following.
Theorem 12.18 (Bunch) A linear system solver is weakly stable on a matrix class , if for all well-conditioned and for all , the computed solution of the linear system satisfies any of the following conditions:
(1) is small;
(2) is small;
(3) There exists such that and are small.
The estimate of Theorem 12.16 can be used in practice if we know estimates of and . If no estimates are available, then we can only make a posteriori error estimates.
In the following we study the componentwise error estimates. We first give an estimate for the absolute error of the approximate solution using the components of the inverse error.
Theorem 12.19 (Bauer, Skeel) Let be nonsingular and assume that the approximate solution of satisfies the linear system . If , and are such that , , , and , then
If (), and
then we obtain the estimate
The quantity is said to be Skeel-norm, although it is not a norm in the earlier defined sense. The Skeel-norm satisfies the inequality
Therefore the above estimate is not worse than the traditional one that uses the standard condition number.
The inverse error can be estimated componentwise by the following result of Oettli and Prager. Let and . Assume that and . Furthermore let
Theorem 12.20 (Oettli, Prager) The computed solution satisfies a perturbed equation with and , if
We do not need the condition number to apply this theorem. In practice the entries and are proportional to the machine epsilon.
Theorem 12.21 (Wilkinson) The approximate solution of obtained by the Gauss method in floating point arithmetic satisfies the perturbed linear equation
with
where denotes the groth factor of the pivot elements and is the unit roundoff.
Since is small in practice, the relative error
is also small. Therefore Theorem 12.18 implies that the Gauss method is weakly stable both for full and partial pivoting.
Wilkinson's theorem implies that
For a small condition number we can assume that . Using Theorems 12.21 and 12.16 (case ) we obtain the following estimate of the direct error:
The obtained result supports the thumb rule in the case of the Gauss method.
Example 12.10 Consider the following linear system whose coefficients can be represented exactly:
Here is big, but is negligible. The exact solution of the problem is , . The MATLAB gives the approximate solution , with the relative error
Since and , the result essentially corresponds to the Wilkinson theorem or the thumb rule. The Wilkinson theorem gives the bound
for the inverse error. If we use the Oettli-Prager theorem with the choice and , then we obtain the estimate . Since , this estimate is better than that of Wilkinson.
Scaling and preconditioning. Several matrices that occur in applications are ill-conditioned if their order is large. For example the famous Hilbert-matrix
has , if . There exist matrices with integer entries that can be represented exactly in standard IEEE754 floating point arithmetic while their condition number is approximately .
We have two main techniques to solve linear systems with large condition numbers. Either we use multiple precision arithmetic or decrease the condition number. There are two known forms of decreasing the condition number.
1. Scaling. We replace the linear system with the equation
where and are diagonal matrices.
We apply the Gauss method to this scaled system and get the solution . The quantity defines the requested solution. If the condition number of the matrix is smaller then we expect a smaller error in and consequently in . Various strategies are given to choose the scaling matrices and . One of the best known strategies is the balancing which forces every column and row of to have approximately the same norm. For example, if
where is the row vector of , the Euclidean norms of the rows of will be and the estimate
holds with . This means that optimally scales the rows of in an approximate sense.
The next example shows that the scaling may lead to bad results.
Example 12.11 Consider the matrix
for . It is easy to show that . Let
Then the scaled matrix
has the condition number , which a very large value for small .
2. Preconditioning. The preconditioning is very close to scaling. We rewrite the linear system with the equivalent form
where matrix is such that is smaller and is easily solvable.
The preconditioning is often used with iterative methods on linear systems with symmetric and positive definite matrices.
A posteriori error estimates. The a posteriori estimate of the error of an approximate solution is necessary to get some information on the reliability of the obtained result. There are plenty of such estimates. Here we show three estimates whose computational cost is flops. This cost is acceptable when comparing to the cost of direct or iterative methods ( or per iteration step).
Theorem 12.22 (Auchmuty) Let be the approximate solution of . Then
where .
The error constant depends on and the direction of error vector . Furthermore
The error constant takes the upper value only in exceptional cases. The computational experiments indicate that the average value of grows slowly with the order of and it depends more strongly on than the condition number of . The following experimental estimate
seems to hold with a high degree of probability.
The famous LINPACK program package uses the following process to estimate . We solve the linear systems and . Then the estimate of is given by
Since
we can interpret the process as an application of the power method of the eigenvalue problem. The estimate can be used with the and -norms. The entries of vector are possibly with random signs.
If the linear system is solved by the -method, then the solution of further linear systems costs flops per system. Thus the total cost of the LINPACK estimate remains small. Having the estimate we can easily estimate and the error of the approximate solution (cf. Theorem 12.16 or the thumb rule). We remark that several similar processes are known in the literature.
We use the Oettli-Prager theorem in the following form. Let be the residual error, and are given such that and . Let
where is set to -nak, is set to , if . Symbol denotes the component of the vector . If , then there exist a matrix and a vector for which
holds and
Moreover is the smallest number for which and exist with the above properties. The quantity measures the relative inverse error in terms of and . If for a given , and , the quantity is small, then the perturbed problem (and its solution) are close to the original problem (and its solution). In practice, the choice and is preferred
Denote by the approximate solution of and let be the residual error at the point . The precision of the approximate solution can be improved with the following method.
Iterative-Refinement(
)
1 2 3 4WHILE
5DO
6 Compute the approximate solution of with the -method. 7 8 9RETURN
There are other variants of this process. We can use other linear solvers instead of the -method.
Let be the smallest bound of relative inverse error with
Furthermore let
Theorem 12.23 (Skeel) If , then for sufficiently large we have
This result often holds after the first iteration, i.e. for . Jankowski and Wozniakowski investigated the iterative refinement for any method which produces an approximate solution with relative error less than . They showed that the iterative refinement improves the precision of the approximate solution even in single precision arithmetic and makes method to be weakly stable.
Exercises
12.2-1 Prove Theorem 12.7.
12.2-2 Consider the linear systems and , where
and . Which equation is more sensitive to the perturbation of ? What should be the relative error of in the more sensitive equation in order to get the solutions of both equations with the same precision?
12.2-3 Let , and
Solve the linear systems for . Explain the results.
12.2-4 Let be a matrix and choose the band matrix consisting of the main and the neighbouring two subdiagonals of as a preconditioning matrix. How much does the condition number of improves if (i) is a random matrix; (ii) is a Hilbert matrix?
12.2-5 Let
and assume that is the common error bound of every component of . Give the sharpest possible error bounds for the solution of the equation and for the sum .
12.2-6 Consider the linear system with the approximate solution .
(i) Give an error bound for , if holds exactly and both and is nonsingular.
(ii) Let
and consider the solution of . Give (if possible) a relative error bound for the entries of such that the integer part of every solution component remains constant within the range of this relative error bound.
The set of complex -vectors will be denoted by . Similarly, denotes the set of complex matrices.
Definition 12.24 Let be an arbitrary matrix. The number is the eigenvalue of if there is vector () such that
Vector is called the (right) eigenvector of that belongs to the eigenvalue .
Equation can be written in the equivalent form , where is the unit matrix of appropriate size. The latter homogeneous linear system has a nonzero solution if and only if
Equation (12.38) is called the characteristic equation of matrix . The roots of this equation are the eigenvalues of matrix . Expanding we obtain a polynomial of degree :
This polynomial called the characteristic polynomial of . It follows from the fundamental theorem of algebra that any matrix has exactly eigenvalues with multiplicities. The eigenvalues may be complex or real. Therefore one needs to use complex arithmetic for eigenvalue calculations. If the matrix is real and the computations are done in real arithmetic, the complex eigenvalues and eigenvectors can be determined only with special techniques.
If is an eigenvector, (), then is also eigenvector. The number of linearly independent eigenvectors that belong to an eigenvalue does not exceed the multiplicity of in the characteristic equation (12.38). The eigenvectors that belong to different eigenvalues are linearly independent.
The following results give estimates for the size and location of the eigenvalues.
Theorem 12.25 Let be any eigenvalue of matrix . The upper estimate holds in any induced matrix norm.
Theorem 12.26 (Gersgorin) Let ,
and
Then for any eigenvalue of we have .
For certain matrices the solution of the characteristic equation (12.38) is very easy. For example, if is a triangular matrix, then its eigenvalues are entries of the main diagonal. In most cases however the computation of all eigenvalues and eigenvectors is a very difficult task. Those transformations of matrices that keeps the eigenvalues unchanged have practical significance for this problem. Later we see that the eigenvalue problem of transformed matrices is simpler.
Definition 12.27 The matrices are similar if there is a matrix such that . The mapping is said to be similarity transformation of .
Theorem 12.28 Assume that . Then the eigenvalues of and are the same. If is the eigenvector of , then is the eigenvector of .
Similar matrices have the same eigenvalues.
The difficulty of the eigenvalue problem also stems from the fact that the eigenvalues and eigenvectors are very sensitive (unstable) to changes in the matrix entries. The eigenvalues of and the perturbed matrix may differ from each other significantly. Besides the multiplicity of the eigenvalues may also change under perturbation. The following theorems and examples show the very sensitivity of the eigenvalue problem.
Theorem 12.29 (Ostrowski, Elsner) For every eigenvalue of matrix there exists an eigenvalue of the perturbed matrix such that
We can observe that the eigenvalues are changing continuously and the size of change is proportional to the root of .
Example 12.12 Consider the following perturbed Jordan matrix of the size :
The characteristic equation is , which gives the different eigenvalues
instead of the original eigenvalue with multiplicity . The size of change is , which corresponds to Theorem 12.29. If , and , then the perturbation size of the eigenvalues is . This is a significant change relative to the input perturbation .
For special matrices and perturbations we may have much better perturbation bounds.
Theorem 12.30 (Bauer, Fike) Assume that is diagonalisable, that is a matrix exists such that . Denote an eigenvalue of . Then
This result is better than that of Ostrowski and Elsner. Nevertheless , which is generally unknown, can be very big.
The eigenvalues are continuous functions of the matrix entries. This is also true for the normalized eigenvectors if the eigenvalues are simple. The following example shows that this property does not hold for multiple eigenvalues.
The eigenvalues of are and . Vector is the eigenvector belonging to . Vector is the eigenvector belonging to . If , then
while the eigenvectors do not have limit.
We study the numerical solution of the eigenvalue problem in the next section. Unfortunately it is very difficult to estimate the goodness of numerical approximations. From the fact that holds with a certain error we cannot conclude anything in general.
Example 12.14 Consider the matrix
where is small. The eigenvalues of are , while the corresponding eigenvectors are . Let be an approximation of the eigenvalues and let be the approximate eigenvector. Then
If , then the residual error under estimate the true error by five order.
Remark 12.31 We can define the condition number of eigenvalues for simple eigenvalues:
where and are the right and left eigenvectors, respectively. For multiple eigenvalues the condition number is not finite.
We investigate only the real eigenvalues and eigenvectors of real matrices. The methods under consideration can be extended to the complex case with appropriate modifications.
This method is due to von Mieses. Assume that has exactly different real eigenvalues. Then the eigenvectors belonging to the corresponding eigenvalues are linearly independent. Assume that the eigenvalues satisfy the condition
and let be a given vector. This vector is a unique linear combination of the eigenvectors, that is . Assume that and compute the sequence (). The initial assumptions imply that
Let be an arbitrary vector such that . Then
Given the initial vector , the power method has the following form.
Power-Method(
)
1 2WHILE
exit condition = FALSE 3DO
4 5 Select vector such that 6 7 8RETURN
It is clear that
The convergence here means that , that is the action line of tends to the action line of . There are various strategies to select . We can select , where is defined by . If we select , then will be identical with the Rayleigh quotient . This choice gives an approximation of that have the minimal residual norm (Example 12.14 shows that this choice is not necessarily the best option).
The speed of convergence depends on the quotient . The method is very sensitive to the choice of the initial vector . If , then the process does not converge to the dominant eigenvalue . For certain matrix classes the power method converges with probability if the initial vector is randomly chosen. In case of complex eigenvalues or multiple we have to use modifications of the algorithm. The speed of convergence can be accelerated if the method is applied to the shifted matrix , where is an appropriately chosen number. The shifted matrix has the eigenvalues and the corresponding convergence factor . The latter quotient can be made smaller than with the proper selection of .
The usual exit condition of the power method is
If we simultaneously apply the power method to the transposed matrix and , then the quantity
gives an estimate for the condition number of (see Remark 12.31). In such a case we use the exit condition
The power method is very useful for large sparse matrices. It is often used to determine the largest and the smallest eigenvalue. We can approximate the smallest eigenvalue as follows. The eigenvalues of are . The eigenvalue will be the eigenvalue with the largest modulus. We can approximate this value by applying the power method to . This requires only a small modification of the algorithm. We replace line 4. with the following:
Solve equation for |
The modified algorithm is called the inverse power method. It is clear that and hold under appropriate conditions. If we use the -method to solve , we can avoid the inversion of .
If the inverse power method is applied to the shifted matrix , then the eigenvalues of are . If approaches, say, to , then . Hence the inequality
holds for the eigenvalues of the shifted matrix. The speed of convergence is determined by the quotient
If is close enough to , then is very small and the inverse power iteration converges very fast. This property can be exploited in the calculation of approximate eigenvectors if an approximate eigenvalue, say , is known. Assuming that , we apply the inverse power method to the shifted matrix . In spite of the fact that matrix is nearly singular and the linear equation cannot be solved with high precision, the algorithm gives very often good approximations of the eigenvectors.
Finally we note that in principle the von Mieses method can be modified to determine all eigenvalues and eigenvectors.
We need the following definition and theorem.
Definition 12.32 The matrix is said to be orthogonal if .
Theorem 12.33 (-decomposition) Every matrix having linearly independent column vectors can be decomposed in the product form , where is orthogonal and is upper triangular matrix.
We note that the -decomposition can be applied for solving linear systems of equations, similarly to the -decomposition. If the -decomposition of is known, then the equation can be written in the equivalent form . Thus we have to solve only an upper triangular linear system.
There are several methods to determine the -decomposition of a matrix. In practice the Givens-, the Householder- and the MGS-methods are used.
The MGS (Modified Gram-Schmidt) method is a stabilised, but algebraically equivalent version of the classical Gram-Schmidt orthogonalisation algorithm. The basic problem is the following: We seek for an orthonormal basis of the subspace
where () are linearly independent vectors. That is we determine the linearly independent vectors such that
and
The basic idea of the classical Gram-Schmidt-method is the following:
Let and . Assume that vectors are already computed and orthonormal. Assume that vector is such that , that is holds for . Since are orthonormal, () and . After normalisation we obtain .
The algorithm is formalised as follows.
CGS-Orthogonalization(
)
1FOR
TO
2DO
FOR
TO
3DO
4 5 6 7RETURN
The algorithm overwrites vectors by the orthonormal vectors . The connection with the -decomposition follows from the relation . Since
we can write that
The numerically stable MGS method is given in the following form
MGS-Orthogonalisation(
)
1FOR
TO
2DO
3 4FOR
TO
5DO
6 7RETURN
The algorithm overwrites vectors by the orthonormal vectors . The MGS method is more stable than the CGS algorithm. Björck proved that for the computed matrix satisfies
where is the unit roundoff.
Today the -method is the most important numerical algorithm to compute all eigenvalues of a general matrix. It can be shown that the -method is a generalisation of the power method. The basic idea of the method is the following: Starting from we compute the sequence , where is orthogonal, is orthogonally similar to () and the lower triangular part of tends to a diagonal matrix, whose entries will be the eigenvalues of . Here is the orthogonal factor of the -decomposition . Therefore . The basic algorithm is given in the following form.
QR-Method(
)
1 2 3WHILE
exit condition = FALSE 4DO
Compute the -decomposition 5 6 7RETURN
The following result holds.
Theorem 12.34 (Parlett) If the matrix is diagonalisable, , the eigenvalues satisfy
and has an -decomposition, then the lower triangular part of converges to a diagonal matrix whose entries are the eigenvalues of .
In general, matrices do not necessarily converge to a given matrix. If has eigenvalues of the same modulus, the form of matrices converge to the form
where the entries of the submatrix denoted by do not converge. However the eigenvalues of this submatrix will converge. This submatrix can be identified and properly handled. A real matrix may have real and complex eigenvalues. If there is a complex eigenvalues, than there is a corresponding conjugate eigenvalue as well. For pairs of complex conjugated eigenvalues is at least . Hence the sequence will show this phenomenon .
The -decomposition is very expensive. Its cost is flops for general matrices. If has upper Hessenberg form, the cost of -decomposition is flops.
Definition 12.35 The matrix has upper Hessenberg form, if
The following theorem guarantees that if has upper Hessenberg form, then every of the -method has also upper Hessenberg form.
Theorem 12.36 If has upper Hessenberg form and , then has also upper Hessenberg form.
We can transform a matrix to a similar matrix of upper Hessenberg form in many ways. One of the cheapest ways, that costs about flops, is based on the Gauss elimination method. Considering the advantages of the upper Hessenberg form the efficient implementation of the -method requires first the similarity transformation of to upper Hessenberg form.
The convergence of the -method, similarly to the power method, depends on the quotients . The eigenvalues of the shifted matrix are . The corresponding eigenvalue ratios are . A proper selection of can fasten the convergence.
The usual form of the -method includes the transformation to upper Hessenberg form and the shifting.
Shifted-
-Method(
)
1 ( is of upper Hessenberg form) 2 3WHILE
exit condition =false
4DO
compute the -decomposition 5 6 7RETURN
In practice the -method is used in shifted form. There are various strategies to select . The most often used selection is given by .
The eigenvectors of can also be determined by the -method. For this we refer to the literature.
Exercises
12.3-1 Apply the power method to the matrix with the initial vector . What is the result of the th step?
12.3-2 Apply the power method, the inverse power method and the -method to the matrix
12.3-3 Apply the shifted -method to the matrix of the previous exercise with the choice ( is fixed).
We have plenty of devices and tools that support efficient coding and implementation of numerical algorithms. One aim of such developments is to free the programmers from writing the programs of frequently occurring problems. This is usually done by writing safe, reliable and standardised routines that can be downloaded from (public) program libraries. We just mention the LINPACK, EISPACK, LAPACK, VISUAL NUMERICS (former IMSL) and NAG libraries. Another way of developments is to produce software that work as a programming language and makes the programming very easy. Such software systems are the MATLAB and the SciLab.
The main purpose of the BLAS (Basic Linear Algebra Subprograms) programs is the standardisation and efficient implementation the most frequent matrix-vector operations. Although the BLAS routines were published in FORTRAN they can be accessed in optimised machine code form as well. The BLAS routines have three levels:
- BLAS 1 (1979),
- BLAS 2 (1988),
- BLAS 3 (1989).
These levels corresponds to the computation cost of the implemented matrix operations. The BLAS routines are considered as the best implementations of the given matrix operations. The selection of the levels and individual BLAS routines strongly influence the efficiency of the program. A sparse version of BLAS also exists.
We note that the BLAS 3 routines were developed mainly for block parallel algorithms. The standard linear algebra packages LINPACK, EISPACK and LAPACK are built from BLAS routines. The parallel versions can be found in the SCALAPACK package. These programs can be found in the public NETLIB library:
Let , . The BLAS 1 routines are the programs of the most important vector operations (, , ), the computation of , the swapping of variables, rotations and the saxpy operation which is defined by
The word saxpy means that “scalar alpha plus ”. The saxpy operation is implemented in the following way.
Saxpy(
)
1 2FOR
to 3DO
4RETURN
The saxpy is a software driven operation. The cost of BLAS 1 routines is flops.
The matrix-vector operations of BLAS 2 requires flops. These operations are , , , , and their variants. Certain operations work only with triangular matrices. We analyse two operations in detail. The “outer or dyadic product” update
can be implemented in two ways.
The rowwise or “ ” variant:
Outer-Product-Update-Version”ij” (
)
1 2FOR
TO
3DO
4RETURN
The notation “:” denotes all allowed indices. In our case this means the indices . Thus denotes the row of matrix .
The columnwise or “ ” variant :
Outer-Product-Update-Version“ji” (
)
1 2FOR
TO
3DO
4RETURN
Here denotes the column of matrix . Observe that both variants are based on the saxpy operation.
The gaxpy operation is defined by
The word gaxpy means that “general plus ”. The gaxpy operation is also software driven and implemented in the following way:
Gaxpy(
)
1 2 3FOR
TO
4DO
5RETURN
Observe that the computation is done columnwise and the gaxpy operation is essentially a generalised saxpy.
These routines are the implementations of matrix-matrix and matrix-vector operations such as the operations , , ( is upper triangular) and their variants. BLAS 3 operations can be implemented in several forms. For example, the matrix product can be implemented at least in three ways. Let , .
Matrix-Product-Dot-Version(
)
1 2 3 4 5FOR
TO
6DO
FOR
TO
7DO
FOR
TO
8DO
9RETURN
This algorithm computes as the dot (inner) product of the row of and the column of . This corresponds to the original definition of matrix products.
Now let and be partitioned columnwise as follows
Then we can write as the linear combination of the columns of , that is
Hence the product can be implemented with saxpy operations.
Matrix-Product-Gaxpy-Variant(
)
1 2 3 4 5FOR
TO
6DO
FOR
TO
7DO
FOR
TO
8DO
9RETURN
The following equivalent form of the “ ”-algorithm shows that it is indeed a gaxpy based process.
Matrix-Product-with-Gaxpy-Call(
)
1 2 3 4FOR
TO
5DO
6RETURN
Consider now the partitions () and
Then .
Matrix-Product-Outer-Product-Variant(
)
1 2 3 4 5FOR
TO
6DO
FOR
TO
7DO
FOR
TO
8DO
9RETURN
The inner loop realizes a saxpy operation: it gives the multiple of to the column of matrix .
These are those programming tools that help easy programming in concise (possibly mathematical) form within an integrated program development system. Such systems were developed primarily for solving mathematical problems. By now they have been extended so that they can be applied in many other fields. For example, Nokia uses MATLAB in the testing and quality control of mobile phones. We give a short review on MATLAB in the next section. We also mention the widely used MAPLE, DERIVE and MATEMATICA systems.
The MATLAB software was named after the expression MATrix LABoratory. The name indicates that the matrix operations are very easy to make. The initial versions of MATLAB had only one data type: the complex matrix. In the later versions high dimension arrays, cells, records and objects also appeared. The MATLAB can be learned quite easily and even a beginner can write programs for relatively complicated problems.
The coding of matrix operations is similar to their standard mathematical form. For example if and are two matrices of the same size, then their sum is given by the command . As a programming language the MATLAB contains only four control structures known from other programming languages:
– the simple statement expression, - the if statement of the form
expression, commands { commands} ,
– the for loop of the form
the values of the loop variable, commands
– the while loop of the form
expression, commands .
The MATLAB has an extremely large number of built in functions that help efficient programming. We mention the following ones as a sample.
– selects the maximum element in every column of ,
– returns the approximate eigenvalues and eigenvectors of ,
– The command returns the numerical solution of the linear system .
The entrywise operations and partitioning of matrices can be done very efficiently in MATLAB. For example, the statement
exchange the second and third rows of while it takes the reciprocal of each element.
The above examples only illustrate the possibilities and easy programming of MATLAB. These examples require much more programming effort in other languages, say e.g. in PASCAL. The built in functions of MATLAB can be easily supplemented by other programs.
The higher number versions of MATLAB include more and more functions and special libraries (tool boxes) to solve special problems such as optimisation, statistics and so on.
There is a built in automatic technique to store and handle sparse matrices that makes the MATLAB competitive in solving large computational problems. The recent versions of MATLAB offer very rich graphic capabilities as well. There is an extra interval arithmetic package that can be downloaded from the WEB site
There is a possibility to build certain C and FORTRAN programs into MATLAB. Finally we mention that the system has an extremely well written help system.
PROBLEMS |
12-1
Without overflow
Write a MATLAB program that computes the norm without overflow in all cases when the result does not make overflow. It is also required that the error of the final result can not be greater than that of the original formula.
12-2
Estimate
Equation has the solution . The perturbed equation has the solutions . Give an estimate for the perturbation .
12-3
Double word length
Consider an arithmetic system that has double word length such that every number represented with digits are stored in two digit word. Assume that the computer can only add numbers with digits. Furthermore assume that the machine can recognise overflow.
(i) Find an algorithm that add two positive numbers of digit length.
(ii) If the representation of numbers requires the sign digit for all numbers, then modify algorithm (i) so that it can add negative and positive numbers both of the same sign. We can assume that the sum does not overflow.
12-4
Auchmuty theorem
Write a MATLAB program for the Auchmuty error estimate (see Theorem 12.22) and perform the following numerical testing.
(i) Solve the linear systems , where is a given matrix, , () are random vectors such that . Compare the true errors (), and the estimated errors , where is the approximate solution of . What is the minimum, maximum and average of numbers ? Use graphic for the presentation of the results. Suggested values are , and .
(ii) Analyse the effect of condition number and size.
(iii) Repeat problems (i) and (ii) using LINPACK and BLAS.
12-5
Hilbert matrix
Consider the linear system , where and is the fourth order Hilbert matrix, that is . is ill-conditioned. The inverse of is approximated by
Thus an approximation of the true solution is given by . Although the true solution is also integer is not an acceptable approximation. Apply the iterative refinement with instead of to find an acceptable integer solution.
12-6
Consistent norm
Let be a consistent norm and consider the linear system
(i) Prove that if is singular, then .
(ii) Show that for the 2-norm equality holds in (i), if and .
(iii) Using the result of (i) give a lower bound to , if
12-7
Cholesky-method
Use the Cholesky-method to solve the linear system , where
Also give the exact Cholesky-decomposition and the true solution of . The approximate Cholesky-factor satisfies the relation . It can proved that in a floating point arithmetic with -digit mantissa and base the entries of satisfy the inequality , where
Give a bound for the relative error of the approximate solution , if and (IBM3033).
12-8
Bauer-Fike theorem
Let
(i) Analyze the perturbation of the eigenvalues for .
(ii) Compare the estimate of Bauer-Fike theorem to the matrix .
12-9
Eigenvalues
Using the MATLAB eig routine compute the eigenvalues of for various (random) matrices and order . Also compute the eigenvalues of the perturbed matrices , where are random matrices with entries from the interval (). What is the maximum perturbation of the eigenvalues? How precise is the Bauer-Fike estimate? Suggested values are and . How do the results depend on the condition number and the order ? Display the maximum perturbations and the Bauer-Fike estimates graphically.
CHAPTER NOTES |
The a posteriori error estimates of linear algebraic systems are not completely reliable. Demmel, Diament és Malajovich [61] showed that for the number estimators there are always cases when the estimate is unreliable (the error of the estimate exceeds a given order). The first appearance of the iterative improvement is due to Fox, Goodwin, Turing and Wilkinson (1946). The experiences show that the decrease of the residual error is not monotone.
Young [279], Hageman and Young [103] give an excellent survey of the theory and application of iterative methods. Barett, Berry et al. [23] give a software oriented survey of the subject. Frommer [78] concentrates on the parallel computations.
The convergence of the -method is a delicate matter. It is analyzed in great depth and much better results than Theorem 12.34 exist in the literature. There are -like methods that involve double shifting. Batterson [24] showed that there exists a Hessenberg matrix with complex eigenvalues such that convergence cannot be achieved even with multiple shifting.
Several other methods are known for solving the eigenvalue problems (see, e.g. [269], [273]). The -method is one of the best known ones. It is very effective on positive definite Hermitian matrices. The -method computes the Cholesky-decomposition and sets .
[1] Difference Equations and Inequalities. Marcel Dekker, New York. 2000.
[3] PRIMES is in P. http://www.cse.iitk.ac.in/users/manindra/. 2002.
[4] Some properties of fix-free codes, In Proceedings of the 1st International Seminarium on Coding Theory and Combinatorics. 1996 (Thahkadzor, Armenia). 20–23.
[5] The Theory of Parsing, Translation and Compiling Vol. I.. Prentice-Hall. 1972.
[6] The Theory of Parsing, Translation and Compiling Vol. II.. Prentice-Hall. 1973.
[7] Compilers, Principles, Techniques and Tools. Addison-Wesley. 1986.
[8] The Shortest Vector Problem in is NP-hard for Randomized Reduction, In Proceedings of the 30th Annual ACM Symposium on Theory of Computing. 1998. 10–18.
[9] Sorting in parallel steps. Combinatorica. 1983. 1–19.
[10] Elements of Computer Algebra with Applications. John Wiley & Sons. 1989.
[11] Dynamic TCP acknowledgement, penalizing long delays, In Proceedings of the 25th ACM-SIAM Symposium on Discrete Algorithms. 2003. 47–55.
[12] Graph isomorphism is in SPP, In Proceedings of the 43rd IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press. 2002. 743–750.
[13] On-line load balancing with applications to machine scheduling and virtual circuit routing. Journal of the ACM. 1997. 486–504.
[14] Theoretische Informatik. Pearson Studium. 2002.
[15] Mathematical Methods of Game and Economic Theory. North-Holland. 1979.
[16] Throughput-competitive online routing, In Proceedings of the 34th Annual Symposium on Foundations of Computer Science. 1993. 32–40.
[17] On-line load balancing, Lecture Notes in Computer Science. Springer-Verlag. 1998. 178–195.
[18] Trading group theory for randomness, In Proceedings of the 17th ACM Symposium on Theory of Computing. ACM Press. 1985. 421–429.
[19] Arthur-Merlin games: A randomized proof system, and a hierarchy of complexity classes. Journal of Computer and Systems Sciences. 1988. 254–276.
[20] Shelf algorithms for two dimensional packing problems. SIAM Journal on Computing. 1983. 508–525.
[21] Universal data compression based on the Burrows-Wheeler transform: theory and practice. IEEE Transactions on Computers. 2000. 1043–1053–953.
[23] Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods. SIAM. 1994.
[24] Convergence of the shifted QR algorithm on normal matrices. Numerische Mathematik. 1990. 341–352.
[25] Modeling for text compression. Communications of the ACM. 1989. 557–591.
[26] Text Compression. Prentice Hall. 1990.
[27] The asymptotic number of labeled graphs with given degree sequences. Combinatorial Theory Series A. 1978. 296–307.
[28] One-way functions in worst-case cryptography: Algebraic and security properties are on the house. SIGACT News. 1999. 25–40.
[29] Twenty years of attacks on the RSA cryptosystem. Notices of the AMS. 1999. 203–213.
[30] Restrictive acceptance suffices for equivalence problems. London Mathematical Society Journal of Computation and Mathematics. 2000. 86–95.
[32] Theory of Computation - Formal Langauges, Automata, and Complexity. The Benjamin/Cummings Publishing Company. 1989.
[34] An Improved Deterministic Local Search Algorithm for 3-SAT. Theoretical Computer Science. 2004. 303–313.
[35] Ein Algorithmus zum Auffinden der Basiselemente des Restklassenringes nach einem nulldimensionalen Polynomidealitle. PhD dissertation, Leopold-Franzens-Universität, Innsbruc. 1965.
[36] A block-sorting lossless data compression algorithm. Research Report 124, http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-124.html. 1994.
[37] The Calgary/Canterbury Text Compression. ftp://ftp.cpsc.ucalgary.ca/pub/projects/text.compression.corpus. 2004.
[38] The Canterbury Corpus. http://corpus.canterbury.ac.nz. 2004.
[39] Randomness conductors and constant-degree lossless expanders, In Proceedings of the 34th ACM Symposium on Theory of Computing. IEEE Computer Society. 2001. 443–452.
[40] Theory of Finite Automata. Prentice Hall, Englewood Cliffs, New Jersey. 1989.
[41] Universal Classes of Hash Functions. KJournal of Computer and System Sciences. 1979. 143–154.
[43] Bounds for list schedules on uniform processors. SIAM Journal on Computing. 1980. 91–103.
[44] Resources for Computer Algebra. Computers in Physics. 1984. 308–315.
[45] New results on the server problem. SIAM Journal on Discrete Mathematics. 1991. 172–181.
[46] An optimal algorithm for -servers on trees. SIAM Journal on Computing. 1991. 144–148.
[47] The server problem and on-line games, DIMACS Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society. 1992. 11–64.
[48] Computer and Job Shop Scheduling. John Wiley & Sons. 1976.
[49] Computer Algebra for Industry: Problem Solving in Practice. John Wiley & Sons. 1993.
[50] Computer Algebra for Industry 2, Problem Solving in Practice. John Wiley & Sons. 1995.
[51] The Complexity of Theorem Proving Procedures, In Proceedings of the 3th Annual ACM Symposium on Theory of Computing. ACM Press. 1971. 151–158.
[52] Engineering a Compiler. Morgan Kaufman Publisher. 2004.
[53] Small solutions to polynomial equations, and low exponent RSA vulnerabilities. Journal of Cryptology. 1997. 233–260.
[54] Introduction to Algorithms (3rd edition, second corrected printing). The MIT Press/McGraw-Hill. 2010.
[55] Elements of Information Theory. John Wiley & Sons. 1991.
[56] On-line packing and covering problems, Lecture Notes in Computer Science. Springer-Verlag. 1998. 147–177.
[57] Shelf algorithms for on-line strip packing. Information Processing Letters. 1997. 171–175.
[58] Coding Theorems for Discrete Memoryless Systems. Akadémiai Kiadó. 1981.
[59] A deterministic algorithm for -SAT based on local search. Theoretical Computer Science. 2002. 69–83.
[60] Computer Algebra: Systems and Algorithms for Algebraic Computation. Academic Press. 2000.
[61] On the complexity of computing error bounds. Foundations of Computational Mathematics. 2001. 101–125.
[62] Lower bound for the redundancy of self-correcting arrangements of unreliable functional elements. Problems of Information Transmission (translated from Russian). 1977. 59–65.
[63] Upper bound for the redundancy of self-correcting arrangements of unreliable elements. Problems of Information Transmission (translated from Russian). 1977. 201–208.
[64] On-line analysis of the TCP acknowledgement delay problem. Journal of the ACM. 2001. 243–273.
[65] Better online algorithms for scheduling with machine cost. SIAM Journal on Computing. 2004. 1035–1051.
[66] New upper and lower bounds for online scheduling with machine cost. Discrete Optimization. 2010. 125–135.
[67] Random Trees. SpringerWienNewYork. 2009.
[68] Problem Solving in Automata, Languages, and Complexity. John Wiley & Sons. 2001.
[69] An Introduction to Difference Equations. Springer-Verlag. 1999 (2nd edition).
[70] Universal lossless source coding with the Burrows-Wheeler transform. IEEE Transactions on Information Theory. 2002. 1061–1081.
[71] Gap-definable counting classes. Journal of Computer and System Sciences. 1994. 116–148.
[72] Competitive -server algorithms. Journal of Computer and System Sciences. 1994. 410–428.
[73] Online Algorithms. The State of Art. Springer-Verlag. 1998.
[74] Crafting a Compiler. The Benjamin/Cummings Publishing Company. 1988.
[75] Analytic Combinatorics. Addison-Wesley. 2009.
[76] On-line scheduling revisited. Journal of Scheduling. 2000. 343–353.
[77] Introduction to the Theory of Games: Concepts, Methods and Applications. Kluwer Academic Publishers. 1999.
[78] Lösung linearer Gleichungssysteme auf Parallelrechnern. Vieweg Verlag. 1990.
[79] Diophantine Equations and Power Integral Bases: New Computational Methods. Birkhäuser Bostoni. 2002.
[80] Low-density Parity-check Codes. The MIT Press. 1963.
[81] Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman. 1979.
[82] Modern Computer Algebra (first edition). Cambridge University Press. 1993.
[83] Modern Computer Algebra (second edition). Cambridge University Press. 2003.
[84] Reliable cellular automata with self-organization. Journal of Statistical Physics. 2001. 45–267, See also www.arXiv.org/abs/math.PR/0003117 and The Proceedings of the 1997 Symposium on the Theory of Computing.
[85] Lower bounds for the complexity of reliable Boolean circuits with noisy gates. IEEE Transactions on Information Theory. 1994. 579–583.
[86] A simple three-dimensional real-time reliable cellular array. Journal of Computer and System Sciences. 1988. 125–147.
[87] Algebraic Theory of Automata. Akadémiai Kiadó, Budapest. 1972.
[88] Algorithms for Computer Algebra. Kluwer Academic Publishers. 1992.
[89] Deterministic Generalized Automata. Theoretical Computer Science. 1999. 191–208.
[90] Randomness, interactive proofs, and zero-knowledge - A survey, In R. Herken The Universal Turing Machine: A Half-Century Survey. Oxford University Press. 1988. 377–405.
[91] Proofs that yield nothing but their validity or all languages in NP. Journal of the ACM. 1991. 691–729.
[92] Foundations of Cryptography. Cambridge University Press. 2001.
[93] Interactive proof systems, In J. Hartmanis (Ed.) Computational Complexity Theory, AMS Short Course Lecture Notes: Introductory Survey Lectures. Proceedings of Symposia in Applied Mathematics. American Mathematical Society. 1989. 108–128.
[94] The knowledge complexity of interactive proof systems. SIAM Journal on Computing. 1989. 186–208.
[95] Private coins versus public coins in interactive proof systems, In S. Micali (Ed.), Randomness and Computation, Advances in Computing Research. JAI Press. 1989. 73–90, A preliminary version appeared in Proc. 18th Ann. ACM Symp. on Theory of Computing, 1986.
[96] Computer Algebra Systems, In A. Ralston, E. D. Reilly, D. Hemmendinger (Eds.) Encyclopedia of Computer Science. Nature Publishing Group. 4th edition, 2000. 287–301.
[97] Bounds for certain multiprocessor anomalies. The Bell System Technical Journal. 1966. 1563–1581.
[98] Concrete Mathematics. Addison-Wesley. 1994 (2nd edition).
[99] Mathematics for the Analysis of Algorithms. Birkhäuser. 1990 (3rd edition).
[100] Compiler Construction for Digital Computers. John Wiley & Sons. 1971.
[101] Symbolic Computation: Applications to Scientific Computing, Frontiers in Applied Mathematics, Vol. 5. SIAM. 1989.
[102] Modern Compiler Design. John Wiley & Sons. 2000.
[103] Applied Iterative Methods. Academic Press. 1981.
[104] Mathematics of Information and Coding. American Mathematical Society. 2002.
[105] Nonlinear and Dynamic Programming. Addison-Wesley. 1964.
[106] Introduction to Information Theory and Data Compression. Chapman & Hall. 2003 (2nd edition).
[107] A Guide to Computer Algebra Systems. John Wiley & Sons. 1991.
[108] Introduction to Formal Language Theory. Addison-Wesley. 1978.
[109] Solving simultaneous modular equations of low degree. SIAM Journal on Computing. 1988. 336–341, Special issue on cryptography.
[110] Future Directions for Research in Symbolic Computation, SIAM Reports on Issues in the Mathematical Sciences. SIAM. 1990.
[111] Enforcing and defying associativity, commutativity, totality, and strong noninvertibility for one-way functions in complexity theory, In M. Coppo, M. Coppo et al. (Eds.) ICTCS 2005, Lecture Notes in Computer Science. Springer Verlag. 2005. 265–279.
[112] The Complexity Theory Companion, EATCS Texts in Theoretical Computer Science. Springer-Verlag. 2002.
[113] If PNP then Some Strongly Noninvertible Functions are Invertible. Theoretical Computer Science. 2006. 54–62.
[114] Creating strong, total, commutative, associative one-way functions from any one-way function in complexity theory. Journal of Computer and Systems Sciences. 1999. 648–659.
[115] Group-Theoretic Algorithms and Graph Isomorphism, Lecture Notes in Computer Science. Springer-Verlag. 1982.
[116] Tight lower bounds on the ambiguity in strong, total, associative, one-way functions. Journal of Computer and System Sciences. 2004. 657–674.
[117] Introduction to Automata Theory, Languages, and Computation. Addison-Wesley. 2006 (in German: Einführung in Automatentheorie, Formale Sprachen und Komplexitätstheorie, Pearson Studium, 2002). 3rd edition.
[118] Introduction to Automata Theory, Languages, and Computation. Addison-Wesley. 1979.
[119] A Method for the Construction of Minimum-Redundancy Codes. Proceedings of the IRE. (1952). 1098–1101.
[120] Abstract Algebra: An Introduction. Saunders College Publishersiadó. 1990.
[121] Compilers, Their Design and Construction using Pascal. John Wiley& Sons. 1985.
[122] Online strip packing with modifiable boxes. Operations Research Letters. 2001. 79–86.
[123] Online scheduling with general machine cost functions. Discrete Applied Mathematics. 2009. 2070–2077.
[124] On time lookahead algorithms for the online data acknowledgement problem, In Proceedings of MFCS 2007 32nd International Symposium on Mathematical Foundations of Computer Science, Lecture Notes in Computer Science. Springer-Verlag. 2007. 288–297.
[125] Scheduling with machine cost, In Proceedings of APPROX'99, Lecture Notes in Computer Science. 1999. 168–176.
[126] Performance bounds for simple bin packing algorithms. Annales Universitatis Scientiarum Budapestinensis de Rolando Eötvös Nominatae, Sectio Computarorica. 1984. 77–82.
[127] Informatikai algoritmusok 1 (Algorithms of Informatics, Vol. 1). ELTE Eötvös Kiadó. 2004. Elektronic version: ELTE Informatikai Kar, 2005.
[128] Informatikai algoritmusok 2 (Algorithms of Informatics, Vol. 2). ELTE Eötvös Kiadó. 2005.
[129] Algorithms of Informatics, Vol. 1. mondAt Kiadó. 2007. Elektronic version: AnTonCom, Budapest, 2010.
[130] Algorithms of Informatics, Vol. . mondAt Kiadó. 2007. Elektronic version: AnTonCom, Budapest, 2010.
[131] Improved Upper Bounds for 3-SAT, In J. Munro (Ed.) Proceedings of the Fifteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 328–329. Society for Industrial and Applied Mathematics. 2004.
[132] Near-Optimal Bin Packing Algorithms. PhD thesis, MIT Department of Mathematics. 1973.
[133] Fast algorithms for bin packing. Journal of Computer and System Sciences. 1974. 272–314.
[134] Worst-case performance-bounds for simple one-dimensional bin packing algorithms. SIAM Journal on Computing. 1974. 299–325.
[135] A generalization of Brouwer's fixed point theorem. Duke Mathematical Journal. 1941. 457–459.
[136] The secure use of RSA. CryptoBytes. 1995. 7–13.
[137] The nonlinear complementarity problems with applications. I, II.. Journal of Optimization Theory and Applications. 1969. 87–98 and 167–181.
[138] Use of Symbolic Computation in Probability and Statistics, In Z. Karian (Ed.) Symbolic Computation in Undergraduate Mathematics Education. Mathematical Association of America, Number 24 in Notes of Mathematical Association of America. 1992.
[139] Dynamic TCP acknowledgement and other stories about , In Proceedings of the 31st Annual ACM Symposium on Theory of Computing. 2001. 502–509.
[140] Reducibility Among Combinatorial Problems, In R. E. Miller, J. W. Thatcher Complexity of Computer Computations. Plenum Press. 1972. 85–103.
[141] Combinatorica cu aplicatii (Combinatorics with Applications). Presa Universitara Clujeana. 2003.
[142] Automata and Formal Languages. Prentice Hall. 1995.
[143] Generating random regular graphs, In Proceedings of the Thirty Fifth ACM Symposium on Theory of Computing. 2003. 213–222.
[144] Fundamental Algorithms, The Art of Computer Programming, Vol. 1.. Addison-Wesley. 1968 (3rd updated edition).
[145] Seminumerical Algorithms, The Art of Computer Programming, Vol. 2.. Addison-Wesley. 1969 (3rd corrected edition).
[146] Sorting and Searching, The Art of Computer Programming, Vol. 3.. Addison-Wesley. 1973 (3rd corrected edition).
[147] On the -server conjecture. Journal of the ACM. 1995. 971–983.
[148] Computer Algebra: Impact and Perspectives. Nieuw Archief voor Wiskunde. 1999. 29–55.
[149] Automata and Computability. Springer-Verlag. 1995.
[150] Turing machines with few accepting computations and low sets for PP. Journal of Computer and System Sciences. 1992. 272–286.
[152] The Graph Isomorphism Problem: Its Structural Complexity. Birkhäuser. 1993.
[153] The performance of universal encoding. IEEE Transactions on Information Theory. 1981. 199–207.
[154] Contributions to the Theory of Games. II. Princeton University Press. 1953.
[155] Space efficient linear time computation of the Burrows and Wheeler transformation, In I. Althöfer, N. Cai, G. Dueck, L. Khachatrian, M. Pinsker, A. Sárközy, I. Wegener, Z. Zhang (Eds.): Numbers, Information and Complexity. Kluwer Academic Publishers. 2000. 375–383.
[156] Information storage in a memory assembled from unreliablecomponents. Problems of Information Transmission (translated from Russian). 1973. 254–264.
[157] A comparison of polynomial time reducibilities. Theoretical Computer Science. 1975. 103–124.
[158] An introduction to arithmetic coding. IBM Journal of Research and Development. 1984. 135–149.
[160] The Development of the Number Field Sieve, Lecture Notes in Mathematics. Springer-Verlag. 1993.
[162] On-line network routing, Lecture Notes in Computer Science. Springer-Verlag. 1998. 242–267.
[163] Introduction to Finite Fields and Their Applications. Cambridge University Press. 1986.
[164] An Introduction to Formal Languages and Automata. Jones and Barlett Publishers. 2001.
[165] An Introduction to Formal Language and Automata. Jones and Bartlett Publishers, Inc.. 2006. 4th edition.
[166] Algebraic Combinatorics on Words. Cambridge University Press. 2002.
[168] Algoritmusok (Algorithms). Műszaki Könyvkiadó és Tankönyvkiadó, Budapest. 1978 and 1987.
[169] Compilers and interpreters, A. B. Tucker (ed.) Computer Science Handbook, pages 99/1–99/30. Chapman & Hall/CRC. 2004.
[171] Writing Compilers and Interpreters. Addison-Wesley. 1991.
[172] Competitive algorithms for server problems. Journal of Algorithms. 1990. 208–230.
[173] Two-person zero-sum games and quadratic programming. Journal of Mathematical Analysis and its Applications. 1964. 348–355.
[174] Mathematical Theory of Computation. McGraw-Hill Book Co.. 1974.
[175] Nonlinear Programming Theory and Methods. Akadémiai Kiadó. 1975.
[176] A note on the graph isomorphism counting problem. Information Processing Letters. 1979. 131–132.
[177] Automata and Languages: Theory and Applications. Springer-Verlag. 2000.
[178] The equivalence problem for regular expressions with squaring requires exponential space, In Proceedings of the 13th IEEE Symposium on Switching and Automata Theory. 1972. 129–129.
[179] Complexity of Lattice Problems: A Cryptographic Perspective, The Kluwer International Series in Engineering and Computer Science. Kluwer Academic Publishers. 2002.
[180] Difference Equations. Theory and Applications. Van Nostrand Reinhold. 1990.
[181] Mathematics for Computer Algebra. Springer. 1992.
[182] Riemann's Hypothesis and Tests for Primality. Journal of Computer and Systems Sciences. 1976. 300–317.
[183] Equilibrium points of finite games. SIAM Journal of Applied Mathematicsiadó. 1976. 397–402.
[184] Algorithmic Algebra. Springer. 1993.
[185] An Introduction to Formal Language Theory. Springer-Verlag. 1988.
[187] Protocol failures in cryptosystems, In G. Simmons Contemporary Cryptology: The Science of Information Integrity. IEEE Computer Society Press. 1992. 541–558.
[188] Randomized Algorithms. Cambridge University Press. 1995.
[189] Advanced Compiler Design and Implementation. Morgan Kaufman Publisher. 1997.
[190] Noncooperative games. Annals of Mathematics. 1951. 286–295.
[192] Probabilistic logics and the synthesis of reliable organisms from unreliable components, In C. Shannon, P. J. McCarthy (Eds.) Automata Studiess. Princeton University Press. 1956. 43–98.
[193] Theory of Games and Economical Behaviour. Princeton University Press. 1947 (2nd edition).
[194] Note on noncooperative games. Pacific Journal of Mathematics. 1955. 807–815.
[195] Applications of Symbolic Mathematics to Mathematics. Kluwer Academic Publishers. 1985.
[196] Expectation and Stability of Oligopoly Models. Springer. 1976.
[197] The Theory of Oligopoly with Multi-Product Firms. Springer. 1999 (2nd edition).
[198] Computational Complexity. Addison-Wesley. 1994.
[199] Source Coding Algorithms for Fast Data Compression. Stanford University. 1976.
[200] An improved exponential-time algorithm for -SAT, In Proceedings of the 39th IEEE Symposium on Foundations of Computer Science, pages 628–637. IEEE Computer Society Press. 1998.
[201] Computer Algebra. Scientific American. 1981. 102–113.
[203] On a lower bound for the redundancy of reliable networks with noisy gates. IEEE Transactions on Information Theory. 1991. 639–643.
[204] Analysis of error correction by majority voting, In S. Micali (Ed.) Randomness in Computationitle. JAI Press. 1989. 171–198.
[206] The Art of Compiler Design, Theory and Practice. Prentice Hall. 1992.
[207] Theorems on factorization and primality testing. Proceedings of the Cambridge Philosophical Society. 1974. 521–528.
[208] An observation on associative one-way functions in complexity theory. Information Processing Letters. 1997. 239–244.
[209] Probabilistic Algorithms for Testing Primality. Journal of Number Theory. 1980. 128–138.
[210] Upward separation for FewP and related classes. Information Processing Letters. 1994. 175–180, (Corrigendum appears in the same journal,74(1-2):89, 2000).
[211] Reliable computation with noisy circuits and decision trees-a general lower bound, In Proceedings of the 32-nd IEEE FOCS Symposium. 1991. 602–611.
[212] Generalized Kraft inequality and arithmetic coding. IBM Journal of Research and Development. 1976. 198–203.
[213] Integration in Finite Terms. Columbia University Press. 1948.
[214] A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Communications of the ACM. 1978. 120–126.
[215] An iterative method of solving a game. Annals of Mathematics. 1951. 296–301.
[216] Existence and uniqueness of equilibrium points for concave -person games. Econometrica. 1965. 520–534.
[217] Some Facets of Complexity Theory and Cryptography: A Five-Lecture Tutorial. ACM Computing Surveys. 2002. 504–549.
[218] A promise class at least as hard as the polynomial hierarchy. Journal of Computing and Information. 1995. 92–107.
[219] Complexity Theory and Cryptology. An Introduction to Cryptocomplexity, EATCS Texts in Theoretical Computer Sciencey. Springer-Verlag. 2005.
[220] Handbook of Formal Languages, Volumes I-III.. Springer-Verlag. 1997.
[221] Theory of Automata. Pergamon Press. 1969.
[222] Formal Languages. Academic Press. 1987 (2nd updated edition).
[223] Public-Key Cryptography, EATCS Monographs on Theoretical Computer Science, Vol. 23. Springer-Verlag. 1996 (2nd edition).
[224] Data Compression. Springer-Verlag. 2004 (3rd edition).
[225] Introduction to Data Compression. Morgan Kaufman Publisher. 2000 (2nd edition).
[226] A low and a high hierarchy within NP. Journal of Computer and System Sciences. 1983. 14–28.
[227] Graph isomorphism is in the low hierarchy. Journal of Computer and System Sciences. 1987. 312–323.
[228] A probabilistic algorithm for -SAT based on limited local search and restart, In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press. 1999. 410–414.
[230] An Introduction to the Analysis of Algorithms. Addison-Wesley. 1996.
[231] Polynomial time enumeration reducibility. SIAM Journal on Computing. 1978. 440–457.
[232] TIP=PSPACE. Journal of the ACM. 1992. 869–877.
[234] The synthesis of two-terminal switching circuits. The Bell Systems Technical Journal. 1949. 59–98.
[235] Note on a computation method in the theory of games. Communications on Pure and Applied Mathematics. 1958. 587–593.
[236] Scheduling parallel machines online. SIAM Journal on Computing. 1995. 1313–1331.
[237] Finite Fields: Theory and Computation The Meeting Point of Number Theory, Computer Science, Coding Theory, and Cryptography. Kluwer Academic Publishers. 1999.
[238] Automata Theory. World Scientific Publishing Company. 1999.
[239] On-line scheduling, Lecture Notes in Computer Science. Springer-Verlag. 1998. 196–231.
[240] Theory of Formal Languages with Applications. World Scientific Publishing Company. 1999.
[242] Introduction to the Theory of Computation. PWS Publishing Company. 1997.
[244] Amortized efficiency of list update and paging rules. Communications of the ACM. 1985. 202–208.
[245] The Algorithmic Resolution of Diophantine Equations, In London Mathematical Society Student Text. Cambridge University Press. 1998.
[246] A fast Monte Carlo test for primality. SIAM Journal on Computing. 1977. 84–85.
[247] Linear-time encodable and decodable error-correcting codes, In Proceedings of the 27th ACM STOC Symposium. 1995. 387–397 (further IEEE Transactions on Information Theory 42(6): 1723–1732).
[248] Highly fault-tolerant parallel computation, In Proceedings of the 37th IEEE Foundations of Computer Science Symposium. 1996. 154–163.
[249] Cryptography: Theory and Practice. CRC Press. 2002 (2nd edition).
[250] The polynomial-time hierarchy. Theoretical Computer Science. 1977. 1–22.
[251] Finite Automata, Formal Logic, and Circuit Complexity. Birkhäuser. 1994.
[252] Languages and Machines. Addison-Wesley. 1997.
[254] Principles and Procedures of Numerical Analysis. Plenum Press. 1998.
[255] A lattice-theoretical fixpoint theorem and its application. Pacific Journal of Mathematics. 1955. 285–308.
[256] JPEG 2000 - Image Compression, Fundamentals, Standards and Practice. Society for Industrial and Applied Mathematics. 1983.
[257] Reliable information storage in memories designed from unreliable components. The Bell Systems Technical Journal. 1968. 2299–2337.
[258] Compiler Writing. McGraw-Hill Book Co.. 1985.
[259] On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, ser. 2. 1936.
[261] The relative complexity of checking and evaluating. Information Processing Letters. 1976. 20–23.
[262] On-line machine scheduling. Phd thesis, Eindhoven University of Technology. 1997.
[264] An improved lower bound for on-line bin packing algorithms. Information Processing Letters. 1992. 277–284.
[265] Lower and upper bounds for on-line bin packing and scheduling heuristics. PhD thesis, Erasmus University, Rotterdam. 1995.
[266] Computational Complexity. D. Reidel Publishing Company. 1986 (and Kluwer Academic Publishers, 2001).
[267] Compiler Construction. Springer-Verlag. 1984.
[268] The JPEG still picture compression standard. Communications of the ACM. 1991. 30–44.
[269] Bulge exchanges in algorithms of QR type. SIAM Journal on Matrix Analysis and Application. 1998. 1074–1096.
[270] Formal Language: A Practical Introduction. Franklin, Beedle & Associates, Inc.. 2008.
[271] Vorlesungen zur Komplexitätstheorie. B. G. Teubner Verlagsgesellschaft. 2000.
[272] Codes and Cryptography. Oxford University Press. 1988.
[273] Convergence of the LR, QR, and related algorithms. The Computer Journal. 1965. 77–84.
[274] The context-tree weighting method: basic properties. IEEE Transactions on Information Theory. 1995. 653–664.
[275] The context-tree weighting method: basic properties. IEEE Information Theory Society Newsletter. 1997. 1 and 20–27.
[276] Polynomial Algorithms in Computer Algebra. Springer-Verlag. 1990.
[277] Arithmetic coding for sequential data compression. Communications of the ACM. 1987. 520–540.
[278] New algorithms for bin packing. Journal of the ACM. 1980. 207–227.
[279] Iterative Solution of Large Linear Systems. Academic Press. 1971.
[280] On-line file caching. Algorithmica. 2002. 371–383.
[281] A universal algorithm for sequential data compression. IEEE Transactions on Information Theory. 1977. 337–343.
[282] Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory. 1978. 530–536.