2$.
\end{gyak}
\vspace{-1mm}
\begin{fld}
\vspace{-1mm}
\ujfld{External attributes}{Maier\nevindex{Maier, David} calls attribute $A$
an \ki{external attribute}\felindex{attribute!external} in the functional
dependency $X\rightarrow Y$ with respect to the family of dependencies $F$
over schema $R$, if the following two conditions hold:}
\begin{enumerate}
\item $\left(F-\{X\rightarrow Y\}\right)\cup \{X\rightarrow (Y-A)\}\models
X\rightarrow Y$, or
\item $\left(F-\{X\rightarrow Y\}\right)\cup \{(X-A)\rightarrow Y\}\models
X\rightarrow Y$.
\end{enumerate}
Design an $O(n^2)$ running time algorithm, whose input is schema $(R,F)$ and
output is a set of dependencies $G$ equivalent with $F$ that has no external
attributes.
\ujfld{The order of the elimination steps in the construction of minimal cover
is important}{In the procedure \textsc{Minimal-cover}($R,F$) the set of
functional dependencies was changed in two
ways:\felindex{minimal-cover@\textsc{Minimal-cover}}} either by dropping
redundant dependencies, or by dropping redundant attributes from the left hand
sides of the dependencies. If the latter method is used first, until there is
no more attribute that can be dropped from some left hand side, then the first
method, this way a minimal cover is obtained really, according to
Proposition~\ref{106all}. Prove that if the first method applied first and
then the second, until there is no more possible applications, respectively,
then the obtained set of dependencies is not necessarily a minimal cover of
$F$.
\ujfld{BCNF subschema}{Prove that the following problem is coNP-complete:
Given a relational schema $R$ with set of functional dependencies $F$,
furthermore $S\subset R$, decide whether $(S,\pi_S(F))$ is in
BCNF.\felindex{normal form!BCNF}}
\ujfld{3NF is hard to recognise}{Let $(R,F)$ be a relational schema, where $F$
is the system of functional dependencies. \newline
The \ki{$k$ size key problem} is the following: given a natural number $k$,
determine whether there exists a key of size at most $k$.\newline
The \ki{prime attribute problem} is the following: for a given $A\in R$,
determine whether it is a prime attribute. \felindex{normal
form!third@3NF}\felindex{attribute!prime}}\label{3nfnp}
\begin{description}
\item[\textit{a.}] Prove that the $k$ size key problem is
NP-complete. \textit{Hint.} Reduce the \ki{vertex cover} problem to the
prime attribute problem.
\item[\textit{b.}] Prove that the prime attribute problem is NP-complete by
reducing the $k$ size key problem to it.
\item[\textit{c.}] Prove that determining whether the relational schema
$(R,F)$ is in 3NF is NP-complete. \textit{Hint.} Reduce the prime attribute
problem to it.
\end{description}
\vspace{-3mm}
\ujfld{Running time of \textsc{Dependency-basis}} {Give an implementation of
procedure \textsc{Dependency-basis}, whose running time is $O(|M|\cdot
|R|^3)$. \felindex{dependency-basis@\textsc{Dependency-basis}}}\label{a76biz}
\end{fld}
\vspace{-2mm}
\begin{fejmegj}
The relational data model was introduced by Codd \cite{codd}\nevindex{Codd,
Edgar F. (1923--2003)} in 1970. Functional dependencies were treated in his paper of 1972
\cite{cod72b}, their axiomatisation was completed by
Armstrong\nevindex{Armstrong, William Ward} \cite{arm74}. The logical
implication problem for functional dependencies were investigated by
Beeri\nevindex{Beeri, Catriel} and Bernstein\nevindex{Bernstein, P. A.}
\cite{bb79}, furthermore
Maier\nevindex{Maier, David} \cite{mai80}. Maier also treats the possible
definitions of minimal covers, their connections and the complexity of their
computations in that paper. Maier, Mendelzon\nevindex{Mendelzon,
Alberto O.} and Sagiv\nevindex{Sagiv, Yehoshua} found method to decide logical
implications among general dependencies \cite{mms79}.
Beeri, Fagin and Howard\nevindex{Howard,
J. H.} proved that axiom system (A1)--(A8) is sound and complete for
functional and multivalued dependencies taken together \cite{bfh77}. Yu and
Johnson \cite{yujo} constructed \nevindex{Yu, C. T.}\nevindex{Johnson,
D. T.} such relational schema, where $|F|=k$ and the number of keys is
$k!$. Békéssy\nevindex{Békéssy, András} and Demetrovics\nevindex{Demetrovics,
János} \cite{bede} gave a simple and beautiful proof for the statement, that
from $k$ functional dependencies at most $k!$ keys can be obtained, thus Yu
and Johnson's construction is extremal.
Armstrong-relations were introduced and studied by Fagin\nevindex{Fagin, R.}
\cite{fag82a,fag82b}, furthermore by Beeri, Fagin,
Dowd\nevindex{Dowd, M.} and Statman\nevindex{Statman, R.} \cite{bdfs84}.
Multivalued dependencies were independently discovered by
Zaniolo\nevindex{Zaniolo, C.} \cite{zan76}, Fagin
\cite{fag77b} and Delobel\nevindex{Delobel, C.} \cite{del78}.
The necessity of normal forms was recognised by Codd while studying update
anomalies \cite{cod71, cod72a}. The Boyce--Codd normal form was introduced in
\cite{cod74}\nevindex{Boyce, Raymond F.}. The definition of the third normal
form used in this chapter was given by Zaniolo
\cite{zan82}. Complexity of decomposition into subschemata in certain normal
forms was studied by Lucchesi\nevindex{Lucchesi,
C. L.} and Osborne \cite{lo78}, Beeri and Bernstein \cite{bb79}, furthermore
Tsou\nevindex{Tsou, D. M.} and Fischer\nevindex{Fischer, P. C.} \cite{tf82}.
Theorems~\ref{beeri80} and \ref{beeri80ff} are results of Beeri
\cite{bee80}. Theorem~ \ref{tetabu} is from a paper of Aho\nevindex{Aho, Alfred V.}, Beeri és Ullman\nevindex{Ullman,
Jeffrey David} \cite{abu79}.
Theorems~\ref{jdnonax} and \ref{jdkomplex} are from the book of Abiteboul\nevindex{Abiteboul, Serge},
Hull\nevindex{Hull, Richard} and Vianu\nevindex{Vianu, Victor} \cite{ahv}, the
non-existence of finite axiomatisation of join dependencies is Petrov's result\nevindex{Petrov,
S. V.} \cite{pet89}.
Branching dependencies were introduced by Demetrovics, Katona\nevindex{Katona, Gyula O. H.}
and Sali\nevindex{Sali, Attila}, they studied existence of
Armstrong-relations and the size of minimal Armstrong-relations
\cite{dks1,dks2,dks3,sasa}. Katona showed that there exists no finite
axiomatisation of branching dependencies in (ICDT'92 Berlin, invited talk)
but never published.
Possibilities of axiomatisation of numerical dependencies were investigated by
Grant\nevindex{Grant, John} and Minker\nevindex{Minker, Jack} \cite{grmi85a,grmi85b}.
Good introduction of the concepts of this chapter can be found in the books of
Abiteboul, Hull and Vianu \cite{ahv}, Ullman \cite{ull88} furthermore
Thalheim\nevindex{Thalheim, Bernhardt} \cite{tha91}, respectively.
\end{fejmegj}
\index{relational data bases|)}
\chapter[ Query Rewriting in Relational Databases]
{Query Rewriting in Relational Databases}\nevindex{Demetrovics, János}\nevindex{Sali, Attila}
\index{query rewriting|(}
In chapter ``Relational database design'' basic concepts of relational
databases were introduced, such as relational schema\index{relational schema},
relation\index{relation},
instance\index{relation!instance}\index{instance}. Databases were studied from
the designer point of view, the main question was how to avoid redundant data
storage, or various anomalies arising during the use of the database.
In the present chapter the schema is considered to be given and focus is on
fast and efficient ways of answering user queries. First, basic (theoretical)
query languages\index{query!language} and their connections are reviewed in
Section~\ref{sec:Lekerdezesek}.
In the second part of this chapter (Section~\ref{sec:Nezetek})
views\index{view} are considered. Informally, a view is nothing else, but
result of a query. Use of views in query efficiency, providing physical data
independence and data integration is explained.
Finally, the third part of the present chapter (Section~\ref{sec:Atiras})
introduces query rewriting.
\section{Queries\label{sec:Lekerdezesek}}
Consider the database of cinemas in Budapest. Assume that the schema consists
of three relations:
\begin{equation}\label{moziséma}
\textbf{CinePest}=\{\textit{Film,Theater,Show}\} \koz .
\end{equation}
The schemata of individual relations are as follows:
\begin{equation}\label{mozisémarel}
\begin{array}{rcl}
\textit{Film}&=&\{\textbf{T}\textit{itle},\textbf{D}\textit{irector},\textbf{A}\textit{ctor}\} \koz ,\\
\textit{Theater}&=&\{\textbf{Th}\textit{eater},\textbf{A}\textit{ddress},\textbf{P}\textit{hone}\} \koz ,\\
\textit{Show}&=&\{\textbf{Th}\textit{eater},\textbf{T}\textit{itle},\textbf{Ti}\textit{me}\} \koz .\end{array}
\end{equation}
Possible values of instances of each relation are shown on
Figure~\ref{mozpeld}.
\begin{figure}[!t]
\begin{center}
\large{\textit{Film}}\\ \vspace{0.3cm}
\begin{tabular}{|l|l|l|}
\hline
\textbf{T}\textit{itle}&\textbf{D}\textit{irector}&\textbf{A}\textit{ctor} \\
\hline\hline
Control&Antal, Nimród&Csányi, Sándor\\
Control&Antal, Nimród&Mucsi, Zoltán\\
Control&Antal, Nimród&Pindroch, Csaba\\
\vdots&\vdots&\vdots\\
Rashomon&Akira Kurosawa&Toshiro Mifune \\
Rashomon&Akira Kurosawa&Machiko Kyo\\
Rashomon&Akira Kurosawa&Mori Masayuki\\
\hline
\end{tabular}\\ \vspace{0.7cm}
\large{\textit{Theatre}}\\ \vspace{0.3cm}
\begin{tabular}{|l|l|l|}
\hline
\textbf{Th}\textit{eater}&\textbf{A}\textit{ddress}&\textbf{P}\textit{hone}\\
\hline\hline
Bem&II., Margit Blvd. 5/b.& 316-8708\\
Corvin&VIII., Corvin alley 1.&459-5050\\
Európa&VII., Rákóczi st. 82.&322-5419\\
Művész&VI., Teréz blvd. 30.&332-6726\\
\vdots&\vdots&\vdots\\
Uránia&VIII., Rákóczi st. 21.&486-3413\\
Vörösmarty&VIII., Üllői st. 4.&317-4542
\\
\hline\end{tabular}
\\ \vspace{0.7cm}
\large{\textit{Show}}\\ \vspace{0.3cm}
\begin{tabular}{|l|l|l|}
\hline
\textbf{Th}\textit{eater}&\textbf{T}\textit{itle}&\textbf{Ti}\textit{me}\\
\hline\hline
Bem&Rashomon&19:00\\
Bem&Rashomon&21:30\\
Uránia&Control&18:15\\
Művész&Rashomon&16:30\\
Művész&Control&17:00\\
\vdots&\vdots&\vdots\\
Corvin&Control&10:15\\
\hline\end{tabular}
\end{center}
\caption{The database \textbf{CinePest}.}\label{mozpeld}\abraindex{relation!instance}
\end{figure}
Typical user queries could be:
\begin{description}
\item{\textbf{\thechapter.\thequery}} Who is the director of
``Control''?\stepcounter{query}
\item{\textbf{\thechapter.\thequery}} \newcounter{negynegy}
\setcounter{negynegy}{\value{query}}List the names address of those
theatres where Kurosawa films are played.\stepcounter{query}
\item{\textbf{\thechapter.\thequery}} Give the names of directors who played
part in some of their films.\stepcounter{query}
\end{description}
These queries define a mapping from the relations of database schema
\textbf{CinePest} to some other schema (in the present case to schemata of
single relations). Formally, \ki{query}\inddef{query} and \ki{query
mapping}\inddef{query!mapping} should be distinguished. The former is a
syntactic concept, the latter is a mapping from the set of instances over the
input schema to the set of instances over the output schema, that is
determined by the query according to some semantic interpretation. However,
for both concepts the word ``query'' is used for the sake of convenience,
which one is meant
will be clear from the context.
\begin{defi}\label{queryekviv}
Queries $q_1$ and $q_2$ over schema $R$ are said to be
\ki{equivalent,}\inddef{query!equivalent} in notation $q_1\equiv q_2$, if they
have the same output schema and for every instance $\mathcal{I}$ over schema
$R$ $q_1(\mathcal{I})=q_2(\mathcal{I})$ holds.
\end{defi}
In the remaining of this chapter the most important query languages are
reviewed. The
expressive powers of query languages need to be compared.
\begin{defi}\label{nyelvekviv}
Let $\mathcal{Q}_1$ and $\mathcal{Q}_2$ be query languages (with appropriate
semantics). $\mathcal{Q}_2$ is \ki{dominated by} $\mathcal{Q}_1$ (
$\mathcal{Q}_1$ is \ki{weaker,} than $\mathcal{Q}_2$), in notation
$\mathcal{Q}_1\sqsubseteq\mathcal{Q}_2$, if for every query $q_1$ of
$\mathcal{Q}_1$
there exists a query $q_2\in\mathcal{Q}_2$, such that $q_1\equiv q_2$.
$\mathcal{Q}_1$ and $\mathcal{Q}_2$ are \ki{equivalent,}\inddef{query!language!equivalent} if $\mathcal{Q}_1\sqsubseteq\mathcal{Q}_2$ and
$\mathcal{Q}_1\sqsupseteq\mathcal{Q}_2$.
\end{defi}
\begin{pelda}{\textit{Query.}}\label{ex411}
Consider Question~\thechapter.\thenegynegy. As a first try the next solution
is obtained:
\begin{description}
\item{\key{if}} there exist in relations \textit{Film}, \textit{Theater} and
\textit{Show} tuples $(x_T,\textrm{''Akira Kurosawa''},x_{A})$,
$(x_{Th},x_{Ad},x_P)$ and $(x_{Th},x_T,x_{Ti})$
\item{\key{then}} put the tuple
$(\textit{Theater}:x_{Th},\textit{Address}:x_A)$ into the output relation.
\end{description}
$x_T,x_{A},x_{Th},x_{Ad},x_P,x_{Ti}$ denote different variables that take
their values from the domains of the corresponding attributes,
respectively. Using the same variables implicitly marked where should stand
identical values in different tuples.
\end{pelda}
\subsection{Conjunctive queries}
Conjunctive queries are the simplest kind of queries, but they are the
easiest to handle and have the most good properties. Three equivalent forms
will be studied, two of them based on logic, the third one is of algebraic
nature. The name comes from first order logic expressions that contain only
existential quantors ($\exists$), furthermore consist of atomic expressions
connected with logical ``and'', that is conjunction.
\subsubsection{Datalog -- rule based queries}
The tuple $(x_1,x_2,\ldots ,x_m) $ is called \ki{free
tuple}\inddef{tuple!free}\inddef{free tuple} if the $x_i$'s are variables or
constants. This is a generalisation of a tuple of a relational instance.
For example, the tuple $(x_T,\textrm{''Akira Kurosawa''},x_{A})$ in
Example~\ref{ex411} is a free tuple.
\begin{defi}\label{rule}
Let $\mathbf{R}$ be a relational database schema. \ki{Rule based conjunctive
query}\inddef{query!conjunctive!rule based} is an expression of the
following form
\begin{equation}\label{szabaly}
\var{ans}(u)\la R_1(u_1),R_2(u_2),\ldots ,R_n(u_n) \koz ,
\end{equation}
where $n\ge 0$, $R_1,R_2,\ldots ,R_n$ are relation names from $\mathbf{R}$,
\var{ans} is a relation name not in $\mathbf{R}$, $u,u_1,u_2,\ldots ,u_n$ are free
tuples. Every variable occurring in $u$ must occur in one of $u_1,u_2,\ldots
,u_n$, as well.
\end{defi}
The rule based conjunctive query is also called a
\ki{rule}\index{rule} for the sake of simplicity. $\var{ans}(u)$ is the
\ki{head}\inddef{rule!head} of the rule,
$ R_1(u_1),R_2(u_2),\ldots ,R_n(u_n)$ is the
\ki{body}\inddef{rule!body} of the rule, $R_i(u_i)$ is called a
\ki{(relational) atom}\inddef{atom!relational}. It is assumed that
each variable of the head also occurs in some atom of the body.
A rule can be considered as some tool that tells how can we deduce
newer and newer \ki{facts,} that is tuples, to include in the output
relation. If the variables of the rule can be assigned such values
that each atom $R_i(u_i)$ is true (that is the appropriate tuple is
contained in the relation $R_i$), then tuple $u$ is added to the
relation \var{ans}. Since all variables of the head occur in some atoms of
the body, one never has to consider \ki{infinite} domains,\inddef{infinite domain} since the
variables can take values from the actual instance
queried. Formally. let $\mathcal I$ be an instance over relational
schema $\mathbf{R}$, furthermore let $q$ be a the query given by rule
(\ref{szabaly}). Let $var(q)$ denote the set of variables occurring in
$q$, and let $dom({\mathcal I})$ denote the set of constants that
occur in $\mathcal I$. The \ki{image of $\mathcal I$ under
$q$}\inddef{image under a query} is given by
\begin{equation}\label{q-kep}
q({\mathcal I})=\{\nu(u)\vert \nu\colon var(q)\ra dom({\mathcal I})\mbox{\ and\
} \nu(u_i)\in R_i\ i=1,2,\ldots ,n\} \koz .
\end{equation}
An immediate way of calculating $q({\mathcal I})$ is to consider all
possible valuations $\nu$ in some order. There are more efficient
algorithms, either by equivalent rewriting of the query, or by using
some indices.
An important difference between atoms of the body and the head is that
relations $R_1,R_2,\ldots ,R_n$ are considered given, (physically)
stored, while relation \var{ans} is not, it is thought to be calculated by
the help of the rule. This justifies the names: $R_i$'s are
\ki{extensional relations}\inddef{relation!extensional} and \var{ans} is
\ki{intensional relation.}\inddef{relation!intensional}
Query $q$ over schema $\mathbf{R}$ is \ki{monotone,}\inddef{query!monotone} if
for instances $\mathcal{I}$ and $\mathcal{J}$ over $\mathbf{R}$,
$\mathcal{I}\subseteq\mathcal{J}$ implies $q(\mathcal{I})\subseteq
q(\mathcal{J})$. $q$ is \ki{satisfiable,}\inddef{query!satisfiable} if there
exists an instance $\mathcal{I}$, such that $q(\mathcal{I})\ne\emptyset$. The
proof of the next simple observation is left for the Reader
(Exercise~\ref{konjmon}).
\begin{allit}\label{422} Rule based queries are monotone and satisfiable.
\end{allit}
Proposition~\ref{422} shows the limitations of rule based queries. For example,
the query \textit{Which theatres play only Kurosawa films?} is obviously not
monotone, hence cannot be expressed by rules of form (\ref{szabaly}).
\subsubsection{Tableau queries.}
If the difference between variables and constants is not considered, then the
body of a rule can be viewed as an instance over the schema. This leads to a
tabular form of conjunctive queries that is most similar to the visual
queries (QBE: Query By
Example)\index{QBE} of database management system Microsoft
Access\index{Microsoft!Access}.
\begin{defi}\label{tableau}
A \ki{tableau}\inddef{query!tableau} over the schema $\mathbf{R}$ is a
generalisation of an instance over $\mathbf{R}$, in the sense that variables
may occur in the tuples besides constants. The pair $(\mathbf{T},u)$ is a
\ki{tableau query} if $\mathbf{T}$ is a tableau and $u$ is a free tuple such
that all variables of $u$ occur in $\mathbf{T}$, as well. The free tuple $u$
is the \ki{summary.}\inddef{query!tableau!summary}
\end{defi}
The summary row $u$ of tableau query $(\mathbf{T},u)$ shows which tuples form
the result of the query. The essence of the procedure is that the pattern
given by tableau $\mathbf{T}$ is searched for in the database, and if found
then the tuple corresponding to is included in the output relation.
More precisely, the mapping $\nu\colon var(\mathbf{T})\ra dom({\mathcal I})$
is an \ki{embedding} of tableau $(\mathbf{T},u)$ into instance $\mathcal I$,
if $\nu(\mathbf{T})\subseteq{\mathcal I}$. The output relation of tableau
query $(\mathbf{T},u)$ consists of all tuples $\nu(u)$ that $\nu$ is an
embedding of tableau $(\mathbf{T},u)$ into instance $\mathcal I$.
\begin{pelda}{\textit{Tableau query}}
Let $\mathbf{T}$ be the following tableau.
\[
\begin{array}{r|lll}
\textit{Film}&\textit{Title}&\textit{Director}&\textit{Actor}\\
\cline{2-4}
&x_T&\textrm{``Akira Kurosawa''}&x_{A}\\
\multicolumn{4}{c}{}\\
\textit{Theater}&\textit{Theater}&\textit{Address}&\textit{Phone}\\
\cline{2-4}
&x_{Th}&x_{Ad}&x_P\\
\multicolumn{4}{c}{}\\
\textit{Show}&\textit{Theater}&\textit{Title}&\textit{Time}\\
\cline{2-4}
&x_{Th}&x_T&x_{Ti}
\end{array}
\]
The tableau query $(\mathbf{T},\langle \textit{Theater}\colon x_{Th},\textit{Address} \colon
x_{Ad}\rangle)$ answers question \thechapter.\thenegynegy. of the introduction.
\end{pelda}
The syntax of tableau queries is similar to that of rule based queries. It
will be useful later that conditions for one query to contain another one can
be easily formulated in the language of tableau queries.
\subsubsection{Relational algebra$^*$.}
A database consists of relations, and a relation is a set of tuples. The
result of a query is also a relation with a given attribute set. It is a
natural idea that output of a query could be expressed by algebraic and other
operations on relations. The \ki{relational algebra$^*$}\inddef{relational
algebra$^*$} consists of the following operations.\footnote{ The relational
algebra$^*$ is the \ki{monotone} part of the (full) relational algebra
introduced later.}
\begin{description}
\item{\textit{Selection:}}\inddef{selection} It is of form either
$\sigma_{A=c}$ or $\sigma_{A=B}$, where $A$ and $B$ are attributes while $c$
is a constant. The operation can be applied to all such relations $R$ that
has attribute $A$ (and $B$), and its result is relation \var{ans} that has the
same set of attributes as $R$ has, and consists of all tuples that satisfy
the \ki{selection condition.}\inddef{selection!condition}
\item{\textit{Projection:}}\inddef{projection} The form of the operation is
$\pi_{A_1,A_2,\ldots ,A_n},$ $n\ge 0,$ where $A_i$'s are distinct
attributes. It can be applied to all such relations whose attribute set
includes each $A_i$ and its result is the relation \var{ans} that has attribute
set $\{A_1,A_2,\ldots ,A_n\},$
\[
val=\{t[A_1,A_2,\ldots ,A_n]\vert t\in R\} \koz ,
\]
that is it consists of the restrictions of tuples in $R$ to the attribute set
$\{A_1,A_2,\ldots ,A_n\}$.
\item{\textit{Natural join:}}\inddef{join!natural}\inddef{natural join} This
operation has been defined earlier in chapter ``Relational database
design''. Its notation is $\Join,$ its input consists of two (or more)
relations $R_1,$ $R_2$, with attribute sets $V_1,$ $V_2$, respectively. The
attribute set of the output relation is $V_1\cup V_2.$
\[
R_1\Join R_2=\{t\textrm{\ tuple\ over\ }V_1\cup V_2\vert \exists
v\in R_1,\exists w\in R_2, t[V_1]=v\textrm{\ és\ }t[V_2]=w\} \koz .
\]
\item{\textit{Renaming:}}\inddef{renaming} Attribute renaming is nothing else,
but an injective mapping from a finite set of attributes $U$ into the set of
all attributes. Attribute renaming $f$ can be given by the list of pairs
$(A,f(A))$, where $A\ne f(A),$ which is written usually in the form
$A_1A_2\ldots A_n\ra B_1B_2\ldots B_n$. The \ki{renaming operator}
$\delta_f$ maps from inputs over $U$ to outputs over$f[U]$. If $R$ is a
relation over $U$, then
\[
\delta_f(R)=\{v\ \textrm{over\ } f[U]\vert \exists u\in R,v(f(A))=u(A),\
\forall A\in U\} \koz .
\]
\vspace{-4mm}
\end{description}
\vspace{-6mm}
Relational algebra$^*$ queries are obtained by finitely many applications of
the operations above from \ki{relational algebra base queries,} which are
\begin{description}
\item{\textit{Input relation:}} $R.$
\item{\textit{Single constant:}} $\{\langle A:a\rangle\},$ where
$a$ is a constant, $A$ is an attribute name.
\end{description}
\begin{pelda}{\textit{Relational algebra$^*$ query.}} The question
\thechapter.\thenegynegy. of the introduction can be expressed with
relational algebra operations as follows.
\[
\pi_{Theater,Address}\left((\sigma_{\textit{Director}=\textrm{``Akira Kurosawa''}}(\textit{Film})\Join\textit{Show})\Join\textit{Theater}\right) \koz .
\]
\end{pelda}
The mapping given by a relational algebra$^*$ query can be easily defined via
induction on the operation tree. It is easy to see (Exercise~\ref{emptyquery})
that non-satisfiable queries can be given using relational algebra$^*$. There
exist no rule based or tableau query equivalent with such a non-satisfiable
query. Nevertheless, the following is true.
\begin{tetel}\label{ekvivalenciatetel}
Rule based queries, tableau queries and satisfiable relational algebra$^*$ are
equivalent query languages.
\end{tetel}
The proof of Theorem~\ref{ekvivalenciatetel} consists of three main parts:
\begin{enumerate}
\item Rule based $\equiv$ Tableau
\item Satisfiable relational algebra$^*$ $\sqsubseteq$ Rule based
\item Rule based $\sqsubseteq$ Satisfiable relational algebra$^*$
\end{enumerate}
The first (easy) step is left to the Reader (Exercise~\ref{szab=tab}). For the
second step, it has to be seen first, that the language of rule based queries
is closed under composition. More precisely, let $\mathbf{R}=\{R_1,R_2,\ldots
,R_n\}$ be a database, $q$ be a query over $\mathbf{R}$. If the output
relation of $q$ is $S_1$, then in a subsequent query $S_1$ can be used in the
same way as any extensional relation of $\mathbf{R}$. Thus relation $S_2$ can
be defined, then with its help relation $S_3$ can be defined, and so
on. Relations $S_i$ are \ki{intensional}\index{relation!intensional}
relations. The \ki{conjunctive query
program}\inddef{query!conjunctive!program} $P$ is a list of rules
\begin{equation}\label{konjprog}
\begin{array}{rcl}
S_1(u_1)&\la&body_1\\
S_2(u_2)&\la&body_2\\
&\vdots& \\
S_m(u_m)&\la&body_m \koz ,
\end{array}\end{equation}
where $S_i$'s are pairwise distinct and not contained in $\mathbf{R}$. In rule
body $body_i$ only relations $R_1,R_2,\ldots R_n$ and $S_1,S_2,\ldots
,S_{i-1}$ can occur. $S_m$ is considered to be the output relation of $P$, its
evaluation is is done by computing the results of the rules one-by-one in
order. It is not hard to see that with appropriate renaming the variables $P$
can be substituted by a single rule, as it is shown in the following example.
\begin{pelda}{\textit{Conjunctive query program.}}\label{peld:konjpr}
Let $\mathbf{R}=\{Q,R\}$, and consider the following conjunctive query program
\begin{equation}\label{konjprpeld}
\begin{array}{rcl}
S_1(x,z)&\la&Q(x,y),R(y,z,w)\\
S_2(x,y,z)&\la&S_1(x,w),R(w,y,v),S_1(v,z)\\
S_3(x,z)&\la&S_2(x,u,v),Q(v,z) \koz .
\end{array}\end{equation}
$S_2$ can be written using $Q$ and $R$ only by the first two rules of
(\ref{konjprpeld})
\begin{equation}\label{s2-q,r}
S_2(x,y,z)\la Q(x,y_1),R(y_1,w,w_1),R(w,y,v),Q(v,y_2),R(y_2,z,w_2) \koz .
\end{equation}
It is apparent that some variables had to be renamed to avoid unwanted
interaction of rule bodies. Substituting expression (\ref{s2-q,r}) into the
third rule of (\ref{konjprpeld}) in place of $S_2$, and appropriately renaming
the variables
\begin{equation}\label{s3-q,r}
S_3(x,z)\la Q(x,y_1),R(y_1,w,w_1),R(w,u,v_1),Q(v_1,y_2),R(y_2,v,w_2),Q(v,z).
\end{equation}
is obtained.
\end{pelda}
Thus it is enough to realise each single relational algebra$^*$ operation by
an appropriate rule.
\begin{description}
\item{$P\Join Q$:} Let $\overrightarrow{x}$ denote the list of variables (and
constants) corresponding to the common attributes of $P$ and $Q$, let
$\overrightarrow{y}$ denote the variables (and constants) corresponding to
the attributes occurring only in $P$, while $\overrightarrow{z}$ denotes
those of corresponding to $Q$'s own attributes. Then rule
$\var{ans}(\overrightarrow{x},\overrightarrow{y},\overrightarrow{z}) \la
P(\overrightarrow{x},\overrightarrow{y}),Q(\overrightarrow{x},\overrightarrow{z})$
gives exactly relation $P\Join Q$.
\item{$\sigma_F(R)$:} Assume that $R=R(A_1,A_2,\ldots
,A_n)$ and the selection condition $F$ is of form either $A_i=a$ or
$A_i=A_j$, where $A_i,A_j$ are attributes $a$ is constant. Then
\[\var{ans}(x_1,\ldots,x_{i-1},a,x_{i+1},\ldots ,x_n)\la R( x_1,\ldots
,x_{i-1},a,x_{i+1},\ldots ,x_n) \koz ,\]
respectively,
\[
\begin{array}{l}\var{ans}(x_1,\ldots
,x_{i-1},y,x_{i+1},\ldots ,x_{j-1},y,x_{j+1},\ldots ,x_n)\la\\
\hskip5cm
R(x_1,\ldots
,x_{i-1},y,x_{i+1},\ldots ,x_{j-1},y,x_{j+1},\ldots ,x_n)
\end{array}
\]
are the rules sought. The satisfiability of relational algebra$^*$ query is
used here. Indeed, during composition of operations we never obtain an
expression where two distinct constants should be equated.
\item{$\pi_{A_{i_1},A_{i_2},\ldots ,A_{i_m}}(R)$:} If $R=R(A_1,A_2,\ldots
,A_n)$, then \[\var{ans}(x_{i_1},x_{i_2},\ldots ,x_{i_m})\la R(x_1,x_2,\ldots
,x_n)\] works.
\item{$A_1A_2\ldots A_n\ra B_1B_2\ldots B_n$:} The renaming operation of
relational algebra$^*$ can be achieved by renaming the appropriate
variables, as it was shown in Example~\ref{peld:konjpr}.
\end{description}
For the proof of the third step let us consider rule
\begin{equation}\label{szab2algebr}
\var{ans}(\overrightarrow{x})\la
R_1(\overrightarrow{x_1}),R_2(\overrightarrow{x_2}),\ldots
,R_n(\overrightarrow{x_n}) \koz .
\end{equation}
By renaming the attributes of relations $R_i$'s, we may assume without loss of
generality that all attribute names are distinct. Then $R=R_1\Join
R_2\Join\cdots\Join R_n$ can be constructed that is really a direct product,
since the the attribute names are distinct. The constants and multiple
occurrences of variables of rule (\ref{szab2algebr}) can be simulated by
appropriate selection operators. The final result is obtained by projecting to
the set of attributes corresponding to the variables of relation \var{ans}.
\subsection{Extensions}\label{subsec:kiterjesztesek}
Conjunctive queries are a class of query languages that has many good
properties. However, the set of expressible questions are rather
narrow. Consider the following.
\begin{description}
\item{\textbf{\thechapter.\thequery.}}\newcounter{cegyenlo}\setcounter{cegyenlo}{\value{query}}\stepcounter{query}
List those pairs where one member directed the other member in a film, and
vice versa, the other member also directed the first in a film.
\item{\textbf{\thechapter.\thequery.}}\newcounter{cunio}\setcounter{cunio}{\value{query}}\stepcounter{query}
Which theatres show ``La Dolce Vita'' or ``Rashomon''?
\item{\textbf{\thechapter.\thequery.}}\newcounter{cnegal}\setcounter{cnegal}{\value{query}}\stepcounter{query}
Which are those films of Hitchcock that Hitchcock did not play a part in?
\item{\textbf{\thechapter.\thequery.}}\newcounter{cnegalotharom}\setcounter{cnegalotharom}{\value{query}}\stepcounter{query}
List those films whose every actor played in some film of Fellini.
\item{\textbf{\thechapter.\thequery.}}\newcounter{crekurzio}\setcounter{crekurzio}{\value{query}}\stepcounter{query}
Let us recall the game ``Chain-of-Actors''. The first player names an
actor/actress, the next another one who played in some film together with the
first named. This is continued like that, always a new actor/actress has to be
named who played together with the previous one. The winner is that player who
could continue the chain last time. List those actors/actresses who could be
reached by ``Chain-of-Actors'' starting with ``Marcello Mastroianni''.
\end{description}
\subsubsection{Equality atoms.}
Question \textbf{\thechapter.\thecegyenlo.} can be easily answered if
equalities are also allowed rule bodies, besides relational atoms:
\begin{equation}\label{=megenged}
\var{ans}(y_1,y_2)\la Film(x_1,y_1,z_1),Film(x_2,y_2,z_2),y_1=z_2,y_2=z_1.
\end{equation}
Allowing equalities raises two problems. First, the result of the query could
become infinite. For example, the rule based query
\begin{equation}\label{vegtelen}
\var{ans}(x,y)\la R(x),y=z
\end{equation}
results in an infinite number of tuples, since variables $y$ and $z$ are not
bounded by relation $R$, thus there can be an infinite number of evaluations
that satisfy the rule body. Hence, the concept of \ki{domain restricted} query
is introduced. Rule based query $q$ is \ki{domain
restricted,}\inddef{query!conjunctive!domain
restricted}\inddef{domain restricted} if all variables that occur in the
rule body also occur in some relational atom.
The second problem is that equality atoms may cause the body of a rule become
unsatisfiable, in contrast to Proposition~\ref{422}. For example, query
\begin{equation}\label{a=b?}
\var{ans}(x)\la R(x), x=a,x=b
\end{equation}
is domain restricted, however if $a$ and $b$ are distinct constants, then the
answer will be empty. It is easy to check whether a rule based query with
equality atoms is satisfiable.
\begin{alg}{Satisfiable($q$)}\inddef{satisfiable@\textsc{Satisfiable}}
1 \' Compute the transitive closure of equalities of the body of $q$. \\
2 \' \key{if} \= Two distinct constants should be equal by transitivity\\
3 \' \> \key{then} \= \key{return} \= ``Not satisfiable.''\\
4 \' \> \key{else} \> \key{return} \> ``Satisfiable.''
\end{alg}
It is also true (Exercise~\ref{45c}) that if a rule based query $q$
that contains equality
atoms is satisfiable, then there exists a another rule based query
$q'$ without equalities that is equivalent with $q$.
\subsubsection{Disjunction -- union.}
The question \textbf{\thechapter.\thecunio}. cannot be expressed with
conjunctive queries. However, if the union operator is added to
relational algebra, then \textbf{\thechapter.\thecunio}. can be
expressed in that extended relational algebra:
\begin{equation}\label{eq:unioalg}
\pi_{\textit{Theater}}\left(\sigma_{\textit{Title}=\textrm{``La Dolce
Vita''}}(\textit{Show})\cup\sigma_{\textit{Title}=\textrm{``
Rashomon''}}(\textit{Show})\right) \koz .
\end{equation}
Rule based queries are also capable of expressing question
\textbf{\thechapter.\thecunio}. if it is allowed that the same
relation is in the head of many distinct rules:
\begin{equation}\label{eq:unioszabaly}
\begin{array}{l}
\var{ans}(x_M)\la \textit{Show}(x_{Th},\textrm{``La Dolce Vita''},x_{Ti}) \koz ,\\
\var{ans}(x_M)\la \textit{Show}(x_{Th},\textrm{``Rashomon''},x_{Ti}) \koz .
\end{array}
\end{equation}
\ki{Non-recursive datalog program} is a generalisation of
this. \inddef{datalog!non-recursive}
\begin{defi}\label{def:nrdatalog}
A \ki{non-recursive datalog program} over schema $\mathbf{R}$ is a set
of rules
\begin{equation}\label{eq:nrdatalog}
\begin{array}{rcl}
S_1(u_1)&\la&body_1\\
S_2(u_2)&\la&body_2\\
&\vdots& \\
S_m(u_m)&\la&body_m \koz ,
\end{array}\end{equation}
where no relation of $\mathbf{R}$ occurs in a head, the same relation
may occur in the head of several rules, furthermore there exists an
ordering $r_1,r_2,\ldots ,r_m$ of the rules such that the relation in
the head of $r_i$ does not occur in the body of any rule $r_j$ for$j\le i$.
\end{defi}
The semantics of the non-recursive datalog program
(\ref{eq:nrdatalog}) is similar to the conjunctive query program
(\ref{konjprog}). The rules are evaluated in the order $r_1,r_2,\ldots
,r_m$ of Definition~\ref{def:nrdatalog}, and if a relation occurs in
more than one head then the union of the sets of tuples given by those
rules is taken.
The union of tableau queries $(\mathbf{T}_i,u)$ $i=1,2,\ldots ,n$ is
denoted by $(\{\mathbf{T}_1,\mathbf{T}_2,\ldots
,\mathbf{T}_n\},u)$. It is evaluated by individually computing the
result of each tableau query $(\mathbf{T}_i,u)$, then the union of
them is taken. The following holds.
\begin{tetel}\label{tet:nrekviv}
The language of non-recursive datalog programs with unique output
relation and the relational algebra extended with the union operator
are equivalent.
\end{tetel}
The proof of Theorem~\ref{tet:nrekviv} is similar to that of
Theorem~\ref{ekvivalenciatetel} and it is left to the Reader
(Exercise~\ref{gy:nrekviv}). Let us note that the expressive power of
the union of tableau queries is weaker. This is caused by the
requirement having the same summary row for each tableau. For example,
the non-recursive datalog program query
\begin{equation}\label{eq:tablanemnrdatalog}
\begin{array}{l}
\var{ans}(a)\la\\
\var{ans}(b)\la
\end{array}\end{equation}
cannot be realised as union of tableau queries.
\subsubsection{Negation.}
The query \textbf{\thechapter.\thecnegal}. is obviously not
monotone. Indeed, suppose that in relation \textit{Film} there exist tuples
about Hitchcock's film \textit{Psycho}, for example
(``Psycho'',''A.~Hitchcock'',''A.~Perkins''),
(``Psycho'',''A.~Hitchcock'',''J.~Leigh''), $\ldots$ , however, the
tuple (``Psycho'',''A.~Hitchcock'',''A.~Hitchcock'') is not
included. Then the tuple (``Psycho'') occurs in the output of query
\textbf{\thechapter.\thecnegal}. With some effort one can realize
however, that Hitchcock appears in the film \textit{Psycho}, as ``a
man in cowboy hat''. If the tuple
(``Psycho'',''A.~Hitchcock'',''A.~Hitchcock'') is added to relation
\textit{Film} as a consequence, then the instance of schema
\textbf{CinePest} gets larger, but the output of query
\textbf{\thechapter.\thecnegal}. becomes smaller.
It is not too hard to see that the query languages discussed so far
are monotone, hence query \textbf{\thechapter.\thecnegal}. cannot be
formulated with non-recursive datalog program or with some of its
equivalents. Nevertheless, if the difference $(-)$operator is also added to
relation algebra, then it becomes capable of expressing queries of
type \textbf{\thechapter.\thecnegal}. For example,
\begin{equation}\label{eq:cnegal}
\pi_{\textit{Title}}\left(\sigma_{\textit{Director=\textrm{``A.~Hitchcock''}}}(\textit{Film})\right)-\pi_{\textit{Title}}\left(\sigma_{\textit{Actor}=\textrm{``A.~Hitchcock''}}(\textit{Film})\right)
\end{equation}
realises exactly query \textbf{\thechapter.\thecnegal}. Hence, the (full)
relational algebra consists of operations
$\{\sigma,\pi,\Join,\delta,\cup,-\}$. The importance of the relational
algebra is shown by the fact, that Codd\nevindex{Codd, Edgar F. (1923--2003)} calls
a query language $\mathcal Q$ \ki{relationally complete,}\inddef{query
language!relationally complete} exactly if for all relational
algebra query $q$ there exists $q'\in\mathcal{Q}$, such that
$q\equiv q'$.
If \ki{negative literals,}\inddef{literal!negative} that is atoms of
the form $\lnot R(u)$ are also allowed in rule bodies, then the
obtained \ki{non-recursive datalog with
negation,}\inddef{datalog!non-recursive!with negation} in notation
\ki{nr-datalog$^{\lnot}$} is relationally complete.
\begin{defi}\label{def:nrdlognegrule}
A non-recursive datalog$^{\lnot}$ (nr-datalog$^{\lnot}$) rule is of
form
\begin{equation}\label{eq:nrdlognegrule}
q:\quad S(u)\la L_1,L_2,\ldots ,L_n \koz ,
\end{equation}
where $S$ is a relation, $u$ is a free tuple\index{free tuple},
$L_i$'s are \ki{literals,}\inddef{literal} that is expression of
form $R(v)$ or $\lnot R(v)$, such that $v$ is a free tuple for
$i=1,2,\ldots ,n$. $S$ does not occur in the body of the rule. The
rule is \ki{domain restricted,}\index{rule!domain restricted} if each
variable $x$ that occurs in the rule also occurs in a \ki{positive
literal}\inddef{literal!positive} (expression of the form $R(v)$)
of the body. Every nr-datalog$^{\lnot}$ rule is considered domain
restricted, unless it is specified otherwise.
\end{defi}
The semantics of rule (\ref{eq:nrdlognegrule}) is as follows. Let
$\mathbf{R}$ be a relational schema that contains all relations occurring
in the body of $q$, furthermore, let $\mathcal I$ be an instance over
$\mathbf{R}$. The \ki{image of $\mathcal I$ under $q$} is
\newlength{\acs}
\setlength{\acs}{\arraycolsep}
\setlength{\arraycolsep}{0pt}
\begin{equation}\label{eq:nrdlogsemantics}
\begin{array}{ll}
q(\mathcal{I})=\{\nu(u)\vert\ \nu\textrm{\ is\ an\ valuation\ of\
}&\textrm{the\ variables\ and\ for\ }i=1,2,\ldots ,n\\
&\nu(u_i)\in\mathcal{I}(R_i),\ \textrm{if}\ L_i=R_i(u_i)\textrm{\ and}\\
&\nu(u_i)\not\in\mathcal{I}(R_i),\ \textrm{if}\ L_i=\lnot R_i(u_i)\}.
\end{array}\end{equation}
\setlength{\arraycolsep}{\acs}
A \ki{nr-datalog$^{\lnot}$
program}\inddef{nr-datalog@nr-datalog$^{\lnot}$ program} over schema
$\mathbf{R}$ is a collection of nr-datalog$^{\lnot}$ rules
\begin{equation}\label{eq:nrdlognegprog}
\begin{array}{rcl}
S_1(u_1)&\la&body_1\\
S_2(u_2)&\la&body_2\\
&\vdots& \\
S_m(u_m)&\la&body_m \koz ,
\end{array}\end{equation}
where relations of schema $\mathbf{R}$ do not occur in heads of
rules, the same relation may appear in more than one rule head,
furthermore there exists an ordering $r_1,r_2,\ldots ,r_m$ of the
rules such that the relation of the head of rule $r_i$ does not occur
in the head of any rule $r_j$ if $j\le i$.
The computation of the result of nr-datalog$^{\lnot}$ program
(\ref{eq:nrdlognegprog}) applied to instance $\mathcal I$ over schema
$\mathbf{R}$ can be done in the same way as the evaluation of
non-recursive datalog program (\ref{eq:nrdatalog}), with the
difference that the individual nr-datalog$^{\lnot}$ rules should be
interpreted according to (\ref{eq:nrdlogsemantics}).
\begin{pelda}{\textit{Nr-datalog$^{\lnot}$ program.}} Let us assume
that all films that are included in relation \textit{Film} have only
one director. (It is not always true in real life!) The
nr-datalog$^{\lnot}$ rule
\begin{equation}\label{eq:51megvalosit}
\var{ans}(x)\la \textit{Film}(x,\textrm{``A.~Hitchcock''},z),\lnot
\textit{Film}(x,\textrm{``A.~Hitchcock''},\textrm{``A.~Hitchcock''})
\end{equation}
expresses query \textbf{\thechapter.\thecnegal}.
Query \textbf{\thechapter.\thecnegalotharom.} is realised by the
nr-datalog$^{\lnot}$ program
\begin{equation}\label{eq:jancsojo}
\begin{array}{rcl}
\textit{Fellini-actor}(z)&\la&\textit{Film}(x,\textrm{''F.~Fellini''},z)\\
\textit{Not-the-answer}(x)&\la&\textit{Film}(x,y,z),\lnot\textit{Fellini-actor}(z)\\
\textit{Answer}(x)&\la&\textit{Film}(x,y,z),\lnot\textit{Not-the-answer}(x)
\koz .
\end{array}
\end{equation}
One has to be careful in writing nr-datalog$^{\lnot}$ programs. If the
first two rules of program (\ref{eq:jancsojo}) were to be merged like
in Example~\ref{peld:konjpr}
\begin{equation}\label{eq:jancsorossz}
\begin{array}{rcl}
\textit{Bad-not-ans}(x)&\la&\textit{Film}(x,y,z),\lnot\textit{Film}(x',\textrm{''F.~Fellini''},z),\textit{Film}(x',\textrm{''F.~Fellini''},z')\\
\textit{Answer}(x)&\la&\textit{Film}(x,y,z),\lnot\textit{Bad-not-ans}(x),
\end{array}
\end{equation}
then (\ref{eq:jancsorossz}) answers the following query (assuming that
all films have unique director)
\begin{description}
\item{\textbf{\thechapter.\thequery.}}\stepcounter{query} List all
those films whose every actor played in \ki{each} film of Fellini,
\end{description}
instead of query \textbf{\thechapter.\thecnegalotharom.}
\end{pelda}
It is easy to see that every satisfiable nr-datalog$^{\lnot}$ program
that contains equality atoms can be replaced by one without
equalities. Furthermore the following proposition is true, as well.
\begin{allit}\label{all:522}
The satisfiable (full) relational algebra and the
nr-datalog$^{\lnot}$ programs with single output relation are
equivalent query languages.
\end{allit}
\subsubsection{Recursion.}
Query \textbf{\thechapter.\thecrekurzio.} cannot be formulated using the
query languages introduced so far. Some \ki{a priori} information
would be needed about how long a \textit{chain-of-actors} could be
formed starting with a given actor/actress. Let us assume that the
maximum length of a chain-of-actors starting from ``Marcello
Mastroianni'' is 117. (It would be interesting to know the \ki{real}
value!) Then, the following non-recursive datalog program gives the
answer.
\begin{equation}\label{eq:szinlanc}
\begin{array}{rcl}
\textit{Film-partner}(z_1,z_2)&\la&\textit{Film}(x,y,z_1),\textit{Film}(x,y,z_2),z_1* \key{do} \key{for} \= all $v\in R_2$\\
4 \' \> \> \key{do} \key{if} \= $u$ and $v$ can be joined\\
5 \' \> \> \> \key{then} $S\la S\cup\{u\Join v\}$\\
6 \' \key{return} $S$
\end{alg}
It is clear that the running time of \textsc{Full-Tuplewise-Join} is
$O(|R_1|\times |R_2|).$ Thus, it is important that in what order is a
query executed, since during computation natural joins of relations of
various sizes must be calculated. In case of tableau queries the
\ki{Homomorphism Theorem}\index{homomorphism theorem} gives a
possibility of a query rewriting that uses less joins than the
original.
Let $q_1,q_2$ be queries over schema $\mathbf{R}$. $q_2$ \ki{contains}
$q_1$, in notation $q_1\sqsubseteq q_2,$ if for all instances
$\mathcal{I}$ over schema $\mathbf{R}$ $q_1(\mathcal{I})\subseteq
q_2(\mathcal{I})$ holds. $q_1\equiv q_2$ according to
Definition~\ref{queryekviv} iff $q_1\sqsubseteq q_2$ and
$q_1\sqsupseteq q_2$. A generalisation of valuations will be
needed. \ki{Substitution}\inddef{substitution} is a mapping from the
set of variables to the union of sets of variables and sets of
constants that is extended to constants as identity. Extension of
substitution to free tuples and tableaux can be defined naturally.
\begin{defi}\label{def:homom}
Let $q=(\mathbf{T},u)$ and $q'=(\mathbf{T'},u')$ be two tableau
queries overs schema $\mathbf{R}$. Substitution $\theta$ is a
\ki{homomorphism}\inddef{query!homomorphism} from $q'$ to $q$, if
$\theta(\mathbf{T'})=\mathbf{T}$ and $\theta(u')=u$.
\end{defi}
\begin{tetel}[Homomorphism Theorem]\label{tet:homom}
Let $q=(\mathbf{T},u)$ and $q'=(\mathbf{T'},u')$ be two tableau
queries overs schema $\mathbf{R}$. $q\sqsubseteq q'$ if and only if,
there exists a homomorphism from $q'$ to $q$.
\end{tetel}
\begin{biz}
Assume first, that $\theta$ is a homomorphism from $q'$ to $q$, and
let $\mathcal{I}$ be an instance over schema $\mathbf{R}$. Let $w\in
q(\mathcal{I})$. This holds exactly if there exists a valuation $\nu$
that maps tableau $\mathbf{T}$ into $\mathcal{I}$ and $\nu(u)=w$. It
is easy to see that $\theta\circ\nu$ maps tableau $\mathbf{T'}$ into
$\mathcal{I}$ and $\theta\circ\nu(u')=w$, that is $w\in
q'(\mathcal{I}).$ Hence, $w\in q(\mathcal{I})\Longrightarrow w\in
q'(\mathcal{I}),$ which in turn is equivalent with $q\sqsubseteq q'$.
On the other hand, let us assume that $q\sqsubseteq q'$. The idea of
the proof is that both, $q$ and $q'$ are applied to the ``instance''
$\mathbf{T}$. The output of $q$ is free tuple $u$, hence the output of
$q'$ also contains $u$, that is there exists a $\theta$ embedding of
$\mathbf{T'}$ into $\mathbf{T}$ that maps $u'$ to $u$. To formalise
this argument the instance $\mathcal{I}_{\mathbf{T}}$ isomorphic to
$\mathbf{T}$ is constructed.
Let $V$ be the set of variables occurring in $\mathbf{T}$. For all
$x\in V$ let $a_x$ be constant that differs from all constants
appearing in $\mathbf{T}$ or $\mathbf{T'}$, furthermore $x\ne
x'\Longrightarrow a_x\ne a_{x'}$. Let $\mu$ be the valuation that maps
$x\in V$ to $a_x$, furthermore let
$\mathcal{I}_{\mathbf{T}}=\mu(\mathbf{T})$. $\mu$ is a bijection from
$V$ to $\mu(V)$ and there are no constants of $\mathbf{T}$ appearing
in $\mu(V)$, hence $\mu^{-1}$ well defined on the constants occurring
in $\mathcal{I}_{\mathbf{T}}$.
It is clear that $\mu(u)\in q(\mathcal{I}_{\mathbf{T}})$, thus using
$q\sqsubseteq q'$ $\mu(u)\in q'(\mathcal{I}_{\mathbf{T}})$ is
obtained. That is, there exists a valuation $\nu$ that embeds tableau
$\mathbf{T'}$ into $\mathcal{I}_{\mathbf{T}}$, such that
$\nu(u')=\mu(u)$. It is not hard to see that $\nu\circ\mu^{-1}$ is a
homomorphism from $q'$ to $q$.
\end{biz}
\subsubsection{Query optimisation by tableau minimisation.}
According to Theorem~\ref{ekvivalenciatetel} tableau queries and
satisfiable relational algebra (without subtraction) are
equivalent. The proof shows that the relational algebra expression
equivalent with a tableau query is of form
$\pi_{\overrightarrow{\j}}(\sigma_F(R_1\Join R_2\Join\cdots\Join
R_k)),$ where $k$ is the number of rows of the tableau. It implies that
if the number of joins is to be minimised, then the number of rows of
an equivalent tableau must be minimised.
The tableau query $(\mathbf{T},u)$ is
\ki{minimal,}\inddef{query!tableau!minimal} if there exists no tableau
query $(\mathbf{S},v)$ that is equivalent with $(\mathbf{T},u)$ and
$|\mathbf{S}|<|\mathbf{T}|$, that is $S$ has fewer rows. It may be
surprising, but it is true, that a minimal tableau query equivalent
with $(\mathbf{T},u)$ can be obtained by simply dropping some rows
from $\mathbf{T}$.
\begin{tetel}\label{tet:minimaltabla}
Let $q=(\mathbf{T},u)$ be a tableau query. There exists a subset
$\mathbf{T'}$ of $\mathbf{T}$, such that query $q'=(\mathbf{T'},u)$ is
minimal and equivalent with $q=(\mathbf{T},u)$.
\end{tetel}
\begin{biz}
Let $(\mathbf{S},v)$ be a minimal query equivalent with $q$. According
to the Homomorphism Theorem there exist homomorphisms $\theta$ from $q$
to $(\mathbf{S},v)$, and $\lambda$ from $(\mathbf{S},v)$ to $q$. Let $
\mathbf{T'}=\theta\circ\lambda(\mathbf{T})$. It is easy to check that
$(\mathbf{T'},u)\equiv q$ and $|\mathbf{T'}|\le |\mathbf{S}|$. But
$(\mathbf{S},v)$ is minimal, hence $(\mathbf{T'},u)$ is minimal, as well.
\end{biz}
\begin{pelda}{\textit{ Application of tableau minimisation}} Consider
the relational algebra expression
\begin{equation}\label{eq:rel-to-opt}
q=\pi_{AB}\left(\sigma_{B=5}(R)\right)\Join\pi_{BC}\left(\pi_{AB}(R)\Join\pi_{AC}\left(\sigma_{B=5}(R)\right)\right)
\end{equation}
over the schema $\mathbf{R}$ of attribute set $\{A,B,C\}$. The tableau
query corresponding to $q$ is the following tableau $\mathbf{T}$:
\begin{equation}\label{eq:megf-tabl}
\begin{array}{r|lll}
R&A&B&C\\
\cline{2-4}
&x&5&z_1\\
&x_1&5&z_2\\
&x_1&5&z\\
\cline{2-4}
u&x&5&z
\end{array}
\end{equation}
Such a homomorphism is sought that maps some rows of $\mathbf{T}$ to
some other rows of $\mathbf{T}$, thus sort of ``folding'' the
tableau. The first row cannot be eliminated, since the homomorphism is
an identity on free tuple $u$, thus $x$ must be mapped to itself. The
situation is similar with the third row, as the image of $z$ is itself
under any homomorphism. However the second row can be eliminated by
mapping $x_1$ to $x$ and $z_2$ to $z$, respectively. Thus, the minimal
tableau equivalent with $\mathbf{T}$ consists of the first and third
rows of $\mathbf{T}$. Transforming back to relational algebra
expression,
\begin{equation}\label{eq:eredm-lekerd}
\pi_{AB}(\sigma_{B=5}(R))\Join\pi_{BC}(\sigma_{B=5}(R))
\end{equation}
is obtained. Query (\ref{eq:eredm-lekerd}) contains one less join
operator than query (\ref{eq:rel-to-opt}).
\end{pelda}
The next theorem states that the question of tableau containment and
equivalence is NP-complete, hence tableau minimisation is an NP-hard
problem.
\begin{tetel}\label{tet:tabla-np-telj}
For given tableau queries $q$ and $q'$ the following decision
problems are NP-complete:
\begin{description}
\item{\thechapter.\thequery.}\newcounter{countsubtabl}\setcounter{countsubtabl}{\value{query}}\stepcounter{query}
$q\sqsubseteq q'$?
\item{\thechapter.\thequery.}\newcounter{countekvtabl}\setcounter{countekvtabl}{\value{query}}\stepcounter{query} $q\equiv q'$?
\item{\thechapter.\thequery.}\newcounter{countreszekvtabl}\setcounter{countreszekvtabl}{\value{query}}\stepcounter{query}
Assume that $q'$ is obtained from $q$ by removing some free
tuples. Is it true then that $q\equiv q'$?
\end{description}
\end{tetel}
\begin{biz} The \textsc{Exact-Cover}\inddef{\textsc{Exact-cover}}
problem will be reduced to the various tableau problems. The
input of \textsc{Exact-Cover} problem is a finite set
$X=\{x_1,x_2,\ldots ,x_n\}$, and a collection of its subsets
$\mathcal{S}=\{S_1,S_2,\ldots ,S_m\}$. It has to be determined
whether there exists $\mathcal{S}'\sqsubseteq\mathcal{S}$, such
that subsets in $\mathcal{S}'$ cover $X$ exactly (that is, for all $x\in
X$ there exists exactly one $S\in\mathcal{S}'$ such that $x\in
S$). \textsc{Exact-Cover} is known to be an NP-complete problem.
Let $\mathcal{E}=(X,\mathcal{S})$ be an input of \textsc{Exact
cover}. A construction is sketched that produces a pair
$q_{\mathcal{E}},q_{\mathcal{E}}'$ of tableau queries to
$\mathcal{E}$ in polynomial time. This construction can be used to
prove the various NP-completeness results.
Let the schema $\mathbf{R}$ consist of the pairwise distinct
attributes $A_1,A_2,\ldots ,A_n,B_1,B_2,\ldots ,B_m$.
$q_{\mathcal{E}}=(\mathbf{T}_\mathcal{E},t)$ and
$q_{\mathcal{E}}'=(\mathbf{T}_\mathcal{E}',t)$ are tableau queries
over schema $\mathbf{R}$ such that the summary row of both of them is
free tuple $t=\langle A_1:a_1,A_2:a_2,\ldots ,A_n:a_n\rangle$, where
$a_1,a_2,\ldots ,a_n$ are pairwise distinct variables.
Let $b_1,b_2,\ldots ,b_m,c_1,c_2,\ldots c_m$ be another set of
pairwise distinct variables. Tableau $ \mathbf{T}_\mathcal{E}$
consists of $n$ rows, for each element of $X$ corresponds one. $a_i$
stands in column of attribute $A_i$ in the row of $x_i$, while $b_j$ stands in
column of attribute $B_j$ for all such $j$ that $x_i\in S_j$
holds. In other positions of tableau $ \mathbf{T}_\mathcal{E}$
pairwise distinct new variables stand.
Similarly, $ \mathbf{T}_\mathcal{E}'$ consists of $m$ rows, one
corresponding to each element of $\mathcal{S}$. $a_i$ stands in column
of attribute $A_i$ in the row of $S_j$ for all such $i$ that $x_i\in
S_j$, furthermore $c_{j'}$ stands in the column of attribute $B_{j'}$,
for all $j'\ne j$. In other positions of tableau
$\mathbf{T}_\mathcal{E}'$ pairwise distinct new variables stand.
The NP-completeness of problem \thechapter.\thecountsubtabl. follows
from that $X$ has an exact cover using sets of $\mathcal{S}$ if and
only if $ q_{\mathcal{E}}'\sqsubseteq q_{\mathcal{E}}$ holds. The
proof, and the proof of the NP-completeness of problems
\thechapter.\thecountekvtabl. and
\thechapter.\thecountreszekvtabl. are left to the Reader (Exercise~\ref{gy:fedes}).
\end{biz}
\begin{gyak}
\ujgyak\label{konjmon}\gyakindex{query!rule
based}\gyakindex{query!monotone}\gyakindex{query!satisfiable}
Prove Proposition~\ref{422}, that is every rule based query $q$ is
monotone and satisfiable. \textit{Hint.} For the proof of
satisfiability let $K$ be the set of constants occurring in query
$q$, and let $a\not\in K$ be another constant. For every relation
schema $R_i$ in
rule (\ref{szabaly}) construct all tuples $(a_1,a_2,\ldots ,a_r),$
where $a_i\in K\cup\{a\}$, and $r$ is the arity of $R_i$. Let
$\mathcal I$ be the instance obtained so. Prove that
$q(\mathcal{I})$ is nonempty.
\ujgyak\label{emptyquery}\gyakindex{query!empty}\gyakindex{query!relational
algebra}
Give a relational schema $\mathbf{R}$ and a relational algebra query
$q$ over schema $\mathbf{R}$, whose result is empty to arbitrary
instance over $\mathbf{R}$.
\ujgyak\label{szab=tab}\gyakindex{query!rule
based}\gyakindex{query!tableau}
Prove that the languages of rule based queries and tableau queries are
equivalent.
\ujgyak\label{45c}
Prove that every rule based query $q$ with equality atoms is either
equivalent with the empty query $q^{\emptyset}$, or there exists a
rule based query $q'$ without equality atoms such that $q\equiv
q'$. Give a polynomial time algorithm that determines for a rule
based query $q$ with equality atoms whether $q\equiv q^{\emptyset}$
holds, and if not, then constructs a rule based query $q'$ without
equality atoms, such that $q\equiv q'$.
\ujgyak\label{gy:nrekviv}
Prove Theorem~\ref{tet:nrekviv} by generalising the proof idea of
Theorem~\ref{ekvivalenciatetel}.
\ujgyak\label{gy:fixpont}
Let $P$ be a datalog program, $\mathcal{I}$ be an instance over
$\var{edb}(P)$, $ in C(P,\mathcal{I})$ be the (finite) set of
constants occurring in $\mathcal{I}$ and $P$. Let
$\mathbf{B}(P,\mathcal{I})$ be the following instance over $sch(P)$:
\begin{enumerate}
\item For every relation $R$ of $\var{edb}(P)$ the fact $R(u)$ is in
$\mathbf{B}(P,\mathcal{I})$ iff it is in $\mathcal{I}$, furthermore
\item for every relation $R$ of $idb(P)$ every $R(u)$ fact constructed
from constants of $C(P,\mathcal{I})$ is in
$\mathbf{B}(P,\mathcal{I})$.
\end{enumerate}
Prove that
\begin{equation}\label{eq:gy:fixpont}
\mathcal{I}\subseteq T_P(\mathcal{I})\subseteq
T_P^2(\mathcal{K})\subseteq T_P^3(\mathcal{K})\subseteq\cdots\subseteq
\mathbf{B}(P,\mathcal{I}).
\end{equation}
\ujgyak\label{gy:snaiv}
Give an example of an input, that is a datalog program $P$ and
instance $\mathcal{I}$ over $\var{edb}(P)$, such that the same tuple
is produced by distinct runs of the loop of
\textsc{Semi-naiv-datalog}\gyakindex{semi-naiv-datalog@\textsc{Semi-naiv-datalog}}.
\ujgyak\label{gy:jfsnaiv}
Prove that procedure
\textsc{Improved-Semi-Naiv-Datalog}\gyakindex{\textsc{Improved-Semi-Naiv-Datalog}}
stops in finite time for all inputs, and gives correct result. Give an
example of an input on which \textsc{Improved-Semi-Naiv-Datalog}
calculates less number of rows multiple times than
\textsc{Semi-naiv-datalog}.
\ujgyak\label{gy:fedes}
\begin{enumerate}
\item Prove that for tableau queries
$q_{\mathcal{E}}=(\mathbf{T}_\mathcal{E},t)$ and
$q_{\mathcal{E}}'=(\mathbf{T}_\mathcal{E}',t)$ of the proof of
Theorem~\ref{tet:tabla-np-telj} there exists a homomorphism from
$(\mathbf{T}_\mathcal{E},t)$ to $(\mathbf{T}_\mathcal{E}',t)$ if and
only if the \textsc{Exact-Cover} problem
$\mathcal{E}=(X,\mathcal{S})$ has solution.
\item Prove that the decision problems
\thechapter.\thecountekvtabl. and
\thechapter.\thecountreszekvtabl. are NP-complete.
\end{enumerate}
\end{gyak}
\section{Views\label{sec:Nezetek}}
A database system architecture has three main levels:
\begin{itemize}
\item physical layer;
\item logical layer;
\item outer (user) layer.
\end{itemize}
The goal of the separation of levels is to reach data independence and
the convenience of users. The three views on Figure~\ref{abr:szintek}
show possible user interfaces: multirelational, universal relation and
graphical interface.
\epskepnagy{figs/szintek.eps}{The three levels of database architecture.\label{abr:szintek}}{}
The \ki{physical layer}\inddef{database architecture!layer!physical}
consists of the actually stored data files and the dense and sparse
indices built over them.
The separation of the \ki{logical layer}\inddef{database
architecture!layer!logical} from the physical layer makes it
possible for the user to concentrate on the logical dependencies of
the data, which approximates the image of the reality to be
modelled better. The logical layer consists of the database schema
description together with the various integrity
constraints\index{integrity constraint}, dependencies. This the
layer where the database administrators work with the system. The
connection between the physical layer and the logical layer is
maintained by the database engine.
The goal of the separation of the logical layer and the \ki{outer
layer}\inddef{database architecture!layer!outer} is that the endusers can
see the database according to their (narrow) needs and
requirements. For example, a very simple view of the outer layer of
a bank database could be the automatic teller machine, or a much
more complex view could be the credit history of a client for loan
approval.
\subsection{View as a result of a query}
The question is that how can the views of different layers be given.
If a query given by relational algebra expression is considered as a
formula that will be applied to relational instances, then the
\ki{view}\inddef{view} is obtained. Datalog rules show the difference
between views and relations, well. The relations defined by rules are
called \ki{intensional}\index{relation!intensional}, because these are
the relations that do not have to exist on external storage devices,
that is to exist extensionally, in contrast to the \ki{extensional}
relations.\index{relation!extensional}
\begin{defi}\label{def:nezet}
The $\mathcal{V}$ expression given in some query language
$\mathcal{Q}$ over schema $\mathbf{R}$ is called a \ki{view}.
\end{defi}
Similarly to intensional relations, views can be used in definition of
queries or other views, as well.
\begin{pelda}{\textit{ SQL view.}}\label{pel:sql-nezet}
Views in database manipulation language SQL can be given in the
following way.\index{SQL} Suppose that the only interesting data for
us from schema \textbf{CinePest} is where and when are Kurosawa's film
shown. The view \textbf{KurosawaTimes} is given by the SQL command
\begin{alg}{KurosawaTimes}
1 \' \key{create} \= \key{view} KurosawaTimes \key{as}\\
2 \' \> \key{select} Theater, Time\\
3 \' \> \key{from} Film, Show\\
4 \' \> \key{where} Film.Title=Show.Title \key{and}
Film.Director="Akira Kurosawa"
\end{alg}
Written in relational algebra is as follows.
\begin{equation}\label{eq:sqlnezet}
\textit{KurosawaTimes(Theater, Time)}=\pi_{Theater,
\textit{ Time}}(Theater\Join\sigma_{\textit{Director}=\textrm{"Akira Kurosawa"}}(Film))
\end{equation}
Finally, the same by datalog rule is:
\begin{equation}\label{eq:sqldlog}
\textit{KurosawaTimes}(x_{Th}, x_{Ti})\la
Theater(x_{Th},x_T,x_{Ti}),Film(x_T,\textrm{"Akira Kurosawa"},x_{A}) \koz .
\end{equation}
Line 2 of \textsc{KurosawaTimes} marks the selection operator used,
line 3 marks that which two relations are to be joined, finally the
condition of line 4 shows that it is a natural join, not a direct product.
\end{pelda}
Having defined view $\mathcal{V}$, it can be used in further
queries or view definitions like any other (extensional) relation.
\subsubsection{Advantages of using views}
\begin{itemize}
\item Automatic data hiding: Such data that is not part of the view
used, is not shown to the user, thus the user cannot read or modify
them without having proper access rights to them. So by providing
access to the database through views, a simple, but effective
security mechanism is created.
\item Views provide simple ``macro capabilities''. Using the view
\textit{KurosawaTimes} defined in Example~\ref{pel:sql-nezet} it is
easy to find those theatres where Kurosawa films are shown in the
morning:
\begin{equation}\label{eq:kuro-delelott}
\textit{KurosawaMorning}(Theater)\la \textit{KurosawaTimes}(Theater,
x_{Ti}),x_{Ti}<12 \koz .
\end{equation}
Of course the user could include the definition of
\textit{KurosawaTimes} in the code directly, however convenience
considerations are first here, in close
similarity with macros.
\item Views make it possible that the same data could be seen in
different ways by different users at the same time.
\item Views provide \ki{logical data independence.}\inddef{data
independence!logical} The essence of logical data independence is
that users and their programs are protected from the structural
changes of the database schema. It can be achieved by defining the
relations of the schema before the structural change as views in the
new schema.
\item Views make controlled data input possible. The \key{with check
option} clause of command \key{create view} is to do this in SQL.
\end{itemize}
\subsubsection{Materialised view.}
Some view could be used in several different queries. It could be
useful in these cases that if the tuples of the relation(s) defined by
the view need not be calculated again and again, but the output of the
query defining the view is stored, and only read in at further uses. Such
stored output is called a \ki{materialised
view.}\inddef{view!materialised}
\begin{gyak}
\ujgyak\label{gy:581} Consider the following schema:
\begin{center}
\begin{tabular}{l}
\textit{FilmStar}(\textit{Name},\textit{Address},\textit{Gender},\textit{BirthDate})\\
\textit{FilmMogul}(\textit{Name},\textit{Address},\textit{Certificate\#},\textit{Assets})\\
\textit{Studio}(\textit{Name},\textit{Address},\textit{PresidentCert\#}) \koz .
\end{tabular}\end{center}
Relation \textit{FilmMogul} contains data of the big people in film
business (studio presidents, producers, etc.). The attribute names
speak for themselves, \textit{Certificate\#} is the number of the
certificate of the filmmogul, \textit{PresidentCert\#}) is the
certificate number of the president of the studio.
Give the definitions of the following views using datalog rules,
relational algebra expressions, furthermore SQL:
\begin{enumerate}
\item \textit{RichMogul}: Lists the names, addresses,certificate
numbers and assets of those filmmoguls, whose asset value is over 1
million dollars.
\item \textit{StudioPresident}: Lists the names, addresses and
certificate numbers of those filmmoguls, who are studio presidents,
as well.
\item \textit{MogulStar}: Lists the names, addresses,certificate
numbers and assets of those people who are filmstars and filmmoguls
at the same time.
\end{enumerate}
\ujgyak\label{gy:ex434}
Give the definitions of the following views over schema
\textbf{CinePest} using datalog rules,
relational algebra expressions, furthermore SQL:
\begin{enumerate}
\item \textit{Marilyn}(\textit{Title}): Lists the titles of Marilyn
Monroe's films.
\item
\textit{CorvinInfo}(\textit{Title},\textit{Time},\textit{Phone}):
List the titles and show times of films shown in theatre
\textit{Corvin}, together with the phone number of the theatre.
\end{enumerate}
\end{gyak}
\section{Query rewriting\label{sec:Atiras}}
Answering queries using views, in other words query rewriting using
views has become a problem attracting much attention and interest
recently. The reason is its applicability in a wide range of data
manipulation problems: query optimisation, providing physical data
independence, data and information integration, furthermore planning
data warehouses.
The problem is the following. Assume that query $Q$ is given over
schema $\mathbf{R}$, together with views $V_1,V_2,\ldots ,V_n$. Can
one answer $Q$ using only the results of views $V_1,V_2,\ldots ,V_n$?
Or, which is the largest set of tuples that can be determined knowing
the views? If the views and the relations of the base schema can be
accessed both, what is the cheapest way to compute the result of $Q$?
\subsection{Motivation}
Before query rewriting algorithms are studied in detail, some
motivating examples of applications are given. The following
university database is used throughout this section.
\begin{equation}\label{eq:egyetem-sema}
\textbf{University}=\{\textit{Professor,Course,Teach,Registered,Major,Affiliation,Supervisor}\}
\koz .
\end{equation}
The schemata of the individual relations are as follows:
\begin{equation}\label{eq:egyetem-relaciok}
\begin{array}{rcl}
\textit{Professor}&=&\{\textit{PName,Area}\}\\
\textit{Course}&=&\{\textit{C-Number,Title}\}\\
\textit{Teaches}&=&\{\textit{PName,C-Number,Semester,Evaluation}\}\\
\textit{Registered}&=&\{\textit{Student,C-Number,Semester}\}\\
\textit{Major}&=&\{\textit{Student,Department}\}\\
\textit{Affiliation}&=&\{\textit{PName,Department}\}\\
\textit{Advisor}&=&\{\textit{PName,Student}\} \koz .
\end{array}
\end{equation}
It is supposed that professors, students and departments are uniquely
identified by their names. Tuples of relation \textit{Registered}
show that which student took which course in what semester, while
\textit{Major} shows which department a student choose in majoring
(for the sake of convenience it is assumed that one department has
one subject as possible major).
\subsubsection{Query optimisation.}
If the computation necessary to answer a query was performed
in advance and the results are stored in some materialised view, then
it can be used to speed up the query answering.
Consider the query that looks for pairs (\textit{Student,Title}), where the
student registered for the given Ph.D.-level course, the course is taught by a
professor of the \textit{Database} area (the C-number of graduate courses is
at least 400, and the Ph.D.-level courses are those with C-number at least
500).
\begin{equation}\label{eq:opti-pelda}
\begin{array}{rcl}
val(x_S,x_T)&\la& \textit{Teach}(x_P,x_C,x_{Se},y_1),
\textit{Professor}(x_P,\textrm{``database''}),\\
& & \textit{Registered}(x_S,x_C,x_{Se}),
\textit{Course}(x_C,x_T),x_C\ge 500 \koz .
\end{array}
\end{equation}
Suppose that the following materialised view is available that contains the registration data of graduate courses.
\begin{equation}\label{eq:valaszthato}
\textit{Graduate}(x_S,x_T,x_C,x_{Se})\la
\textit{Registered}(x_S,x_C,x_{Se}),\textit{Course}(x_X,x_T),x_C\ge 400 \koz .
\end{equation}
View \textit{Graduate} can be used to answer query (\ref{eq:opti-pelda}).
\begin{equation}\label{eq:mater-hasznal}
\begin{array}{rcl}
val(x_S,x_T)&\la&
\textit{Teaches}(x_P,x_C,x_{Se},y_1),\textit{Professor}(x_P,\textrm{``database''}),\\
& &(x_S,x_T,x_C,x_{Se}),x_C\ge 500 \koz .
\end{array}
\end{equation}
It will be faster to compute (\ref{eq:mater-hasznal}) than to compute
(\ref{eq:opti-pelda}), because the natural join of relations
\textit{Registered} and \textit{Course} has already be done by view
\textit{Graduate}, furthermore it shelled off the undergraduate courses (that
make up for the bulk of registration data at most universities). It worth
noting that view \textit{Graduate} could be used event hough \ki{syntactically}
did not agree with any part of query (\ref{eq:opti-pelda}).
On the other hand, it may happen that the the original query can be answered
faster. If relations \textit{Registered} and \textit{Course} have an index on
attribute \textit{C-Number}, but there exists no index built for
\textit{Graduate}, then it could be faster to answer query
(\ref{eq:opti-pelda}) directly from the database relations. Thus, the real
challenge is not only that to decide about a materialised view whether it
could be used to answer some query logically, but a thorough cost analysis
must be done when is it worth using the existing views.
\subsubsection{Physical data independence.}
One of the principles underlying modern database systems is the separation
between the logical view of data and the physical view of data. With the
exception of horizontal or vertical partitioning of relations into multiple
files, relational database systems are still largely based on a one-to-one
correspondence between relations in the schema and the files in which they are
stored. In object-oriented systems, maintaining the separation is necessary
because the logical schema contains significant redundancy, and does not
correspond to a good physical layout. Maintaining physical data independence
becomes more crucial in applications where the logical model is introduced as
an intermediate level after the physical representation has already been
determined. This is common in storage of XML data in relational databases, and
in data integration. \index{XML} In fact, the Stored System stores XML
documents in a relational database, and uses views to describe the mapping
from XML into relations in the database.
To maintain physical data independence, a widely accepted method is to use
views over the logical schema as mechanism for describing the physical
storage of data. In particular, Tsatalos, Solomon and
Ioannidis\nevindex{Tsatalos, Odysseas G.}\nevindex{Solomon, Marvin
H.}\nevindex{Ioannidis, Yannis E.} use GMAPs (\textit{Generalised
Multi-level Access Paths})\inddef{gmap@GMAP}\inddef{generalised
multi-level@Generalised Multi-level Access Path} to describe data storage.
A GMAP describes the physical organisation and the indexes of the storage
structure. The first clause of the GMAP (\key{as}) describes the actual data
structure used to store a set of tuples (e.g., a B\textsuperscript{+}-tree,
hash index, etc.) The remaining clauses describe the content of the structure,
much like a view definition. The \key{given} and \key{select} clauses
describe the available attributes, where the \key{given} clause describes the
attributes on which the structure is indexed. The definition of the view,
given in the \key{where} clause uses infix notation over the conceptual
model.
In the example shown in Figure~\ref{abr:gmap}, the GMAP G1 stores a set of
pairs containing students and the departments in which they major, and these
pairs are indexed by a B\textsuperscript{+}-tree on attribute
\textit{Student.name}. The GMAP G2 stores an index from names of students to
the numbers of courses in which they are registered. The GMAP G3 stores an
index from course numbers to departments whose majors are enrolled in the
course.
\begin{figure}
\begin{tabbing}
\' \key{def.gmap} \= G1 \key{as} b$^+$-tree \key{by}\\
\' \> \key{given} \ \ \= \textit{Student.name}\\
\' \> \key{select} \>\textit{Department}\\
\' \> \key{where} \>\textit{Student} major \textit{Department}\\
\' \\
\' \key{def.gmap} \= G2 \key{as} b$^+$-tree \key{by}\\
\' \> \key{given} \ \ \= \textit{Student.name}\\
\' \> \key{select} \>\textit{Course.c-number}\\
\' \> \key{where} \>\textit{Student} registered \textit{Course}\\
\' \\
\' \key{def.gmap} \= G3 \key{as} b$^+$-tree \key{by}\\
\' \> \key{given} \ \ \= \textit{Course.c-number}\\
\' \> \key{select} \>\textit{Department}\\
\' \> \key{where} \>\textit{Student} registered \textit{Course} \key{and}
\textit{Student} major \textit{Department}
\end{tabbing}
\caption{GMAPs for the university domain.}\label{abr:gmap}
\end{figure}
Given that the data is stored in structures described by the GMAPs, the
question that arises is how to use these structures to answer queries. Since
the logical content of the GMAPs are described by views, answering a query
amounts to finding a way of rewriting the query using these views. If there are
multiple ways of answering the query using the views, we would like to find
the cheapest one. Note that in contrast to the query optimisation context, we
\ki{must} use the views to answer the given query, because all the data is
stored in the GMAPs.
Consider the following query, which asks for names of students registered for
Ph.D.-level courses and the departments in which these students are majoring.
\setlength{\arraycolsep}{0pt}
\begin{equation}\label{eq:gmappelda}
\begin{array}{rcl}
ans(\textit{Student,Department})&\la&
\textit{Registered}(\textit{Student,C-number,y}),\\
& &\textit{Major}(\textit{Student,Department}),\\
& &\textit{C-number}\ge 500 \koz .
\end{array}
\end{equation}
\setlength{\arraycolsep}{\acs}
The query can be answered in two ways. First, since \textit{Student.name}
uniquely identifies a student, we can take the join of G! and G2, and then
apply selection operator $\textit{Course.c-number}\ge 500$, finally a
projection eliminates the unnecessary attributes. The other execution plan could be to join G3 with G2 and select $\textit{Course.c-number}\ge 500$. In
fact, this solution may even be more efficient because G3 has an index on the
course number and therefore the intermediate joins may be much smaller.
\subsubsection{Data integration.}
A \ki{data integration system}\inddef{data integration system} (also known as \ki{mediator system}\inddef{mediator system}) provides a \ki{uniform} query interface to a multitude of autonomous heterogeneous data sources. Prime examples of data integration applications include enterprise integration, querying multiple sources on the World-Wide Web, and integration of data from distributed scientific experiments.
To provide a uniform interface, a data integration system exposes to the user a \ki{mediated schema.}\inddef{schema!mediated} A mediated schema is a set of \ki{virtual} relations\inddef{relation!virtual}, in the sense that they are not stored anywhere. The mediated schema is designed manually for a particular data integration application. To be able to answer queries, the system must also contain a set of \ki{source descriptions.}\inddef{source description} A description of a data source specifies the contents of the source, the attributes that can be found in the source, and the (integrity) constraints on the content of the source.
A widely adopted approach for specifying source descriptions is to describe the contents of a data source as a \ki{view} over the mediated schema. This approach facilitates the addition of new data sources and the specification of constraints on the contents of sources.
In order to answer a query, a data integration system needs to translate a query formulated on the mediated schema into one that refers directly to the schemata of the data sources. Since the contents of the data sources are described as views, the translation problem amounts finding a way to answer a query using a set of views.
Consider as an example the case where the mediated schema exposed to the user is schema
\textbf{University}, except that the relations \textit{Teaches} and \textit{Course}
have an additional attribute identifying the university at which the course is being taught:
\begin{equation}\label{eq:egyet-kieg}
\begin{array}{rcl}
\textit{Course}&=&\{\textit{C-number,Title,Univ}\}\\
\textit{Teaches}&=&\{\textit{PName,C-number,Semester,Evaluation,Univ}.\}\\
\end{array}\end{equation}
Suppose we have the following two data sources. The first source provides a listing of all the courses entitled "Database Systems" taught anywhere and their instructors. This source can be described by the following view definition:
\setlength{\arraycolsep}{0pt}
\begin{equation}\label{eq:dbkurzusok}
\begin{array}{rcll}
DBcourse(\textit{Title,PName,C-number,Univ})&\la&
\textit{Course}(&\textit{C-number,Title,Univ}), \\
& &\textit{Teaches}(&\textit{PName,C-number,Semester,}\\
& & &Evaluation,Univ),\\
& & \textit{Title}=&\textrm{``Database Systems''} \koz .
\end{array}
\end{equation}
\setlength{\arraycolsep}{\acs}
The second source lists Ph.D.-level courses being taught at The Ohio State University (OSU), and is described by the following view definition:
\setlength{\arraycolsep}{0pt}
\begin{equation}\label{eq:bmephd}
\begin{array}{rcll}
OSUPhD(\textit{Title,PName,C-number,Univ})&\la&\textit{Course}(&\textit{C-number,Title,Univ}),\\
& &\textit{Teaches}(&\textit{PName,C-number,Semester,}\\
& & &Evaluation,Univ),\\
& &Univ=&\textrm{``OSU''},\textit{C-number}\ge 500.
\end{array}
\end{equation}
\setlength{\arraycolsep}{\acs}
If we were to ask the data integration system who teaches courses titled "Database Systems" at OSU, it would be able to answer the query by applying a selection on the source \textit{DB-courses}:
\begin{equation}
ans(\textit{PName})\la
DBcourse(\textit{Title,PName,C-number,Univ}),Univ=\textrm{"OSU"} \koz .
\end{equation}
On the other hand, suppose we ask for all the graduate-level courses (not just in databases) being offered at OSU. Given that only these two sources are available, the data integration system cannot find \ki{all} tuples in the answer to the query. Instead, the system can attempt to find the maximal set of tuples in the answer available from the sources. In particular, the system can obtain graduate \ki{database} courses at OSU from the \textit{DB-course} source, and the Ph.D.-level courses at OSU from the \textit{OSUPhD} source. Hence, the following non-recursive datalog program gives the maximal set of answers that can be obtained from the two sources:
\setlength{\arraycolsep}{0pt}
\begin{equation}\label{eq:legbovebb}
\begin{array}{rcl}
ans(\textit{Title,C-number})&\la&DBcourse(\textit{Title,PName,C-number,Univ}),\\
& &Univ=\textrm{"OSU"},\textit{C-number}\ge 400\\
ans(\textit{Title,C-number})&\la&OSUPhD(\textit{Title,PName,C-number,Univ}) \koz .
\end{array}
\end{equation}
\setlength{\arraycolsep}{\acs}
Note that courses that are not PH.D.-level courses or database courses will not be returned as answers. Whereas in the contexts of query optimisation and maintaining physical data independence the focus is on finding a query expression that is \ki{equivalent} with the original query, here finding a query expression that provides the \ki{maximal answers} from the views is attempted.
\subsubsection{Semantic data caching.}
If the database is accessed via client-server architecture, the data cached at the client can be semantically modelled as results of certain queries, rather than at the physical level as a set of data pages or tuples. Hence, deciding which data needs to be shipped from the server in order to answer a given query requires an analysis of which parts of the query can be answered by the cached views.
\subsection{Complexity problems of query rewriting}
In this section the \ki{theoretical} complexity of query rewriting is studied. Mostly conjunctive queries are considered. \ki{Minimal,} and \ki{complete} rewriting will be distinguished.\index{query!rewriting!minimal}\index{query!rewriting!complete} It will be shown that if the query is conjunctive, furthermore the materialised views are also given as results of conjunctive queries, then the rewriting problem is \textit{NP-complete,} assuming that neither the query nor the view definitions contain comparison atoms. Conjunctive queries will be considered in rule-based form for the sake of convenience.
Assume that query $Q$ is given over schema $\mathbf{R}$.
\begin{defi}\label{def:atiras}
The conjunctive query $Q'$ is a \ki{rewriting}\inddef{query!rewriting}\inddef{query!rewriting!equivalent} of query $Q$ using views $\mathcal{V}=V_1,V_2,\ldots ,V_m$, if
\begin{itemize}
\item $Q$ and $Q'$ are equivalent, and
\item $Q'$ contains one or more literals from $\mathcal{V}$.
\end{itemize}
$Q'$ is said to be \ki{locally minimal}\inddef{query!rewriting!locally minimal}
if no literal can be removed from $Q'$ without violating the equivalence.
The rewriting is \ki{globally minimal,}\inddef{query!rewriting!globally minimal}
if there exists no rewriting using a smaller number of literals.
(The comparison atoms $=,\ne,\le,<$ are not counted in the number of literals.)
\end{defi}
\begin{pelda}{\textit{ Query rewriting.}}\label{pel:pelda22}
Consider the following query $Q$ and view $V$.
\begin{equation}\label{eq:exa22}
\begin{array}{rcl}
Q\colon q(X,U)&\la&p(X,Y),p_0(Y,Z),p_1(X,W),p_2(W,U)\\
V\colon v(A,B)&\la&p(A,C),p_0(C,B),p_1(A,D) \koz .
\end{array}
\end{equation}
$Q$ can be rewritten using $V$:
\begin{equation}\label{eq:qvesszo}
Q'\colon q(X,U)\la v(X,Z),p_1(X,W),p_2(W,U) \koz .
\end{equation}
View $V$ replaces the first two literals of query $Q$. Note that the view certainly satisfies the third literal of the query, as well. However, it cannot be removed from the rewriting, since variable $D$ does not occur in the head of $V$, thus if literal $p_1$ were to be removed, too, then the natural join of $p_1$ and $p_2$ would not be enforced anymore.
\end{pelda}
Since in some of the applications the database relations are inaccessible, only the views can be accessed, for example in case of data integration or data warehouses, the concept of \ki{complete rewriting} is introduced.
\begin{defi}\label{def:teljesatiras}
A rewriting $Q'$ of query $Q$ using views $\mathcal{V}=V_1,V_2,\ldots ,V_m$ is called a \ki{complete rewriting,}\inddef{query!rewriting!complete} if $Q'$ contains only literals of $\mathcal{V}$ and comparison atoms.
\end{defi}
\begin{pelda}{\textit{ Complete rewriting.}}\label{pel:pelda25}
Assume that besides view $V$ of Example~\ref{pel:pelda22} the following view is given, as well:
\begin{equation}\label{eq:exa25}
V_2\colon v_2(A,B)\la p_1(A,C),p_2(C,B),p_0(D,E)
\end{equation}
A complete rewriting of query $Q$ is:
\begin{equation}\label{eq:q2vesszo}
Q''\colon q(X,U)\la v(X,Z),v_2(X,U) \koz .
\end{equation}
It is important to see that this rewriting cannot be obtained \ki{step-by-step,} first using only $V$, then trying to incorporate $V_2$, (or just in the opposite order) since relation $p_0$ of $V_2$ does not occur in $Q'$. Thus, in order to find the complete rewriting, use of the two view must be considered \ki{parallel,} at the same time.
\end{pelda}
There is a close connection between finding a rewriting and the problem of
query containment. This latter one was discussed for tableau queries in
section \ref{sec:tartalmazas}. Homomorphism\index{query!homomorphism} between
tableau queries can be defined for rule based queries, as well. The only
difference is that it is not required in this section that a homomorphism maps
the head of one rule to the head of the other. (The head of a rule corresponds
to the summary row of a tableau.) According to
Theorem~\ref{tet:tabla-np-telj} it is NP-complete to decide whether
conjunctive query $Q_1$ contains another conjunctive query $Q_2$. This remains
true in the case when $Q_2$ may contain comparison atoms, as well. However, if
both, $Q_1$ and $Q_2$ may contain comparison atoms, then the existence of a
homomorphism from $Q_1$ to $Q_2$ is only a sufficient but not necessary
condition for the containment of queries, which is a $\Pi^p_2$-complete
problem in that case. The discussion of this latter complexity class is beyond
the scope of this chapter, thus it is omitted. The next proposition gives a
necessary and sufficient condition whether there exists a rewriting of query
$Q$ using view $V$.
\begin{allit}\label{all:atir-ekviv}
Let $Q$ and $V$ be conjunctive queries that may contain comparison
atoms. There exists a a rewriting of query
$Q$ using view $V$ if and only if $\pi_{\emptyset}(Q)\subseteq
\pi_{\emptyset}(V)$, that is the projection of $V$ to the empty attribute set
contains that of $Q$.
\end{allit}
\begin{biz}
Observe that $\pi_{\emptyset}(Q)\subseteq \pi_{\emptyset}(V)$ is equivalent
with the following proposition: If the output of $V$ is empty for some
instance, then the same holds for the output of $Q$, as well.
Assume first that there exists a rewriting, that is a rule equivalent with $Q$
that contains $V$ in its body. If $r$ is such an instance, that the result of
$V$ is empty on it, then every rule that includes $V$ in its body results in
empty set over $r$, too.
In order to prove the other direction, assume that if the output of $V$ is
empty for some instance, then the same holds for the output of $Q$, as
well. Let
\begin{equation}\label{eq:31bizqv}
\begin{array}{rcl}
Q\colon
q(\tilde{x})&\la&q_1(\tilde{x}),q_2(\tilde{x}),\ldots
,q_m(\tilde{x})\\
V\colon
v(\tilde{a})&\la&v_1(\tilde{a}),v_2(\tilde{a}),\ldots
,v_n(\tilde{a}) \koz .
\end{array}
\end{equation}
Let $\tilde{y}$ be a list of variables disjoint from variables of
$\tilde{x}$. Then the query $Q'$ defined by
\begin{equation}
Q'\colon q'(\tilde{x})\la q_1(\tilde{x}),q_2(\tilde{x}),\ldots
,q_m(\tilde{x}),v_1(\tilde{y}),v_2(\tilde{y}),\ldots
,v_n(\tilde{y})
\end{equation}
satisfies $Q\equiv Q'$. It is clear that $Q'\subseteq Q$. On the other hand,
if there exists a valuation of the variables of $\tilde{y}$ that satisfies the
body of $V$ over some instance $r$, then fixing it, for arbitrary valuation of variables
in $\tilde{x}$ a tuple is obtained in the output of $Q$, whenever a tuple is
obtained in the output of $Q'$ together with the previously fixed valuation of
variables of $\tilde{y}$.
\end{biz}
As a corollary of Theorem~\ref{tet:tabla-np-telj} and
Proposition~\ref{all:atir-ekviv} the following theorem is obtained.
\begin{tetel}\label{tet:atirNP}
Let $Q$ be a conjunctive query that may contain comparison atoms, and let
$\mathcal{V}$ be a set of views. If the views in $\mathcal{V}$ are given by
conjunctive queries that do not contain comparison atoms, then it is
NP-complete to decide whether there exists a rewriting of $Q$ using
$\mathcal{V}$.
\end{tetel}
The proof of Theorem~\ref{tet:atirNP} is left for the Reader
(Exercise~\ref{gy:atirNPbiz}).
The proof of Proposition~\ref{all:atir-ekviv} uses new variables. However,
according to the next lemma, this is not necessary. Another important
observation is that it is enough to consider a subset of database relations
occurring in the original query when locally minimal rewriting is sought, new
database relations need not be introduced.
\begin{lemma}\label{lem:lemma33}
Let $Q$ be a conjunctive query that does not contain comparison atoms
\begin{equation}
Q\colon q(\tilde{X})\la p_1(\tilde{U}_1),p_2(\tilde{U}_2),\ldots
,p_n(\tilde{U}_n) \koz ,
\end{equation}
furthermore let $\mathcal{V}$ be a set of views that do not contain comparison
atoms either.
\begin{enumerate}
\item If $Q'$ is a locally minimal rewriting of $Q$ using $\mathcal{V}$, then
the set of database literals in $Q'$ is isomorphic to a subset of database
literals occurring in $Q$.
\item If
\begin{equation}
q(\tilde{X})\la p_1(\tilde{U}_1),p_2(\tilde{U}_2),\ldots
,p_n(\tilde{U}_n),v_1(\tilde{Y}_1),v_2(\tilde{Y}_2),\ldots v_k(\tilde{Y}_k)
\end{equation}
is a rewriting of $Q$ using the views, then there exists a rewriting
\begin{equation}
q(\tilde{X})\la p_1(\tilde{U}_1),p_2(\tilde{U}_2),\ldots
,p_n(\tilde{U}_n),v_1(\tilde{Y'}_1),v_2(\tilde{Y'}_2),\ldots v_k(\tilde{Y'}_k)
\end{equation}
such that $\{ \tilde{Y'}_1\cup\cdots
\cup\tilde{Y'}_k\}\subseteq\{\tilde{U}_1\cup\cdots
\cup\tilde{U}_n\}$, that is the rewriting does not introduce new variables.
\end{enumerate}
\end{lemma}
The details of the proof of Lemma~\ref{lem:lemma33} are left for the Reader
(Exercise~\ref{gy:l33biz}). The next lemma is of fundamental importance: A
minimal rewriting of $Q$ using $\mathcal{V}$ cannot \ki{increase} the number
of literals.
\begin{lemma}\label{lem:nemnovel}
Let $Q$ be conjunctive query, $\mathcal{V}$ be set of views given by
conjunctive queries, both without comparison atoms. If the body of $Q$
contains $p$ literals and $Q'$ is a locally minimal rewriting of $Q$ using
$\mathcal{V}$, then $Q'$ contains at most $p$ literals.
\end{lemma}
\begin{biz}
Replacing the view literals of $Q'$ by their definitions query $Q''$ is
obtained. Let $\varphi$ be a homomorphism from the body of $Q$ to $Q''$. The
existence of $\varphi$ follows from $Q\equiv Q''$ by the Homomorphism Theorem
(Theorem~\ref{tet:homom}). Each of the literals $l_1,l_2,\ldots ,l_p$ of the
body of $Q$ is mapped to at most one literal obtained from the expansion of
view definitions. If $Q'$ contains more than $p$ view literals, then the
expansion of some view literals in the body of $Q''$ is disjoint from the
image of $\varphi$. These view literals can be removed from the body of $Q'$
without changing the equivalence.
\end{biz}
\noindent Based on Lemma~\ref{lem:nemnovel} the following theorem can be
stated about complexity of minimal rewritings.
\begin{tetel}\label{tet:atirbony}
Let $Q$ be conjunctive query, $\mathcal{V}$ be set of views given by
conjunctive queries, both without comparison atoms. Let the body of $Q$
contain $p$ literals.
\begin{enumerate}
\item It is NP-complete to decide whether there exists a rewriting $Q'$ of $Q$ using
$\mathcal{V}$ that uses at most $k \ (\le p)$ literals.
\item It is NP-complete to decide whether there exists a rewriting $Q'$ of $Q$
using $\mathcal{V}$ that uses at most $k \ (\le p)$ database literals.
\item It is NP-complete to decide whether there exists a complete rewriting
of $Q$ using $\mathcal{V}$.
\end{enumerate}
\end{tetel}
\begin{biz}
The first statement is proved, the proof of the other two is
similar. According to Lemmas~\ref{lem:nemnovel} and \ref{lem:lemma33}, only
such rewritings need to be considered that have at most as many literals as
the query itself, contain a subset of the literals of the query and do not
introduce new variables. Such a rewriting and the homomorphisms proving the
equivalence can be tested in polynomial time, hence the problem is in NP.
In order to prove that it is NP-hard, Theorem~\ref{tet:atirNP} is used. For a
given query $Q$ and view $V$ let $V'$ be the view, whose head is same as the
head of $V$, but whose body is the conjunction of the bodies of $Q$ and
$V$. It is easy to see that there exists a rewriting using $V'$ with a single
literal if and only if there exists a rewriting (with no restriction) using
$V$.
\end{biz}
\subsection{Practical algorithms}
In this section only complete rewritings are studied. This does not mean real
restriction, since if database relations are also to be used, then views
mirroring the database relations one-to-one can be introduced. The concept of
\ki{equivalent}\index{query!rewriting!equivalent} rewriting introduced in
Definition~\ref{def:atiras} is appropriate if the goal of the rewriting is
query optimisation or providing physical data independence. However, in the
context of data integration on data warehouses equivalent rewritings cannot be
sought, since all necessary data may not be available. Thus, the concept of
maximally contained rewriting is introduced that depends on the query language
used, in contrast to equivalent rewritings.
\begin{defi}\label{def:maxiatiras}
Let $Q$ be a query, $\mathcal{V}$ be a set of views, $\mathcal{L}$ be a query
language. $Q'$ is a \ki{maximally contained
rewriting}\inddef{query!rewriting!maximally contained} of $Q$ with respect to
$\mathcal{L}$, if
\begin{enumerate}
\item $Q'$ is a query of language $\mathcal{L}$ using only views from
$\mathcal{V}$,
\item $Q$ contains $Q'$,
\item if query $Q_1\in\mathcal{L}$ satisfies $Q'\sqsubseteq
Q_1\sqsubseteq Q$, then $Q'\equiv Q_1$.
\end{enumerate}
\end{defi}
\subsubsection{Query optimisation using materialised views.}
Before discussing how can a traditional optimiser be modified in order to use
materialised views instead of database relations, it is necessary to survey
when can view be used to answer a given query. Essentially, view $V$ can be
used to answer query $Q$, if the intersection of the sets of database
relations in the body of $V$ and in the body of $Q$ is non-empty, furthermore
some of the attributes are selected by $V$ are also selected by $Q$. Besides
this, in case of equivalent rewriting, if $V$ contains comparison atoms for
such attributes that are also occurring in $Q$, then the view must apply
logically equivalent, or weaker condition, than the query.
If logically stronger condition is applied in the view, then it can be part of
a (maximally) contained rewriting. This can be shown best via an
example. Consider the query $Q$ over schema \textbf{University} that list
those professor, student, semester triplets, where the advisor of the student
is the professor and in the given semester the student registered for some
course taught by the professor.
\setlength{\arraycolsep}{0pt}
\begin{equation}\label{eq:hasznalhato-e}
\begin{array}{rcl}
Q\colon
q(\textit{Pname,Student,Semester})&\la&Registered(\textit{Student,C-number,Semester}),\\
& &\textit{Advisor}(\textit{Pname,Student}),\\
& &\textit{Teaches}(\textit{Pname,
C-number,Semester},x_E),\\
& &\textit{Semester}\ge\textrm{``Fall2000''} \koz .
\end{array}\end{equation}
\setlength{\arraycolsep}{\acs}
View $V_1$ below can be used to answer $Q$, since it uses the same join
condition for relations \textit{Registered} and \textit{Teaches} as $Q$, as it is
shown by variables of the same name. Furthermore, $V_1$ selects attributes
\textit{Student, PName, Semester}, that are necessary in order to properly join with
relation \textit{Advisor}, and for select clause of the query. Finally, the
predicate $\textit{Semester}>\textrm{``Fall1999''}$ is weaker than the predicate
$\textit{Semester}\ge\textrm{``Fall2000''}$ of the query.
\setlength{\arraycolsep}{0pt}
\begin{equation}\label{eq:hasznalhatov1}
\begin{array}{rcl}
V_1\colon v_1(\textit{Student,PName,Semester})&\la&\textit{Teaches}(\textit{PName,C-number,Semester},x_E),\\
&
&Registered(\textit{Student,C-number,Semester}),\\
& &\textit{Semester}>\textrm{``Fall1999''}
\koz .
\end{array}\end{equation}
\setlength{\arraycolsep}{\acs}
The following four views illustrate how minor modifications to $V_1$ change
the usability in answering the query.
\setlength{\arraycolsep}{0pt}
\begin{equation}\label{eq:hasznalhatov2}
\begin{array}{rcl}
V_2\colon v_2(\textit{Student,Semester})&\la&\textit{Teaches}(x_P,\textit{C-number,Semester},x_E),\\
& &Registered(\textit{Student,C-number,Semester}),\\
& &\textit{Semester}>\textrm{``Fall1999''} \koz .
\end{array}\end{equation}
\begin{equation}\label{eq:hasznalhatov3}
\begin{array}{rcl}
V_3\colon v_3(\textit{Student,PName,Semester})&\la&\textit{Teaches}(\textit{PName,
C-number},x_S,x_E),\\
& &Registered(\textit{Student,C-number,Semester}),\\
& &\textit{Semester}>\textrm{``Fall1999''} \koz .
\end{array}\end{equation}
\begin{equation}\label{eq:hasznalhatov4}
\begin{array}{rcl}
V_4\colon v_4(\textit{Student,PName,Semester})&\la&\textit{Teaches}(\textit{PName,
C-number,Semester},x_E),\\
&
&\textit{Adviser}(\textit{PName},x_{St}),\textit{Professor}(\textit{PName},x_{A}),\\
& &Registered(\textit{Student,C-number,Semester}),\\
& &\textit{Semester}>\textrm{``Fall1999''} \koz .
\end{array}\end{equation}
\begin{equation}\label{eq:hasznalhatov5}
\begin{array}{rcl}
V_5\colon v_5(\textit{Student,PName,Semester})&\la&\textit{Teaches}(\textit{PName,
C-number,Semester},x_E),\\
& &Registered(\textit{Student,C-number,Semester}),\\
& &\textit{Semester}>\textrm{``Fall2001''} \koz .
\end{array}\end{equation}
\setlength{\arraycolsep}{\acs}
View $V_2$ is similar to $V_1$, except that it does not select the attribute
\textit{PName} from relation \textit{Teaches}, which is needed for the join with
the relation \textit{Adviser} and for the selection of the query. Hence, to
use $V_2$ in the rewriting, it has to be joined with relation \textit{Teaches}
again. Still, if the join of relations \textit{Registered} and \textit{Teaches} is
very selective, then employing $V_2$ may actually result in a more efficient
query execution plan.
In view $V_3$ the join of relations \textit{Registered} and \textit{Teaches} is
over only attribute \textit{C-number}, the equality of variables \textit{Semester}
and $x_S$ is not required. Since attribute $x_S$ is not selected by $V_3$, the
join predicate cannot be applied in the rewriting, and therefore there is
little gain by using $V_3$.
View $V_4$ considers only the professors who have at least one area of
research. Hence, the view applies an additional condition that does not exists
in the query, and cannot be used in an equivalent rewriting unless
union and negation are allowed in the rewriting language. However, if there is
an integrity
constraint stating that every professor has at least one area of research,
then an optimiser should be able to realise that $V_4$ is usable.
Finally, view $V_5$ applies a stronger predicate than the query, and is
therefore usable for a contained rewriting, but not for an equivalent
rewriting of the query.
\paragraph{System-R style optimisation}
Before discussing the changes to traditional optimisation, first the
principles underlying the \ki{System-R style optimiser}\inddef{System-R style
optimiser} is recalled briefly. System-R takes a bottom-up approach to
building query execution plans. In the first phase, it concentrates of plans
of size 1, i.e., chooses the best access paths to every table mentioned in the
query. In phase $n$, the algorithm considers plans of size $n$, by combining
plans obtained in the previous phases (sizes of $k$ and $n-k$). The algorithm
terminates after constructing plans that cover all the relations in the
query. The efficiency of System-R stems from the fact that it partitions query
execution plans into \ki{equivalence classes,} and only considers a single
execution plan for every equivalence class. Two plans are in the same
equivalence class if they
\begin{itemize}
\item cover the same set of relations in the query (and therefore are also of
the same size), and
\item produce the answer in the same interesting order.
\end{itemize}
In our context, the query optimiser builds query execution plans by accessing
a set of views, rather than a set of database relations. Hence, in addition to
the meta-data that the query optimiser has about the materialised views (e.g.,
statistics, indexes) the optimiser is also given as input the query
expressions defining the views. Th additional issues that the optimiser needs
to consider in the presence of materialised views are as follows.
\begin{description}
\item{\textbf{A.}} In the first iteration the algorithm needs to decide which
views are \ki{relevant} to the query according to the conditions illustrated
above. The corresponding step is trivial in a traditional optimiser: a
relation is relevant to the query if it is in the body of the query
expression.
\item{\textbf{B.}} Since the query execution plans involve joins over views,
rather than joins over database relations, plans can no longer be neatly
partitioned into equivalence classes which can be explored in increasing
size. This observation implies several changes to the traditional algorithm:
\begin{enumerate}
\item \ki{Termination testing:} the algorithm needs to distinguish \ki{partial
query execution plans} of the query from \ki{complete query execution
plans.} The enumeration of the possible join orders terminates when there
are no more unexplored partial plans. In contrast, in the traditional
setting the algorithm terminates after considering the equivalence classes
that include all the relations of the query.
\item \ki{Pruning of plans:} a traditional optimiser compares between pairs of
plans \ki{within} one equivalence class and saves only the cheapest one for
each class. I our context, the query optimiser needs to compare between
\ki{any pair} of plans generated thus far. A plan $p$ is pruned if there is
another plan $p'$ that
\begin{enumerate}
\item is cheaper than $p$, and
\item has greater or equal contribution to the query than $p$. Informally, a
plan $p'$ contributes more to the query than plan $p$ if it covers more of
the relations in the query and selects more of the necessary attributes.
\end{enumerate}
\item \ki{Combining partial plans:} in the traditional setting, when two
partial plans are combined, the join predicates that involve both plans are
explicit in the query, and the enumeration algorithm need only consider the
most efficient way to apply these predicates. However, in our case, it may
not be obvious a priori which join predicate will yield a correct rewriting
of the query, since views are joined rather than database relations
directly. Hence, the enumeration algorithm needs to consider several
alternative join predicates. Fortunately, in practice, the number of join
predicates that need to be considered can be significantly pruned using
meta-data about the schema. For example, there is no point in trying to join
a string attribute with a numeric one. Furthermore, in some cases knowledge
of integrity constraints and the structure of the query can be used to
reduce the number of join predicates to be considered. Finally, after
considering all the possible join predicates, the optimiser also needs
to check whether the resulting plan is still a partial solution to the query.
\end{enumerate}
\end{description}
The following table summarises the comparison of the traditional optimiser
versus one that exploits materialised views.
\begin{center}
\begin{tabular}{p{6cm}p{6cm}}
{\large \textbf{Conventional optimiser}}&{\large \textbf{Optimiser using views
}}\\
\ki{Iteration 1}&\ki{Iteration 1}\\
a) Find all possible access paths. & a1)Find all views that are relevant to
the query.\newline
a2) Distinguish between partial and complete solutions to the query.\\
b) Compare their cost and keep the least expensive. & b)
Compare all pairs of views. If one has neither greater contribution nor a
lower cost than the other, prune it.\\
c) If the query has one relation, \key{stop}. & c) If there are no partial solutions, \key{stop}.\\
\ki{Iteration 2}&\ki{Iteration 2}\\
For each query join: & \\
a) Consider joining the relevant access paths found in the previous iteration
using all possible join methods. & a1) Consider joining all partial
solutions found in the previous iteration using all possible equi-join
methods and trying all possible subsets of join predicates.\newline a2)
Distinguish between partial and complete solutions to the query. \\
b) Compare the cost of the resulting join plans and keep the least expensive. &
b) If any newly generated solution is either not relevant to the query, or
dominated by another, prune it.\\
c) If the query has two relations, \key{stop}. & c) If there are no partial solutions, \key{stop}.\\
\ki{Iteration 3}&\ki{Iteration 3}\\
\hfill\vdots\hfill\hbox{ }&\hfill\vdots\hfill\hbox{ }
\end{tabular}
\end{center}
Another method of equivalent rewriting is using transformation rules. The
common theme in the works of that area is that replacing some part of the
query with a view is considered as another transformation available to the
optimiser. These methods are not discussed in detail here.
The query optimisers discussed above were designed to handle cases where the
number of views is relatively small (i.e., comparable to the size of the
database schema), and cases where equivalent rewriting is required. In
contrast, the context of data integration requires consideration of large
number of views, since each data source is being described by one or more
views. In addition, the view definitions may contain many complex predicates,
whose goal is to express fine-grained distinction between the contents of
different data sources. Furthermore, in the context of data integration it is
often assumed that the views are not complete, i.e., they may only contain a
subset of the tuples satisfying their definition. In the foregoing, some
algorithms for answering queries using views are described that were developed
specifically for the context of data integration.
\subsubsection{The Bucket Algorithm.}
The goal of the Bucket Algorithm\inddef{bucket algorithm@Bucket
Algorithm} is to reformulate a user query that is posed
on a mediated (virtual) schema into a query that refers directly to the
available sources. Both the query and the sources are described by conjunctive
queries that may include atoms of arithmetic comparison predicates. The set of
comparison atoms of query $Q$ is denoted by $C(Q)$.
Since the number of possible rewritings may be exponential in the size of the
query, the main idea underlying the bucket algorithm is that the number of
query rewritings that need to be considered can be drastically reduced if we
first consider each
\ki{subgoal}\inddef{subgoal}\inddef{query!conjunctive!subgoal} -- the
relational atoms of the query -- is considered in isolation, and determine
which views may be relevant to each subgoal.
The algorithm proceeds as follows. First, a \ki{bucket}\inddef{bucket} is
created for each subgoal
in the query $Q$ that is not in $C(Q)$, containing the views that are relevant
to answering the particular subgoal. In the second step, all such conjunctive
query rewritings are considered that include one conjunct (view) from each
bucket. For each rewriting $V$ obtained it is checked that whether it is
semantically correct, that is
$V\sqsubseteq Q$ holds, or whether it can be made semantically correct by
adding comparison atoms.
Finally the remaining plans are minimised by pruning redundant subgoals.
Algorithm \textsc{Create-Bucket}\inddef{\textsc{Create-Bucket}}
executes the first step described above. Its input is a set of source
descriptions $\mathcal{V}$ and a conjunctive query $Q$ in the form
\begin{equation}\label{eq:vodorinput}
Q\colon Q(\tilde{X})\la R_1(\tilde{X}_1), R_2(\tilde{X}_2),\ldots ,
R_m(\tilde{X}_m),C(Q) \koz .
\end{equation}
\begin{algN}{Create-Bucket($Q$,$\mathcal{V}$)}
1 \' \key{for} \= $i\la 1$ \key{to} $m$\\
2 \' \> \key{do} \= $\textit{Bucket}[i]\la\emptyset$\\
3 \' \> \> \key{for} \= all $V\in\mathcal{V}$\\
4 \'\` $\rhd$ $V$ is of form $V\colon V(\tilde{Y})\la S_1(\tilde{Y}_1),\ldots
S_n(\tilde{Y}_n),C(V)$.\\
5 \' \> \> \> \key{do} \= \key{for} \= $j\la 1$ \key{to} $n$\\
6 \'\> \> \> \> \key{if} \= $R_i=S_j$\\
7 \' \> \> \> \> \> \key{then} \= let $\phi$ be a mapping defined on the
variables \\
\' \> \> \> \> \> \> of $V$ as follows:\\
8 \' \> \> \> \> \> \> \key{if} \= $y$ is the $k$\textsuperscript{th} variable
of $\tilde{Y}_j$ and $y\in\tilde{Y}$\\
9 \' \> \> \> \> \> \> \> \key{then} \= $\phi(y)=x_k$, where $x_k$ is the
$k$\textsuperscript{th} variable of
$\tilde{X}_i$\\
10 \' \> \> \> \> \> \> \> \key{else} \> $\phi(y)$ is a new variable that \\
\' \> \> \> \> \> \> \> \> does not appear in $Q$ or $V$.\\
11 \' \> \> \> \> \> \> $Q'()\la R_1(\tilde{X}_1),R_m(\tilde{X}_m),C(Q),
S_1(\phi(\tilde{Y}_1)), \ldots ,$ \\
\' \> \> \> \> \> \> $S_n(\phi(\tilde{Y}_n)),\phi(C(V))$\\
12 \' \> \> \> \> \> \> \key{if} \= \textsc{Satisfiable}$^{\ge}$($Q'$)\\
13 \' \> \> \> \> \> \> \> \key{then} add $\phi(V)$ to
$\textit{Bucket}[i]$. \\
14 \' \key{return} \textit{Bucket}
\end{algN}
Procedure \textsc{Satisfiable}$^{\ge}$ is the extension of
\textsc{Satisfiable} described in section \ref{subsec:kiterjesztesek} to the
case when comparison atoms may occur besides equality atoms. The necessary
change is only that for all variable $y$ occurring in comparison atoms it must
be checked whether all predicates involving $y$ are satisfiable
simultaneously.
\textsc{Create-Bucket} running time is polynomial function of the sizes of
$Q$ and $\mathcal{V}$. Indeed, the kernel of the nested loops of lines 3 and 5
runs $n\sum_{V\in\mathcal{V}}|V|$ times. The commands of lines 6--13
require constant time, except for line 12. The condition of of command
\key{if} in line 12 can be checked in polynomial time.
In order to prove the correctness of procedure \textsc{Create-Bucket}, one
should check under what condition is a view $V$ put in
$\textit{Bucket}[i]$. In line 6 it is checked whether relation $R_i$ appears
as a subgoal in $V$. If not, then obviously $V$ cannot give usable information
for subgoal $R_i$ of $Q$. If $R_i$ is a subgoal of $V$, then in lines 9--10 a
mapping is constructed that applied to the variables allows the correspondence
between subgoals $S_j$ and $R_i$, in accordance with relations occurring in the
heads of $Q$ and $V,$ respectively. Finally, in line 12 it is checked whether
the comparison atoms contradict with the correspondence constructed.
In the second step, having constructed the buckets using
\textsc{Create-Bucket}, the bucket algorithm finds a set of \ki{conjunctive
query rewritings,} each of them being a conjunctive query that includes one
conjunct from every bucket. Each of these conjunctive query rewritings
represents one way of obtaining part of the answer to $Q$ from the views. The
result of the bucket algorithm is defined to be the union of the conjunctive
query rewritings (since each of the rewritings may contribute different
tuples). A given conjunctive query $Q'$ is a \ki{conjunctive
query rewriting,}\inddef{query!rewriting!conjunctive} if
\begin{enumerate}
\item $Q'\sqsubseteq Q$, or
\item $Q'$ can be extended with comparison atoms, so that the previous
property holds.
\end{enumerate}
\begin{pelda}{\textit{Bucket algorithm.}}\label{pel:vodoralgoritmus}
Consider the following query $Q$ that lists those articles $x$ that there
exists another article $y$ of the same area such that $x$ and $y$ mutually
cites each other. There are three views (sources) available, $V_1,V_2,V_3$.
\begin{equation}\label{eq:qv456}
\begin{array}{lcl}
Q(x)&\la&\textit{cite}(x,y),\textit{cite}(y,x),\textit{sameArea}(x,y)\\
V_1(a)&\la&\textit{cite}(a,b),\textit{cite}(b,a)\\
V_2(c,d)&\la&\textit{sameArea}(c,d)\\
V_3(f,h)&\la&\textit{cite}(f,g),\textit{cite}(g,h),\textit{sameArea}(f,g) \koz
.
\end{array}
\end{equation}
In the first step, applying \textsc{Create-Bucket}, the following buckets are
constructed.
\begin{equation}\label{eq:vodrokqv456}
\begin{array}{c|c|c}
\textit{cite}(x,y)&\textit{cite}(y,x)&\textit{sameArea}(x,y)\\
\hline
V_1(x)&V_1(x)&V_2(x)\\
V_3(x)&V_3(x)&V_3(x)
\end{array}
\end{equation}
In the second step the algorithm constructs a conjunctive query $Q'$ from each
element of the Cartesian product of the buckets, and checks whether $Q'$ is
contained in $Q$. If yes, it is given to the answer.
In our case, it tries to match $V_1$ with the other views, however no correct
answer is obtained so. The reason is that $b$ does not appear in the head of
$V_1$, hence the join condition of $Q$ -- variables $x$ and $y$ occur in
relation \textit{sameArea}, as well -- cannot be applied. Then
rewritings containing $V_3$ are considered, recognising that equating
the variables in the head of $V_3$ a contained rewriting is
obtained. Finally, the algorithm finds that combining $V_3$ and $V_2$
rewriting is obtained, as well. This latter is redundant, as it is
obtained by simple checking, that is $V_2$ can be pruned. Thus, the
result of the bucket algorithm for query (\ref{eq:qv456}) is the
following (actually equivalent) rewriting
\begin{equation}\label{eq:qv6ekviv}
Q'(x)\la V_3(x,x) \koz .
\end{equation}
\end{pelda}
The strength of the bucket algorithm is that it exploits the
predicates in the query to prune significantly the number of
candidate conjunctive rewritings that need to be considered. Checking
whether a view should belong to a bucket can be done in time
polynomial in the size of the query and view definition when the
predicates involved are arithmetic comparisons. Hence, if the data
sources (i.e., the views) are indeed distinguished by having different
comparison predicates, then the resulting buckets will be relatively
small.
The main disadvantage of the bucket algorithm is that the Cartesian
product of the buckets may still be large. Furthermore, the second
step of the algorithm needs to perform a query containment test for
every candidate rewriting, which is NP-complete even when no
comparison predicates are involved.
\subsubsection{Inverse-rules algorithm.}
The Inverse-rules algorithm\index{inverse-rules algorithm} is a procedure that can be applied more
generally than the Bucket algorithm. It finds a maximally contained
rewriting for any query given by arbitrary recursive datalog program
that does not contain negation, in polynomial time.
The first question is that for given datalog program $\mathcal{P}$ and
set of conjunctive queries $\mathcal{V}$, whether there exists a
datalog program $\mathcal{P}_v$ equivalent with $\mathcal{P}$, whose
\textit{edb} relations are relations $v_1,v_2,\ldots ,v_n$ of
$\mathcal{V}$. Unfortunately, this is algorithmically
undecidable. Surprisingly, the best, maximally contained rewriting
can be constructed. In the case, when there exists a
datalog program $\mathcal{P}_v$ equivalent with $\mathcal{P}$, the
algorithm finds that, since a maximally contained rewriting contains
$\mathcal{P}_v$, as well. This seemingly contradicts to the fact that
the existence of equivalent rewriting is algorithmically undecidable,
however it is undecidable about the result of the inverse-rules
algorithm, whether it is really equivalent to the original query.
\begin{pelda}{\textit{ Equivalent rewriting.}}\label{pel:feketeut}
Consider the following datalog program $\mathcal{P}$, where
\textit{edb} relations \textit{edge} and
$black$ contain the edges and vertices coloured black of a graph $G$.
\begin{equation}\label{eq:feketeut}
\begin{array}{lrcl}
\mathcal{P}\colon&q(X,Y)&\la&\textit{edge}(X,Z),\textit{edge}(Z,Y),black(Z)\\
&q(X,Y)&\la&\textit{edge}(X,Z),black(Z), q(Z,Y) \koz .
\end{array}
\end{equation}
It is easy to check that $\mathcal{P}$ lists the endpoints of such
paths (more precisely walks) of graph $G$ whose inner points are all
black. Assume that only the following two views can be accessed.
\begin{equation}\label{eq:feketenez}
\begin{array}{rcl}
v_1(X,Y)&\la&\textit{edge}(X,Y),black(X)\\
v_2(X,Y)&\la&\textit{edge}(X,Y),black(Y)
\end{array}
\end{equation}
$v_1$ stores edges whose tail is black, while $v_2$ stores those,
whose head is black.
There exists an equivalent rewriting $\mathcal{P}_v$ of datalog
program $\mathcal{P}$ that uses only views $v_1$ and $v_2$ as
\textit{edb} relations:
\begin{equation}\label{eq:feketatir}
\begin{array}{lrcl}
\mathcal{P}_v\colon&q(X,Y)&\la&v_2(X,Z),v_1(Z,Y)\\
&q(X,Y)&\la&v_2(X,Z), q(Z,Y)
\end{array}
\end{equation}
However, if only $v_1$, or $v_2$ is accessible alone, then equivalent
rewriting is not possible, since only such paths are obtainable whose
starting, or respectively, ending vertex is black.
\end{pelda}
In order to describe the Inverse-rules Algorithm, it is necessary to
introduce the \ki{Horn rule,}\inddef{Horn rule} which is a
generalisation of \ki{datalog program,}\index{datalog!program} and
\ki{datalog rule.}\index{datalog!rule} If \ki{function symbols} are
also allowed in the free tuple $u_i$ of
rule (\ref{eq:dlog}) in Definition~\ref{def:dlog}, besides variables
and constants, then \ki{Horn rule} is obtained. A \ki{logic
program}\inddef{logic program} is a collection of Horn rules. In
this sense a logic program without function symbols is a datalog
program. The concepts of $\var{edb}$ and $idb$ can be defined for
logic programs in the same way as for datalog programs.
The Inverse-rules Algorithm consists of two steps. First, a logic
program is constructed that may contain function symbols. However,
these will not occur in recursive rules, thus in the second step they
can be eliminated and the logic program can be transformed into a
datalog program.
\begin{defi}\label{def:nezetinverz}
The \ki{inverse}\inddef{view!inverse} $v^{-1}$ of view $v$ given by
\begin{equation}\label{eq:vdef}
v(X_1,\ldots ,X_m)\la v_1(\tilde{Y}_1),\ldots , v_n(\tilde{Y}_n)
\end{equation}
is the following collection of Horn rules. A rule corresponds to every
subgoal $v_i(\tilde{Y}_i)$, whose body is the literal $v(X_1,\ldots
,X_m)$. The head of the rule is $v_i(\tilde{Z}_i)$, where
$\tilde{Z}_i$ is obtained from $\tilde{Y}_i$ by preserving variables
appearing in the head of rule (\ref{eq:vdef}), while function symbol
$f_Y(X_1,\ldots ,X_m)$ is written in place of every variable $Y$ not
appearing the head. Distinct function symbols correspond to distinct
variables. The inverse of a set $\mathcal{V}$ of views is the set
$\{v^{-1}\colon v\in\mathcal{V}\}$, where distinct function symbols
occur in the inverses of distinct rules.
\end{defi}
The idea of the definition of inverses is that if a tuple $(x_1,\ldots
x_m)$ appears in a view $v$, for some constants $x_1,\ldots x_m$, then
there is a valuation of every variable $y$ appearing in the head that
makes the body of the rule true. This ``unknown'' valuation is denoted
by the function symbol $f_Y(X_1,\ldots ,X_m)$.
\begin{pelda}{\textit{ Inverse of views.}}\label{pel:nezinverz}
Let $\mathcal{V}$ be the following collection of views.
\begin{equation}\label{eq:nezinverz}
\begin{array}{lcl}
v_1(X,Y)&\la&\textit{edge}(X,Z),\textit{edge}(Z,W),\textit{edge}(W,Y)\\
v_2(X))&\la&\textit{edge}(X,Z) \koz .
\end{array}
\end{equation}
Then $\mathcal{V}^{-1}$ consists of the following rules.
\begin{equation}\label{eq:nezinverzszab}
\begin{array}{lcl}
\textit{edge}(X,f_{1,Z}(X,Y))&\la&v_1(X,Y)\\
\textit{edge}(f_{1,Z}(X,Y),f_{1,W}(X,Y))&\la&v_1(X,Y)\\
\textit{edge}(f_{1,W}(X,Y),Y)&\la&v_1(X,Y)\\
\textit{edge}(X,f_{2,Z}(X))&\la&v_2(X) \koz .
\end{array}
\end{equation}
\end{pelda}
Now, the maximally contained rewriting of datalog program
$\mathcal{P}$ using views $\mathcal{V}$ can easily be constructed for
given $\mathcal{P}$ and $\mathcal{V}$.
First, those rules are deleted from $\mathcal{P}$ that contain such
\textit{edb} relation that do not appear in the definition any view
from $\mathcal{V}$. The rules of $\mathcal{V}^{-1}$ are added the
datalog program $\mathcal{P}^-$
obtained so, thus forming logic program
$(\mathcal{P}^-,\mathcal{V}^{-1})$. Note, that the remaining
\textit{edb} relations of $\mathcal{P}$ are \textit{idb} relations in
logic program $(\mathcal{P}^-,\mathcal{V}^{-1})$, since they appear in
the heads of the rules of $\mathcal{V}^{-1}$. The names of
\textit{idb} relations are arbitrary, so they can be renamed so that
their names do not coincide with the names of \textit{edb} relations
of $\mathcal{P}$. However, this is not done in the following example,
for the sake of better understanding.
\begin{pelda}{\textit{ Logic program.}}\label{pel:logiprog}
Consider the following datalog program that calculates the transitive
closure of relation \textit{edge}.
\begin{equation}\label{eq:tranzlez}
\begin{array}{lrcl}
\mathcal{P}\colon&q(X,Y)&\la&\textit{edge}(X,Y)\\
&q(X,Y)&\la&\textit{edge}(X,Z),q(Z,Y)
\end{array}
\end{equation}
Assume that only the following materialised view is accessible, that
stores the endpoints of paths of length two. If only this view is
usable, then the most that can be expected is listing the endpoints of
paths of even length. Since the unique \textit{edb} relation of
datalog program $\mathcal{P}$ is \textit{edge}, that also appears in
the definition of $v$, the logic program
$(\mathcal{P}^-,\mathcal{V}^{-1})$ is obtained by adding the rules of
$\mathcal{V}^{-1}$ to $\mathcal{P}$.
\begin{equation}\label{eq:tranzlezinv}
\begin{array}{llcl}
(\mathcal{P}^-,\mathcal{V}^{-1})\colon&q(X,Y)&\la&\textit{edge}(X,Y)\\
&q(X,Y)&\la&\textit{edge}(X,Z),q(Z,Y)\\
&\textit{edge}(X,f(X,Y))&\la&v(X,Y)\\
&\textit{edge}(f(X,Y),Y)&\la&v(X,Y) \koz .
\end{array}\end{equation}
Let the instance of the \textit{edb} relation \textit{edge} of datalog
program $\mathcal{P}$ be the graph $G$ shown on
Figure~\ref{abr:grafg}.
\epskepnagy{figs/grafg.eps}{The graph $G$.\label{abr:grafg}}{}
Then $(\mathcal{P}^-,\mathcal{V}^{-1})$ introduces three new
constants, $f(a,c),f(b,d)$ és $f(c,e)$. The \textit{idb} relation
\textit{edge} of logic program $\mathcal{V}^{-1}$ is graph $G'$ shown on
Figure~\ref{abr:grafv-1}.
\epskepnagy{figs/grafv-1.eps}{The graph $G'$.\label{abr:grafv-1}}{}
$\mathcal{P}^-$ computes the transitive closure of graph $G'$. Note
that those pairs in th transitive closure that do not contain any of
the new constants are exactly the endpoints of even paths of $G$.
\end{pelda}
The result of logic program $(\mathcal{P}^-,\mathcal{V}^{-1})$ in
Example~\ref{pel:logiprog} can be calculated by procedure
\textsc{Na\"\i v-datalog}, for example. However, it is not true for logic
programs in general, that the algorithm terminates. Indeed, consider
the logic program
\begin{equation}\label{eq:vegtelen}
\begin{array}{lcl}
q(X)&\la&p(X)\\
q(f(X))&\la&q(X) \koz .
\end{array}
\end{equation}
If the \textit{edb} relation $p$ contains the constant $a$, then the
output of the program is the infinite sequence
$a,f(a),f(f(a)),f(f(f(a))),\ldots$. In contrary to this, the output of
the logic
program $(\mathcal{P}^-,\mathcal{V}^{-1})$ given by the Inverse-rules
Algorithm is guaranteed to be finite, thus the computation terminates
in finite time.
\begin{tetel}\label{tet:evalveges}
For arbitrary datalog program $\mathcal{P}$ and set of conjunctive
views $\mathcal{V}$, and for finite instances of the views, there
exist a unique minimal fixpoint of the logic program
$(\mathcal{P}^-,\mathcal{V}^{-1})$, furthermore procedures
\textsc{Naiv-Datalog}\index{\textsc{Naiv-Datalog}} and
\textsc{Semi-Naiv-Datalog}\index{\textsc{Semi-Naiv-Datalog}}
give this minimal fixpoint as output.
\end{tetel}
The essence of the proof of Theorem~\ref{tet:evalveges} is that
function symbols are only introduced by inverse rules, that are in
turn not recursive, thus terms containing nested functions symbols are
not produced. The details of the proof are left for the Reader
(Exercise~\ref{gy:evalvegesbiz}).
Even if started from the \textit{edb} relations of a database, the
output of a logic program may contain tuples that have function
symbols. Thus, a filter is introduced that eliminates the unnecessary
tuples. Let database $D$ be the instance of the \textit{edb}relations
of datalog program $\mathcal{P}$.
$\mathcal{P}(D)\hspace{-4pt}\downarrow$ denotes the set of those
tuples from $\mathcal{P}(D)$ that do not contain function
symbols. Let $\mathcal{P}\hspace{-3pt}\downarrow$ denote that program,
which computes $\mathcal{P}(D)\hspace{-3pt}\downarrow$ for a given
instance $D$. The proof of the following theorem, exceeds the
limitations of the present chapter.
\begin{tetel}\label{tet:p-v-1maxtart}
For arbitrary datalog program $\mathcal{P}$ and set of conjunctive
views $\mathcal{V}$, the logic program
$(\mathcal{P}^-,\mathcal{V}^{-1})\hspace{-3pt}\downarrow$ is a
maximally contained rewriting of $\mathcal{P}$ using
$\mathcal{V}$. Furthermore, $(\mathcal{P}^-,\mathcal{V}^{-1})$ can be
constructed in polynomial time of the sizes of $\mathcal{P}$ and
$\mathcal{V}$.
\end{tetel}
The meaning of Theorem~\ref{tet:p-v-1maxtart} is that the simple
procedure of adding the inverses of view definitions to a datalog
program results in a logic program that uses the views as much as
possible. It is easy to see that $(\mathcal{P}^-,\mathcal{V}^{-1})$ can be
constructed in polynomial time of the sizes of $\mathcal{P}$ and
$\mathcal{V}$, since for every subgoal $v_i\in\mathcal{V}$ a unique
inverse rule must be constructed.
In order to completely solve the rewriting problem however, a datalog
program needs to be produced that is equivalent with the logic program
$(\mathcal{P}^-,\mathcal{V}^{-1})\hspace{-3pt}\downarrow$. The key to
this is the observation that
$(\mathcal{P}^-,\mathcal{V}^{-1})\hspace{-3pt}\downarrow$ contains
only finitely many function symbols, furthermore during a bottom-up
evaluation like \textsc{Naiv-Datalog} and its versions, nested
function symbols are not produced. With proper book keeping the
appearance of function symbols can be kept track, without actually
producing those tuples that contain them.
The transformation is done bottom-up like in procedure \textsc{Naiv-Datalog}. The function symbol $f(X_1,\ldots
,X_k)$ appearing in the \textit{idb} relation of $\mathcal{V}^{-1}$ is
replaced by the list of variables $X_1,\ldots
,X_k$. At same time the name of the \textit{idb} relation needs to be
marked, to remember that the list $X_1,\ldots
,X_k$ belongs to function symbol $f(X_1,\ldots
,X_k)$. Thus, new ``temporary'' relation names are
introduced. Consider the the rule
\begin{equation}\label{eq:megjelolni}
\textit{edge}(X,f(X,Y))\la v(X,Y)
\end{equation}
of the logic program (\ref{eq:tranzlezinv}) in
Example~\ref{pel:logiprog}. It is replaced by rule
\begin{equation}\label{eq:helyebe}
\textit{edge}^{\langle 1,f(2,3)\rangle}(X,X,Y)\la v(X,Y)
\end{equation}
Notation $\langle 1,f(2,3)\rangle$ means that the first argument of
$\textit{edge}^{\langle 1,f(2,3)\rangle}$ is the same as the first
argument of $\textit{edge}$, while the second and third arguments of
$\textit{edge}^{\langle 1,f(2,3)\rangle}$ together with function symbol
$f$ give the second argument of $\textit{edge}$. If a function symbol
would become an argument of an \textit{idb} relation of
$\mathcal{P}^-$ during the bottom-up evaluation of
$(\mathcal{P}^-,\mathcal{V}^{-1})$, then a new rule is added to the
program with appropriately marked relation names.
\begin{pelda}{\textit{ Transformation of logic program into datalog
program.}}\label{pel:logiboldatalog}
The logic program Example~\ref{pel:logiprog} is transformed to the following
datalog program by the procedure sketched above. The different phases
of the bottom-up execution of \textsc{Naiv-Datalog} are separated
by lines.
\begin{equation}\label{eq:logiboldatalog}
\begin{array}{lcl}
\textit{edge}^{\langle 1,f(2,3)\rangle}(X,X,Y)&\la&v(X,Y)\\
\textit{edge}^{\langle f(1,2),3\rangle}(X,Y,Y)&\la&v(X,Y)\\
\hline
q^{\langle 1,f(2,3)\rangle}(X,Y_1,Y_2)&\la&\textit{edge}^{\langle
1,f(2,3)\rangle}(X,Y_1,Y_2)\\
q^{\langle f(1,2),3\rangle}(X_1,X_2,Y)&\la&\textit{edge}^{\langle
f(1,2),3\rangle}(X_1,X_2,Y)\\
\hline
q(X,Y)&\la&\textit{edge}^{\langle 1,f(2,3)\rangle}(X,Z_1,Z_2),q^{\langle
f(1,2),3\rangle}(Z_1,Z_2,Y)\\
q^{\langle f(1,2),f(3,4)\rangle}(X_1,X_2,Y_1,Y_2)&\la&\textit{edge}^{\langle
f(1,2),3\rangle}(X_1,X_2,Z),q^{\langle 1,f(2,3)\rangle}(Z,Y_1,Y_2)\\
\hline
q^{\langle f(1,2),3\rangle}(X_1,X_2,Y)&\la&\textit{edge}^{\langle
f(1,2),3\rangle}(X_1,X_2,Z),q(Z,Y)\\
q^{\langle 1,f(2,3)\rangle}(X,Y_1,Y_2)&\la&\textit{edge}^{\langle
1,f(2,3)\rangle}(X,Z_1,Z_2),q^{\langle
f(1,2),f(3,4)\rangle}(Z_1,Z_2,Y_1,Y_2)
\end{array}
\end{equation}
\end{pelda}
The datalog program obtained shows clearly that which arguments could
involve function symbols in the original logic program. However, some
rows containing function symbols never give tuples not containing
function symbols during the evaluation of the output of the program.
Relation $p$ is called \ki{significant,} if in the precedence
graph\index{precedence graph} of
Definition~\ref{def:fuggraf}\footnote{ Here the definition of precedence
graph needs to be extended for the \textit{edb} relations of the
datalog program, as well.} there exists oriented path from $p$ to
the output relation of $q$. If $p$ is not significant, then the tuples of $p$ are not needed to
compute the output of the program,
thus $p$ can be eliminated from the program.
\begin{pelda}{\textit{ Eliminating non-significant
relations.}}\label{pel:nemfontos}
There exists no directed path in the precedence graph of the datalog
program obtained in Example~\ref{pel:logiboldatalog}, from relations
$q^{\langle 1,f(2,3)\rangle}$ and $q^{\langle f(1,2),f(3,4)\rangle}$
to the output relation $q$ of the program, thus they are not
significant, i.e., they can be eliminated together with the rules
that involve them. The following datalog program is obtained:
\begin{equation}\label{eq:nemfontos}
\begin{array}{lcl}
\textit{edge}^{\langle 1,f(2,3)\rangle}(X,X,Y)&\la&v(X,Y)\\
\textit{edge}^{\langle f(1,2),3\rangle}(X,Y,Y)&\la&v(X,Y)\\
q^{\langle f(1,2),3\rangle}(X_1,X_2,Y)&\la&\textit{edge}^{\langle
f(1,2),3\rangle}(X_1,X_2,Y)\\
q^{\langle f(1,2),3\rangle}(X_1,X_2,Y)&\la&\textit{edge}^{\langle
f(1,2),3\rangle}(X_1,X_2,Z),q(Z,Y)\\
q(X,Y)&\la&\textit{edge}^{\langle 1,f(2,3)\rangle}(X,Z_1,Z_2),q^{\langle
f(1,2),3\rangle}(Z_1,Z_2,Y) \koz .
\end{array}
\end{equation}
\end{pelda}
One more simplification step can be performed, which does not decrease
the number of necessary derivations during computation of the output,
however avoids redundant data copying. If $p$ is such a relation in
the datalog program that is defined by a single rule, which in turn
contains a single relation in its body, then $p$ can be removed from
the program and replaced by the relation of the body of the rule
defining $p$, having equated the variables accordingly.
\begin{pelda}{\textit{ Avoiding unnecessary data copying.}}\label{pel:nemmasol}
In Example~\ref{pel:logiboldatalog} the relations
$\textit{edge}^{\langle 1,f(2,3)\rangle}$ and $\textit{edge}^{\langle
f(1,2),3\rangle}$ are defined by a single rule, respectively,
furthermore these two rules both have a single relation in their
bodies. Hence, program (\ref{eq:nemfontos}) can be simplified further.
\begin{equation}\label{eq:nemmasol}
\begin{array}{lcl}
q^{\langle f(1,2),3\rangle}(X,Y,Y)&\la&v(X,Y)\\
q^{\langle f(1,2),3\rangle}(X,Z,Y)&\la&v(X,Z),q(Z,Y)\\
q(X,Y)&\la&v(X,Z),q^{\langle f(1,2),3\rangle}(X,Z,Y) \koz .
\end{array}
\end{equation}
\end{pelda}
The datalog program obtained in the two simplification steps above is
denoted by $(\mathcal{P}^-,\mathcal{V}^{-1})^{\textit{datalog}}$. It
is clear that there exists a one-to-one correspondence between the
bottom-up evaluations of $(\mathcal{P}^-,\mathcal{V}^{-1})$ and
$(\mathcal{P}^-,\mathcal{V}^{-1})^{\textit{datalog}}$. Since the
function symbols in
$(\mathcal{P}^-,\mathcal{V}^{-1})^{\textit{datalog}}$ are kept track,
it is sure that the output instance obtained is in fact the subset of
tuples of the output of $(\mathcal{P}^-,\mathcal{V}^{-1})$ that do
not contain function symbols.
\begin{tetel}\label{tet:dlogekviv}
For arbitrary datalog program $\mathcal{P}$ that does not contain
negations, and set of conjunctive views $\mathcal{V}$, the logic
program $(\mathcal{P}^-,\mathcal{V}^{-1})\hspace{-3pt}\downarrow$ is
equivalent with the datalog program
$(\mathcal{P}^-,\mathcal{V}^{-1})^{\textit{datalog}}$.
\end{tetel}
\subsubsection{MiniCon.}
The main disadvantage of the Bucket Algorithm \index{bucket
algorithm@Bucket Algorithm}
is that it considers each of the subgoals in isolation, therefore does
not observe the most of the interactions between the subgoals of the
views. Thus, the buckets may contain many unusable views, and the
second phase of the algorithm may become very expensive.
The advantage of the Inverse-rules Algorithm\index{inverse-rules
algorithm@Inverse-rules Algorithm} is its conceptual simplicity and
modularity. The inverses of the views must be computed only once,
then they can be applied to arbitrary queries given by datalog
programs. On the other hand, much of the computational advantage of
exploiting the materialised views can be lost. Using the resulting
rewriting produced by the algorithm for actually evaluating queries
from the views has significant drawback, since it insists on
recomputing the extensions of the database relations.
The \textsc{MiniCon} algorithm addresses the limitations of the
previous two algorithms. The key idea underlying the algorithm is a
change of perspective: instead of building rewritings for each of the
query \ki{subgoals,}\index{query!subgoal} it is considered how each of
the \ki{variables}\inddef{query!variables of} in the query can
interact with the available views. The result is that the second phase
of \textsc{MiniCon} needs to consider drastically fewer combinations
of views. In the following we return to
conjunctive queries, and for the sake of easier understanding only
such views are considered that do not contain constants.
The \textsc{MiniCon} algorithm starts out like the Bucket Algorithm,
considering which views contain subgoals that correspond to subgoals
in the query. However, once the algorithm finds a partial variable
mapping from a subgoal $g$ in the query to a subgoal $g_1$ in a view
$V$, it changes perspective and looks at the variables in the
query. The algorithm considers the join predicates in the query --
which are specified by multiple occurrences of the same variable --
and finds the minimal additional set of subgoals that \ki{must} be
mapped to subgoals in $V$, given that $g$ will be mapped to
$g_1$. This set of subgoals and mapping information is called a
\ki{MiniCon Description (MCD)}\index{MiniCon
Description}\index{MCD}. In the second phase the MCDs are combined
to produce query rewritings. The construction of the MCDs makes the
most expensive part of the Bucket Algorithm obsolete, that is the
checking of containment between the rewritings and the query, because
the generating rule of MCDs makes it sure that their join gives
correct result.
For a given mapping $\tau\colon Var(Q)\longrightarrow Var(V)$ subgoal
$g_1$ of view $V$ is said to \ki{cover} a subgoal $g$ of query $Q$,
if $\tau(g)=g_1$. $Var(Q)$, and respectively $Var(V)$ denotes the set
of variables of the query, respectively of that of the view. In order
to prove that a rewriting gives only tuples that belong to the output
of the query, a homomorphism must be exhibited from the query onto the
rewriting. An MCD can be considered as a part of such a homomorphism,
hence, these parts will be put together easily.
The rewriting of query $Q$ is a union of conjunctive queries using the
views. Some of the variables may be equated in the heads of some of
the views as in the equivalent rewriting (\ref{eq:qv6ekviv}) of
Example~\ref{pel:vodoralgoritmus}. Thus, it is useful to introduce the
concept of \ki{head homomorphism.}\inddef{head homomorphism} The
mapping $h\colon \var{Var}(V)\longrightarrow \var{Var}(V)$ is a
\ki{head homomorphism,} if it is an identity on variables that do not
occur in the head of $V$, but it can equate variables of the head. For
every variable $x$ of the head of $V$, $h(x)$ also appear in the
head of $V$, furthermore $h(x)=h(h(x)).$ Now, the exact definition of
MCD can be given.
\begin{defi}\label{def:mcd}
The quadruple $C=(h_C,V(\tilde{Y})_C,\varphi_C,G_C)$ is a \ki{MiniCon
Description (MCD)}\inddef{MiniCon Description (MCD)}\inddef{MCD} for
query $Q$ over view $V$, where
\begin{itemize}
\item $h_C$ is a head homomorphism over $V$,
\item $V(\tilde{Y})_C$ is obtained from $V$ by applying $h_C$, that is
$\tilde{Y}=h_C(\tilde{A})$, where $\tilde{A}$ is the set of
variables appearing in the head of $V$,
\item $\varphi_C$ is a partial mapping from $Var(Q)$ to $h_C(Var(V))$,
\item $G_C$ is a set of subgoals of $Q$ that are covered by some
subgoal of $H_C(V)$ using the mapping $\varphi_C$ (note: not all such
subgoals are necessarily included in $G_C$).
\end{itemize}
\end{defi}
The procedure constructing MCDs is based on the following proposition.
\begin{allit}\label{all:property1}
Let $C$ be a MiniCon Description over view $V$ for query $Q$. $C$ can
be used for a non-redundant rewriting of $Q$ if the following conditions hold
\begin{description}
\item{\textbf{C1.}} for every variable $x$ that is in the head of $Q$
and is in the domain of $\varphi_C$, as well, $\varphi_C(x)$ appears in
the head of $h_C(V)$, furthermore
\item{\textbf{C2.}} if $\varphi_C(y)$ does not appear in the head of
$h_C(V)$, then for all such subgoals of $Q$ that contain $y$ holds
that
\begin{enumerate}
\item every variable of $g$ appears in the domain of $\varphi_C$ and
\item $\varphi_C(g)\in h_C(V)$.
\end{enumerate}
\end{description}
\end{allit}
Clause \textbf{C1} is the same as in the Bucket Algorithm. Clause \textbf{C2}
means that if a variable $x$ is part of a join predicate which is not enforced
by the view, then $x$ must be in the head of the view so the join predicate
can be applied by another subgoal in the rewriting. The procedure
\textsc{Form-MCDs} gives the usable MiniCon Descriptions for a conjunctive
query $Q$ and set of conjunctive views $\mathcal{V}$.
\begin{algN}{Form-MCDs($Q,\mathcal{V}$)}\inddef{form-MCDs@\textsc{Form-MCDs}}
1 \' $\mathcal{C}\la\emptyset$\\
2 \' \key{for} \= each subgoal $g$ of $Q$\\
3 \' \> \key{do} \= \key{for} $V\in\mathcal{V}$\\
4 \' \> \> \key{do} \= \key{for} every subgoal $v\in V$\\
5 \' \> \> \> \key{do} \= Let $h$ be the least restrictive head homomorphism on
$V$,\\
\> \> \> \> such that there exists a mapping $\varphi$ with
$\varphi(g)=h(v)$.\\
6 \' \> \> \> \>\key{if} \= $\varphi$ and $h$ exist\\
7 \' \> \> \> \> \> \key{then} \= Add to $\mathcal{C}$ any new MCD $C$, \\
\' \> \> \> \> \> \> that can be constructed where:\\
8 \' \> \> \> \> \> \> (a) \= $\varphi_C$ (respectively, $h_C$) is \\
\' \> \> \> \> \> \> \> an extension of $\varphi$ (respectively, $h$),\\
9 \' \> \> \> \> \> \> (b) \= $G_C$ is the minimal subset of subgoals of $Q$ such that \\
\' \> \> \> \> \> \> \> $G_C$, $\varphi_C$ and $h_C$ satisfy
Proposition~\ref{all:property1}, and\\
10 \' \> \> \> \> \> \> (c) \= It is not possible to extend $\varphi$ and $h$ to $\varphi_C'$ \\
\' \> \> \> \> \> \> \> and $h_C'$ such that (b) is satisfied, \\
\' \> \> \> \> \> \> \> and $G_C'$ as defined in (b), is a subset of
$G_C$.\\
11 \' \key{return} $\mathcal{C}$
\end{algN}
Consider again query (\ref{eq:qv456}) and the views of
Example~\ref{pel:vodoralgoritmus}. Procedure \textsc{Form-MCDs} considers
subgoal $\textit{cite}(x,y)$ of the query first. It does not create an MCD
for view $V_1$, because clause \textbf{C2} of Proposition~\ref{all:property1}
would be violated. Indeed, the condition would require that subgoal
$\textit{sameArea}(x,y)$ be also covered by $V_1$ using the mapping
$\varphi(x)=a$, $\varphi(y)=b$, since is not in the head of
$V_1$.\footnote{ The case of $\varphi(x)=b$, $\varphi(y)=a$ is similar.} For
the same reason, no MCD will be created for $V_1$ even when the other
subgoals of the query are considered. In a sense, the MiniCon Algorithm
shifts some of the work done by the combination step of the Bucket Algorithm
to the phase of creating the MCDs by using \textsc{Form-MCDs}. The following
table shows the output of procedure \textsc{Form-MCDs}.
\begin{equation}\label{eq:MCDkesz}
\begin{array}{l|l|l|l}
V(\tilde{Y})&h&\varphi&G\\
\hline
V_2(c,d)&c\rightarrow c,d\rightarrow d&x\rightarrow c,y\rightarrow d&3\\
V_3(f,f)&f\rightarrow f,h\rightarrow f&x\rightarrow f,y\rightarrow f&1,2,3
\end{array}
\end{equation}
Procedure \textsc{Form-MCDs} includes in $G_C$ only the \ki{minimal} set of
subgoals that are necessary in order to satisfy
Proposition~\ref{all:property1}. This makes it possible that in the second
phase of the MiniCon Algorithm needs only to consider combinations of MCDs
that cover \ki{pairwise disjoint subsets} of subgoals of the query.
\begin{allit}\label{all:property2}
Given a query $Q$, a set of views $\mathcal{V}$, and the set of MCDs
$\mathcal{C}$ for $Q$ over the views $\mathcal{V}$, the only combinations of
MCDs that can result in non-redundant rewritings of $Q$ are of the form
$C_1,\ldots C_l$, where
\begin{description}
\item{\textbf{C3.}} $G_{C_1}\cup\cdots\cup G_{C_l}$ contains all the subgoals
of $Q$, and
\item{\textbf{C4.}} for every $i\ne j$ $G_{C_i}\cap G_{C_j}=\emptyset$.
\end{description}
\end{allit}
The fact that only such sets of MCDs need to be considered that provide
partitions of the subgoals in the query reduces the search space of the
algorithm drastically.
In order to formulate procedure \textsc{Combine-MCDs}, another notation needs
to be introduced. The $\varphi_C$ mapping of MCD $C$ may map a set of
variables of $Q$ onto the same variable of $h_C(V)$. One arbitrarily chosen
representative of this set is chosen, with the only restriction that if there
exists variables in this set from the head of $Q$, then one of those is the
chosen one. Let $EC_{\varphi_C}(x)$ denote the representative variable of the
set containing $x$. The MiniCon Description $C$ is considered extended with
$EC_{\varphi_C}(x)$ in he following as a quintet
$(h_C,V(\tilde{Y}),\varphi_C,G_C,EC_{\varphi_C})$. If the MCDs $C_1,\ldots
,C_k$ are to be combined, and for some $i\ne j$
$EC_{\varphi_{C_i}}(x)=EC_{\varphi_{C_i}}(y)$ and
$EC_{\varphi_{C_j}}(y)=EC_{\varphi_{C_j}}(z)$ holds, then in the conjunctive
rewriting obtained by the join $x$, $y$ and $z$ will be mapped to the same
variable. Let $S_C$ denote the equivalence relation determined on the
variables of $Q$ by two variables being equivalent if they are mapped onto the
same variable by $\varphi_C$, that is, $xS_Cy\iff
EC_{\varphi_C}(x)=EC_{\varphi_C}(y)$. Let $\mathcal{C}$ be the set of MCDs
obtained as the output of \textsc{Form-MCDs}.
\begin{algN}{Combine-MCDs($\mathcal{C}$)}\inddef{combine-MCDs@\textsc{Combine-MCDs}}
1 \' \textit{Answer}$\la\emptyset$\\
2 \' \key{for} \= $\{C_1,\ldots ,C_n\}\subseteq\mathcal{C}$ such that
$G_{C_1},\ldots ,G_{C_n}$ is a partition of the subgoals of $Q$\\
3 \' \> \key{do} \= Define a mapping $\Psi_i$ on $\tilde{Y}_i$ as follows:\\
4 \' \> \> \key{if} \= there exists a variable $x$ in $Q$ such that $\varphi_i(x)=y$\\
5 \' \> \> \> \key{then} \= $\Psi_i(y)=x$\\
6 \' \> \> \> \key{else} \> $\Psi_i(y)$ is a fresh copy of $y$\\
7 \' \> Let $S$ be the transitive closure of $S_{C_1}\cup \cdots \cup S_{C_n}$\\
8 \' \> \` $\rhd$ $S$ is an equivalence relation of variables of $Q$.\\
9 \' \> Choose a representative for each equivalence class of $S$.\\
10 \' \> Define mapping $EC$ as follows:\\
11 \' \> \key{if} \= $x\in Var(Q)$\\
12 \' \> \> \key{then} \= $EC(x)$ is the representative of the equivalence
class of $x$ under $S$\\
13 \' \> \> \key{else} \> $EC(x)=x$\\
14 \' \> Let $Q'$ be given as $Q'(EC(\tilde{X}))\la V_{C_1}(EC(\Psi_1(\tilde{Y}_1))),
\dots ,V_{C_n}(EC(\Psi_n(\tilde{Y}_n)))$\\
15 \' \> \textit{Answer}$\la \textit{Answer}\cup\{Q'\}$\\
16 \' \key{return} \textit{Answer}
\end{algN}
The following theorem summarises the properties of the MiniCon Algorithm.
\begin{tetel}\label{tet:miniconjo}
Given a conjunctive query $Q$ and conjunctive views $\mathcal{V}$, both
without comparison predicates and constants, the MiniCon Algorithm produces
the union of conjunctive queries that is a maximally contained rewriting of
$Q$ using $\mathcal{V}$.
\end{tetel}
The complete proof of Theorem~\ref{tet:miniconjo} exceeds the limitations of
the present chapter. However, in Problem~\ref{fel:miniconjo} the reader is
asked to prove that union of the conjunctive queries obtained as output of
\textsc{Combine-MCDs} is contained in $Q$.
It must be noted that the running times of the Bucket Algorithm, the
Inverse-rules Algorithm and the MiniCon Algorithm are the same in the worst
case: $O(nmM^n)$, where $n$ is the number of subgoals in the query, $m$ is
the maximal number of subgoals in a view, and $M$ is the number of views.
However, practical test runs show that in case of large number of views
(3--400 views) the MiniCon Algorithm is significantly faster than the other
two.
\begin{gyak}
\ujgyak\label{gy:atirNPbiz}
Prove Theorem~\ref{tet:atirNP} using Proposition~\ref{all:atir-ekviv} and
Theorem~\ref{tet:tabla-np-telj}.
\ujgyak\label{gy:l33biz}
Prove the two statements of Lemma~\ref{lem:lemma33}. \textit{Hint.} For the
first statement, write in their definitions in place of views
$v_i(\tilde{Y}_i)$ into $Q'$. Minimise the obtained query $Q''$ using
Theorem~\ref{tet:minimaltabla}. For the second statement use
Proposition~\ref{all:atir-ekviv} to prove that there exists a homomorphism $h_i$
from the body of the conjunctive query defining view $v_i(\tilde{Y}_i)$ to the
body of $Q$. Show that $\tilde{Y'}_i=h_i(\tilde{Y}_i)$ is a good choice.
\ujgyak\label{gy:evalvegesbiz}
Prove Theorem~\ref{tet:evalveges} using that datalog programs have unique
minimal fixpoint.
\end{gyak}
\begin{fld}
\ujfld{MiniCon is correct}{}\label{fel:miniconjo}
Prove that the output of the MiniCon Algorithm is correct. \textit{Hint.} It
is enough to show that for each conjunctive query $Q'$ given in line 14 of
\textsc{Combine-MCDs} $Q'\sqsubseteq Q$ holds. For the latter, construct a
homomorphism from $Q$ to $Q'$.
\ujfld{$(\mathcal{P}^-,\mathcal{V}^{-1})\hspace{-3pt}\downarrow$ is correct}{}
Prove that each tuple produced by logic program
$(\mathcal{P}^-,\mathcal{V}^{-1})\hspace{-3pt}\downarrow$ is contained in the
output of $\mathcal{P}$ (part of the proof of
Theorem~\ref{tet:p-v-1maxtart}). \textit{Hint.} Let $t$ be a tuple in the
output of $(\mathcal{P}^-,\mathcal{V}^{-1})$ that does not contain function
symbols. Consider the derivation tree of $t$. Its leaves are literals, since
they are extensional relations of program
$(\mathcal{P}^-,\mathcal{V}^{-1})$. If these leaves are removed from the tree,
then the leaves of the remaining tree are \textit{edb} relations of
$\mathcal{P}$. Prove that the tree obtained is the derivation tree of $t$ in
datalog program $\mathcal{P}$.
\ujfld{Datalog views}{}
This problem tries to justify why only conjunctive views were considered. Let
$\mathcal{V}$ be a set of views, $Q$ be a query. For a given instance
$\mathcal{I}$ of the views the tuple $t$ is a \ki{certain
answer}\felindex{certain answer} of query $Q$, if for any database instance
$\mathcal{D}$ such that $\mathcal{I}\subseteq\mathcal{V}(\mathcal{D})$, $t\in
Q(\mathcal{D})$ holds, as well.
\begin{description}
\item[\textit{a.}] Prove that if the views of $\mathcal{V}$ are given by
datalog programs, query $Q$ is conjunctive and may contain non-equality
($\ne$) predicates, then the question whether for a given instance
$\mathcal{I}$ of the views tuple $t$ is a certain answer of $Q$ is
algorithmically undecidable. \textit{Hint.} Reduce to this question the
\ki{Post Correspondence Problem,}\felindex{Post Correspondence Problem}
which is the following: Given two sets of words $\{w_1,w_2,\ldots
,w_n\}$ and $\{w'_1,w'_2,\ldots ,w'_n\}$ over the alphabet $\{a,b\}$. The
question is whether there exists a sequence of indices $i_1,i_2,\ldots ,i_k$
(repetition allowed) such that
\begin{equation}\label{eq:postproblem}
w_{i_1}w_{i_2}\cdots w_{i_k}=w'_{i_1}w'_{i_2}\cdots w'_{i_k} \koz .
\end{equation}
The Post Correspondence Problem is well known algorithmically undecidable
problem. Let the view $V$ be given by the following datalog program:
\begin{equation}\label{eq:postnezet}
\begin{array}{lcl}
V(0,0)&\la&S(e,e,e)\\
V(X,Y)&\la&V(X_0,Y_0),S(X_0,X_1,\alpha_1),\ldots ,S(X_{g-1},Y,\alpha_g),\\
& &S(Y_0,Y_1,\beta_1),\ldots ,S(Y_{h-1},Y,\beta_h)\\
& &\textrm{where\ } w_i=\alpha_1\ldots\alpha_g\textrm{\ and \ }w'_i=\beta_1
\ldots\beta_h\\
& &\textrm{is\ a\ rule\ for\ all\ }i\in\{1,2,\ldots ,n\}\\
S(X,Y,Z)&\la&P(X,X,Y),P(X,Y,Z) \koz .
\end{array}
\end{equation}
Furthermore, let $Q$ be the following conjunctive query.
\begin{equation}\label{eq:postlekerd}
Q(c)\la P(X,Y,Z),P(X,Y,Z'),Z\ne Z' \koz .
\end{equation}
Show that for the instance $\mathcal{I}$ of $V$ that is given by
$\mathcal{I}(V)=\{\langle e,e\rangle\}$ and $\mathcal{I}(S)=\{\}$, the tuple
$\langle c\rangle$ is a certain answer of query $Q$ if and only if the Post
Correspondence Problem with sets $\{w_1,w_2,\ldots
,w_n\}$ and $\{w'_1,w'_2,\ldots ,w'_n\}$ has \textit{no} solution.
\item[\textit{b.}] In contrast to the undecidability result of \textit{a.}, if
$\mathcal{V}$ is a set of conjunctive views and query $Q$ is given by
datalog program $\mathcal{P}$, then it is easy to decide about an arbitrary
tuple $t$ whether it is a certain answer of $Q$ for a given view instance
$\mathcal{I}$. Prove that the datalog program
$(\mathcal{P}^-,\mathcal{V}^{-1})^{\textit{datalog}}$ gives exactly the
tuples of the certain answer of $Q$ as output.
\end{description}
\end{fld}
\begin{fejmegj}
There are several dimensions along which the treatments of the problem
``answering queries using views'' can be classified. Figure~\ref{abr:taxonomy}
shows the taxonomy of the work.
\epskepnagy{figs/taxonomy.eps}{A taxonomy of work on answering queries using
views.\label{abr:taxonomy}}{width=13cm}
The most significant distinction between the different work s is whether their
goal is data integration or whether it is query optimisation and maintenance
of physical data independence. The key difference between these two classes of
works is the output of the the algorithm for answering queries using views. In
the former case, given a query $Q$ and a set of views $\mathcal{V}$, the goal
of the algorithm is to produce an expression $Q'$ that references the views
and is either equivalent to or contained in $Q$. In the latter case, the
algorithm must go further and produce a (hopefully optimal) query execution
plan for answering $Q$ using the views (and possibly the database
relations). Here the rewriting must be an equivalent to $Q$ in order to ensure
the correctness of the plan.
The similarity between these two bodies of work is that they are concerned
with the core issue of whether a rewriting of a query is equivalent or
contained in the query. However, while logical correctness suffices for the
data integration context, it does not in the query optimisation context where
we also need to find the \textit{cheapest} plan using the views. The
complication arises because the optimisation algorithms need to consider views
that do not contribute to the \textit{logical} correctness of the rewriting,
but do reduce the cost of the resulting plan. Hence, while the reasoning
underlying the algorithms in the data integration context is mostly logical,
in the query optimisation case it is both logical and cost-based. On the other
hand, an aspect stressed in data integration context is the importance of
dealing with a large number of views, which correspond to data sources. In the
context of query optimisation it is generally assumed (not always!) that the
number of views is roughly comparable to the size of the schema.
The works on query optimisation can be classified further into System-R style
optimisers and transformational optimisers. Examples of the former are works
of Chaudhuri,
Krishnamurty, Potomianos and Shim \cite{CKPS95};\nevindex{Shim,
Kyuseok}\nevindex{Krishnamurty, Ravi}\nevindex{Chaudhuri,
Surajit}\nevindex{Potomianos, Spyros} Tsatalos, Solomon, and
Ioannidis\nevindex{Tsatalos, Odysseas G.}\nevindex{Solomon, Marvin
H.}\nevindex{Ioannidis, Yannis E.} \cite{TSI96}. Papers of Florescu,
Raschid, and Valduriez\nevindex{Florescu, Daniela
D.}\nevindex{Raschid, Louiqa}\nevindex{Valduriez, Patrick}
\cite{FRV96}; Bello et. al. \cite{BDD98};\nevindex{Bello, Randall G.}\nevindex{Dias,
Karl}\nevindex{Downing, Alan}\nevindex{Feenan, James J.}\nevindex{Finnerty,
James L.}\nevindex{Norcott, William D.}\nevindex{Sun, Harry}\nevindex{Witkowski,
Andrew}\nevindex{Ziauddin, Mohamed} Deutsch, Popa and Tannen\nevindex{Deutsch,
Alin}\nevindex{Popa, Lucian}\nevindex{Tannen, Val}
\cite{DPT99}, Zaharioudakis et. al. \cite{ZCL+00},\nevindex{Zaharioudakis,
Markos}\nevindex{Cochrane, Roberta}\nevindex{Lapis, George}\nevindex{Pirahesh,
Hamid}\nevindex{Urata, Monica} furthermore Goldstein\nevindex{Goldstein,
Jonathan} és
Larson\nevindex{Larson, Per--\AA ke}\cite{GL01} belong to the latter.
Rewriting algorithms in the data integration context are studied in works of
Yang\nevindex{Yang, H. Z.} and
Larson \cite{YL87}; Levy, Mendelzon, Sagiv and Srivastava
\cite{LMSS95}\nevindex{Levy, Alon Y.}\nevindex{Mendelzon, Alberto
O.}\nevindex{Sagiv, Yehoshua}\nevindex{Srivastava, Divesh};
Qian\nevindex{Qian, Xiaolei}
\cite{Qia96}; furthermore Lambrecht, Kambhampati and
Gnanaprakasam\nevindex{Lambrecht, Eric}\nevindex{Kambhampati,
Subbarao}\nevindex{Gnanaprakasam, Senthil} \cite{LKG99}. The Bucket
Algorithm was introduced by Levy, Rajaraman and Ordille \nevindex{Rajaraman,
Anand}\nevindex{Ordille, Joann J.}\cite{LRO96b}. The Inverse-rules Algorithm
is invented by Duschka and Genesereth
\cite{DG97a,DG97b}\nevindex{Duschka, Oliver M.}\nevindex{Genesereth, Michael
R.}. The MiniCon Algorithm was developed by Pottinger\nevindex{Pottinger,
Rachel} and
Halevy\nevindex{Halevy, Alon Y.}
\cite{PL00,PH01}.
Query answering algorithms and the complexity of the problem is studied in
papers of Abiteboul\nevindex{Abiteboul, Serge} and Duschka \cite{AD98};
Grahne\nevindex{Grahne, Gösta} and Mendelzon
\cite{GM99a}; furthermore Calvanese, De Giacomo, Lenzerini and
Vardi\nevindex{Calvanese, Diego}\nevindex{De Giacomo, Giuseppe}
\nevindex{Lenzerini, Maurizio}\nevindex{Vardi, Moshe Y.}
\cite{CGLV00a}.
The STORED system was developed by Deutsch, Fernandez\nevindex{Fernandez,
Mary} and
Suciu\nevindex{Suciu, Dan}
\cite{DFS99}. Semantic caching is discussed in the paper of Yang, Karlapalem
and Li
\cite{YKL97}\nevindex{Yang, Jian}\nevindex{Karlapalem, Kamalakar}\nevindex{Li,
Qing}. Extensions of the rewriting problem are studied in \cite{Bun97,Flo96,
FW97,KW96,YKL97}.\nevindex{Buneman, Peter}\nevindex{Friedman,
Marc}\nevindex{Weld, Daniel S.}\nevindex{Kwok, Cody T.}
Surveys of the area can be found in works of Abiteboul \cite{Abi97}, Florescu,
Levy and
Mendelzon \cite{FLM98}, Halevy \cite{Lev00,Hal01}, furthermore
Ullman\cite{Ull97}\nevindex{Ullman, Jeffrey David}.
Research of the authors was (partially) supported by Hungarian National
Research Fund (OTKA) grants Nos. T034702, T037846T and T042706.
\end{fejmegj}
\index{query rewriting|)}
*