\chapter[ Systolic Systems]{Systolic Systems}
\index{systolic!system|(}\index{systolic array|(}\nevindex{Zehendner, Eberhard}
Systolic arrays probably constitute a perfect kind of special purpose computer.
In their simplest appearance, they may provide only one operation, that is repeated over
and over again. Yet, systolic arrays show an abundance of practice-oriented applications,
mainly in fields dominated by iterative procedures: numerical mathematics,
combinatorial optimisation, linear algebra, algorithmic graph theory, image and
signal processing, speech and text processing, et cetera.
For a systolic array can be tailored to the structure of its one and only algorithm thus
accurately! So that time and place of each executed operation are fixed once and for all.
And communicating cells are permanently and directly connected, no switching required.
The algorithm has in fact become hardwired. \index{systolic!algorithm|(}Systolic algorithms
in this respect are considered to be \inddef{hardware algorithm}\ki{hardware algorithms.}
Please note that the term \emph{systolic algorithms} usually does not refer
to a set of concrete algorithms for solving a single specific computational problem,
as for instance \emph{sorting}. And this is quite in contrast to terms like \emph{sorting algorithms}.
Rather, systolic algorithms constitute a special style of specification, programming, and computation.
So algorithms from many different areas of application can be \index{systolic}\emph{systolic} in style.
But probably not all well-known algorithms from such an area might be suited to systolic computation.
Hence, this chapter does not intend to present \emph{all} systolic algorithms,
nor will it introduce even the most important systolic algorithms from any field of application.
Instead, with a few simple but typical examples, we try to lay the foundations for the Readers'
general understanding of systolic algorithms.
The rest of this chapter is organised as follows:
Section \ref{sysex1} shows some basic concepts of systolic systems by means of an introductory example.
Section \ref{sysex2} explains how systolic arrays formally emerge from space-time transformations.
Section \ref{sysex3} deals with input/output schemes.
Section \ref{sysex4} is devoted to all aspects of control in systolic arrays.
In Section \ref{sysexs5} we study the class of linear systolic arrays, raising further questions.
\section{Basic concepts of systolic systems}\label{sysex1}
\vspace{-1mm}
\index{pipelining|(}\index{parallelism!massive|(}
The designation \inddef{systolic}\ki{systolic} follows from the operational principle of the systolic architecture. The systolic style is characterised by an intensive application of both \emph{pipelining} and \emph{parallelism}, controlled by a global and completely synchronous \index{clock signal}clock. \inddef{data stream}\ki{Data streams} pulsate rhythmically through the communication network, like streams of blood are driven from the heart through the veins of the body. Here, pipelining is not constrained to a single space axis but concerns \emph{all data streams} possibly \emph{moving in different directions} and \emph{intersecting} in the cells of the systolic array.
\index{pipelining|)}
A \inddef{systolic!system}\ki{systolic system} typically consists of a \index{host computer}\emph{host computer}, and the actual \emph{systolic array}. Conceptionally, the host computer is of minor importance, just controlling the operation of the systolic array and supplying the data. The \inddef{systolic array}\ki{systolic array} can be understood as a specialised network of cells rapidly performing data-intensive computations, supported by \emph{massive parallelism}. A \inddef{systolic!algorithm}\ki{systolic algorithm} is the program collaboratively executed by the cells of a systolic array.
\index{parallelism!massive|)}\index{systolic!algorithm|)}
Systolic arrays may appear very differently, but usually share a couple of key features:
discrete time scheme, synchronous operation, regular (frequently two-dimensional) geometric layout,
communication limited to directly neighbouring cells, and spartan control mechanisms.
In this section, we explain fundamental phenomena in context of systolic arrays, driven by a running example. A computational problem usually allows several solutions, each implemented by a specific systolic array. Among these, the most attractive designs (in whatever respect) may be very complex. Note, however, that in this educational text we are less interested in advanced solutions, but strive to present important concepts compactly and intuitively.
\vspace{2mm}
\subsection{An introductory example: matrix product}\index{matrix product}
\vspace{-1mm}
Figure \ref{exonea} shows a \inddef{systolic array!rectangular}\ki{rectangular} systolic
array consisting of 15 \inddef{cell}\ki{cells} for multiplying a $3 \times N$ matrix
$A$ by an $N \times 5$ matrix $B$. The \emph{parameter} $N$ is not reflected
in the \index{systolic array!structure of}\emph{structure} of this particular
systolic array, but in the \index{input/output scheme}\emph{input scheme} and
the \index{running time!of systolic algorithm}\emph{running time} of the algorithm.
The input scheme depicted is based on the special choice of parameter $N=4$.
Therefore, Figure \ref{exonea} gives a solution to the following problem instance:
\begin{equation*}
A \cdot B = C \koz ,
\end{equation*}
%
\vspace{-2mm}
where
%
\vspace{-2mm}
\begin{align*}\label{exmmult1}\inddef{matrix product}
A & = \left(\begin{array}{llll}
a_{11} & a_{12} & a_{13} & a_{14} \\
a_{21} & a_{22} & a_{23} & a_{24} \\
a_{31} & a_{32} & a_{33} & a_{34}
\end{array}\right) \koz , \\[1.5ex]
B & = \left(\begin{array}{lllll}
b_{11} & b_{12} & b_{13} & b_{14} & b_{15} \\
b_{21} & b_{22} & b_{23} & b_{24} & b_{25} \\
b_{31} & b_{32} & b_{33} & b_{34} & b_{35} \\
b_{41} & b_{42} & b_{43} & b_{44} & b_{45}
%%%\end{array}\right) \koz , \\[1.5ex]
\end{array}\right) \koz ,
\vspace{-2mm}
\end{align*}
\vspace{-4mm}
\begin{align*}
\vspace{-4mm}
\hspace{12mm} C & = \left(\begin{array}{lllll}
c_{11} & c_{12} & c_{13} & c_{14} & c_{15} \\
c_{21} & c_{22} & c_{23} & c_{24} & c_{25} \\
c_{31} & c_{32} & c_{33} & c_{34} & c_{35}
\end{array}\right) \koz ,
\intertext{and}
c_{ij} & = \sum\limits_{k=1}^4 a_{ik} \cdot b_{kj} \qquad (1 \le i \le 3, 1 \le j \le 5) \koz .
\end{align*}
The cells of the systolic array can exchange data through \inddef{link}\ki{links,} drawn as
arrows between the cells in Figure \ref{exonea}(a). \inddef{cell!boundary}\ki{Boundary cells}
of the systolic array can also communicate with the \index{outside world}\emph{outside world}.
All cells of the systolic array share a common
\inddef{connection pattern}\ki{connection pattern} for communicating with their environment.
The completely regular \inddef{systolic array!structure of}\ki{structure} of the systolic array
(placement and connection pattern of the cells) induces \index{data flow!regular}\emph{regular data flows}
along all connecting directions.
Figure \ref{exonea}(b) shows the \inddef{cell!structure of}\ki{internal structure} of a cell. We find a \emph{multiplier}, an \emph{adder}, three \emph{registers}, and four \emph{ports}, plus some wiring between these units. Each \inddef{port}\ki{port} represents an interface to some external link that is attached to the cell. All our cells are of the same structure.
Each of the registers \textit{A}, \textit{B}, \textit{C} can store a single data item.
The designations of the registers are suggestive here, but arbitrary in principle.
Registers \textit{A} and \textit{B} get their values from \inddef{port!input}\ki{input ports,} shown in
Figure \ref{exonea}(b) as small circles on the left resp. upper border of the cell.
The current values of registers \textit{A} and \textit{B} are used as operands of the multiplier and,
at the same time, are passed through \inddef{port!output}\ki{output ports} of the cell,
see the circles on the right resp. lower border. The result of the multiplication is
supplied to the adder, with the second operand originating from register \textit{C}.
The result of the addition eventually overwrites the past value of register \textit{C}.
\epskepnagyfrag{figs/exonea}
{\label{exonea}\abraindex{systolic array!rectangular}
Rectangular systolic array for \abraindex{matrix product}matrix product.
\textbf{(a)} \abraindex{systolic array!structure of}Array structure
and\abraindex{input/output scheme}input scheme.
\textbf{(b)}\abraindex{cell!structure of}Cell structure.}{10.6cm}
\subsection{Problem parameters and array parameters}\index{problem parameter|(}\index{systolic array!parameter of|(}
The 15 cells of the systolic array are organised as a rectangular pattern of three rows by five columns, exactly
as with matrix $C$. Also, these dimensions directly correspond to the number of rows of matrix $A$ and
the number of columns of matrix $B$. The \inddef{systolic array!size of}\ki{size of the systolic array,}
therefore, \emph{corresponds} to the \emph{size of some data structures} for the problem to solve.
If we had to multiply an $N_1 \times N_3$ matrix $A$ by an $N_3 \times N_2$ matrix $B$ in the general
case, then we would need a systolic array with $N_1$ rows and $N_2$ columns.
The quantities $N_1,N_2,N_3$ are parameters of the problem to solve,
because the number of operations to perform depends on each of them; they are thus
\inddef{problem parameter}\ki{problem parameters.} The size of the systolic array,
in contrast, depends on the quantities $N_1$ and $N_2$, only. For this reason,
$N_1$ and $N_2$ become also \inddef{systolic array!parameter of}\ki{array parameters,}
for this particular systolic array, whereas $N_3$ is \emph{not} an array parameter.
\textit{Remark.} For matrix product, we will see another systolic array in Section \ref{sysex2},
with dimensions dependent on all three problem parameters $N_1,N_2,N_3$.
An $N_1 \times N_2$ systolic array as shown in Figure \ref{exonea} would also permit to multiply an $M_1 \times M_3$ matrix $A$ by an $M_3 \times M_2$ matrix $B$, where $M_1 \le N_1$ and $M_2 \le N_2$. This is important if we intend to use the same systolic array for the multiplication of matrices of varying dimensions. Then we would operate on a properly dimensioned rectangular subarray, only, consisting of $M_1$ rows and $M_2$ columns, and located, for instance, in the upper left corner of the complete array. The remaining cells would also work, but without any contribution to the solution of the whole problem; they should do no harm, of course.
\index{problem parameter|)}\index{systolic array!parameter of|)}
\subsection{Space coordinates}\label{Raumkoordinaten1}\index{space coordinates|(}
Now let's assume that we want to assign unique \inddef{space coordinates}\ki{space coordinates}
to each cell of a systolic array, for characterising the geometric position of the cell relative
to the whole array. In a \emph{rectangular} systolic array, we simply can use the
respective row and column numbers, for instance.
The cell marked with $c_{11}$ in Figure \ref{exonea} thus would get the coordinates (1,1), the cell marked
with $c_{12}$ would get the coordinates (1,2), cell $c_{21}$ would get (2,1), and so on.
For the remainder of this section, we take space coordinates constructed in such a way for granted.
In principle it does not matter where the coordinate origin lies, where the axes are pointing to, which direction in space corresponds to the first coordinate, and which to the second. In the system presented above, the order of the coordinates has been chosen corresponding to the designation of the matrix components. Thus, the first coordinate stands for the rows numbered top to bottom from posi-\linebreak
\noindent tion 1, the second component stands for the columns numbered left to right, also from position 1.
Of course, we could have made a completely different choice for the coordinate system. But the presented system perfectly matches our particular systolic array: the indices of a matrix element $c_{ij}$ computed in a cell agree with the coordinates of this cell. The entered rows of the matrix $A$ carry the same number as the first coordinate of the cells they pass; correspondingly for the second coordinate, concerning the columns of the matrix $B$. All links (and thus all passing data flows) are in parallel to some axis, and towards ascending coordinates.
It is not always so clear how expressive space coordinates can be determined; we refer to the systolic array from Figure \ref{extwoa}(a) as an example. But whatsoever the coordinate system is chosen: it is important that the regular structure of the systolic array is obviously reflected in the coordinates of the cells. Therefore, almost always integral coordinates are used. Moreover, the coordinates of cells with minimum Euclidean distance should differ in one component, only, and then with distance 1.
\index{space coordinates|)}
\subsection{Serialising generic operators}
\index{serialisation|(}\index{generic operator|(}
Each active cell $(i,j)$ from Figure \ref{exonea} computes exactly the element $c_{ij}$ of the result matrix $C$.
Therefore, the cell must evaluate the \inddef{dot product}\ki{dot product}
\begin{equation*}
\sum_{k=1}^4 a_{ik} \cdot b_{kj} \koz .
\end{equation*}
This is done iteratively: in each step, a product $a_{ik} \cdot b_{kj}$ is calculated and
added to the current partial sum for $c_{ij}$. Obviously, the partial sum has to be
\index{clear}\emph{cleared}---or set to another initial value, if required---before starting
the accumulation. Inspired by the classical notation of imperative programming languages,
the general proceeding could be specified in pseudocode as follows:
\begin{alg}{Matrix-Product$(N_1,N_2,N_3)$}\inddef{\textsc{Matrix-Product}}
1 \' \key{for} \= $i \leftarrow 1$ \key{to} $N_1$ \\
2 \' \> \key{do} \key{for} \= $j \leftarrow 1$ \key{to} $N_2$ \\
3 \' \> \> \key{do} \= $c(i,j) \leftarrow 0$ \\
4 \' \> \> \> \key{for} \= $k \leftarrow 1$ \key{to} $N_3$ \\
5 \' \> \> \> \> \key{do} $c(i,j) \leftarrow c(i,j) + a(i,k) \cdot b(k,j)$ \\
6 \' \key{return} $C$
\end{alg}
If $N_1=N_2=N_3=N$, we have to perform $N^3$ multiplications, additions, and assignments, each. Hence the \index{running time!of systolic algorithm}\emph{running time} of this algorithm is of order $\Theta(N^3)$ for any sequential processor.
The sum operator $\sum$ is one of the so-called \inddef{generic operator}\ki{generic operators,} that combine an arbitrary number of operands. In the systolic array from Figure \ref{exonea}, all additions contributing to a particular sum are performed in the same cell. However, there are plenty of examples where the individual operations of a generic operator are spread over several cells---see, for instance, the systolic array from Figure \ref{extwoa}.
\textit{Remark.} Further examples of generic operators are: product, minimum, maximum, as well as the Boolean operators \textsc{and}, \textsc{or}, and \textsc{exclusive or}.
Thus, generic operators usually have to be \inddef{serialisation}\ki{serialised} before the calculations to perform can be assigned to the cells of the systolic array. Since the distribution of the individual operations to the cells is not unique, generic operators generally must be dealt with in another way than simple operators with fixed arity, as for instance the dyadic addition.
\index{serialisation|)}\index{generic operator|)}
\subsection{Assignment-free notation}\index{assignment-free notation|(}
Instead of using an imperative style as in algorithm \textsc{Matrix-product},
we better describe systolic programs by an \inddef{assignment-free notation}\ki{assignment-free notation}
which is based on an \index{equational calculus}\emph{equational calculus}. Thus we
avoid \index{side effect}\emph{side effects} and are able to
\index{parallelism!directly express}\emph{directly express parallelism}. For instance, we may be bothered
about the reuse of the program variable $c(i,j)$ from algorithm \textsc{Matrix-product}.
So, we replace $c(i,j)$ with a sequence of \inddef{instance}\ki{instances} $c(i,j,k)$,
that stand for the successive states of $c(i,j)$. This approach yields a so-called
\ki{recurrence equation}\inddef{recurrence equation} We are now able to state the general matrix
product from algorithm \textsc{Matrix-product} by the following assignment-free expressions:
{\small
\begin{equation}\label{exmmult2}
\begin{array}{@{}ll@{}}
\multicolumn{2}{@{}l@{}}{\textit{input operations}} \\[1ex]
c(i,j,0) = 0 & 1 \le i \le N_1, 1 \le j \le N_2 \koz . \\[3ex]
\multicolumn{2}{@{}l@{}}{\textit{calculations}} \\[1ex]
c(i,j,k) = c(i,j,k-1) + a(i,k) \cdot b(k,j) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz . \\[3ex]
\multicolumn{2}{@{}l@{}}{\textit{output operations}} \\[1ex]
c_{ij} = c(i,j,N_3) & 1 \le i \le N_1, 1 \le j \le N_2 \koz .
\end{array}\normalsize
\end{equation}}
System (\ref{exmmult2}) explicitly describes the fine structure of the
executed\index{systolic!algorithm}systolic algorithm. The first equation specifies
all\inddef{data!input} \ki{input data,} the third equation all \inddef{data!output}\ki{output data.}
The systolic array implements these equations
by\inddef{operation!input/output}\ki{input/output operations.} Only the second equation
corresponds to real calculations.
Each equation of the system is accompanied, on the right side, by a \inddef{quantification}\ki{quantification.}
The quantification states the set of values the \inddef{iteration!variable}\ki{iteration variables}
$i$ and $j$ (and, for the second equation, also $k$) should take. Such a set is called
a \inddef{domain}\ki{domain.} The iteration variables $i,j,k$ of the second
equation can be combined in an \inddef{iteration!vector}\ki{iteration vector}
$(i,j,k)$. For the input/output equations, the iteration vector would consist of the components
$i$ and $j$, only. To get a closed representation, we augment this vector by a third component
$k$, that takes a fixed value. Inputs then are characterised by $k=0$, outputs by $k=N_3$.
Overall we get the following system:
{\small
\begin{equation}\label{exmmult2a}
\begin{array}{@{}ll@{}}
\multicolumn{2}{@{}l@{}}{\textit{input operations}} \\[1ex]
c(i,j,k) = 0 & 1 \le i \le N_1, 1 \le j \le N_2, k = 0 \koz . \\[3ex]
\multicolumn{2}{@{}l@{}}{\textit{calculations}} \\[1ex]
c(i,j,k) = c(i,j,k-1) + a(i,k)\cdot b(k,j) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz . \\[3ex]
\multicolumn{2}{@{}l@{}}{\textit{output operations}} \\[1ex]
c_{ij} = c(i,j,k) & 1 \le i \le N_1, 1 \le j \le N_2, k = N_3 \koz .
\end{array}\normalsize
\end{equation}}
Note that although the domains for the input/output equations now are formally also of dimension 3,
as a matter of fact they are only two-dimensional in the classical geometric sense.
\index{assignment-free notation|)}
\subsection{Elementary operations}\index{operation!elementary|(}
From equations as in system (\ref{exmmult2a}), we directly can infer the atomic entities to perform
in the cells of the systolic array. We find these operations by instantiating each equation
of the system with all points of the respective domain. If an equation contains several suboperations
corresponding to one point of the domain, these are seen as a
\inddef{compound operation}\ki{compound operation,} and are always processed together
by the same cell in one working cycle.
In the second equation of system (\ref{exmmult2a}), for instance, we find the multiplication
$a(i,k) \cdot b(k,j)$ and the successive addition $c(i,j,k)=c(i,j,k-1)+\cdots$.
The corresponding \inddef{operation!elementary}\ki{elementary operations}---multiplication
and addition---are indeed executed together as a \index{multiply-add}\emph{multiply-add}
compound operation by the cell of the systolic
array shown in Figure \ref{exonea}(b).
Now we can assign a designation to each elementary operation, also called \emph{coordinates}.
A straight-forward method to define suitable coordinates is provided by the
\index{iteration!vector}iteration vectors $(i,j,k)$ used in the quantifications.
Applying this concept to system (\ref{exmmult2}), we can for instance assign
the tuple of coordinates $(i,j,k)$ to the calculation $c(i,j,k) = c(i,j,k-1) + a(i,k) \cdot b(k,j)$.
The same tuple $(i,j,k)$ is assigned to the input operation $c(i,j,k) = 0$,
but with setting $k=0$. By the way: all domains are disjoint in this example.
If we always use the iteration vectors as designations for the calculations
and the input/output operations, there is no further need to distinguish between coordinates
and iteration vectors. Note, however, that this decision also mandates
that all operations belonging to a certain point of the domain together
constitute a \index{compound operation}compound operation---even when they appear
in different equations and possibly are not related. For simplicity, we always use
the iteration vectors as coordinates in the sequel.
\index{operation!elementary|)}
\subsection{Discrete timesteps}\index{systolic!timestep|(}
The various elementary operations always happen in
\inddef{systolic!timestep}\inddef{timestep!discrete}\ki{discrete timesteps} in the systolic cells.
All these timesteps driving a systolic array are of equal duration.
Moreover, all cells of a systolic array work completely \inddef{synchronous}\ki{synchronous,}
i.e., they all start and finish their respective communication and calculation steps
at the same time. Successive timesteps controlling a cell seamlessly follow each other.
\textit{Remark.} But haven't we learned from Albert Einstein\nevindex{Einstein, Albert (1879--1955)}
that strict \index{simultaneity}simultaneity is physically impossible?
Indeed, all we need here are cells that operate almost simultaneously. Technically this is guaranteed
by providing to all systolic cells a common \inddef{clock signal}\ki{clock signal}
that switches all registers of the array. Within the bounds of the usually achievable accuracy,
the communication between the cells happens sufficiently synchronised, and thus no loss of data occurs concerning send and receive operations. Therefore, it should be justified to assume a conceptional simultaneity for theoretical reasoning.
Now we can slice the physical time into units of a timestep, and number the timesteps consecutively. The origin on the time axis can be arbitrarily chosen, since time is running synchronously for all cells. A reasonable decision would be to take $t=0$ as the time of the first input in any cell. Under this regime, the elementary compound operation of system (\ref{exmmult2}) designated by $(i,j,k)$ would be executed at time $i+j+k-3$. On the other hand, it would be evenly justified to assign the time $i+j+k$ to the coordinates $(i,j,k)$; because this change would only induce a global time shift by three time units.
So let us assume for the following that the execution of an instance $(i,j,k)$ starts at time $i+j+k$. The first calculation in our example then happens at time $t=3$, the last at time $t=N_1+N_2+N_3$. The \index{running time!of systolic algorithm}\emph{running time} thus amounts to $N_1+N_2+N_3-2$ timesteps.
\index{systolic!timestep|)}
\subsection{External and internal communication}\index{communication!internal}\index{communication!external}
Normally, the data needed for calculation by the systolic array initially are not yet located inside
the cells of the array. Rather, they must be infused into the array from
the\inddef{outside world} \ki{outside world.} The outside world in this case is
a\inddef{host computer} \ki{host computer,} usually
a\index{scalar control processor} \emph{scalar control processor} accessing a
central\index{data storage} \emph{data storage}. The control processor, at the right time,
fetches the necessary data from the storage, passes them to the systolic array in a suitable way,
and eventually writes back the calculated results into the storage.
Each cell $(i,j)$ must access the operands $a_{ik}$ and $b_{kj}$ during the timestep concerning index value $k$. But only the cells of the leftmost column of the systolic array from Figure \ref{exonea} get the items of the matrix $A$ directly as \index{data!input}\emph{input data} from the outside world. All other cells must be provided with the required values $a_{ik}$ from a neighbouring cell. This is done via the horizontal \index{link}\emph{links} between neighbouring cells, see Figure \ref{exonea}(a). The item $a_{ik}$ successively passes the cells $(i,1)$, $(i,2)$, \ldots, $(i,N_2)$. Correspondingly, the value $b_{kj}$ enters the array at cell $(1,j)$, and then flows through the vertical links, reaching the cells $(2,j)$, $(3,j)$, \ldots up to cell $(N_1,j)$. An arrowhead in the figure shows in which \index{link!directed}\emph{direction} the link is oriented.
Frequently, it is considered problematic to transmit a value over large distances within a single \index{timestep}\emph{timestep}, in a distributed or parallel architecture. Now suppose that, in our example, cell $(i,j)$ got the value $a_{ik}$ during timestep $t$ from cell $(i,j-1)$, or from the outside world. For the reasons described above, $a_{ik}$ is not passed from cell $(i,j)$ to cell $(i,j+1)$ in the same timestep $t$, but one timestep later, i.e., at time $t+1$. This also holds for the values $b_{kj}$. The \inddef{delay}\ki{delay} is visualised in the detail drawing of the \index{cell}cell from Figure \ref{exonea}(b): input data flowing through a cell always pass one register, and each passed register induces a delay of exactly one timestep.
\textit{Remark.} For \index{systolic!architecture}systolic architectures, it is mandatory that any path between two cells contains at least one register---even when forwarding data to a neighbouring cell, only. All registers in the cells are synchronously switched by the global \index{clock signal}clock signal of the systolic array. This results in the characteristic rhythmical traffic on all links of the systolic array. Because of the analogy with pulsating veins, the medical term \emph{systole} has been reused for the name of the concept.
To elucidate the delayed forwarding of values, we augment system (\ref{exmmult2}) with further equations. Repeatedly \emph{used} values like $a_{ik}$ are represented by separate \index{instance}\emph{instances}, one for each access. The result of this proceeding---that is very characteristic for the design of \index{systolic!algorithm}systolic algorithms---is shown as system (\ref{exmmult4}).
%{\small
%\begin{equation}\label{exmmult4}
%\begin{array}{@{}ll@{}}
%\multicolumn{2}{@{}l@{}}{\textit{input operations}} \\[1ex]
%a(i,j,k) = a_{ik} & 1 \le i \le N_1, j = 0, 1 \le k \le N_3 \koz , \\[1ex]
%b(i,j,k) = b_{kj} & i = 0, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
%c(i,j,k) = 0 & 1 \le i \le N_1, 1 \le j \le N_2, k = 0 \koz . \\[3ex]
%\multicolumn{2}{@{}l@{}}{\textit{calculations and forwarding}} \\[1ex]
%a(i,j,k) = a(i,j-1,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
%b(i,j,k) = b(i-1,j,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
%c(i,j,k) = c(i,j,k-1) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz . \\[.5ex]
%\quad {} + a(i,j-1,k) \cdot b(i-1,j,k) \\[3ex]
%\multicolumn{2}{@{}l@{}}{\textit{output operations}} \\[1ex]
%c_{ij} = c(i,j,k) & 1 \le i \le N_1, 1 \le j \le N_2, k = N_3 \koz .
%\end{array}\normalsize
%\end{equation}}
%
% Revision on June 23, 2007: Adapt formula to other formulas below, e.g. (15.32)
{\small
\begin{equation}\label{exmmult4}
\begin{array}{@{}ll@{}}
\multicolumn{2}{@{}l@{}}{\textit{input operations}} \\[1ex]
a(i,j,k) = a_{ik} & 1 \le i \le N_1, j = 0, 1 \le k \le N_3 \koz , \\[1ex]
b(i,j,k) = b_{kj} & i = 0, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
c(i,j,k) = 0 & 1 \le i \le N_1, 1 \le j \le N_2, k = 0 \koz . \\[3ex]
\multicolumn{2}{@{}l@{}}{\textit{calculations and forwarding}} \\[1ex]
a(i,j,k) = a(i,j-1,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
b(i,j,k) = b(i-1,j,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
\begin{array}{@{}l@{}}
c(i,j,k) = c(i,j,k-1) \\[0.5ex]
\quad {} + a(i,j-1,k) \cdot b(i-1,j,k) \\
\end{array} & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz . \\[4.5ex]
\multicolumn{2}{@{}l@{}}{\textit{output operations}} \\[1ex]
c_{ij} = c(i,j,k) & 1 \le i \le N_1, 1 \le j \le N_2, k = N_3 \koz .
\end{array}\normalsize
\end{equation}}
Each of the partial sums $c(i,j,k)$ in the progressive evaluation of $c_{ij}$ is calculated in a certain
timestep, and then used only once, namely in the next timestep. Therefore, cell $(i,j)$ must provide a
register (named \textit{C} in Figure \ref{exonea}(b)) where the value of $c(i,j,k)$ can be stored for
one timestep. Once the old value is no longer needed, the register holding $c(i,j,k)$ can be
overwritten with the new value $c(i,j,k+1)$. When eventually the dot product is completed,
the register contains the value $c(i,j,N_3)$, that is the final result $c_{ij}$. Before performing
any computation, the register has to be \inddef{clear}\ki{cleared,} i.e., preloaded with a
zero value---or any other desired value.
In contrast, there is no need to store the values $a_{ik}$ and $b_{kj}$ permanently in cell $(i,j)$.
As we can learn from Figure \ref{exonea}(a), each row of the matrix $A$ is delayed by one
timestep with respect to the preceding row. And so are the columns of the matrix $B$.
Thus the values $a(i,j-1,k)$ and $b(i-1,j,k)$ arrive at cell $(i,j)$ exactly when the calculation
of $c(i,j,k)$ is due. They are put to the registers \textit{A} resp. \textit{B}, then immediately
fetched from there for the multiplication, and in the same cycle forwarded to the neighbouring cells.
The values $a_{ik}$ and $b_{kj}$ are of no further use for cell $(i,j)$ after they have been
multiplied, and need not be stored there any longer. So \textit{A} and \textit{B} are overwritten
with new values during the next timestep.
It should be obvious from this exposition that we urgently need to make economic use of the memory
contained in a cell. Any calculation and any communication must be coordinated in space and time
in such a way that storing of values is limited to the shortest-possible time interval.
This goal can be achieved by immediately using and forwarding the received values. Besides the overall
structure of the systolic array, choosing an
appropriate\index{input/output scheme}\emph{input/output scheme} and placing the corresponding number
of \index{delay}\emph{delays} in the cells essentially facilitates the desired coordination.
Figure \ref{exonea}(b) in this respect shows the smallest possible delay by one timestep.
Geometrically, the input scheme of the example resulted from \index{skew}\emph{skewing} the matrices $A$ and $B$.
Thereby some places in the \inddef{input stream}\ki{input streams} for matrix $A$ became vacant and
had to be filled with zero values; otherwise, the calculation of the $c_{ij}$ would have been garbled.
The \index{input stream!length of}input streams in length depend
on the \index{problem parameter}\emph{problem parameter} $N_3$.
As can been seen in Figure \ref{exonea}, the items of matrix $C$ are calculated
\inddef{stationary!matrix}\ki{stationary,} i.e., all additions contributing to an item $c_{ij}$ happen
in the same cell. \inddef{stationary!variable}\ki{Stationary variables} don't move at all during
the calculation in the systolic array. Stationary results eventually must be forwarded to
a \inddef{systolic array!border of}\ki{border} of the array in a supplementary action for getting
delivered to the \index{outside world}outside world. Moreover, it is necessary to initialise the register
for item $c_{ij}$. Performing these extra tasks requires a high expenditure of runtime and hardware.
We will further study this problem in Section \ref{Steuerung}.
\subsection{Pipelining}
\index{pipelining|(}
The characteristic operating style with globally synchronised discrete timesteps of equal duration
and the strict separation in time of the cells by registers suggest systolic arrays to be special cases
of \emph{pipelined} systems. Here, the registers of the cells correspond to the well-known
\emph{pipeline registers.} However, classical pipelines come as linear structures, only, whereas
systolic arrays frequently extend into more spatial dimensions---as visible in our example.
A \index{systolic array!multi-dimensional}\emph{multi-dimensional systolic array} can be regarded
as a set of interconnected linear pipelines, with some justification. Hence it should be apparent
that basic properties of one-dimensional pipelining also apply to multi-dimensional systolic arrays.
A typical effect of pipelining is the reduced \inddef{utilisation}\ki{utilisation} at startup
and during shut-down of the operation. Initially, the pipe is empty, no pipeline stage active.
Then, the first stage receives data and starts working; all other stages are still idle.
During the next timestep, the first stage passes data to the second stage and itself receives
new data; only these two stages do some work. More and more stages become active until all
stages process data in every timestep; the pipeline is now fully utilised for the first time.
After a series of timesteps at maximum load, with duration dependent
on the \index{data stream!length of}length of the data stream, the input sequence ceases;
the first stage of the pipeline therefore runs out of work. In the next timestep,
the second stage stops working, too. And so on, until eventually all stages have been fallen asleep again.
Phases of reduced activity diminish the average performance of the whole pipeline, and the relative
contribution of this drop in productivity is all the worse, the more stages the pipeline has
in relation to the \index{data stream!length of}length of the data stream.
We now study this phenomenon to some depth by analysing the two-dimensional systolic array from Figure \ref{exonea}. As expected, we find a lot of idling cells when starting or finishing the calculation. In the first timestep, only cell $(1,1)$ performs some useful work; all other cells in fact do calculations that work like null operations---and that's what they are supposed to do in this phase. In the second timestep, cells $(1,2)$ and $(2,1)$ come to real work, see Figure \ref{exoneb}(a). Data is flooding the array until eventually all cells are doing work. After the last true data item has left cell $(1,1)$, the latter is no longer contributing to the calculation but merely reproduces the finished value of $c_{11}$. Step by step, more and more cells drop off. Finally, only cell $(N_1,N_2)$ makes a last necessary computation step; Figure \ref{exoneb}(b) shows this concluding timestep.
\index{pipelining|)}
\epskepnagyfrag{figs/exoneb}{\label{exoneb} Two \abraindex{snapshot}\emph{snapshots} for the systolic array from Figure \ref{exonea}.}{12.6cm}
\begin{gyak}
\ujgyak
What must be changed in the input scheme from Figure \ref{exonea}(a) to multiply a
$2 \times 6$ matrix by a $6 \times 3$ matrix on the same systolic array?
Could the calculations be organised such that the result matrix would emerge in the lower right corner
of the systolic array?
\ujgyak
Why is it necessary to clear spare slots in the input streams for matrix $A$,
as shown in Figure \ref{exonea}? Why haven't we done the same for matrix $B$ also?
\ujgyak
If the systolic array from Figure \ref{exonea} should be interpreted as a pipeline:
how many stages would you suggest to adequately describe the behaviour?
\end{gyak}
\section{Space-time transformation and systolic arrays}\label{sysex2}
\index{space-time!transformation|(}
Although the approach taken in the preceding section should be sufficient for a basic understanding of the topic, we have to work harder to describe and judge the properties of systolic arrays in a quantitative and precise way. In particular the solution of \index{parametric problem}\emph{parametric problems} requires a solid mathematical framework. So, in this section, we study central concepts of a formal theory on \index{uniform algorithm}\emph{uniform algorithms}, based on \emph{linear transformations}.
\subsection{Further example: matrix product}
\index{matrix product}
System (\ref{exmmult4}) can be computed by a multitude of other systolic arrays, besides that from Figure \ref{exonea}. In Figure \ref{extwoa}, for example, we see such an alternative systolic array. Whereas the same function is evaluated by both architectures, the appearance of the array from Figure \ref{extwoa} is very different:
%
\begin{itemize}
\item The number of cells now is considerably larger, altogether 36, instead of 15.
\item The \index{systolic array!shape of}shape of the array is \inddef{systolic array!hexagonal}\ki{hexagonal,} instead of \index{systolic array!rectangular}rectangular.
\item Each cell now has three \index{port!input}input ports and three \index{port!output}output ports.
\item The input scheme is clearly different from that of Figure \ref{exonea}(a).
\item And finally: the matrix $C$ here also flows through the whole array.
\end{itemize}
The cell structure from Figure \ref{extwoa}(b) at first view does not appear essentially
distinguished from that in Figure \ref{exonea}(b). But the differences matter: there are no \emph{cyclic paths} in the new cell, thus \emph{stationary variables} can no longer appear. Instead, the cell is provided with three input ports and three output ports, passing items of all three matrices through the cell. The direction of communication at the ports on the right and left borders of the cell has changed, as well as the assignment of the matrices to the ports.
\epskepnagyfrag{figs/extwoa}{\label{extwoa} \abraindex{systolic array!hexagonal}Hexagonal systolic array for \abraindex{matrix product}matrix product. \textbf{(a)} \abraindex{systolic array!structure of}Array structure and principle of the data input/output. \textbf{(b)} \abraindex{cell!structure of}Cell structure.}{9.8cm}
\subsection{The space-time transformation as a global view}
How system (\ref{exmmult4}) is related to Figure \ref{extwoa}? No doubt that you were able to fully understand the operation of the systolic array from Section \ref{sysex1} without any special aid. But for the present example this is considerably more difficult---so now you may be sufficiently motivated for the use of a mathematical formalism.
We can assign two fundamental measures to each elementary operation of an algorithm for describing the execution in the systolic array: the time \emph{when} the operation is performed, and the position of the cell \emph{where} the operation is performed. As will become clear in the sequel,
after fixing the so-called \emph{space-time transformation} there are hardly any degrees of freedom left for further design: practically all features of the intended systolic array strictly follow from the chosen space-time transformation.
As for the systolic array from Figure \ref{exonea}, the execution of an instance $(i,j,k)$
in the systolic array from Figure \ref{extwoa} happens at time $t=i+j+k$.
We can represent this expression as the dot product of a \inddef{time vector}\ki{time vector}
\begin{equation}\label{zeitvektor}
\pi = \left(\begin{array}{ccc} 1 & 1 & 1 \end{array}\right)
\end{equation}
by the \index{iteration!vector}iteration vector
\begin{equation}
v = \left(\begin{array}{ccc} i & j & k \end{array}\right) \koz ,
\end{equation}
hence
\begin{equation}\label{zeitabbildung}
t = \pi \cdot v \koz ;
\end{equation}
so in this case
\begin{equation}\label{zeitabbildung1}
t =
\left(\begin{array}{ccc} 1 & 1 & 1 \end{array}\right) \cdot
\left(\begin{array}{ccc} i \\ j \\ k \end{array}\right)
= i+j+k \koz .
\end{equation}
The space coordinates $z=(x,y)$ of the executed operations in the example from Figure \ref{exonea} can be inferred as $z=(i,j)$ from the iteration vector $v=(i,j,k)$ according to our decision in Subsection \ref{Raumkoordinaten1}. The chosen map is a \inddef{projection}\ki{projection} of the space $\R^3$ along the $k$ axis. This linear map can be described by a \inddef{projection!matrix}\ki{projection matrix}
%
\begin{equation}\label{projektionsmatrix1}
P = \left(\begin{array}{rrr}
1 & 0 & 0 \\
0 & 1 & 0
\end{array}\right) \koz .
\end{equation}
%
To find the \index{space coordinates}space coordinates, we multiply the \index{projection!matrix}projection matrix $P$ by the \index{iteration!vector}iteration vector $v$, written as
%
\begin{equation}\label{projektion}
z = P \cdot v \koz .
\end{equation}
The \inddef{projection!direction}\ki{projection direction} can be represented
by any vector $u$ perpendicular to all rows of the projection matrix,
\begin{equation}\label{projektionsvektor}
P \cdot u = \vec{0} \koz .
\end{equation}
For the projection matrix $P$ from (\ref{projektionsmatrix1}), one of the possible
\inddef{projection!vector}\ki{projection vectors} would be $u=(0,0,1)$.
Projections are very popular for describing the space coordinates when designing a systolic array.
Also in our example from Figure \ref{extwoa}(a), the space coordinates are generated
by projecting the iteration vector. Here, a feasible projection matrix is given by
\begin{equation}\label{projektionsmatrix2}
P = \left(\begin{array}{rrr}
0 & -1 & 1 \\
-1 & 1 & 0
\end{array}\right) \koz .
\end{equation}
A corresponding projection vector would be $u=(1,1,1)$.
We can combine the projection matrix and the time vector in a matrix $T$,
that fully describes the \inddef{space-time!transformation}\ki{space-time transformation,}
\begin{equation}\label{raumzeit}
\left(\begin{array}{r} z \\ t \end{array}\right) =
\left(\begin{array}{r} P \\ \pi \end{array}\right) \cdot v =
T \cdot v \koz .
\end{equation}
The first and second rows of $T$ are constituted by the projection matrix $P$,
the third row by the time vector $\pi$.
For the example from Figure \ref{exonea}, the matrix $T$ giving the space-time transformation reads as
\begin{equation}\label{raumzeit1}
T = \left(\begin{array}{rrr}
1 & 0 & 0 \\
0 & 1 & 0 \\
1 & 1 & 1
\end{array}\right) \koz ;
\end{equation}
for the example from Figure \ref{extwoa} we have
\begin{equation}\label{raumzeit2}
T = \left(\begin{array}{rrr}
0 & -1 & 1 \\
-1 & 1 & 0 \\
1 & 1 & 1
\end{array}\right) \koz .
\end{equation}
Space-time transformations may be understood as a \emph{global view} to the systolic system.
Applying a space-time transformation---that is linear, here, and described by a matrix $T$---to
a system of recurrence equations directly yields the external features of the systolic array,
i.e., its \inddef{systolic array!architecture of}\ki{architecture}---consisting of
\index{space coordinates}space coordinates, \index{connection pattern}connection pattern,
and \index{cell!structure of}cell structure.
\textit{Remark.} Instead of purely linear maps, we alternatively may consider general affine maps,
additionally providing a translative component, $T \cdot v + h$. Though as long as we treat
all iteration vectors with a common space-time transformation, affine maps are not really required.
\subsection{Parametric space coordinates}\label{Raumkoordinaten2}\index{space coordinates!parametric|(}
If the domains are numerically given and contain few points in particular, we can easily calculate
the concrete set of space coordinates via equation (\ref{projektion}). But when
the \inddef{domain!parametric}domains are specified parametrically as in system (\ref{exmmult4}),
the positions of the cells must be determined by \index{symbolic evaluation}\emph{symbolic evaluation}.
The following explanation especially dwells on this problem.
Suppose that each cell of the systolic array is represented geometrically by a point with space
coordinates $z=(x,y)$ in the two-dimensional space $\R^2$. From each iteration vector $v$ of the
domain $S$, by equation (\ref{projektion}) we get the space coordinates $z$ of a certain processor,
$z=P\cdot v$: the operations denoted by $v$ are projected onto cell $z$. The set
$P(S)=\{ P \cdot v : v \in S \}$ of space coordinates states the positions of all cells in the
systolic array necessary for correct operation.
To our advantage, we normally use domains that can be described as the set of all integer points
inside a convex region, here a subset of $\R^3$---called \inddef{domain!dense convex}\ki{dense convex domains.}
The convex hull of such a domain with a finite number of domain points is a \emph{polytope},
with domain points as vertices. Polytopes map to polytopes again by arbitrary linear transformations.
Now we can make use of the fact that each projection is a linear transformation.
Vertices of the destination polytope then are images of vertices of the source polytope.
\textit{Remark.} But not all vertices of a source polytope need to be projected
to vertices of the destination polytope, see for instance Figure \ref{extwob}.
\epskepnagyfrag{figs/extwob}{\label{extwob} Image of a rectangular domain under projection.
Most interior points have been suppressed for clarity. Images of previous vertex points are shaded.}{11.5cm}
When projected by an integer matrix $P$, the \emph{lattice} $\Z^3$ maps to the lattice $\Z^2$
if $P$ can be extended by an integer time vector $\pi$ to a
\emph{unimodular}\index{space-time!matrix}space-time matrix $T$.
Practically any \index{domain!dense convex}dense convex domain, apart from some exceptions
irrelevant to usual applications, thereby maps to another dense convex set of space coordinates,
that is completely characterised by the vertices of the hull polytope. To determine
the \index{systolic array!shape of}\emph{shape} and the \index{systolic array!size of}\emph{size}
of the systolic array, it is therefore sufficient to apply the matrix $P$ to the vertices
of the convex hull of $S$.
\textit{Remark.} Any square integer matrix with determinant $\pm 1$ is
called \inddef{matrix!unimodular}\ki{unimodular.} Unimodular matrices have unimodular inverses.
We apply this method to the integer domain
\begin{equation}\label{traegermenge}
S = [1,N_1] \times [1,N_2] \times [1,N_3]
\end{equation}
from system (\ref{exmmult4}). The vertices of the convex hull here are
\begin{equation}\label{ecken1}
\begin{split}
& (1,1,1), (N_1,1,1), (1,N_2,1), (1,1,N_3), \\
& (1,N_2,N_3), (N_1,1,N_3), (N_1,N_2,1), (N_1,N_2,N_3) \koz .
\end{split}
\end{equation}
For the projection matrix $P$ from (\ref{projektionsmatrix2}), the vertices of the corresponding
image have the positions
\begin{equation}\label{ecken2}
\begin{split}
& (N_3-1,0), (N_3-1,1-N_1), (0,1-N_1) \koz , \\
& (1-N_2,N_2-N_1), (1-N_2,N_2-1), (N_3-N_2,N_2-N_1) \koz .
\end{split}
\end{equation}
Since $S$ has eight vertices, but the image $P(S)$ only six, it is obvious that two vertices of $S$ have
become \emph{interior points} of the image, and thus are of no relevance
for the size of the array; namely the vertices $(1,1,1)$ and $(N_1,N_2,N_3)$.
This phenomenon is sketched in Figure \ref{extwob}.
The settings $N_1=3$, $N_2=5$, and $N_3=4$ yield the vertices (3,0), (3,-2), (0,-2), (-4,2), (-4,4), and
(-1,4). We see that space coordinates in principle can be negative. Moreover, the choice
of an origin---that here lies in the interior of the polytope---might not always be obvious.
As the image of the projection, we get a systolic array with
\index{systolic array!hexagonal}\emph{hexagonal} \index{systolic array!shape of}shape and
parallel opposite \index{systolic array!border of}borders. On these, we find $N_1$, $N_2$, and
$N_3$ integer points, respectively; cf. Figure \ref{extwoc}. Thus, as opposed to our first example,
\emph{all} \index{problem parameter}problem parameters here are also \index{systolic array!parameter of}array
parameters.
\epskepnagyfrag{figs/extwoc}{\label{extwoc} Partitioning of the space coordinates.}{3.8cm}
The area function of this region is of order $\Theta(N_1 \cdot N_2 + N_1 \cdot N_3 + N_2 \cdot N_3)$,
and thus depends on all three matrix dimensions. So this is quite different
from the situation in Figure \ref{exonea}(a), where the area function---for the same problem---is of order
$\Theta(N_1 \cdot N_2)$.
Improving on this approximate calculation, we finally count the exact number of cells.
For this process, it might be helpful to partition the entire region into subregions
for which the number of cells comprised can be easily determined; see Figure \ref{extwoc}.
The points (0,0), $(N_3-1,0)$, $(N_3-1,1-N_1)$, and $(0,1-N_1)$ are the vertices of a rectangle
with $N_1 \cdot N_3$ cells. If we translate this point set up by $N_2-1$ cells and
right by $N_2-1$ cells, we exactly cover the whole region. Each shift by one cell up
and right contributes just another $N_1+N_3-1$ cells. Altogether this yields
$N_1 \cdot N_3 + (N_2-1) \cdot (N_1+N_3-1) = N_1 \cdot N_2 + N_1 \cdot N_3 + N_2
\cdot N_3 - (N_1+N_2+N_3) + 1$ cells.
For $N_1=3$, $N_2=5$, and $N_3=4$ we thereby get a number of 36 cells,
as we have already learned from Figure \ref{extwoa}(a).
\index{space coordinates!parametric|)}
\subsection{Symbolically deriving the running time}\index{systolic!algorithm|(}\index{running time!of
systolic algorithm|(}
The running time of a systolic algorithm can be symbolically calculated by an approach similar
to that in Subsection \ref{Raumkoordinaten2}. The time transformation according
to formula (\ref{zeitabbildung}) as well is a linear map. We find the timesteps
of the first and the last calculations as the minimum resp. maximum in the set
$\pi(S)=\{ \pi \cdot v : v \in S \}$ of execution timesteps. Following the discussion above,
it thereby suffices to vary $v$ over the vertices of the convex hull of $S$.
The \inddef{running time!of systolic algorithm}\ki{running time} is then given by the formula
\begin{equation}\label{gesamtdauer}
t_\Sigma = 1 + \max P(S) - \min P(S) \koz .
\end{equation}
Adding one is mandatory here, since the first as well as the last timestep belong to the calculation.
For the example from Figure \ref{extwoa}, the vertices of the polytope as enumerated in (\ref{ecken1}) are mapped by (\ref{zeitabbildung1}) to the set of images
\begin{equation*}
\{ 3, 2+N_1, 2+N_2, 2+N_3, 1+N_1+N_2, 1+N_1+N_3, 1+N_2+N_3, N_1+N_2+N_3 \} \koz .
\end{equation*}
With the basic assumption $N_1,N_2,N_3 \ge 1$, we get a minimum of 3 and a maximum of $N_1+N_2+N_3$, thus a running time of $N_1+N_2+N_3-2$ timesteps, as for the systolic array from Figure \ref{exonea}---no surprise, since the domains and the time vectors agree.
For the special problem parameters $N_1=3$, $N_2=5$, and $N_3=4$, a running time of $12-3+1=10$ timesteps can be derived.
If $N_1=N_2=N_3=N$, the systolic algorithm shows a running time of order $\Theta(N)$, using $\Theta(N^2)$ systolic cells.
\index{systolic!algorithm|)}\index{running time!of systolic algorithm|)}
\subsection{How to unravel the communication topology}\label{verbindungen}
\index{communication topology!of systolic array|(}
The \inddef{communication topology!of systolic array}\ki{communication topology}
of the systolic array is induced by applying the space-time transformation to
the \index{data dependence}\emph{data dependences} of the algorithm.
Each data dependence results from a direct use of a variable instance to calculate
another instance of the same variable, or an instance of another variable.
\textit{Remark.} In contrast to the general situation where a data dependence analysis
for imperative programming languages has to be performed by highly optimising compilers,
data dependences here always are \emph{flow dependences}. This is a direct consequence
from the \index{assignment-free notation}assignment-free notation employed by us.
The \inddef{data dependence}\ki{data dependences} can be read off the quantified equations
in our \index{assignment-free notation}assignment-free notation by comparing their right and
left sides. For example, we first analyse the equation
$c(i,j,k) = c(i,j,k-1) + a(i,j-1,k) \cdot b(i-1,j,k)$ from system (\ref{exmmult4}).
The value $c(i,j,k)$ is calculated from the values $c(i,j,k-1)$, $a(i,j-1,k)$, and $b(i-1,j,k)$.
Thus we have a \inddef{data flow}\ki{data flow} from $c(i,j,k-1)$ to $c(i,j,k)$,
a data flow from $a(i,j-1,k)$ to $c(i,j,k)$, and a data flow from $b(i-1,j,k)$ to $c(i,j,k)$.
All properties of such a data flow that matter here can be covered by
a \inddef{dependence vector}\ki{dependence vector,} which is the iteration vector
of the calculated variable instance minus the iteration vector of the correspondingly used variable instance.
The iteration vector for $c(i,j,k)$ is $(i,j,k)$; that for $c(i,j,k-1)$ is $(i,j,k-1)$. Thus, as the difference vector, we find
\begin{equation}\label{dC}
d_C=
\left(\begin{array}{c} i \\ j \\ k \end{array}\right) -
\left(\begin{array}{c} i \\ j \\ k-1 \end{array}\right) =
\left(\begin{array}{r} 0 \\ 0 \\ 1 \end{array}\right) \koz .
\end{equation}
Correspondingly, we get
\begin{equation}\label{dA}
d_A=
\left(\begin{array}{c} i \\ j \\ k \end{array}\right) -
\left(\begin{array}{c} i \\ j-1 \\ k \end{array}\right) =
\left(\begin{array}{r} 0 \\ 1 \\ 0 \end{array}\right)
\end{equation}
and
\begin{equation}\label{dB}
d_B=
\left(\begin{array}{c} i \\ j \\ k \end{array}\right) -
\left(\begin{array}{c} i-1 \\ j \\ k \end{array}\right) =
\left(\begin{array}{r} 1 \\ 0 \\ 0 \end{array}\right) \koz .
\end{equation}
In the equation $a(i,j,k) = a(i,j-1,k)$ from system (\ref{exmmult4}),
we cannot directly recognise which is the calculated variable instance,
and which is the used variable instance. This example elucidates the difference
between \emph{equations} and \emph{assignments}. When fixing that $a(i,j,k)$ should
follow from $a(i,j-1,k)$ by a \inddef{copy operation}\ki{copy operation,} we get
the same dependence vector $d_A$ as in (\ref{dA}). Correspondingly for the equation $b(i,j,k) = b(i-1,j,k)$.
A variable instance with iteration vector $v$ is calculated in cell $P\cdot v$. If for this
calculation another variable instance with iteration vector $v'$ is needed, implying a data
dependence with dependence vector $d=v-v'$, the used variable instance is provided by cell $P\cdot v'$.
Therefore, we need a communication from cell $z'=P\cdot v'$ to cell $z=P\cdot v$. In systolic arrays,
all communication has to be via direct static links between the communicating cells.
Due to the linearity of the transformation from (\ref{projektion}), we have
$z-z'=P\cdot v-P\cdot v'=P\cdot(v-v')=P\cdot d$.
If $P\cdot d=\vec{0}$, communication happens exclusively inside the calculating cell,
i.e., in time, only---and not in space. Passing values in time is via registers of the calculating cell.
Whereas for $P\cdot d\neq\vec{0}$, a communication between different cells is needed. Then a link along the \inddef{flow direction}\ki{flow direction} $P\cdot d$ must be provided from/to all cells of the systolic array. The vector $-P\cdot d$, oriented in counter flow direction, leads from space point $z$ to space point $z'$.
If there is more than one dependence vector $d$, we need an appropriate \index{link}\emph{link} for each of them at every cell. Take for example the formulas (\ref{dC}), (\ref{dA}), and (\ref{dB}) together with (\ref{projektionsmatrix2}), then we get $P \cdot d_A = (-1,1)$, $P \cdot d_B = (0,-1)$, and $P \cdot d_C = (1,0)$. In Figure \ref{extwoa}(a), terminating at every cell, we see three links corresponding to the various vectors $P \cdot d$. This results in a \inddef{communication topology!hexagonal}\ki{hexagonal communication topology}---instead of the \inddef{communication topology!orthogonal}\ki{orthogonal communication topology} from the first example.
\index{communication topology!of systolic array|)}
\subsection{Inferring the structure of the cells}\index{cell!structure of|(}
Now we apply the space-related techniques from Subsection \ref{verbindungen} to time-related questions. A variable instance with iteration vector $v$ is calculated in timestep $\pi\cdot v$. If this calculation uses another variable instance with iteration vector $v'$, the former had been calculated in timestep $\pi\cdot v'$. Hence communication corresponding to the dependence vector $d=v-v'$ must take exactly $\pi\cdot v-\pi\cdot v'$ timesteps.
Since (\ref{zeitabbildung}) describes a linear map, we have $\pi\cdot v-\pi\cdot v'=\pi\cdot(v-v')=\pi\cdot d$. According to the systolic principle, each communication must involve at least one register. The dependence vectors $d$ are fixed, and so the choice of a \index{time vector}time vector $\pi$ is constrained by
\begin{equation}\label{kausal}
\pi \cdot d \ge 1 \koz .
\end{equation}
In case $P\cdot d=\vec{0}$, we must provide \inddef{register}\ki{registers} for \index{stationary!variable}stationary variables in all cells. But each register is overwritten with a new value in every timestep. Hence, if $\pi\cdot d\ge 2$, the old value must be carried on to a further register. Since this is repeated for $\pi\cdot d$ timesteps, the cell needs exactly $\pi\cdot d$ registers per stationary variable. The values of the stationary variable successively pass all these registers before eventually being used. If $P\cdot d\neq\vec{0}$, the transport of values analogously goes by $\pi\cdot d$ registers, though these are not required to belong all to the same cell.
For each dependence vector $d$, we thus need an appropriate number of registers. In Figure \ref{extwoa}(b), we see three \index{port!input}input ports at the cell, corresponding to the dependence vectors $d_A$, $d_B$, and $d_C$. Since for these we have $P \cdot d\neq\vec{0}$. Moreover, $\pi \cdot d = 1$ due to (\ref{zeitabbildung1}) and (\ref{zeitvektor}). Thus, we need one register per dependence vector. Finally, the regularity of system (\ref{exmmult4}) forces three \index{port!output}output ports for every cell, opposite to the corresponding input ports.
Good news: we can infer in general that each \index{cell}cell needs only a few registers, because the number of dependence vectors $d$ is statically bounded with a system like (\ref{exmmult4}), and for each of the dependence vectors the amount of registers $\pi\cdot d$ has a fixed and usually small value.
The three \index{port}input and output ports at every cell now permit the use of three moving matrices.
Very differently from Figure \ref{exonea}, a dot product $\sum_{k=1}^4 a_{ik} \cdot b_{kj}$
here is not calculated within a single cell, but dispersed over the systolic array.
As a prerequisite, we had to dissolve the sum into a sequence of single additions.
We call this principle a \inddef{generic operator!distributed}\ki{distributed generic operator.}
Apart from the three input ports with their registers, and the three output ports,
Figure \ref{extwoa}(b) shows a multiplier chained to an adder. Both units are induced in each
cell by applying the transformation (\ref{projektion}) to the domain $S$ of the equation
$c(i,j,k) = c(i,j,k-1) + a(i,j-1,k) \cdot b(i-1,j,k)$ from system (\ref{exmmult4}).
According to this equation, the addition has to follow the calculation of the product,
so the order of the hardware operators
as seen in Figure \ref{extwoa}(b) is implied.
The source cell for each of the used operands follows from the projection of the corresponding dependence vector. Here, variable $a(i,j-1,k)$ is related to the dependence vector $d_A=(0,1,0)$. The projection $P\cdot d_A=(-1,1)$ constitutes the \index{flow direction}\emph{flow direction} of matrix $A$. Thus the value to be used has to be expected, as observed by the calculating cell, in opposite direction $(1,-1)$, in this case from the port in the lower left corner of the cell, passing through register \textit{A}.
All the same, $b(i-1,j,k)$ comes from the right via register \textit{B}, and $c(i,j,k-1)$ from above through register \textit{C}. The calculated values $a(i,j,k)$, $b(i,j,k)$, and $c(i,j,k)$ are output into the opposite directions through the appropriate ports: to the upper right, to the left, and downwards.
If alternatively we use the projection matrix $P$ from (\ref{projektionsmatrix1}), then for $d_C$ we get the direction $(0,0)$. The formula $\pi \cdot d_C = 1$ results in the requirement of exactly one register \textit{C} for each item of the matrix $C$. This register provides the value $c(i,j,k-1)$ for the calculation of $c(i,j,k)$, and after this calculation receives the value $c(i,j,k)$. All this reasoning matches with the cell from Figure \ref{exonea}(b). Figure \ref{exonea}(a) correspondingly shows no links for matrix $C$ between the cells: for the matrix is \index{stationary!matrix}\emph{stationary}.
\index{cell!structure of|)}
\begin{gyak}
\ujgyak
Each projection vector\gyakindex{projection!vector} $u$ induces several corresponding
projection mat- \linebreak
\noindent rices\gyakindex{projection!matrix} $P$.
\begin{description}
\item[\textit {a.}] Show that
%
\begin{equation*}\label{projektionsmatrix3}
P =\left(\begin{array}{rrr} 0 & 1 & -1 \\ -1 & 0 & 1 \end{array}\right)
\end{equation*}
%
also is a projection matrix fitting with projection vector $u=(1,1,1)$.
\item[\textit {b.}] Use this projection matrix to transform the domain from system (\ref{exmmult4}).
\item[\textit {c.}] The resulting space coordinates differ from that in Subsection
\ref{Raumkoordinaten2}. Why, in spite of this, both point sets are topologically equivalent?
\item[\textit {d.}] Analyse the cells in both arrangements for common and differing features.
\end{description}
\vspace{-2mm}
\ujgyak
Apply all techniques from Section \ref{sysex2} to system (\ref{exmmult4}), employing a space-time matrix
\begin{equation*}\label{raumzeit3}
T = \left(\begin{array}{rrr} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 1 \end{array}\right).
\end{equation*}
\end{gyak}\index{space-time!transformation|)}
\section{Input/output schemes}\label{easchemata}\label{sysex3}\index{input/output scheme|(}
In Figure \ref{extwoa}(a), the \emph{input/output scheme} is only sketched
by the \index{flow direction}\emph{flow directions} for the matrices $A,B,C$.
The necessary details to understand the input/output operations are now provided by Figure \ref{extwod}.
\epskepnagyfrag{figs/extwod}{\label{extwod} Detailed \abraindex{input/output scheme}input/output scheme for the systolic array from Figure \ref{extwoa}(a).}{10.9cm}
The \inddef{input/output scheme}\ki{input/output scheme} in
Figure \ref{extwod} shows some new phenomena when compared with Figure \ref{exonea}(a).
The input and output cells belonging to any matrix are no longer threaded all on a single straight line;
now, for each matrix, they lie along two adjacent \index{systolic array!border of}\emph{borders},
that additionally may differ in the number of links to the outside world.
The data structures from Figure \ref{extwod} also differ from that in Figure \ref{exonea}(a)
in the angle of inclination. Moreover, the matrices $A$ and $B$ from Figure \ref{extwod}
arrive at the \index{cell!boundary}\emph{boundary cells} with only one third
of the \index{data rate}\emph{data rate}, compared to Figure \ref{exonea}(a).
Spending some effort, even here it might be possible in principle to construct---item by item---the
appropriate input/output scheme fitting the present systolic array. But it is much more safe
to apply a formal derivation. The following subsections are devoted to the presentation
of the various methodical steps for achieving our goal.
\subsection{From data structure indices to iteration vectors}\index{iteration!vector|(}
First, we need to construct a formal relation between the abstract data structures and
the concrete variable instances in the assignment-free representation.
Each item of the matrix $A$ can be characterised by a row index $i$ and a column index $k$.
These \inddef{data structure index}\ki{data structure indices} can be comprised
in a \inddef{data structure vector}\ki{data structure vector} $w=(i,k)$.
Item $a_{ik}$ in system (\ref{exmmult4}) corresponds to the instances $a(i,j,k)$, with any $j$.
The coordinates of these instances all lie on a line along direction $q=(0,1,0)$ in space $\R^3$.
Thus, in this case, the formal change from data structure vector $(i,k)$ to coordinates $(i,j,k)$
can be described by the transformation
\begin{equation}\label{datalayout1a}
\left(\begin{array}{c}i\\ j\\ k\end{array}\right)=
\left(\begin{array}{rr}1&0\\ 0&0\\ 0&1\end{array}\right)\cdot
\left(\begin{array}{c}i\\ k\end{array}\right)+
j\cdot\left(\begin{array}{r}0\\ 1\\ 0\end{array}\right)+
\left(\begin{array}{r}0\\ 0\\ 0\end{array}\right) \koz .
\end{equation}
In system (\ref{exmmult4}), the coordinate vector $(i,j,k)$ of every variable instance
equals the iteration vector of the domain point representing the calculation of this variable instance.
Thus we also may interpret formula (\ref{datalayout1a}) as a relation between
\index{data structure vector}\emph{data structure vectors} and \emph{iteration vectors}.
Abstractly, the desired iteration vectors $v$ can be inferred from the data structure vector $w$ by the formula
\begin{equation}\label{datalayout1}
v = H \cdot w + \lambda \cdot q + p \koz .
\end{equation}
The affine vector $p$ is necessary in more general cases, though always null in our example.
Because of $b(i,j,k)=b_{kj}$, the representation for matrix $B$ correspondingly is
%
\begin{equation}\label{datalayout1b}
\left(\begin{array}{c}i\\ j\\ k\end{array}\right)=
\left(\begin{array}{rr}0&0\\ 0&1\\ 1&0\end{array}\right)\cdot
\left(\begin{array}{c}k\\ j\end{array}\right)+
i\cdot\left(\begin{array}{r}1\\ 0\\ 0\end{array}\right)+
\left(\begin{array}{r}0\\ 0\\ 0\end{array}\right) \koz .
\end{equation}
Concerning matrix $C$, each variable instance $c(i,j,k)$ may denote a different value.
Nevertheless, all instances $c(i,j,k)$ to a fixed index pair $(i,j)$ can be regarded
as belonging to the same matrix item $c_{ij}$, since they all stem from the serialisation
of the sum operator for the calculation of $c_{ij}$. Thus, for matrix $C$,
following formula (\ref{datalayout1}) we may set
\begin{equation}\label{datalayout1c}
\left(\begin{array}{c}i\\ j\\ k\end{array}\right)=
\left(\begin{array}{rr}1&0\\ 0&1\\ 0&0\end{array}\right)\cdot
\left(\begin{array}{c}i\\ j\end{array}\right)+
k\cdot\left(\begin{array}{r}0\\ 0\\ 1\end{array}\right)+
\left(\begin{array}{r}0\\ 0\\ 0\end{array}\right) \koz .
\end{equation}
\index{iteration!vector|)}
\subsection{Snapshots of data structures}\index{snapshot|(}
Each of the three matrices $A, B, C$ is generated by two directions
with regard to the \index{data structure index}\emph{data structure indices}:
along a row, and along a column. The difference vector (0,1) thereby describes a move from
an item to the next item of the same row, i.e., in the next column: $(0,1)=(x,y+1)-(x,y)$.
Correspondingly, the difference vector (1,0) stands for sliding from an item to the next item
in the same column and next row: $(1,0)=(x+1,y)-(x,y)$.
\index{input/output scheme}\emph{Input/output schemes} of the appearance shown in Figures
\ref{exonea}(a) and \ref{extwod} denote \inddef{snapshot}\ki{snapshots:}
all positions of data items depicted, with respect to the entire systolic array, are related to a common timestep.
As we can notice from Figure \ref{extwod}, the rectangular shapes of the abstract data structures
are mapped to parallelograms in the snapshot, due to the linearity of the applied space-time transformation.
These parallelograms can be described by difference vectors along their borders, too.
Next we will translate difference vectors $\Delta w$ from data structure vectors into
spatial difference vectors $\Delta z$ for the snapshot. Therefore, by choosing the parameter
$\lambda$ in formula (\ref{datalayout1}), we pick a pair of iteration vectors $v,v'$ that are mapped
to the same timestep under our space-time transformation. For the moment it is not important
which concrete timestep we thereby get. Thus, we set up
\begin{equation}\label{datalayout2}
\pi \cdot v = \pi \cdot v'
\quad \text{with} \quad
v = H \cdot w + \lambda \cdot q + p
\quad \text{and} \quad
v' = H \cdot w' + \lambda' \cdot q + p \koz ,
\end{equation}
implying
\begin{equation}\label{datalayout3}
\pi \cdot H \cdot (w-w') + (\lambda - \lambda') \cdot \pi \cdot q = 0 \koz ,
\end{equation}
and thus
\begin{equation}\label{datalayout4}
\Delta \lambda = (\lambda - \lambda') = \frac{- \pi \cdot H \cdot (w-w')}{\pi \cdot q} \koz .
\end{equation}
Due to the linearity of all used transformations, the wanted spatial difference vector $\Delta z$
hence follows from the difference vector of the data structure $\Delta w = w - w'$ as
\begin{equation}\label{datalayout5}
\Delta z = P \cdot \Delta v = P \cdot H \cdot \Delta w + \Delta \lambda \cdot P \cdot q \koz ,
\end{equation}
or
\begin{equation}\label{datalayout6}
\Delta z = P \cdot H \cdot \Delta w - \frac{\pi \cdot H \cdot \Delta w}{\pi \cdot q} \cdot P \cdot q \koz .
\end{equation}
With the aid of formula (\ref{datalayout6}), we now can determine the spatial difference vectors
$\Delta z$ for matrix $A$. As mentioned above, we have
\begin{equation*}
H = \left(\begin{array}{rr}1&0\\ 0&0\\ 0&1\end{array}\right),\quad
q = \left(\begin{array}{r}0\\ 1\\ 0\end{array}\right),\quad
P = \left(\begin{array}{rrr}
0 & -1 & 1 \\
-1 & 1 & 0
\end{array}\right),\quad
\pi = \left(\begin{array}{rrr}
1 & 1 & 1
\end{array}\right) \koz .
\end{equation*}
Noting $\pi \cdot q = 1$, we get
\begin{equation*}
\Delta z = \left(\begin{array}{rr}
0 & 1 \\
-1 & 0
\end{array}\right)
\cdot \Delta w + \Delta \lambda \cdot \left(\begin{array}{r} -1 \\ 1 \end{array}\right)
\quad \text{with} \quad
\Delta \lambda = - \left(\begin{array}{rr} 1 & 1 \end{array}\right) \cdot \Delta w \koz .
\end{equation*}
For the rows, we have the difference vector $\Delta w = (0,1)$, yielding the spatial difference vector
$\Delta z = (2,-1)$. Correspondingly, from $\Delta w = (1,0)$ for the columns we get $\Delta z = (1,-2)$.
If we check with Figure \ref{extwod}, we see that the rows of $A$ in fact run along
the vector $(2,-1)$, the columns along the vector $(1,-2)$.
Similarly, we get $\Delta z = (-1,2)$ for the rows of $B$,
and $\Delta z = (1,1)$ for the columns of $B$; as well as $\Delta z = (-2,1)$ for the rows of $C$,
and $\Delta z = (-1,-1)$ for the columns of $C$.
Applying these instruments, we are now able to reliably generate appropriate input/output
schemes---although separately for each matrix at the moment.\index{snapshot|)}
\subsection{Superposition of input/output schemes}\index{input/output scheme!superposition of|(}
Now, the shapes of the matrices $A,B,C$ for the snapshot have been fixed.
But we still have to adjust the matrices relative to the systolic array---and thus,
also relative to each other. Fortunately, there is a simple graphical method for doing the task.
We first choose an arbitrary iteration vector, say $v=(1,1,1)$. The latter we map with the projection
matrix $P$ to the cell where the calculation takes place,
\begin{equation*}
z = \left(\begin{array}{rrr}
0 & -1 & 1 \\
-1 & 1 & 0
\end{array}\right) \cdot
\left(\begin{array}{r} 1 \\ 1 \\ 1 \end{array}\right) =
\left(\begin{array}{r} 0 \\ 0 \end{array}\right) \koz .
\end{equation*}
The iteration vector (1,1,1) represents the calculations $a(1,1,1)$, $b(1,1,1)$, and $c(1,1,1)$;
these in turn correspond to the data items $a_{11}$, $b_{11}$, and $c_{11}$.
We now lay the input/output schemes for the matrices $A,B,C$ on the systolic array in a way that
the entries $a_{11}$, $b_{11}$, and $c_{11}$ all are located in cell $z=(0,0)$.
In principle, we would be done now. Unfortunately, our input/output schemes overlap
with the cells of the systolic array, and are therefore not easily perceivable.
Thus, we simultaneously retract the input/output schemes of all matrices in counter flow direction,
place by place, until there is no more overlapping. With this method,
we get exactly the input/output scheme from Figure \ref{extwod}.
As an alternative to this nice graphical method, we also could formally calculate
an overlap-free placement of the various input/output schemes.
Only after specifying the input/output schemes, we can correctly calculate
the number of timesteps effectively needed. The first relevant timestep starts
with the first input operation. The last relevant timestep ends with the last output of a result.
For the example, we determine from Figure \ref{extwod} the beginning of the calculation
with the input of the data item $b_{11}$ in timestep 0, and the end of the calculation
after output of the result $c_{35}$ in timestep 14. Altogether, we identify 15
timesteps---five more than with pure treatment of the real calculations.
\index{input/output scheme!superposition of|)}
\subsection{Data rates induced by space-time transformations}\index{data rate|(}
The input schemes of the matrices $A$ and $B$ from Figure \ref{exonea}(a) have a dense layout:
if we drew the borders of the matrices shown in the figure, there would be no spare places comprised.
Not so in Figure \ref{extwod}. In any \index{data stream!input}input data stream,
each data item is followed by two spare places there. For the input matrices this means:
the \index{cell!boundary}\emph{boundary cells} of the systolic array receive a proper data item
only every third timestep.
This property is a direct result of the employed space-time transformation.
In both examples, the abstract data structures themselves are dense. But how close the various items
really come in the input/output scheme depends on the absolute value
of the determinant of the transformation matrix $T$: in every
\index{data stream!input/output}input/output data stream,
the proper items follow each other with a spacing of exactly
$|\mathrm{det}(T)|$ places. Indeed $|\mathrm{det}(T)|=1$ for Figure \ref{exonea}; as for Figure
\ref{extwod}, we now can rate the fluffy spacing as a practical consequence of $|\mathrm{det}(T)|=3$.
What to do with spare places as those in Figure \ref{extwod}? Although each cell of the systolic array
from Figure \ref{extwoa} in fact does useful work only every third timestep,
it would be nonsense to pause during two out of three timesteps.
Strictly speaking, we can argue that values on places marked with dots in Figure \ref{extwod}
have no influence on the calculation of the shown items $c_{ij}$, because they never reach
an active cell at time of the calculation of a variable $c(i,j,k)$. Thus, we may simply fill
spare places with any value, no danger of disturbing the result.
It is even feasible to execute three different matrix products at the same time on the systolic array from
Figure \ref{extwoa}, without interference. This will be our topic in Subsection \ref{verschraenkt}.
\index{data rate|)}
\subsection{Input/output expansion}\label{eaerweitert}
\index{input/output expansion|(}\index{input/output scheme!extended|(}
When further studying Figure \ref{extwod}, we can identify another problem.
Check, for example, the itinerary of $c_{22}$ through the cells of the systolic array.
According to the space-time transformation, the calculations contributing
to the value of $c_{22}$ happen in the cells $(-1,0)$, $(0,0)$, $(1,0)$, and $(2,0)$.
But the input/output scheme from Figure \ref{extwod} tells us that $c_{22}$
also passes through cell $(-2,0)$ before, and eventually visits cell $(3,0)$, too.
This may be interpreted as some \inddef{spurious operation}\ki{spurious calculations}
being introduced into the system (\ref{exmmult4}) by the used space-time transformation, here,
for example, at the new domain points (2,2,0) and (2,2,5).
The reason for this phenomenon is that
the \index{domain!of input/output operations}\emph{domains of the input/output operations}
are not in parallel to the chosen \index{projection!direction}\emph{projection direction}.
Thus, some input/output operations are projected onto cells that do not belong
to the \inddef{systolic array!boundary of}\ki{boundary} of the systolic array.
But in the interior of the systolic array, no input/output operation can be performed directly.
The problem can be solved by extending the trajectory, in flow or counter flow direction,
from these inner cells up to the boundary of the systolic array.
But thereby we introduce some new calculations, and possibly also some new domain points.
This technique is called \inddef{input/output expansion}\ki{input/output expansion.}
We must avoid that the additional calculations taking place in the cells (-2,0) and (3,0)
corrupt the correct value of $c_{22}$. For the matrix product, this is quite
easy---though the general case is more difficult. The \index{generic operator}generic sum
operator has a neutral element, namely zero. Thus, if we can guarantee that
by new calculations only zero is added, there will be no harm. All we have to do is
providing always at least one zero operand to any
\index{spurious operation}spurious multiplication; this can be achieved by filling appropriate
input slots with zero items.
\epskepnagyfrag{figs/extwoe}{\label{extwoe} \abraindex{input/output scheme!extended}Extended
input/output scheme, correcting Figure \ref{extwod}.}{12.2cm}
Figure \ref{extwoe} shows an example of a properly
\index{input/output scheme!extended}extended input/output scheme. Preceding and following the items
of matrix $A$, the necessary zero items have been filled in. Since the entered zeroes count
like data items, the input/output scheme from Figure \ref{extwod} has been retracted again
by one place. The calculation now begins already in timestep $-1$, but ends as
before with timestep 14. Thus we need 16 timesteps altogether.
\index{input/output expansion|)}\index{input/output scheme!extended|)}
\subsection{Coping with stationary variables}\index{stationary!variable|(}
Let us come back to the example from Figure \ref{exonea}(a). For inputting the items of matrices
$A$ and $B$, no expansion is required, since these items are always used
in \index{cell!boundary}\emph{boundary cells} first. But not so with matrix $C$!
The items of $C$ are calculated in stationary variables, hence always in the same cell.
Thus most results $c_{ij}$ are produced in inner cells of the systolic array,
from where they have to be moved---in a separate action---to \emph{boundary cells}
of the systolic array.
Although this new challenge, on the face of it, appears very similar to the problem from
Subsection \ref{eaerweitert}, and thus very easy to solve, in fact we here have a completely
different situation. It is not sufficient to extend existing data flows forward or backward
up to the boundary of the systolic array. Since for stationary variables the dependence vector
is projected to the null vector, which constitutes no extensible direction,
there can be no spatial flow induced by this dependency. Possibly, we can construct some
auxiliary extraction paths, but usually there are many degrees of freedom. Moreover, we then need
a \emph{control mechanism} inside the cells. For all these reasons, the problem is further
dwelled on in Section \ref{Steuerung}.\index{stationary!variable|)}
\subsection{Interleaving of calculations}\label{verschraenkt}\index{interleaving|(}
As can be easily noticed, the \index{utilisation}\emph{utilisation} of the systolic array
from Figure \ref{extwoa} with input/output scheme from Figure \ref{extwoe} is quite poor.
Even without any deeper study of the starting phase and the closing phase,
we cannot ignore that the average utilisation of the array is below one third---after all,
each cell at most in every third timestep makes a proper contribution to the calculation.
A simple technique to improve this behaviour is to \inddef{interleaving}\ki{interleave}
calculations. If we have three independent matrix products,
we can successively input their respective data, delayed by only one timestep,
without any changes to the systolic array or its cells. Figure \ref{extwof} shows
a snapshot of the systolic array, with parts of the corresponding input/output scheme.
Now we must check by a formal derivation whether this idea is really working.
Therefore, we slightly modify system (\ref{exmmult4}). We augment the variables
and the domains by a fourth dimension, needed to distinguish the three matrix products:
{\small
\begin{equation}\label{exmmult5}
\begin{array}{@{}ll@{}}
\multicolumn{2}{@{}l@{}}{\textit{input operations}} \\[1ex]
a(i,j,k,l) = a_{ik}^l & 1 \le i \le N_1, j = 0, 1 \le k \le N_3, 1 \le l \le 3 \koz , \\[1ex]
b(i,j,k,l) = b_{kj}^l & i = 0, 1 \le j \le N_2, 1 \le k \le N_3, 1 \le l \le 3 \koz , \\[1ex]
c(i,j,k,l) = 0 & 1 \le i \le N_1, 1 \le j \le N_2, k = 0, 1 \le l \le 3 \koz . \\[3ex]
\multicolumn{2}{@{}l@{}}{\textit{calculations and forwarding}} \\[1ex]
a(i,j,k,l) = a(i,j-1,k,l) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3, 1 \le l \le 3 \koz , \\[1ex]
b(i,j,k,l) = b(i-1,j,k,l) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3, 1 \le l \le 3 \koz , \\[1ex]
\begin{array}{@{}l@{}}
c(i,j,k,l) = c(i,j,k-1,l) \\[0.5ex]
\quad {} + a(i,j-1,k,l) \\[0.5ex]
\quad {} \cdot b(i-1,j,k,l) \\
\end{array} & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3, 1 \le l \le 3 \koz . \\[6.5ex]
\multicolumn{2}{@{}l@{}}{\textit{output operations}} \\[1ex]
c_{ij}^l = c(i,j,k,l) & 1 \le i \le N_1, 1 \le j \le N_2, k = N_3, 1 \le l \le 3 \koz .
\end{array}\normalsize
\end{equation}}
\epskepnagyfrag{figs/extwof}{\label{extwof} \abraindex{interleaving}Interleaved calculation of three
matrix products on the systolic array from Figure \ref{extwoa}.}{10.3cm}
Obviously, in system (\ref{exmmult5}), problems with different values of $l$ are not related.
Now we must preserve this property in the systolic array. A suitable
\index{space-time!matrix}\emph{space-time matrix} would be
\begin{equation}\label{raumzeit4}
T = \left(\begin{array}{rrrr}
0 & -1 & 1 & 0 \\
-1 & 1 & 0 & 0 \\
1 & 1 & 1 & 1
\end{array}\right) \koz .
\end{equation}
Notice that $T$ is not square here. But for calculating the space coordinates,
the fourth dimension of the \index{iteration!vector}iteration vector is completely irrelevant,
and thus can simply be neutralised by corresponding zero entries in the fourth column
of the first and second rows of $T$.
The last row of $T$ again constitutes the \index{time vector}\emph{time vector} $\pi$.
Appropriate choice of $\pi$ embeds the three problems to solve into the space-time continuum,
avoiding any intersection. Corresponding instances of the iteration vectors of the three problems
are projected to the same cell with a respective spacing of one timestep, because the fourth entry of $\pi$ equals 1.
\index{utilisation|(}Finally, we calculate the average \emph{utilisation}---with or without
interleaving---for the concrete problem parameters $N_1=3$, $N_2=5$, and $N_3=4$.
For a single matrix product, we have to perform $N_1 \cdot N_2 \cdot N_3=60$ calculations,
considering a multiplication and a corresponding addition as a \index{compound operation}compound
operation, i.e., counting both together as only one calculation; input/output operations
are not counted at all. The systolic array has 36 cells.
Without interleaving, our systolic array altogether takes 16 timesteps for calculating
a single matrix product, resulting in an average utilisation of $60 / (16 \cdot 36) \approx 0.104$
calculations per timestep and cell. When applying the described interleaving technique,
the calculation of all three matrix products needs only two timesteps more,
i.e., 18 timesteps altogether. But the number of calculations performed thereby has tripled,
so we get an average utilisation of the cells amounting to $3 \cdot 60 / (18 \cdot 36) \approx 0.278$
calculations per timestep and cell. Thus, by interleaving, we were able to improve
the utilisation of the cells to 267 per cent!
\begin{gyak}
\ujgyak
From equation (\ref{datalayout6}), formally derive the spatial difference vectors of
matrices $B$ and $C$ for the input/output scheme shown in Figure \ref{extwod}.
\ujgyak
Augmenting Figure \ref{extwod}, draw an extended input/output scheme that forces
both operands of all spurious multiplications to zero.
\ujgyak
Apply the techniques presented in Section \ref{easchemata} to the systolic array from Figure \ref{exonea}.
\hardgyak
Proof the properties claimed in Subsection \ref{verschraenkt} for the special space-time
transformation (\ref{raumzeit4}) with respect to system (\ref{exmmult5}).
\end{gyak}\index{utilisation|)}\index{interleaving|)}\index{input/output scheme|)}
\section{Control}\label{Steuerung}\label{sysex4}
\epskepnagyfrag{figs/exthreea}{\label{exthreea} \abraindex{reset}Resetting registers
via \abraindex{cell!with global control}global control. \textbf{(a)} Array structure.
\textbf{(b)} Cell structure.}{11.6cm}
So far we have assumed that each cell of a systolic array behaves in completely
the same way during every timestep. Admittedly there are some relevant examples of such
systolic arrays. However, in general the cells successively have to work in
several \inddef{operation mode}\ki{operation modes,} switched to by some control mechanism.
In the sequel, we study some typical situations for exerting control.
\subsection{Cells without control}\index{cell!without control|(}
The cell from Figure \ref{extwoa}(b) contains the registers \textit{A}, \textit{B}, and \textit{C},
that---when activated by the global \index{clock signal}clock signal---accept the data
applied to their inputs and then reliably reproduce these values at their outputs
for one clock cycle. Apart from this system-wide activity, the function calculated
by the cell is invariant for all timesteps: a \emph{fused multiply-add} operation
is applied to the three input operands $A$, $B$, and $C$, with result passed
to a neighbouring cell; during the same cycle, the operands $A$ and $B$ are also forwarded
to two other neighbouring cells. So in this case, the \inddef{cell!without control}cell
needs \emph{no control} at all.
The \emph{initial values} $c(i,j,0)$ for the execution of the generic sum operator---which
could also be different from zero here---are provided to the systolic array via
the \index{input stream}\emph{input streams}, see Figure \ref{extwoe}; the \emph{final results}
$c(i,j,N_3)$ continue to flow into the same direction up to the boundary of the array. Therefore,
the input/output activities for the cell from Figure \ref{extwoa}(b) constitute an intrinsic part
of the normal cell function. The price to pay for this extremely simple cell function
without any control is a restriction in all three dimensions of the matrices:
on a systolic array like that from Figure \ref{extwoa}, with \emph{fixed}
\index{systolic array!parameter of}array parameters $N_1,N_2,N_3$, an $M_1 \times M_3$ matrix
$A$ can only be multiplied by an $M_3 \times M_2$ matrix $B$ if the relations $M_1 \le N_1$,
$M_2 \le N_2$, and $M_3 \le N_3$ hold.
\index{cell!without control|)}
\subsection{Global control}\index{cell!with global control|(}
In this respect, constraints for the array from Figure \ref{exonea} are not so restrictive:
though the \index{problem parameter}problem parameters $M_1$ and $M_2$ also are bounded
by $M_1 \le N_1$ and $M_2 \le N_2$, there is no constraint for $M_3$.
Problem parameters unconstrained in spite of fixed array parameters can only emerge in time
but not in space, thus mandating the use of \index{stationary!variable}\emph{stationary variables}.
Before a new calculation can start, each register assigned to a stationary variable
has to be \emph{reset} to an initial state independent from the previously performed calculations.
For instance, concerning the systolic cell from Figure \ref{extwoa}(b), this should be the case
for register \textit{C}. By a \inddef{cell!with global control}\emph{global} signal similar
to the clock, register \textit{C} can be cleared in all cells at the same time,
i.e., reset to a zero value. To prevent a corruption of the reset by the current values
of \textit{A} or \textit{B}, at least one of the registers \textit{A} or \textit{B} must
be \index{clear}\emph{cleared} at the same time, too. Figure \ref{exthreea} shows an array
structure and a cell structure implementing this idea.\index{cell!with global control|)}
\subsection{Local control}\label{lokalgesteuert}\index{cell!with local control|(}
\epskepnagyfrag{figs/exthreeb}{\label{exthreeb} \abraindex{input/output scheme}Output scheme
with \abraindex{output!delayed}delayed output of results.}{3.9cm}
Unfortunately, for the matrix product the principle of the global control is not
sufficient without further measures. Since the systolic array presented in Figure \ref{exonea}
even lacks another essential property: the results $c_{ij}$ are not passed to the boundary but stay in the cells.
At first sight, it seems quite simple to forward the results to the boundary:
when the calculation of an item $c_{ij}$ is finished, the links from cell $(i,j)$
to the neighbouring cells $(i,j+1)$ and $(i+1,j)$ are no longer needed to forward items
of the matrices $A$ and $B$. These links can be reused then for any other purpose.
For example, we could pass all items of $C$ through the downward-directed links
to the lower border of the systolic array.
But it turns out that leading through results from the upper cells is hampered
by ongoing calculations in the lower parts of the array. If the result $c_{ij}$,
finished in timestep $i+j+N_3$, would be passed to cell $(i+1,j)$ in the next timestep,
a conflict would be introduced between two values: since only one value per timestep
can be sent from cell $(i+1,j)$ via the lower port, we would be forced to keep either
$c_{ij}$ or $c_{i+1\;j}$, the result currently finished in cell $(i+1,j)$. This effect would
spread over all cells down.
To fix the problem, we could \inddef{slow down}\ki{slow down} the forwarding of
items $c_{ij}$. If it would take two timesteps for $c_{ij}$ to pass a cell, no
collisions could occur. Then, the results stage a procession through the same link,
each separated from the next by one timestep. From the lower boundary cell of a column,
the \index{host computer}host computer first receives the result of the bottom row,
then that of the penultimate row; this procedure continues until eventually we see the
result of the top row. Thus we get the output scheme shown in Figure \ref{exthreeb}.
How can a cell recognise when to change from forwarding items of matrix $B$ to passing
items of matrix $C$ through the lower port? We can solve this task by an automaton
combining \index{control!local/global}global control with local control in the cell:
If we send a global signal to all cells at exactly the moment when the last items of $A$ and
$B$ are input to cell $(1,1)$, each cell can start a countdown process: in each successive timestep,
we decrement a counter initially set to the number of the remaining calculation steps.
Thereby cell $(i,j)$ still has to perform $i+j-1$ calculations before changing
to \ki{propagation mode.}\inddef{propagation mode} Later, the already mentioned
global reset signal switches the cell back to
\ki{calculation mode.}\inddef{calculation mode}\index{mode!calculation}\index{mode!propagation}
Figure \ref{exthreec} presents a systolic array implementing this local/global principle.
Basically, the array structure and the communication topology have been preserved.
But each cell can run in one of two states now, switched by a \emph{control logic}:
\epskepnagyfrag{figs/exthreec}{\label{exthreec} Combined \abraindex{control!local/global}local/global
control. \textbf{(a)} Array structure. \textbf{(b)} Cell structure.}{12cm}
\begin{enumerate}
\item
In \textit{calculation mode}, as before, the result of the addition is written
to register \textit{C}. At the same time, the value in register \textit{B}---i.e.,
the operand used for the multiplication---is forwarded through the lower port of the cell.
\item
In \textit{propagation mode}, registers \textit{B} and \textit{C} are connected in series.
In this mode, the only function of the cell is to guide each value received at the upper
port down to the lower port, thereby enforcing a \index{delay}\emph{delay} of two timesteps.
\end{enumerate}
The first value output from cell $(i,j)$ in \emph{propagation mode}
is the currently calculated value $c_{ij}$, stored in register \textit{C}.
All further output values are results forwarded from cells above. A formal description of the
algorithm implemented in Figure \ref{exthreec} is given by the assignment-free system (\ref{exmmult6}).
%{\small
%\begin{equation}\label{exmmult6}
%\begin{array}{@{}ll@{}}
%\multicolumn{2}{@{}l@{}}{\textit{input operations}} \\[1ex]
%a(i,j,k) = a_{ik} & 1 \le i \le N_1, j = 0, 1 \le k \le N_3 \koz, \\[1ex]
%b(i,j,k) = b_{kj} & i = 0, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
%c(i,j,k) = 0 & 1 \le i \le N_1, 1 \le j \le N_2, k = 0 \koz . \\[3ex]
%\multicolumn{2}{@{}l@{}}{\textit{calculations and forwarding}} \\[1ex]
%a(i,j,k) = a(i,j-1,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
%b(i,j,k) = b(i-1,j,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
%c(i,j,k) = c(i,j,k-1) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz . \\[0.5ex]
%\quad {} + a(i,j-1,k) \cdot b(i-1,j,k) \\[3ex]
%\multicolumn{2}{@{}l@{}}{\textit{propagation}} \\[1ex]
%b(i,j,k) = c(i,j,k-1) & 1 \le i \le N_1, 1 \le j \le N_2, 1+N_3 \le k \le i+N_3 \koz , \\[1ex]
%c(i,j,k) = b(i-1,j,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1+N_3 \le k \le i-1+N_3 \koz , \\[3ex]
%\multicolumn{2}{@{}l@{}}{\textit{output operations}} \\[1ex]
%c_{1+N_1+N_3-k,j} = b(i,j,k) & i = N_1, 1 \le j \le N_2, 1 + N_3 \le k \le N_1 + N_3 \koz .
%\end{array}\normalsize
%\end{equation}}
%
% Revision on June 23, 2007: Narrow formula width to place the label aside the formula
{\small
\begin{equation}\label{exmmult6}
\begin{array}{@{}ll@{}}
\multicolumn{2}{@{}l@{}}{\textit{input operations}} \\[1ex]
a(i,j,k) = a_{ik} & 1 \le i \le N_1, j = 0, 1 \le k \le N_3 \koz, \\[1ex]
b(i,j,k) = b_{kj} & i = 0, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
c(i,j,k) = 0 & 1 \le i \le N_1, 1 \le j \le N_2, k = 0 \koz . \\[3ex]
\multicolumn{2}{@{}l@{}}{\textit{calculations and forwarding}} \\[1ex]
a(i,j,k) = a(i,j-1,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
b(i,j,k) = b(i-1,j,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
\begin{array}{@{}l@{}}
c(i,j,k) = c(i,j,k-1) \\[0.5ex]
\quad {} + a(i,j-1,k) \\[0.5ex]
\quad {} \cdot b(i-1,j,k) \\
\end{array} & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz . \\[6.5ex]
\multicolumn{2}{@{}l@{}}{\textit{propagation}} \\[1ex]
b(i,j,k) = c(i,j,k-1) & 1 \le i \le N_1, 1 \le j \le N_2, 1+N_3 \le k \le i+N_3 \koz , \\[1ex]
c(i,j,k) = b(i-1,j,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1+N_3 \le k \le i-1+N_3 \koz , \\[3ex]
\multicolumn{2}{@{}l@{}}{\textit{output operations}} \\[1ex]
c_{1+N_1+N_3-k,j} = b(i,j,k) & i = N_1, 1 \le j \le N_2, 1 + N_3 \le k \le N_1 + N_3 \koz .
\end{array}\normalsize
\end{equation}}
It rests to explain how the control signals in a cell are generated in this model. As a prerequisite,
the cell must contain a \inddef{state flip-flop}\ki{state flip-flop} indicating the
current \index{operation mode}\emph{operation mode}. The output of this flip-flop is
connected to the control inputs of both multiplexors, see Figure \ref{exthreec}(b).
The global reset signal clears the state flip-flop, as well as the registers \textit{A}
and \textit{C}: the cell now works in \emph{calculation mode}.
The global ready signal starts the countdown in all cells, so in every timestep
the counter is diminished by 1. The counter is initially set to the precalculated
value $i+j-1$, dependent on the position of the cell. When the counter reaches zero,
the flip-flop is set: the cell switches to \emph{propagation mode}.
If desisting from a direct reset of the register \textit{C}, the last value passed,
before the reset, from register \textit{B} to register \textit{C} of a cell can be used
as a freely decidable initial value for the next dot product to evaluate in the cell.
We then even calculate, as already in the systolic array from Figure \ref{extwoa}, the more general problem
\begin{equation}\label{matrixproduktallgemein}
C = A \cdot B + D \koz ,
\end{equation}
detailed by the following equation system:
%{\small
%\begin{equation}\label{exmmult7}
%\begin{array}{@{}ll@{}}
%\multicolumn{2}{@{}l@{}}{\textit{input operations}} \\[1ex]
%a(i,j,k) = a_{ik} & 1 \le i \le N_1, j = 0, 1 \le k \le N_3 \koz , \\[1ex]
%b(i,j,k) = b_{kj} & i = 0, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
%c(i,j,k) = d_{ij} & 1 \le i \le N_1, 1 \le j \le N_2, k = 0 \koz . \\[3ex]
%\multicolumn{2}{@{}l@{}}{\textit{calculations and forwarding}} \\[1ex]
%a(i,j,k) = a(i,j-1,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
%b(i,j,k) = b(i-1,j,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
%c(i,j,k) = c(i,j,k-1) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz . \\[0.5ex]
%\quad {} + a(i,j-1,k) \cdot b(i-1,j,k) \\[3ex]
%\multicolumn{2}{@{}l@{}}{\textit{propagation}} \\[1ex]
%b(i,j,k) = c(i,j,k-1) & 1 \le i \le N_1, 1 \le j \le N_2, 1+N_3 \le k \le i+N_3 \koz , \\[1ex]
%c(i,j,k) = b(i-1,j,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1+N_3 \le k \le i-1+N_3 \koz . \\[1ex]
%\multicolumn{2}{@{}l@{}}{\textit{output operations}} \\[1ex]
%c_{1+N_1+N_3-k,j} = b(i,j,k) & i = N_1, 1 \le j \le N_2, 1 + N_3 \le k \le N_1 + N_3 \koz .
%\end{array}\normalsize
%\end{equation}}
%
% Revision on June 23, 2007: Narrow formula width to place the label aside the formula
\vspace{-4mm}
{\small
\begin{equation}\label{exmmult7}
\begin{array}{@{}ll@{}}
\multicolumn{2}{@{}l@{}}{\textit{input operations}} \\[1ex]
a(i,j,k) = a_{ik} & 1 \le i \le N_1, j = 0, 1 \le k \le N_3 \koz , \\[1ex]
b(i,j,k) = b_{kj} & i = 0, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
c(i,j,k) = d_{ij} & 1 \le i \le N_1, 1 \le j \le N_2, k = 0 \koz . \\[3ex]
\multicolumn{2}{@{}l@{}}{\textit{calculations and forwarding}} \\[1ex]
a(i,j,k) = a(i,j-1,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
b(i,j,k) = b(i-1,j,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
\begin{array}{@{}l@{}}
c(i,j,k) = c(i,j,k-1) \\[0.5ex]
\quad {} + a(i,j-1,k) \\[0.5ex]
\quad {} \cdot b(i-1,j,k) \\
\end{array} & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3 \koz . \\[6.5ex]
\multicolumn{2}{@{}l@{}}{\textit{propagation}} \\[1ex]
b(i,j,k) = c(i,j,k-1) & 1 \le i \le N_1, 1 \le j \le N_2, 1+N_3 \le k \le i+N_3 \koz , \\[1ex]
c(i,j,k) = b(i-1,j,k) & 1 \le i \le N_1, 1 \le j \le N_2, 1+N_3 \le k \le i-1+N_3 \koz . \\[3ex]
\multicolumn{2}{@{}l@{}}{\textit{output operations}} \\[1ex]
c_{1+N_1+N_3-k,j} = b(i,j,k) & i = N_1, 1 \le j \le N_2, 1 + N_3 \le k \le N_1 + N_3 \koz .
\end{array}\normalsize
\end{equation}}
\index{cell!with local control|)}
\subsection{Distributed control}\label{verteiltgesteuert}\index{cell!with distributed control|(}
The method sketched in Figure \ref{exthreec} still has the following drawbacks:
\begin{enumerate}
\item
The systolic array uses global control signals, requiring a high technical accuracy.
\item
Each cell needs a counter with counting register, introducing a considerable hardware expense.
\item
The initial value of the counter varies between the cells. Thus, each cell must be
individually designed and implemented.
\item
The input data of any successive problem must wait outside the cells until all results
from the current problem have left the systolic array.
\end{enumerate}
These disadvantages can be avoided, if \index{control signal!propagated}\emph{control signals}
are \emph{propagated like data}---meaning a \inddef{cell!with distributed control}\ki{distributed control.}
Therefore, we preserve the connections of the registers \textit{B} and \textit{C}
with the multiplexors from Figure \ref{exthreec}(b), but do not generate any control signals in the cells;
also, there will be no global reset signal. Instead, a cell receives the necessary control signal
from one of the neighbours, stores it in a new one-bit register \textit{S},
and appropriately forwards it to further neighbouring cells.
The primary control signals are generated by the \index{host computer}\emph{host computer},
and infused into the systolic array by \index{cell!boundary}\emph{boundary cells}, only.
Figure \ref{exthreed}(a) shows the required array structure, Figure \ref{exthreed}(b) the modified cell structure.
Switching to the \emph{propagation mode} occurs successively down one cell in a column,
always delayed by one timestep. The delay introduced by register \textit{S} is therefore sufficient.
Reset to the \emph{calculation mode} is performed via the same control wire,
and thus also happens with a delay of one timestep per cell. But since the results $c_{ij}$
sink down at half speed, only, we have to wait sufficiently long with the reset: if a cell is
switched to \emph{calculation mode} in timestep $t$, it goes to \emph{propagation mode}
in timestep $t+N_3$, and is reset back to \emph{calculation mode} in timestep $t+N_1+N_3$.
So we learned that in a systolic array, distributed control induces a different
macroscopic timing behaviour than local/global control. Whereas the systolic array from
Figure \ref{exthreed} can start the calculation of a new problem (\ref{matrixproduktallgemein})
every $N_1+N_3$ timesteps, the systolic array from Figure \ref{exthreec} must wait for
$2 \cdot N_1+ N_2 + N_3 - 2$ timesteps. The time difference $N_1+N_3$ resp.
$2 \cdot N_1+ N_2 + N_3 - 2$ is called the \inddef{period}\ki{period,} its reciprocal being
the \inddef{throughput}\ki{throughput.}
System (\ref{exmmult8a} and \ref{exmmult8b}), divided into two parts during the typesetting,
formally describes the relations between distributed control
and calculations. We thereby assume an infinite, densely packed sequence of matrix product
problems, the additional iteration variable $l$ being unbounded. The equation
headed \emph{variables with alias} describes but pure identity relations.
%{\small
%\begin{equation}\label{exmmult8}
%\begin{array}{@{}l@{}l@{}}
%\multicolumn{2}{@{}l@{}}{\textit{control}} \\[1ex]
%s(i,j,k,l) = 0 & i = 0, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
%s(i,j,k,l) = 1 & i = 0, 1 \le j \le N_2, 1 + N_3 \le k \le N_1+N_3 \koz , \\[1ex]
%s(i,j,k,l) = s(i-1,j,k,l) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_1+N_3 \koz . \\[3ex]
%\multicolumn{2}{@{}l@{}}{\textit{input operations}} \\[1ex]
%a(i,j,k,l) = a_{ik}^l & 1 \le i \le N_1, j = 0, 1 \le k \le N_3 \koz , \\[1ex]
%b(i,j,k,l) = b_{kj}^l & i = 0, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
%b(i,j,k,l) = d_{N_1+N_3+1-k,j}^{l+1} & i = 0, 1 \le j \le N_2, 1 + N_3 \le k \le N_1 + N_3 \koz . \\[3ex]
%\multicolumn{2}{@{}l@{}}{\textit{variables with alias}} \\[1ex]
%c(i,j,k,l) = c(i,j,N_1+N_3,l-1) & 1 \le i \le N_1, 1 \le j \le N_2, k = 0 \koz . \\[3ex]
%\multicolumn{2}{@{}l@{}}{\textit{calculations and forwarding}} \\[1ex]
%a(i,j,k,l) = a(i,j-1,k,l) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_1 + N_3 \koz , \\[1ex]
%b(i,j,k,l) = \left\{\begin{array}{@{}l@{\;}l@{}}
% b(i-1,j,k,l) & : s(i-1,j,k,l) = 0 \\
% c(i,j,k-1,l) & : s(i-1,j,k,l) = 1
%\end{array}\right. & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_1 + N_3 \koz , \\[2.5ex]
%c(i,j,k,l) = \left\{\begin{array}{@{}l@{}l@{}}
% c(i,j,k-1,l) \\
% {} + a(i,j-1,k,l) \\
% {} \cdot b(i-1,j,k,l)\,\mbox{:}\, & s(i-1,j,k,l) = 0 \\
% b(i-1,j,k,l)\;\;\;\mbox{:} & s(i-1,j,k,l) = 1
%\end{array}\right. &
%\begin{array}{@{}l@{}}
%1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_1 + N_3 \koz . \\ \\
%\end{array} \\[7ex]
%\multicolumn{2}{@{}l@{}}{\textit{output operations}} \\[1ex]
%c_{1+N_1+N_3-k,j}^l = b(i,j,k,l) & i = N_1, 1 \le j \le N_2, 1 + N_3 \le k \le N_1 + N_3 \koz .
%\end{array}\normalsize
%\end{equation}}
%
% Revision on June 23, 2007: Narrow formula width to place the label aside the formula
{\small
\begin{equation}\label{exmmult8a}
\begin{array}{@{}l@{}l@{}}
\multicolumn{2}{@{}l@{}}{\textit{control}} \\[1ex]
s(i,j,k,l) = 0 & i = 0, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
s(i,j,k,l) = 1 & i = 0, 1 \le j \le N_2, 1 + N_3 \le k \le N_1+N_3 \koz , \\[1ex]
s(i,j,k,l) = s(i-1,j,k,l) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_1+N_3 \koz . \\[3ex]
\multicolumn{2}{@{}l@{}}{\textit{input operations}} \\[1ex]
a(i,j,k,l) = a_{ik}^l & 1 \le i \le N_1, j = 0, 1 \le k \le N_3 \koz , \\[1ex]
b(i,j,k,l) = b_{kj}^l & i = 0, 1 \le j \le N_2, 1 \le k \le N_3 \koz , \\[1ex]
b(i,j,k,l) = d_{N_1+N_3+1-k,j}^{l+1} & i = 0, 1 \le j \le N_2, 1 + N_3 \le k \le N_1 + N_3 \koz . \\[3ex]
\multicolumn{2}{@{}l@{}}{\textit{variables with alias}} \\[1ex]
c(i,j,k,l) = c(i,j,N_1+N_3,l-1) & 1 \le i \le N_1, 1 \le j \le N_2, k = 0 \koz . \\[3ex]
\end{array}\normalsize
\end{equation}}
{\small
\begin{equation}\label{exmmult8b}
\begin{array}{@{}l@{}l@{}}
\multicolumn{2}{@{}l@{}}{\textit{calculations and forwarding}} \\[1ex]
a(i,j,k,l) = a(i,j-1,k,l) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_1 + N_3 \koz , \\[1ex]
b(i,j,k,l) = \left\{\begin{array}{@{}l}
b(i-1,j,k,l), \\[0.5ex]
\text{if $s(i-1,j,k,l) = 0$} \\[1ex]
c(i,j,k-1,l), \\[0.5ex]
\text{if $s(i-1,j,k,l) = 1$} \\
\end{array}\right. & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_1 + N_3 \koz , \\[6.25ex]
c(i,j,k,l) = \left\{\begin{array}{@{}l}
c(i,j,k-1,l) \\[0.5ex]
{} + a(i,j-1,k,l) \\[0.5ex]
{} \cdot b(i-1,j,k,l), \\[0.5ex]
\text{if $s(i-1,j,k,l) = 0$} \\[1ex]
b(i-1,j,k,l), \\[0.5ex]
\text{if $s(i-1,j,k,l) = 1$} \\
\end{array}\right. & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_1 + N_3 \koz . \\[11.5ex]
\multicolumn{2}{@{}l@{}}{\textit{output operations}} \\[1ex]
c_{1+N_1+N_3-k,j}^l = b(i,j,k,l) & i = N_1, 1 \le j \le N_2, 1 + N_3 \le k \le N_1 + N_3 \koz .
\end{array}\normalsize
\end{equation}}
\epskepnagyfrag{figs/exthreed}{\label{exthreed} \abraindex{matrix product}Matrix product on
a \abraindex{systolic array!rectangular}rectangular systolic array, with output of results
and \abraindex{cell!with distributed control}distributed control.
\textbf{(a)} \abraindex{systolic array!structure of}Array structure.
\textbf{(b)} \abraindex{cell!structure of}Cell structure.}{12cm}
Formula (\ref{raumzeit5}) shows the corresponding space-time matrix. Note that one entry
of $T$ is not constant but depends on the \index{problem parameter}\emph{problem parameters}:
\begin{equation}\label{raumzeit5}
T = \left(\begin{array}{cccc}
1 & 0 & 0 & 0 \\
0 & 1 & 0 & 0 \\
1 & 1 & 1 & N_1+N_3 \koz .
\end{array}\right)
\end{equation}
Interestingly, also the cells in a row switch one timestep later when moving one position to the
right. Sacrificing some regularity, we could use this circumstance to relieve
the \index{host computer}\emph{host computer} by applying control to the systolic array at cell
(1,1), only. We therefore would have to change the control scheme in the following way:
{\small
\begin{equation}\label{exmmult9}
\begin{array}{@{}ll@{}}
\multicolumn{2}{@{}l@{}}{\textit{control}} \\[1ex]
s(i,j,k,l) = 0 & i = 1, j = 0, 1 \le k \le N_3 \koz , \\[1ex]
s(i,j,k,l) = 1 & i = 1, j = 0, 1 + N_3 \le k \le N_1+N_3 \koz , \\[1ex]
s(i,j,k,l) = s(i-1,j,k,l) & 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_1+N_3 \koz , \\[2ex]
\multicolumn{2}{@{}l@{}}{\textit{\ldots}} \\[2ex]
\multicolumn{2}{@{}l@{}}{\textit{variables with alias}} \\[1ex]
s(i,j,k,l) = s(i+1,j-1,k,l) & i = 0, 1 \le j \le N_2, 1 \le k \le N_1+N_3 \koz , \\[2ex]
\multicolumn{2}{@{}l@{}}{\textit{\ldots}}
\end{array}\normalsize
\end{equation}}
\epskepnagyfrag{figs/exthreee}{\label{exthreee} Matrix product on a rectangular systolic array,
with output of results and distributed control. \textbf{(a)} Array structure.
\textbf{(b)} Cell on the upper border. \textbf{(c)} Regular cell.}{12cm}
Figure \ref{exthreee} shows the result of this modification. We now need cells of two kinds:
cells on the upper border of the systolic array must be like that in
Figure \ref{exthreee}(b); all other cells would be as before, see Figure \ref{exthreee}(c).
Moreover, the \index{communication topology!of systolic array}\emph{communication topology}
on the upper border of the systolic array would be slightly different from that in the regular area.
\index{cell!with distributed control|)}
\subsection{The cell program as a local view}\index{cell!program|(}
The chosen \index{space-time!transformation}\emph{space-time transformation} widely determines
the architecture of the systolic array. Mapping recurrence equations to space-time
coordinates yields an explicit view to the \emph{geometric properties} of the systolic
array, but gives no real insight into the \emph{function of the cells}. In contrast,
the processes performed inside a cell can be directly expressed by a
\inddef{cell!program}\ki{cell program.} This approach is particularly of interest if dealing
with a \index{systolic array!programmable}\emph{programmable systolic array},
consisting of cells indeed controlled by a repetitive program.
Like the \inddef{global view}\ki{global view,} i.e., the
\index{systolic array!structure of}\emph{structure} of the systolic array,
the \inddef{local view}\ki{local view} given by a \emph{cell program} in fact
is already fixed by the space-time transformation. But, this local view is only
induced implicitly here, and thus, by a further mathematical transformation,
an explicit representation must be extracted, suitable as a cell program.
In general, we denote instances of program variables with the aid
of \inddef{index expression}\ki{index expressions,} that refer to iteration variables.
Take, for instance, the equation
\begin{equation*}
c(i,j,k) = c(i,j,k-1) + a(i,j-1,k) \cdot b(i-1,j,k) \quad 1 \le i \le N_1, 1 \le j \le N_2, 1 \le k \le N_3
\end{equation*}
from system (\ref{exmmult4}). The instance $c(i,j,k-1)$ of the program variable $c$ is specified
using the index expressions $i$, $j$, and $k-1$, which can be regarded as functions
of the iteration variables $i,j,k$.
As we have noticed, the set of iteration vectors $(i,j,k)$ from the quantification becomes
a set of space-time coordinates $(x,y,t)$ when applying a space-time transformation
(\ref{raumzeit}) with transformation matrix $T$ from (\ref{raumzeit2}),
\begin{equation}\label{injektiveRZA}
\left(\begin{array}{c} x \\ y \\ t \end{array}\right)
= T \cdot \left(\begin{array}{c} i \\ j \\ k \end{array}\right)
= \left(\begin{array}{rrr}
0 & -1 & 1 \\
-1 & 1 & 0 \\
1 & 1 & 1
\end{array}\right) \cdot \left(\begin{array}{c} i \\ j \\ k \end{array}\right) \koz .
\end{equation}
Since each cell is denoted by space coordinates $(x,y)$, and the cell program must refer
to the current time $t$, the \index{iteration!variable}iteration variables $i,j,k$
in the index expressions for the program variables are not suitable, and must be
translated into the new coordinates $x,y,t$. Therefore, using the inverse of the space-time
transformation from (\ref{injektiveRZA}), we express the iteration variables $i,j,k$
as functions of the space-time coordinates $(x,y,t)$,
\begin{equation}\label{UmkehrRZA}
\left(\begin{array}{c} i \\ j \\ k \end{array}\right)
= T^{-1} \cdot \left(\begin{array}{c} x \\ y \\ t \end{array}\right)
= \frac{1}{3} \cdot \left(\begin{array}{rrr}
-1 & -2 & 1 \\
-1 & 1 & 1 \\
2 & 1 & 1
\end{array}\right) \cdot \left(\begin{array}{c} x \\ y \\ t \end{array}\right) \koz .
\end{equation}
The existence of such an inverse transformation is guaranteed if the space-time transformation
is injective on the domain---and that it should always be: if not, some instances must be
calculated by a cell in the same timestep. In the example, reversibility is guaranteed
by the square, non singular matrix $T$, even without referral to the domain.
With respect to the \index{time vector}time vector $\pi$ and any
\index{projection!vector}projection vector $u$, the property $\pi \cdot u \neq 0$ is sufficient.
Replacing \index{iteration!variable}iteration variables by space-time coordinates,
which might be interpreted as a \inddef{transformation!of domain}\ki{transformation of the domain,}
frequently yields very unpleasant index expressions. Here, for example, from $c(i,j,k-1)$ we get
\begin{equation*}
c((-x-2 \cdot y+t)/3,(-x+y+t)/3,(2 \cdot x+y+t)/3) \koz .
\end{equation*}
But, by a successive \inddef{transformation!of index set}\ki{transformation of the index sets,}
we can relabel the instances of the program variables such that the reference to cell and time
appears more evident. In particular, it seems worthwhile to transform the equation system back
into \inddef{output normal form}\ki{output normal form,} i.e., to denote the results
calculated during timestep $t$ in cell $(x,y)$ by instances $(x,y,t)$ of the program variables.
We best gain a real understanding of this approach via an abstract mathematical formalism,
that we can fit to our special situation.
Therefore, let
\begin{equation}\label{quantification1}
r(\psi_r(v)) = \mathcal{F}(\ldots,s(\psi_s(v)),\ldots) \qquad v \in S
\end{equation}
be a quantified equation over a domain $S$, with program variables $r$ and $s$.
The \inddef{index function}\ki{index functions} $\psi_r$ and $\psi_s$ generate the instances
of the program variables as tuples of index expressions.
By transforming the domain with a function $\varphi$ that is injective on $S$,
equation (\ref{quantification1}) becomes
\begin{equation}\label{quantification2}
r(\psi_r(\varphi^{-1}(e))) = \mathcal{F}(\ldots,s(\psi_s(\varphi^{-1}(e))),\ldots) \qquad e \in \varphi(S) \koz ,
\end{equation}
where $\varphi^{-1}$ is a function that constitutes an inverse of $\varphi$ on $\varphi(S)$.
The new index functions are $\psi_r\circ\varphi^{-1}$ and $\psi_s\circ\varphi^{-1}$.
\index{transformation!of index set}Transformations of index sets don't touch the domain;
they can be applied to each program variable separately, since only the instances
of this program variable are renamed, and in a consistent way. With such
renamings $\vartheta_r$ and $\vartheta_s$, equation (\ref{quantification2}) becomes
\begin{equation}\label{quantification3}
r(\vartheta_r(\psi_r(\varphi^{-1}(e)))) =
\mathcal{F}(\ldots,s(\vartheta_s(\psi_s(\varphi^{-1}(e)))),\ldots) \qquad e \in \varphi(S) \koz .
\end{equation}
If output normal form is desired, $\vartheta_r\circ\psi_r\circ\varphi^{-1}$ has to be the identity.
In the most simple case (as for our example), $\psi_r$ is the identity,
and $\psi_s$ is an affine transformation of the form $\psi_s(v)=v-d$,
with constant $d$---the already known \index{dependence vector}\emph{dependence vector}.
$\psi_r$ then can be represented in the same way, with $d=\vec{0}$.
\index{transformation!of domain}Transformation of the domains happens by the space-time
transformation $\varphi(v)=T\cdot v$, with an invertible matrix $T$. For all index transformations,
we choose the same $\vartheta=\varphi$. Thus equation (\ref{quantification3}) becomes
\begin{equation}\label{quantification4}
r(e) = \mathcal{F}(\ldots,s(e-T\cdot d),\ldots) \qquad e \in T(S) \koz .
\end{equation}
For the generation of a \emph{cell program}, we have to know the following information
for every timestep: the operation to perform, the source of the data, and the destination
of the results---known from assembler programs as \texttt{opc}, \texttt{src}, \texttt{dst}.
The operation to perform (\texttt{opc}) follows directly from the function $\mathcal{F}$.
For a cell with control, we must also find the timesteps when to perform this individual
function $\mathcal{F}$. The set of these timesteps, as a function of the space coordinates, can \linebreak
be determined by projecting the set $T(S)$ onto the time axis; for general polyhedric
$S$ with the aid of a \nevindex{Fourier, Jean Baptiste Joseph (1768--1830)}\nevindex{Motzkin,
Theodore Samuel}\index{Fourier-Motzkin elimination}\emph{Fourier-Motzkin elimination}, for example.
In system (\ref{quantification4}), we get a new dependence vector $T\cdot d$,
consisting of two components: a (vectorial) spatial part, and a (scalar) timely part.
The \emph{spatial} part $\Delta z$, as a difference vector, specifies \emph{which}
neighbouring cell has calculated the operand. We directly can translate this information,
concerning the input of operands to cell $z$, into a port specifier with port
position $-\Delta z$, serving as the \texttt{src} operand of the instruction.
In the same way, the cell calculating the operand, with position $z-\Delta z$,
must write this value to a port with port position $\Delta z$,
used as the \texttt{dst} operand in the instruction.
The \emph{timely} part of $T\cdot d$ specifies, as a time difference $\Delta t$,
\emph{when} the calculation of the operand has been performed. If $\Delta t=1$,
this information is irrelevant, because the reading cell $z$ always gets
the output of the immediately preceding timestep from neighbouring cells.
However, for $\Delta t>1$, the value must be buffered for $\Delta t-1$ timesteps,
either by the \emph{producer} cell $z-\Delta z$, or by the \emph{consumer} cell $z$---or by both,
sharing the burden. This need can be realised in the cell program, for example,
with $\Delta t-1$ copy instructions executed by the producer cell $z-\Delta z$,
preserving the value of the operand until its final output from the cell
by passing it through $\Delta t-1$ registers.
Applying this method to system (\ref{exmmult8a} and \ref{exmmult8b}), with transformation
matrix $T$ as in (\ref{raumzeit5}), yields
%{\small
%\begin{equation}\label{exmmult11}
%\begin{array}{@{}ll@{}}
%s(x,y,t) = s(x-1,y,t-1) & \\[1ex]
%a(x,y,t) = a(x,y-1,t-1) & \\[1ex]
%b(x,y,t) = \left\{\begin{array}{@{}ll@{}}
% b(x-1,y,t-1) & :s(x-1,y,t-1) = 0 \\
% c(x,y,t-1) & :s(x-1,y,t-1) = 1
%\end{array}\right. & \\[3ex]
%c(x,y,t) = \left\{\begin{array}{@{}ll@{}}
% c(x,y,t-1) \\
% {} + a(x,y-1,t-1) \\
% {} \cdot b(x-1,y,t-1) & :s(x-1,y,t-1) = 0 \\
% b(x-1,y,t-1) & :s(x-1,y,t-1) = 1 \koz .
%\end{array}\right. &
%\end{array}\normalsize
%\end{equation}}
%
% Revision on June 23, 2007: Narrow formula width to place the label aside the formula
{\small
\begin{equation}\label{exmmult11}
\begin{array}{@{}ll@{}}
s(x,y,t) = s(x-1,y,t-1) & \\[1ex]
a(x,y,t) = a(x,y-1,t-1) & \\[1ex]
b(x,y,t) = \left\{\begin{array}{@{}l}
b(x-1,y,t-1), \\[0.5ex]
\text{if $s(x-1,y,t-1) = 0$} \\[1ex]
c(x,y,t-1), \\[0.5ex]
\text{if $s(x-1,y,t-1) = 1$} \\
\end{array}\right. & \\[6.25ex]
c(x,y,t) = \left\{\begin{array}{@{}l}
c(x,y,t-1) + a(x,y-1,t-1) \cdot b(x-1,y,t-1) \koz , \\[0.5ex]
\text{if $s(x-1,y,t-1) = 0$} \\[1ex]
b(x-1,y,t-1), \\[0.5ex]
\text{if $s(x-1,y,t-1) = 1 \koz .$} \\
\end{array}\right. &
\end{array}\normalsize
\end{equation}}
The iteration variable l, being relevant only for the input/output scheme,
can be set to a fixed value prior to the transformation.
The cell program for the systolic array from Figure \ref{exthreed},
performed once in every timestep, reads as follows:
\begin{algN}{Cell-Program}\inddef{\textsc{Cell-Program}}
1 \' $\mathtt{S} \leftarrow \mathtt{C}(-1,0)(0)$ \\
2 \' $\mathtt{A} \leftarrow \mathtt{C}(0,-1)$ \\
3 \' $\mathtt{B} \leftarrow \mathtt{C}(-1,0)(1:N)$ \\
4 \' $\mathtt{C}(1,0)(0) \leftarrow \mathtt{S}$ \\
5 \' $\mathtt{C}(0,1) \leftarrow \mathtt{A}$
\algnewpageN
6 \' \key{if} \= $\mathtt{S}=1$ \\
7 \' \> \key{then} \= $\mathtt{C}(1,0)(1:N) \leftarrow \mathtt{C}$ \\
8 \' \> \> $\mathtt{C} \leftarrow \mathtt{B}$ \\
9 \' \> \key{else} \> $\mathtt{C}(1,0)(1:N) \leftarrow \mathtt{B}$ \\
10 \' \> \> $\mathtt{C} \leftarrow \mathtt{C} + \mathtt{A} \cdot \mathtt{B}$
\end{algN}
The port specifiers stand for local input/output to/from the cell. For each, a pair of qualifiers
is derived from the geometric position of the ports relative to the centre of the cell. Port
$\mathtt{C}(0,-1)$ is situated on the left border of the cell, $\mathtt{C}(0,1)$ on the right
border; $\mathtt{C}(-1,0)$ is above the centre, $\mathtt{C}(1,0)$ below. Each port specifier
can be augmented by a bit range: $\mathtt{C}(-1,0)(0)$ stands for bit 0 of the port, only;
$\mathtt{C}(-1,0)(1:N)$ denotes the bits 1 to $N$. The designations $\mathtt{A}$,
$\mathtt{B}$, \ldots{} without port qualifiers stand for registers of the cell.
By application of matrix $T$ from (\ref{raumzeit1}) to system (\ref{exmmult7}), we get
%{\small
%\begin{equation}\label{exmmult10}
%\begin{array}{@{}ll@{}}
%a(x,y,t) = a(x,y-1,t-1) & 1+x+y \le t \le x+y+N_3 \koz , \\[1ex]
%b(x,y,t) = b(x-1,y,t-1) & 1+x+y \le t \le x+y+N_3 \koz , \\[1ex]
%c(x,y,t) = c(x,y,t-1) & 1+x+y \le t \le x+y+N_3 \koz , \\[0.5ex]
%\quad {} + a(x,y-1,t-1) \cdot b(x-1,y,t-1) \\[1ex]
%b(x,y,t) = c(x,y,t-1) & x+y+1+N_3 \le t \le 2 \cdot x+y+N_3 \koz , \\[1ex]
%c(x,y,t) = b(x-1,y,t-1) & x+y+1+N_3 \le t \le 2 \cdot x+y-1+N_3 \koz .
%\end{array}\normalsize
%\end{equation}}
%
% Revision on June 23, 2007: Narrow formula width to place the label aside the formula
{\small
\begin{equation}\label{exmmult10}
\begin{array}{@{}ll@{}}
a(x,y,t) = a(x,y-1,t-1) & 1+x+y \le t \le x+y+N_3 \koz , \\[1ex]
b(x,y,t) = b(x-1,y,t-1) & 1+x+y \le t \le x+y+N_3 \koz , \\[1ex]
\begin{array}{@{}l@{}}
c(x,y,t) = c(x,y,t-1) \\[0.5ex]
\quad {} + a(x,y-1,t-1) \\[0.5ex]
\quad {} \cdot b(x-1,y,t-1) \\
\end{array} & 1+x+y \le t \le x+y+N_3 \koz , \\[6ex]
b(x,y,t) = c(x,y,t-1) & x+y+1+N_3 \le t \le 2 \cdot x+y+N_3 \koz , \\[1ex]
c(x,y,t) = b(x-1,y,t-1) & x+y+1+N_3 \le t \le 2 \cdot x+y-1+N_3 \koz .
\end{array}\normalsize
\end{equation}}
Now the advantages of distributed control become obvious. The cell program for (\ref{exmmult11}) can be written with referral to the respective timestep $t$, only. And thus, we need no reaction to global control signals, no counting register, no counting operations, and no coding of the local cell coordinates.
\begin{gyak}
\ujgyak
Specify appropriate input/output schemes for performing, on the systolic arrays
presented in Figures \ref{exthreec} and \ref{exthreed}, two evaluations of system
(\ref{exmmult7}) that follow each other closest in time.
\ujgyak
How could we change the systolic array from Figure \ref{exthreed}, to efficiently support
the calculation of matrix products with parameters $M_1