\chapter[ Parallel Computations]{Parallel
Computations}\index{parallel computations|(}\nevindex{Fohry, Claudia}\nevindex{Iványi, Antal Miklós}
Parallel computations is concerned with solving a problem faster by using
multiple processors in parallel. These processors may belong to a single
machine, or to different machines that communicate through a network. In
either case, the use of parallelism requires to split the problem into tasks
that can be solved simultaneously.
In the following, we will take a brief look at the history of parallel
computing, and then discuss reasons why parallel computing is harder than
sequential computing. We explain differences from the related subjects of
distributed and concurrent computing, and mention typical application areas.
Finally, we outline the rest of this chapter.
Although the history of parallel computing can be followed back even longer,
the first parallel computer is commonly said to be Illiac IV, an experimental
64-processor machine that became operational in 1972. The parallel computing
area boomed in the late 80s and early 90s when several new companies were
founded to build parallel machines of various types. Unfortunately, software
was difficult to develop and non-portable at that time. Therefore, the
machines were only adopted in the most compute-intensive areas of science and
engineering, a market too small to commence for the high development
costs. Thus many of the companies had to give up.
On the positive side, people soon discovered that cheap parallel computers can
be built by interconnecting standard PCs and workstations. As networks became
faster, these so-called \emph{clusters} soon achieved speeds of the same order
as the special-purpose machines. At present, the Top 500 list, a regularly
updated survey of the most powerful computers worldwide,
contains 42\% clusters. Parallel computing also profits from the increasing
use of multiprocessor machines which, while designed as servers for web etc.,
can as well be deployed in parallel computing. Finally, software portability
problems have been solved by establishing widely used standards for parallel
programming. The most important standards, MPI and OpenMP, will be explained
in Subsections \ref{subsec:MPI} and \ref{subsec:OpenMP} of this book.
In summary, there is now an affordable hardware basis for parallel
computing. Nevertheless, the area has not yet entered the mainstream, which is
largely due to difficulties in developing parallel software. Whereas writing a
sequential program requires to find an algorithm, that is, a sequence of
elementary operations that solves the problem, and to formulate the algorithm
in a programming language, parallel computing poses additional challenges:
\begin{itemize}
\item Elementary operations must be grouped into tasks that can be solved
concurrently.
\item The tasks must be scheduled onto processors.
\item Depending on the architecture, data must be distributed to memory
modules.
\item Processes and threads must be managed, i.e., started, stopped and so on.
\item Communication and synchronisation must be organised.
\end{itemize}
Of course, it is not sufficient to find any grouping, schedule etc. that work,
but it is necessary to find solutions that lead to fast programs. Performance
measures and general approaches to performance optimisation will be discussed
in Section \ref{sec:perfMeasures}, where we will also elaborate on the items
above. Unlike in sequential computing, different parallel architectures and
programming models favour different algorithms.
In consequence, the design of parallel algorithms is more complex than the
design of sequential algorithms. To cope with this complexity, algorithm
designers often use simplified models. For instance, the Parallel Random
Access Machine (see Subsection \ref{subsec:PRAM}) provides a model in which
opportunities and limitations of parallelisation can be studied, but it
ignores communication and synchronisation costs.
We will now contrast parallel computing with the related fields of distributed
and concurrent computing. Like
parallel computing, \emph{distributed computing} uses interconnected
processors and divides a problem into tasks, but the purpose of division is
different. Whereas in parallel computing, tasks are executed \emph{at the same
time}, in distributed computing tasks are executed \emph{at different
locations}, using \emph{different resources}. These goals overlap, and many
applications can be classified as both parallel and distributed, but the focus
is different. Parallel computing emphasises homogeneous architectures, and
aims at speeding up applications, whereas distributed computing deals with
heterogeneity and openness, so that applications profit from the inclusion of
different kinds of resources. Parallel applications are typically stand-alone
and predictable, whereas distributed applications consist of components that
are brought together at runtime.
\emph{Concurrent computing} is not bound to the existence of multiple
processors, but emphasises the fact that several sub-computations are in
progress at the same time. The most important issue is guaranteeing
correctness for any execution order, which can be parallel or interleaved.
Thus, the relation between concurrency and parallelism is comparable to the
situation of reading several books at a time. Reading the books concurrently
corresponds to having a bookmark in each of them and to keep track of all
stories while switching between books. Reading the books in parallel, in
contrast, requires to look into all books at the same time (which is probably
impossible in practice). Thus, a concurrent computation may or may not be
parallel, but a parallel computation is almost always concurrent. An exception
is data parallelism, in which the instructions of a single program are applied
to different data in parallel. This approach is followed by SIMD
architectures, as described below.
For the emphasis on speed, typical application areas of parallel computing are
science and engineering, especially numerical solvers and simulations. These
applications tend to have high and increasing computational demands, since
more computing power allows one to work with more detailed models that yield
more accurate results. A second reason for using parallel machines is their
higher memory capacity, due to which more data fit into a fast memory level
such as cache.
The rest of this chapter is organised as follows: In Section \ref{sec:archs}, we
give a brief overview and classification of current parallel
architectures. Then, we introduce basic concepts such as task and process,
and discuss performance measures and general approaches to
the improvement of efficiency in Section \ref{sec:perfMeasures}. Next,
Section \ref{sec:parProg} describes parallel programming models, with focus on the
popular MPI and OpenMP standards. After having given this general background,
the rest of the chapter delves into the subject of parallel algorithms from a
more theoretical perspective. Based on example algorithms, techniques for
parallel algorithm design are introduced. Unlike in sequential computing,
there is no universally accepted model for parallel algorithm design and
analysis, but various models are used depending on purpose. Each of the models
represents a different compromise between the conflicting goals of accurately
reflecting the structure of real architectures on one hand, and keeping
algorithm design and analysis simple on the other. Section \ref{sec:models} gives an
overview of the models, Section \ref{sec:perfAnal} introduces the basic concepts
of parallel algorithmics, Sections \ref{sec:PRAM} and \ref{sec:mesh} explain deterministic example
algorithms for PRAM and mesh computational model.
\section{Parallel architectures}\label{sec:archs}
A simple, but well-known classification of parallel architectures has been
given in 1972 by Michael Flynn. He distinguishes computers into
four classes: SISD, SIMD, MISD, and MIMD architectures, as follows:
\begin{itemize}
\item SI stands for ``single instruction'', that is, the machine carries out a
single instruction at a time.
\item MI stands for ``multiple instruction'', that is, different processors
may carry out different instructions at a time.
\item SD stands for ``single data'', that is, only one data item is processed
at a time.
\item MD stands for ``multiple data'', that is, multiple data items may be
processed at a time.
\end{itemize}
SISD computers are von-Neumann machines. MISD computers have probably never
been built. Early parallel computers were SIMD, but today most parallel
computers are MIMD. Although the scheme is of limited classification power,
the abbreviations are widely used.
The following more detailed classification distinguishes parallel
machines into SIMD, SMP, ccNUMA, nccNUMA, NORMA, clusters, and grids.
\subsection{SIMD architectures} As depicted in Figure \ref{fig:SIMD},
a SIMD computer is composed of a powerful control processor and several less
powerful processing elements (PEs). The PEs are typically arranged as a mesh
so that each PE can communicate with its immediate neighbours. A program is a
single thread of instructions. The control processor, like the processor of a
sequential machine, repeatedly reads a next instruction and decodes it. If the
instruction is sequential, the control processor carries out the instruction
on data in its own memory. If the instruction is parallel, the control
processor broadcasts the instruction to the various PEs, and these
simultaneously apply the instruction to different data in their respective
memories. As an example, let the instruction be {\sf LD reg, 100}. Then, all
processors load the contents of memory address 100 to {\sf reg}, but memory
address 100 is physically different for each of them. Thus, all processors
carry out the same instruction, but read different values (therefore
``SIMD''). For a statement of the form \ {\sf if} \ \emph{test} \ {\sf then} \
\emph{if\_branch} \ {\sf else} \ \emph{else\_branch}, \ first all processors
carry out the test simultaneously, then some carry out \emph{if\_branch} while
the rest sits idle, and finally the rest carries out \emph{else\_branch} while
the formers sit idle. In consequence, SIMD computers are only suited for
applications with a regular structure. The architectures have been important
historically, but nowadays have almost disappeared.
\begin{figure}
\begin{center}
\epsfig{file=simd.eps,height=6cm}
\caption{SIMD architecture.}
\label{fig:SIMD}
\end{center}
\end{figure}
\subsection{Symmetric multiprocessors}
Symmetric multiprocessors (SMP) contain multiple processors that are connected
to a single memory. Each processor may access each memory location through
standard load/store operations of the hardware. Therefore, programs, including
the operating system, must only be stored once. The memory can be physically
divided into modules, but the access time is the same for each pair of a
processor and a memory module (therefore ``symmetric''). The processors are
connected to the memory by a bus (see Figure~\ref{fig:busSMP}), by a crossbar,
or by a network of switches. In either case, there is a delay for
memory accesses which, partially due to competition for network resources,
grows with the number of processors.
\begin{figure}
\begin{center}
\epsfig{file=smp.eps,height=3cm}
\caption{Bus-based SMP architecture.}
\label{fig:busSMP}
\end{center}
\end{figure}
In addition to main memory, each processor has one or several levels of cache
with faster access. Between memory and cache, data are moved in units of cache
lines. Storing a data item in multiple caches (and writing to it) gives rise
to coherency problems. In particular, we speak of \emph{false sharing} if
several processors access the same cache line, but use different portions of
it. Since coherency mechanisms work at the granularity of cache lines, each
processor assumes that the other would have updated its data, and therefore
the cache line is sent back and forth.
\subsection{Cache-coherent NUMA architectures:} NUMA stands for
\textbf{N}on-\textbf{U}niform \textbf{M}emory \textbf{A}ccess,
and contrasts with the symmetry property of the previous
class. The general structure of ccNUMA architectures is depicted in
Figure \ref{fig:ccNUMA}.
\begin{figure}
\begin{center}
\epsfig{file=ccNumaCD.eps,height=4.5cm}
\caption{ccNUMA architecture.}
\label{fig:ccNUMA}
\end{center}
\end{figure}
As shown in the figure, each processor owns a \emph{local memory}, which can
be accessed faster than the rest called \emph{remote memory}. All memory is
accessed through standard load/store operations, and hence programs, including
the operating system, must only be stored once. As in SMPs, each processor
owns one or several levels of cache; cache coherency is taken care of by the
hardware.
\subsection{Non-cache-coherent NUMA architectures:}
nccNUMA (\textbf{n}on \textbf{c}ache \textbf{c}oherent \textbf{N}on-\textbf{U}uniform
\textbf{M}emory \textbf{A}ccess) architectures
differ from ccNUMA architectures in that the hardware puts into a processor's
cache only data from local memory. Access to remote memory can still be
accomplished through standard load/store operations, but it is now up to the
operating system to first move the corresponding page to local memory. This
difference simplifies hardware design, and thus nccNUMA machines scale to
higher processor numbers. On the backside, the operating system gets more
complicated, and the access time to remote memory grows. The overall structure
of Figure~\ref{fig:ccNUMA} applies to nccNUMA architectures as well.
\vspace{-1.2mm}
\subsection{No remote memory access architectures} NORMA
(\textbf{NO} \textbf{R}emote \textbf{M}emory \textbf{A}cess) architectures
differ from the previous class in that the remote memory must be accessed
through slower I/O operations as opposed to load/store operations. Each node,
consisting of processor, cache and local memory, as depicted in
Figure~\ref{fig:ccNUMA}, holds an own copy of the operating system, or at least
of central parts thereof. Whereas SMP, ccNUMA, and nccNUMA architectures
%Tanenbaum checken
are commonly classified as shared memory machines, SIMD architectures, NORMA
architectures, clusters, and grids (see below) fall under the heading of
distributed memory.
\vspace{-1.2mm}
\subsection{Clusters} According to Pfister, a cluster is a type
of parallel or distributed system that consists of a collection of
interconnected whole computers that are used as a single, unified computing
resource. Here, the term ``whole computer'' denotes a PC, workstation or,
increasingly important, SMP, that is, a node that consists of processor(s),
memory, possibly peripheries, and operating system. The use as a single,
unified computing resource is also denoted as single system image SSI. For
instance, we speak of SSI if it is possible to login into the system instead
of into individual nodes, or if there is a single file system. Obviously, the
SSI property is gradual, and hence the borderline to distributed systems is
fuzzy. The borderline to NORMA architectures is fuzzy as well, where the
classification depends on the degree to which the system is designed as a
whole instead of built from individual components.
Clusters can be classified according to their use for parallel computing, high
throughput computing, or high availability. Parallel computing clusters can be
further divided into \emph{dedicated clusters}, which are solely built for the
use as parallel machines, and \emph{campus-wide clusters}, which are
distributed systems with part-time use as a cluster. Dedicated clusters
typically do not contain peripheries in their nodes, and are interconnected
through a high-speed network. Nodes of campus-wide clusters, in contrast, are
often desktop PCs, and the standard network is used for intra-cluster
communication.
\subsection{Grids}
A grid is a hardware/software infrastructure for
shared usage of resources and problem solution. Grids enable
coordinated access to resources such as processors, memories, data, devices,
and so on. Parallel computing is one out of several emerging application
areas. Grids differ from other parallel architectures in that they are large,
heterogeneous, and dynamic. Management is complicated by the fact that grids
cross organisational boundaries.
\section{Performance in practice}\label{sec:perfMeasures}
As explained in the introduction, parallel computing splits a problem into
\textit{tasks}\index{task} that are solved independently. The tasks are implemented as
either \textit{processes}\index{task} or \textit{threads}.\index{thread} A detailed discussion of these
concepts can be found in operating system textbooks such as
Tanenbaum. Briefly stated, processes are programs in
execution. For each process, information about resources such as memory
segments, files, and signals is stored, whereas threads exist within processes
such that multiple threads share resources. In particular, threads of a
process have access to shared memory, while processes (usually) communicate
through explicit message exchange. Each thread owns a separate PC and other
register values, as well as a stack for local variables. Processes can be
considered as units for resource usage, whereas threads are units for
execution on the CPU. As less information needs to be stored, it is faster to
create, destroy and switch between threads than it is for processes.
Whether threads or processes are used, depends on the architecture. On
shared-memory machines, threads are usually faster, although processes may be
used for program portability. On distributed memory machines, only processes
are a priori available. Threads can be used if there is a software layer
(distributed shared memory) that implements a shared memory abstraction, but
these threads have higher communication costs.
Whereas the notion of tasks is problem-related, the notions of processes and
threads refer to implementation. When designing an algorithm, one typically
identifies a large number of tasks that can potentially be run in parallel,
and then maps several of them onto the same process or thread.
Parallel programs can be written in two styles that can also be mixed: With
\ki{data parallelism,}\inddef{data parallelism}\index{parallelism|data}
the same operation is applied to different data at a
time. The operation may be a machine instruction, as in SIMD architectures, or
a complex operation such as a function application. In the latter case,
different processors carry out different instructions at a time. With
\ki{task parallelism,}\inddef{task parallelism}\index{parallelism!task}
in contrast, the processes/threads carry out
different tasks. Since a function may have an {\sf if} or {\sf case}
statement as the outermost construct, the borderline between data parallelism
and task parallelism is fuzzy.
Parallel programs that are implemented with processes can be further
classified as using \textit{Single Program Multiple Data}
(SPMD)\index{Single Program Multiple Data}\index{SPMD|see{Single Program Multiple Data}} or
\textit{Multiple Program Multiple Data}\index{Multiple Program Multiple
Data}\index{MPMD|see{Multiple Program Multiple Data}} (MPMD) coding styles. With SPMD, all
processes run the same program, whereas with MPMD they run different
programs. MPMD programs are task-parallel, whereas SPMD programs may be either
task-parallel or data-parallel. In SPMD mode, task parallelism is expressed
through conditional statements.
As the central goal of parallel computing is to run programs faster,
performance measures play an important role in the field. An obvious measure
is execution time, yet more frequently the derived measure of speedup is
used. For a given problem, speedup is defined by
\[
\var{speedup}(p) = \frac{T_1}{T_p} \koz ,
\]
where $T_1$ denotes the running time of the fastest sequential algorithm, and
$T_p$ denotes the running time of the parallel algorithm on $p$
processors. Depending on context, speedup may alternatively refer to using $p$
processes or threads instead of $p$ processors. A related, but less frequently
used measure is efficiency, defined by
\[
\var{efficiency}(p)=\frac{\var{speedup}(p)}{p} \koz .
\]
Unrelated to this definition, the term efficiency is also used informally as a
synonym for good performance.
Figure~\ref{fig:speedup} shows ideal, typical, and super-linear speedup
curves. The ideal curve reflects the assumption that an execution that uses
twice as many processors requires half of the time. Hence, ideal speedup
corresponds to an efficiency of one. Super-linear speedup may arise due to
cache effects, that is, the use of multiple processors increases the total
cache size, and thus more data accesses can be served from cache instead of
from slower main memory.
\epskepnagy{speedupsmallp-tony.eps}{Ideal, typical, and super-linear
speedup curves.\label{fig:speedup}}{height=4cm}
Typical speedup stays below ideal speedup, and grows up to some number of
processors. Beyond that, use of more processors slows down the program. The
difference between typical and ideal speedups has several reasons:
\begin{itemize}
\item \emph{Amdahl's law}\nevindex{Amdahl, Gene M.}
states that each program contains a serial portion $s$ that is not amenable
to parallelisation. Hence, $T_p>s$, and thus $\id{speedup}(p)
#include
#include "mpi.h"
int main (int argc, char **argv) {
char msg[20];
int me, total, tag=99;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &me);
MPI_Comm_size(MPI_COMM_WORLD, &total);
if (me==0) {
strcpy(msg, "Hello");
MPI_Send(msg, strlen(msg)+1, MPI_CHAR, 1, tag,
MPI_COMM_WORLD);
}
else if (me==1) {
MPI_Recv(msg, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD,
&status);
printf("Received: %s \n", msg);
}
MPI_Finalize();
return 0;
}
\end{verbatim}
\caption{A simple MPI program.}
\label{fig:mpiProg}
\end{center}
\end{figure}
Although the above functions are sufficient to write simple programs, many
more functions help to improve the efficiency and/or structure MPI
programs. In particular, MPI-1 supports the following classes of functions:
\begin{itemize}
\item {\em Alternative functions for pairwise communication:} The base {\sf
MPI\_Send} function, also called standard mode send, returns if either the
message has been delivered to the receiver, or the message has been buffered
by the system. This decision is left to MPI. Variants of {\sf MPI\_Send}
enforce one of the alternatives: In synchronous mode, the send function only
returns when the receiver has started receiving the message, thus
synchronising in both directions. In buffered mode, the system is required
to store the message if the receiver has not yet issued {\sf MPI\_Recv}.
On both the sender and receiver sides, the functions for standard,
synchronous, and buffered modes each come in blocking and nonblocking
variants. Blocking variants have been described above. Nonblocking variants
return immediately after having been called, to let the sender/receiver
continue with program execution while the system accomplishes communication
in the background. Nonblocking communications must be completed by a call to
{\sf MPI\_Wait} or {\sf MPI\_Test} to make sure the communication has
finished and the buffer may be reused. Variants of the completion functions
allow to wait for multiple outstanding requests.
MPI programs can deadlock, for instance if a process $P_0$ first issues a send
to process $P_1$ and then a receive from $P_1$; and $P_1$ does the same with
respect to $P_0$. As a possible way-out, MPI supports a combined send/receive
function.
In many programs, a pair of processes repeatedly exchanges data with the same
buffers. To reduce communication overhead in these cases, a kind of address
labels can be used, called persistent communication. Finally, MPI functions
{\sf MPI\_Probe} and {\sf MPI\_Iprobe} allow to first inspect the size and
other characteristics of a message before receiving it.
\item {\em Functions for Datatype Handling:} In simple forms of message
passing, an array of equally-typed data (e.g. {\sf float}) is
exchanged. Beyond that, MPI allows to combine data of different types in a
single message, and to send data from non-contiguous buffers such as every
second element of an array. For these purposes, MPI defines two alternative
classes of functions: user-defined data types describe a pattern of data
positions/types, whereas packaging functions help to put several data into a
single buffer. MPI supports heterogeneity by automatically converting data
if necessary.
\item {\em Collective communication functions:} These functions support
frequent patterns of communication such as broadcast (one process sends a data
item to all other processes). Although any pattern can be implemented by a
sequence of sends/receives, collective functions should be preferred since
they improve program compactness/understandability, and often have an
optimised implementation. Moreover, implementations can exploit specifics of
an architecture, and so a program that is ported to another machine may run
efficiently on the new machine as well, by using the optimised implementation
of that machine.
\item {\em Group and communicator management functions:} As mentioned above,
the send and receive functions contain a communicator argument that
describes a group of processes. Technically, a communicator is a distributed
data structure that tells each process how to reach the other processes of
its group, and contains additional information called attributes. The same
group may be described by different communicators. A message exchange only
takes place if the communicator arguments of {\sf MPI\_Send} and {\sf
MPI\_Recv} match. Hence, the use of communicators partitions the messages of
a program into disjoint sets that do not influence each other. This way,
communicators help structuring programs, and contribute to correctness. For
libraries that are implemented with MPI, communicators allow to separate
library traffic from traffic of the application
program. Groups/communicators are necessary to express collective
communications. The attributes in the data structure may contain
application-specific information such as an error handler. In addition to
the (intra)communicators described so far, MPI supports intercommunicators
for communication between different process groups.
\end{itemize}
MPI-2 adds four major groups of functions:
\begin{itemize}
\item {\em Dynamic process management functions: } With these functions, new
MPI processes can be started during a program run. Additionally,
independently started MPI programs (each consisting of multiple processes)
can get into contact with each other through a client/server mechanism.
\item {\em One-sided communication functions:} One-sided communication is a
type of shared-memory communication in which a group of processes agrees to
use part of their private address spaces as a common resource. Communication
is accomplished by writing into and reading from that shared
memory. One-sided communication differs from other shared-memory programming
models such as OpenMP in that explicit function calls are required for the
memory access.
\item {\em Parallel I/O functions:} A large set of functions allows multiple
processes to collectively read from or write to the same file.
\item {\em Collective communication functions for intercommunicators:} These
functions generalise the concept of collective communication to
intercommunicators. For instance, a process of one group may broadcast a
message to all processes of another group.
\end{itemize}
\subsection{OpenMP programming}\label{subsec:OpenMP}
OpenMP derives its name from being an open standard for multiprocessing, that
is for architectures with a shared memory. Because of the shared memory, we
speak of threads (as opposed to processes) in this section.
Shared-memory communication is fundamentally different from message passing:
Whereas message passing immediately involves two processes, shared-memory
communication uncouples the processes by inserting a medium in-between. We
speak of read/write instead of send/receive, that is, a thread writes into
memory, and another thread later reads from it. The threads need not know each
other, and a written value may be read by several threads. Reading and writing
may be separated by an arbitrary amount of time. Unlike in message passing,
synchronisation must be organised explicitly, to let a reader know when the
writing has finished, and to avoid concurrent manipulation of the same data by
different threads.
OpenMP is one type of shared-memory programming, while others include
one-sided communication as outlined in Subsection \ref{subsec:MPI}, and threads
programming as outlined in Subsection \ref{subsec:otherProg}. OpenMP differs from other
models in that it enforces a fork-join structure, which is depicted in
Figure \ref{fig:openMP}. A program starts execution as a single thread, called
master thread, and later creates a team of threads in a so-called parallel
region. The master thread is part of the team. Parallel regions may be nested,
but the threads of a team must finish together. As shown in the figure, a
program may contain several parallel regions in sequence, with possibly
different numbers of threads.
\epskepnagy{openMPstruc.eps}{Structure of an OpenMP program.\label{fig:openMP}}{height=5cm}
As another characteristic, OpenMP uses compiler directives as opposed to
library functions. Compiler directives are hints that a compiler may or may
not take into account. In particular, a sequential compiler ignores the
directives. OpenMP supports incremental parallelisation, in which one starts
from a sequential program, inserts directives at the most performance-critical
sections of code, later inserts more directives if necessary, and so on.
OpenMP has been introduced in 1998, version 2.0 appeared in 2002. In addition
to compiler directives, OpenMP uses a few library functions and environment
variables. The standard is available for C, C++, and Fortran.
Programming OpenMP is easier than programming MPI since the compiler does part
of the work. An OpenMP programmer chooses the number of threads, and then
specifies work sharing in one of the following ways:
\begin{itemize}
\item {\em Explicitly:} A thread can request its own number by calling the
library function {\sf omp\_get\_thread\_num}. Then, a conditional statement
evaluating this number explicitly assigns tasks to the threads, similar as
in SPMD-style MPI programs.
\item {\em Parallel loop:} The compiler directive {\sf \#pragma omp parallel
for} indicates that the following {\sf for} loop may be executed in parallel
so that each thread carries out several iterations (tasks). An example is
given in Figure~\ref{fig:bspOpenMP}. The programmer can influence the work
sharing by specifying parameters such as {\sf schedule(static)} or {\sf
schedule(dynamic)}. Static scheduling means that each thread gets an about
equal-sized block of consecutive iterations. Dynamic scheduling means that
first each thread is assigned one iteration, and then, repeatedly, a thread
that has finished an iteration gets the next one, as in the master/slave
paradigma described before for MPI. Different from master/slave, the
compiler decides which thread carries out which tasks, and inserts the
necessary communications.
\item {\em Task-parallel sections:} The directive {\sf \#pragma omp parallel
sections} allows to specify a list of tasks that are assigned to the
available threads.
\end{itemize}
Threads communicate through shared memory, that is, they write to or read from
shared variables. Only part of the variables are shared, while others are
private to a particular thread. Whether a variable is private or shared is
determined by rules that the programmer can overwrite.
\begin{figure}
\begin{center}
\begin{verbatim}
#include
#define N 100
double a[N][N], b[N], c[N];
int main() {
int i, j;
double h;
/* initialisation omitted */
omp_set_num_threads(4);
#pragma omp parallel for shared(a,b,c) private(j)
for (i=0; i \key{then} \= $y_1 \leftarrow x_1$ \\
3 \' \> \> \key{return} $Y$ \\
4 \' \key{if} \= $p > 1$ \\
5 \' \> \key{then} \= $P_i$ \= \key{ in } \= \key{parallel for} $i \leftarrow 1 \key{ to } p/2$ \\
\> \> \> \key{ do } \> compute recursive $y_1, y_2, \ldots, y_{p/2}$, \\
\> \> \> \> the prefixes, belonging to $x_1, x_2, \ldots, x_{p/2}$ \\
\> \> $P_i$ \= \key{ in } \= \key{parallel for} $i \leftarrow p/2 + 1 \key{ to } p$ \\
\> \> \> \key{ do } \> compute recursive $y_{p/2+1}, y_{p/2+2}, \ldots, y_p$\\
\> \> \> \> the prefixes, belonging to $x_{p/2+1}, x_{p/2+2}, \ldots, x_p$ \\
6 \' \> \> $P_i$ \= \key{ in } \= \key{parallel for} $p/2 + 1 \leq i \leq p$ \\
\> \> \> \key{ do } \> read $y_{p/2}$ from the global memory and compute $y_{p/2}\oplus y_{p/2+i}$ \\
7 \' \key{return} $Y$
\end{alg}
\begin{pelda} \textit{Calculation of prefixes of 8 elements on 8 processors.}
Let $n = 8$ and $p = 8$. The input data of the prefix calculation are
12, 3, 6, 8, 11, 4, 5 and 7, the associative operation is the addition.
The run of the \textit{recursive algorithm}\index{recursive algorithm}\index{algorithm!recursive}
consists of \textit{rounds}.\index{round} In the first round (step 4) the first four processors
get the input data 12, 3, 6, 8, and compute recursively the prefixes 12, 15, 21, 29 as output.
At the same time the other four processors get the input data
11, 4, 5, 7, and compute the prefixes 11, 15, 20, 27.
According to the recursive structure $P_1, P_2, \ P_3$ and $P_4$ work as follows. $P_1$ and $P_2$ get
$x_1$ and $x_2$, resp. $P_3$ and $P_4$ get $x_3$ and $x_4$ as input. Recursivity mean for $P_1$ and
$P_2$, that $P_1$ gets $x_1$ and $P_2$ gets $x_2$, computing at first $y_1 = x_1$ and $y_2 = x_2$, then
$P_2$ updates $y_2 = y_1 \oplus y_2$. After this $P_3$ computes $y_3 = y_2 \oplus y_3$ and
$y_4 = y_4 \oplus y_4$.
While $P_1, P_2, \ P_3$ and $P_4$, according to step 4, compute the final values $y_1, \ y_2, \ y_3$
and $y_4$, $P_5, P_6, \ P_7$ and $P_8$ compute the local provisional values of
$y_5, \ y_6, \ y_7$ and $y_8$.
In the second round (step 5) the first four processors stay,
the second four processors compute the final values of $y_5, \ y_6, \ y_7$ and $y_8$,
adding $y_4 = 29$ to the provisional values 11, 15, 20 and 27 and receiving
40, 44, 49 and 56.
\end{pelda}
In the remaining part of the section we use the notation $W(n)$ instead of $W(n,p)$ and give the number
of used processors in verbal form. If $p = n$, then we usually prefer to use $p$.
\begin{tetel} Algorithm \textsc{CREW-Prefix} uses $\Theta(\lg p)$ time on p \textsc{CREW PRAM} processors
to compute the prefixes of p elements.
\end{tetel}
\begin{biz}
The lines 4--6 require $W(p/2)$ steps, the line 7 does $\Theta(1)$ steps.
So we get the following recurrence:
\begin{equation}
W(p) = W(p/2) + \Theta(1).
\end{equation}
Solution of this recursive equation is $W(p) = \Theta(\lg p)$.
\end{biz}
\textsc{CREW-prefix} is not work-optimal, since its work is $\Theta(p \lg p)$
and we know sequential algorithm requiring only $O(p)$ time, but it is
work-effective, since all sequential prefix algorithms require $\Omega(p)$ time.
\subsubsection{An EREW PRAM algorithm.}
In the following algorithm we use exclusive write instead of the parallel one,
therefore it can be implemented on the EREW PRAM model. Its input is
the number of processors $p$ and the sequence $X = x_1, \ x_2, \ \ldots, \ x_p$, and its output is
the sequence $Y = y_1, \ y_2, \ \ldots, \ y_p$ containing the prefixes.
\begin{alg}{EREW-Prefix$(X)$}\inddef{EREW-Prefix}
1 \' $Y[1] \leftarrow X[1]$ \\
2 \' $P_i$ \= \key{in parallel for} \= $i \leftarrow 2 \key{ to } p$ \\
3 \' \> \key{do} $Y[i] \leftarrow X[i - 1] \oplus X[i ]$ \\
4 \' $k \leftarrow 2$ \\
5 \' \key{while} \= $k < p$ \\
6 \' \> \key{do} \= $P_i$ \= \key{in parallel for} $i \leftarrow k + 1 \key{ to } p$ \\
7 \' \> \> \key{do} \= $Y[i] \leftarrow Y[i - k] \oplus Y[i]$ \\
8 \' \> \> \> $k \leftarrow k + k$ \\
9 \' \key{return} $Y$
\end{alg}
\begin{tetel} Algorithm \textsc{EREW-Prefix} computes the prefixes of $p$ elements
on $p$ \textsc{EREW PRAM} processors in $\Theta(\lg p)$ time.
\end{tetel}
\begin{biz}
The commands in lines 1--3 and 9 are executed in $O(1)$ time. Lines 4--7 are executed so many times
as the assignment in line 8, that is $\Theta(p)$ times.
\end{biz}
\subsubsection{A work-optimal algorithm.}
Next we consider a recursive work-optimal algorithm, which uses $p/\lg p$ CREW PRAM processors.
Input is the length of the input sequence $(p)$ and the sequence $X = x_1, \ x_2, \ \ldots, \ x_p$, output is
the sequence $Y = y_1, \ y_2, \ \ldots, \ y_p$, containing the computed prefixes.
\begin{algN}{Optimal-Prefix$(p,X)$}\inddef{Optimal-Prefix}
1 \' $P_i$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } p/\lg p$ \\
2 \' \> \key{do} \= compute recursive $z_{(i - 1) \lg p + 1}, z_{(i - 1) \lg p + 2}, \ldots, z_{i \lg p},$ \\
\' \> \> the prefixes of the following $\lg p$ input data \\
\' \> \> $x_{(i - 1)\lg p + 1}, x_{(i - 1) \lg p + 2}, \ldots, x_{i \lg p}$ \\
3 \' $P_i$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } p/\lg p$ \\
4 \' \> \key{do} \= using \textsc{CREW-Prefix} compute $w_{\lg p}, w_{2 \lg p}, w_{3 \lg p}, \ldots, w_p$, \\
\' \> \> the prefixes of the following $p/\lg p$ elements: \\
\' \> \> $z_{\lg p}, z_{2 \lg p}, z_{3 \lg p}, \ldots, , z_p$ \\
5 \' $P_i$ \= \key{in parallel for} $i \leftarrow 2 \key{ to } p/\lg p$ \\
6 \' \> \key{do} \= \key{for} \= $j \leftarrow 1 \key{ to } p$ \\
7 \' \> \> \> \key{do} $Y[(i-1) \lg p+j] \leftarrow w_{(i-1) \lg p} \oplus z_{(i-1) \lg p +j}$ \\
8 \' $P_1$ \= \key{for} \= $j \leftarrow 1 \key{ to } p$ \\
9 \' \> \> \key{do} $Y[j] \leftarrow z_j$ \\
10 \' \key{return} $Y$
\end{algN}
This algorithm runs in logarithmic time. The following two formulas help to show it:
\begin{equation}
z_{(i - 1)\lg p + k} = \sum_{j = (i - 1)\lg p + 1}^{i \lg p} x_j \ \ (k = 1, \ 2, \ \ldots, \ \lg p)
\end{equation}
and
\begin{equation}
w_{i \lg p} = \sum_{j = 1}^{i} z_{j \lg p} \ (i = 1, \ 2, \ \ldots,),
\end{equation}
where summing goes using the corresponding associative operation.
\begin{tetel} [parallel prefix computation in $\Theta(\lg p)$ time] Algorithm \textsc{Opti-} \textsc{mal-Prefix}
computes the prefixes of $p$ elements on $p/\lg p$ \textsc{CREW PRAM} processors in $\Theta(\lg p)$
time.\label{theorem:optprefix}
\end{tetel}
\begin{biz}
Line 1 runs in $\Theta(\lg p)$ time, line 2 runs
$O(\lg(p/\lg p)) = O(\lg p)$ time, line 3 runs $\Theta(\lg p)$ time.
\end{biz}
This theorem imply that the work of \textsc{Optimal-Prefix} is $\Theta(p)$, therefore
\textsc{Optimal-Prefix} is a work-optimal algorithm.
\parkep{boritoEN}{Computation of prefixes of 16 elements using \textsc{Optimal-Prefix}. \label{fig:borito}}
Let the elements of the sequence $X = x_1, x_2, \ldots, x_p$ be the elements of the
alphabet $\Sigma$. Then the input data of the prefix computation are the elements of the sequence $X$,
and the \ki{prefix problem} is the computation of the elements
$x_1, \ x_1 \oplus x_2, \ \ldots, x_1 \oplus x_2 \oplus x_3 \oplus \ldots \oplus x_p$.
These computable elements are called \ki{prefixes.}\inddef{prefix}
We remark, that in some books on parallel programming often the elements of the sequence $X$
are called prefixes.
\begin{pelda}{ Associative operations.} If $\Sigma$ is the set of integers, $\oplus$ denotes the addition
and the sequence of the input data is 3, -5, 8, 2, 5, 4, then the prefixes are
3, -2, 6, 8, 13, 17. If the alphabet and the input data are the same, the operation is the multiplication, then
the output data (prefixes) are 3, -15, -120,
-240, -1200, -4800. If the operation is the minimum (it is also associative), then the prefixes are
3, -5, -5, -5, -5, -5. The last prefix equals to the smallest input data.
\end{pelda}
Sequential prefix calculation can be solved in $O(p)$ time. Any A
sequential algorithm needs $N(p,\rm{A}) = \Omega(n)$ time. There exist
work-effective parallel algorithms
solving the prefix problem.
Our first parallel algorithm is \textsc{CREW-Prefix}\index{CREW PRAM}, which uses $p$ CREW PRAM processors and
requires $\Theta(\lg p)$ time. Then we continue with algorithm \textsc{EREW-Prefix}\index{EREW-Prefix},
having similar qualitative characteristics, but running on EREW PRAM model too.
These algorithms solve the prefix problem quicker, than the sequential algorithms, but
the order of their work is larger.
\newpage
Algorithm \textsc{Optimal-Prefix}\index{\textsc{Optimal-Prefix}} requires
only $\left \lceil p/\lg p \right \rceil$ CREW PRAM processors\index{CREW PRAM}
and in spite of the reduced numbers of processors requires only $O(\lg p)$ time. So its
work is $O(p)$, therefore its efficiency is $\Theta(1)$ and is work-effective.
The speedup of the algorithm is $\Theta(n/\lg n)$.
\subsection{Ranking}
The input of the \ki{list ranking problem}\inddef{list ranking problem} is a list represented
by an array $A[1 \twodots p]$: each element contains the index of its
\ki{right neighbour}\inddef{right neighbour}
(and maybe further data). The task is to determine the rank of the elements. The
\ki{rank} is defined as the number of the right neighbours of the given element.
Since the further data are not necessary to find the solution, for the simplicity we suppose that the elements of the
array contain only the index of the right neighbour. This index is called \ki{pointer.}
The pointer of the rightmost element equals to zero.
\begin{pelda} \textit{Input of list ranking.\label{pelda:6elem}}
Let $A[1 \twodots 6]$ be the array represented in the first row of Figure \ref{fig:tomb6}.
Then the right neighbour of the element $A[1]$ is $A[5]$, the right neighbour of $A[2]$ is
$A[4]$. $A[4]$ is the last element, therefore its rank is 0. The rank of $A[2]$ is 1, since only one element,
$A[4]$ is to right from it. The rank of $A[1]$ is 4, since the elements $A[5], A[3], A[2]$ and
$A[4]$ are right from it. The second row of Figure \ref{fig:tomb6} shows the elements
of $A$ in decreasing order of their ranks.
\end{pelda}
\parkep{2.2E}{Input data of array ranking and the the result of the ranking. \label{fig:tomb6}}
The list ranking problem can be solved in linear time using a sequential algorithm.
At first we determine \ki{the head of the list}\inddef{head of a list} which is
the unique $A[i]$ having the property that does not exist an index $j \ (1 \leq j \leq p)$ with
$A[j] = i$. In our case the head of $A$ is $A[6]$. The head of the list has the rank $p - 1,$
its right neighbour has a rank $p - 2, \ldots$ and finally the rank of the last element is zero.
In this subsection we present a deterministic list ranking algorithm, which uses $p$ EREW PRAM processors
and in worst case $\Theta(\lg p)$ time. The pseudocode of algorithm \textsc{Det-Ranking} is as follows.
The input of the algorithm is the number of the elements to be ranked $(p)$, the array $N[1 \twodots p]$
containing the index of the right neighbour of the elements of $A$, output is the array $R[1 \twodots p]$
containing the computed ranks.
\newpage
\begin{algN}{Det-Ranking$(p,N)$}\inddef{\textsc{Det-Ranking}}
1 \' $P_i$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } p$ \\
2 \' \> \key{do} \= \key{if} $N[i] = 0$ \\
3 \' \> \> \key{then} \= $R[i] \leftarrow 0$ \\
4 \' \> \> \key{else} \> $R[i] \leftarrow 1$ \\
5 \' \key{for} \= $j \leftarrow 1$ \key{ to } $\lceil \lg p \rceil$ \\
6 \' \> \key{do} $P_i$ \= \key{in parallel for} $i \leftarrow 1$ \key{to} $p$ \\
7 \' \> \> \key{do} \key{if} \= $N[i] \neq 0$ \\
8 \' \> \> \> \textbf{then} \= $R[i] \leftarrow R[i] + R[N[i]]$ \\
9 \' \> \> \> \> $N[i] \leftarrow N[N[i]]$ \\
10 \' \key{return} $R$
\end{algN}
The basic idea behind the algorithm \textsc{Det-Ranking} is the \ki{pointer jumping.}\inddef{pointer jumping}
According to this algorithm at the beginning each element contains the index of its right neighbour, and
accordingly its provisional rank equal to 1 (with exception of the last element of the list,
whose rank equals to zero).\index{\textsc{Det-Ranking}}
This initial state is represented in the first row of Figure \ref{fig:6pelda}.
\parkep{2.3E}{Work of algorithm \textsc{Det-Ranking}\abraindex{\textsc{Det-Ranking}}
on the data of Example \ref{pelda:6elem}. \label{fig:6pelda}}
Then the algorithm modifies the element so, that each element points to the right neighbour of its right
neighbour (if it exist, otherwise to the end of the list). This state is represented in the second row of
Figure \ref{fig:6pelda}.
If we have $p$ processors, then it can be done in $O(1)$ time.
After this each element (with exception of the last one) shows to the element whose distance was originally
two. In the next step of the pointer jumping the elements will show to such other element whose
distance was originally 4 (if there is no such element, then to the last one), as it is shown in
the third row of Figure \ref{fig:6pelda}.
In the next step the pointer part of the elements points to the neighbour of distance 8 (or
to the last element, if there is no element of distance 8), according to the last row of Figure
\ref{fig:6pelda}.
In each step of the algorithm each element updates the information on the number of elements between
itself and the element pointed by the pointer. Let $R[i]$, resp. $N[i]$ the rank, resp. neighbour field of
the element $A[i]$. The initial value of $R[i]$ is 1 for the majority of
the elements, but is 0 for the rightmost element ($R(4) = 0$ in the first line of Figure \ref{fig:6pelda}).
During the pointer jumping $R[i]$ gets the new value (if $N[i] \neq 0$) gets the new value
$R[i] + R[N[i]]$, if $N[i] \neq 0$.
E.g. in the second row of Figure \ref{fig:6pelda}) $R[1] = 1 + 1 = 2,$
since its previous rank is 1, and the rank of its right neighbour is also 1.
After this $N[i]$ will be modified to point to $N[N[i]]$. E.g. in the second row of
Figure \ref{fig:6pelda} $N[1] = 3,$ since the right neighbour of the right neighbour of $A[1]$ is $A[3]$.
\begin{tetel} Algorithm \textsc{Det-Ranking} computes the ranks of an array consisting of $p$ elements
on $p$ \textsc{EREW PRAM} processors in $\Theta{(\lg p)}$ time.
\label{theorem:detranking}\end{tetel}
Since the work of \textsc{Det-Ranking} is $\Theta(p \lg p)$, this algorithm is not work-optimal,
but it is work-efficient.
The list ranking problem corresponds to a list prefix problem, where each element is 1, but the
last element of the list is 0. One can easily modify \textsc{Det-Ranking} to get a prefix
algorithm.
\subsection{Merge}\index{merge|(}
The input of the \ki{merging problem} is two sorted sequences $X_1$ and $X_2$ and the output is one sorted
sequence $Y$ containing the elements of the input.
If the length of the input sequences is $p$, then the merging problem can be solved in $O(p)$
time using a sequential processor. Since we have to investigate all elements and write them
into the corresponding element of $Y$, the running time
of any algorithm is $\Omega(p)$. We get this lower bound even in the case when we count only
the number of necessary comparisons.
\subsubsection{Merge in logarithmic time.}
Let $X_1 = x_1, x_2, \ldots , x_m$ and $X_2 = x_{m + 1}, x_{m + 2},
\ldots, x_{2m}$ be the input sequences. For the shake of simplicity let
$m$ be the power of two and let the elements be different.
To merge two sequences of length $m$ it is enough to know the ranks of the keys, since then we can write the
keys---using $p = 2m$ processors---into the corresponding memory locations with one parallel write operation.
The running time of the following algorithm is a logarithmic, therefore it is called
\textsc{Logarithmic-Merge}.
\begin{tetel} Algorithm \textsc{Logarithmic-Merge} merges two sequences of length $m$
on $2m$ \textsc{CREW PRAM} processors in $\Theta(\lg m)$ time.
\end{tetel} \index{\textsc{Logarithmic-Merge}}\index{\textsc{Logarithmic-Merge}}
\begin{biz} Let the rank of element $x$ be $r_1 \ (r_2)$ in $X_1$ (in $X_2$). If $x = x_j \in X_1$, then
let $r_1 = j$. If we assign a single processor $P$ to the element $x$, then it can determine,
using binary search, the number $q$ of elements in $X_2$, which are smaller than $x$. If $q$ is known,
then $P$ computes the rank $r_j$ in the union of $X_1$ and $X_2$, as $j + q$. If $x$ belongs to
$X_2$, the method is the same.
Summarising the time requirements we get, that using one CREW PRAM processor per element,
that is totally $2\var{m}$ processors the running time is $\Theta(\lg m)$.
\end{biz}
This algorithm is not work-optimal, only work-efficient.
\subsubsection{Odd-even merging algorithm.}
This following recursive algorithm \textsc{Odd-Even-Merge}
follows the classical \textit{divide-and-conquer}\index{divide-and-conquer} principle.
Let $X_1 = x_1, x_2, \ldots, x_m$ and $X_2 = x_{m + 1}, x_{m + 2},
\ldots, x_{2m}$ be the two input sequences. We suppose that $m$
is a power of 2 and the elements of the arrays are different.
The output of the algorithm is the sequence $Y = y_1, \ldots, y_{2m}$, containing the merged elements.
This algorithm requires $2m$ EREW PRAM processors.
\begin{algN}{Odd-Even-Merge$(X_1,X_2)$}\inddef{\textsc{Odd-Even-Merge}}
1 \' \key{if} \= $m = 1$ \\
2 \' \> \key{then} get $Y$ by merging $x_1$ and $x_2$ with one comparison \\
3 \' \> \key{return} $Y$ \\
4 \' \key{if} \= $m > 1$ \\
5 \' \> \key{then} \= $P_i$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } m$ \\
6 \' \> \> \> \key{do} \= merge recursively $X_1^{odd} = x_1, x_3, \ldots, x_{m - 1}$ and \\
7 \' \> \> \> \> $X_2^{odd} = x_{m+1}, x_{m+3}, \ldots, x_{2m - 1}$ to get $L_1 = l_1, l_2, \ldots, l_m$ \\
8 \' \> \> $P_i$ \= \key{in parallel for} $1 \leftarrow m+1 \key{ to } 2m$ \\
9 \' \> \> \> \key{do} \= merge recursively $X_1^{even} = x_2, x_4, \ldots, x_m$ and \\
10 \' \> \> \> \> $X_2^{even} = x_{m+2}, x_{m+4}, \ldots, x_{2m}$ to get $L_2 = l_{m + 1}, l_{m + 2}, \ldots, l_{2m}$ \\
11 \' \> \> $P_i$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } m$ \\
12 \' \> \> \> \key{do} \= $y_{2i-1} \leftarrow l_i$ \\
13 \' \> \> \> \> $y_{2i} \leftarrow l_{m+i}$ \\
14 \' \> \> \> \> \key{if} \= $y[2i] > y[2i + 1]$ \\
13 \' \> \> \> \> \> \key{then} \= $z \leftarrow y[2i]$ \\
14 \' \> \> \> \> \> \> $y[2i] \leftarrow y[2i+1]$ \\
15 \' \> \> \> \> \> \> $y[2i+1] \leftarrow z$ \\
15 \' \key{return} $Y$
\end{algN}
\vspace{-4mm}
\begin{pelda} \textit{Merge of twice eight numbers.} Let $X_1$ = 1, 5, 8, 11, 13, 16, 21, 26 and $X_2$ =
3, 9, 12, 18, 23, 27, 31, 65. Figure \ref{fig:8meg8} shows the sort of 16 numbers.
At first elements of $X_1$ with odd indices form the sequence $X^{odd}_1$ and elements with even indices form
the sequence $X^{even}_1$, and in the same way we get the sequences $X^{odd}_2$ and $X^{even}_2$. Then comes the
recursive merge of the two odd sequences resulting $L_1$ and the recursive merge of the even sequences
resulting $L_2$.
After this \textsc{Odd-Even-Merge} shuffles $L_1$ and $L_2$, resulting the sequence $Y = y_1, \ldots, y_{2m}$:
the elements of $Y$ with odd indices come from $L_1$ and the elements with even indices come from $L_2.$
Finally we compare the elements of $Y$ with even index and the next element (that is $Y[2]$ with $Y[3]$,
$Y[4]$ with $Y[5]$ etc.) and if necessary (that is they are not in the good order) they are changed.
\end{pelda}
\parkep{ia9.14EN}{Sorting of 16 numbers by algorithm \textsc{Odd-Even-Merge}.\label{fig:8meg8}}
\vspace{-4mm}
\begin{tetel} [merging in $\Theta(\lg m)$ time] Algorithm \textsc{Odd-Even-Merge}
merges two sequences of length $m$ elements in $\Theta(\lg m)$ time using $2m$ \textsc{EREW PRAM} processors.
\end{tetel}
\newpage
\begin{biz} Let denote the running time of the algorithm by $W(m)$. Step 1 requires
$\Theta(1)$ time, Step 2 $m/2$ time. Therefore we get the recursive equation
\begin{equation}
W(m) = W(m/2) + \Theta(1),
\end{equation}
having the solution $W(m) = \Theta(\lg m)$.
\end{biz}
We prove the correctness of this algorithm using the \ki{zero-one principle.}\inddef{zero-one principle}
A comparison-based sorting algorithm is \ki{oblivious,}\inddef{oblivious algorithm}
if the sequence of comparisons is fixed (elements of the comparison do not depend on the results of
the earlier comparisons). This definition means, that the sequence of the pairs of elements to be compared
$(i_1, j_1), \ (i_2, j_2), \ldots, \ (i_m, j_m)$ is given.
\begin{tetel} [zero-one principle] If a simple comparison-based sorting algorithm correctly sorts
an arbitrary 0-1 sequence of length n, then it sorts also correctly
any sequence of length n consisting of arbitrary keys.
\end{tetel}
\begin{biz} Let A be a comparison-based oblivious sorting algorithm
and let $S$ be such a sequence of elements, sorted incorrectly by A.
Let suppose A sorts in increasing order the elements of $S$.
Then the incorrectly sorted sequence $S'$ contains an element $x$ on the $i$-th $(1 \leq i \leq n - 1)$
position in spite of the fact that $S$ contains at least
$i$ keys smaller than $x$.
Let $x$ be the first (having the smallest index) such element of $S$.
Substitute in the input sequence the elements smaller than $x$ by 0's and
the remaining elements by 1's. This modified sequence is a 0-1 sequence therefore A
sorts it correctly. This observation implies that in the sorted 0-1 sequence at least
$i$ 0's precede the 1, written on the place of $x$.
Now denote the elements of the input sequence smaller than $x$ by red colour,
and the remaining elements by blue colour (in the original and the
transformed sequence too). We can show by induction, that
the coloured sequences are identical at the start and remain identical after each comparison.
According to colours we have three types of comparisons: blue-blue, red-red and blue-red. If
the compared elements have the same colour, in both cases (after a change or not-change)
the colours remain unchanged. If we compare elements of different colours, then
in both sequences the red element occupy the position with smaller index. So finally
we get a contradiction, proving the assertion of the theorem.
\end{biz}
\begin{pelda} \textit{A non comparison-based sorting algorithm.}
Let $x_1, x_2, \ldots, x_n$ be a bit sequence. We can sort this sequence simply counting
the zeros, and if we count $z$ zeros, then write $z$ zeros, then
$n - z$ ones. Of course, the general correctness of this algorithm is not guaranteed.
Since this algorithm is not comparison-based, therefore this fact does not contradict to the zero-one principle.
But merge is sorting, and \textsc{Odd-Even-Merge} is an oblivious sorting algorithm.
\end{pelda}
\begin{tetel} Algorithm \textsc{Odd-Even-Merge} sorts correctly sequences consisting of arbitrary numbers.
\end{tetel}
\begin{biz} Let $X_1$ and $X_2$ sorted 0-1 sequences of length $m$.
Let $q_1 \ (q_2)$ the number of zeros at the beginning of $X_1 \ (X_2)$.
Then the number of zeros in $X_1^{odd}$ equals to
$\lceil q_1/2 \rceil$, while the number of zeros in $X_1^{even}$ is
$\lfloor q_1/2 \rfloor$. Therefore the number of zeros in $L_1$ equals to
$z_1 = \lceil q_1/2 \rceil + \lceil q_2/2 \rceil$ and the number of zeros in $L_2$ equals to
$z_2 = \lfloor q_1/2 \rfloor + \lfloor q_2/2 \rfloor$.
The difference of $z_1$ and $z_2$ is at most 2. This difference is exactly then 2, if
$q_1$ and $q_2$ are both odd numbers. Otherwise the difference is at most 1.
Let suppose, that $| z_1 - z_2 | = 2$ (the proof in the other cases is similar).
In this cases $L_1$ contains two additional zeros. When the algorithm shuffles $L_1$ and $L_2$,
$L$ begins with an even number of zeros, end an even number of ones, and between the zeros and ones is a short
``dirty" part, 0, 1. After the comparison and change in the last step of the algorithm the whole sequence become sorted.
\end{biz}
\subsubsection{A work-optimal merge algorithm.}
Algorithm \textsc{Work-Optimal-Merge} uses only $\lceil 2m/\lg m \rceil$ processors,
but solves the merging in logarithmic time.
This algorithm divides the original problem into $m/\lg m$ parts so,
that each part contains approximately $\lg m$ elements.
Let $X_1 = x_1,x_2,\ldots,x_m$ and $X_2=x_{m+1},x_{m+2},\ldots,x_{m+m}$ be the input sequences.
Divide $X_1$ into $M = \lceil m/\lg m \rceil$ parts so, that each part contain at most
$\lceil \lg m \rceil$ elements. Let the parts be denoted by $A_1, A_2, \ldots, A_M$.
Let the largest element in $A_1$ be $l_i \ (i = 1, 2, \ldots, M)$.
Assign a processor to each $l_i$ element. These processors determine (by binary search)
the correct place (according to the sorting) of $l_i$ in $X_2$. These places divide
$X_2$ to $M + 1$ parts (some of these parts can be empty). Let
denote these parts by $B_1, \ B_2, \ \ldots, \ B_{M+1}$.
We call $B_i$ the subset corresponding to $A_i$ in $X_2$ (see Figure \ref{fig:optmer}).
\parkep{ia9.15EN}{A work-optimal merge algorithm \textsc{Optimal-Merge}.\label{fig:optmer}}
The algorithm gets the merged sequence merging at first
$A_1$ with $B_1$, $A_2$ with $B_2$ and so on, and then
joining these merged sequences.
\begin{tetel} Algorithm \textsc{Optimal-Merging} merges two sorted sequences of length $m$
in $O(\lg m)$ time on $\lceil 2m/\lg m \rceil$ \textsc{CREW PRAM} processors.
\end{tetel}
\begin{biz} We use the previous algorithm.
The length of the parts $A_i$ is $\lg m$, but the length of the parts $B_i$ can be much larger.
Therefore we repeat the partition. Let
$A_i, \ B_i$ an arbitrary pair. If $|B_i| = O(\lg m)$, then $A_i$ and
$B_i$ can be merged using one processor in $O(\lg m)$ time.
But if $|B_i| = \omega (\lg m)$, then divide $B_i$ into $|B_i|/\lg m$
parts---then each part contains at most $\lg m$ keys.
Assign a processor to each part. This assigned processor finds the subset corresponding to this subsequence
in $A_i$: $O(\lg \lg m)$ time is sufficient to do this. So the merge of $A_i$ and $B_i$
can be reduced to $|B_i|/\lg m$ subproblems, where each subproblem is the merge of two sequences of
$O(\lg m)$ length.
The number of the used processors is $\sum_{i = 1}^{M} \left \lceil
|B_i|/\lg m \right \rceil$, and this is at most $m / \lg m + M$, what is
not larger then $2M$.
\end{biz}\index{Merge|)}
This theorem imply, that \textsc{Optimal-Merging} is work-optimal.
\begin{kov} \textsc{Optimal-Merging} is work-optimal.
\end{kov}
\subsection{Selection}\index{Selection|(}
In the \ki{selection problem} $n \ge 2$ elements and a positive integer $i \ (1 \le i \le n)$ are given
and the $i$-th smallest element is to be selected.\inddef{selection}
Since selection requires the investigation of all elements, and our operations
can handle at most two elements, so $N(n) = \Omega(n)$.
Since it is known sequential algorithm A requiring only $W(n,\mathcal{A}) = O(n)$ time,
so A is asymptotically optimal.
The \ki{search problem}\inddef{search problem}\inddef{search problem} is similar:
in that problem the algorithm has to decide, whether a given element appears in the given sequence, and if yes,
then where. Here negative answer is also possible and the features of any element decide,
whether it corresponds the requirements or not.
We investigate three special cases and work-efficient algorithms to solve them.
\subsubsection{Selection in constant time using $n^2$ processors.}
Let $i = n$, that is we wish to select the largest key. Algorithm
\textsc{Quadratic-Select} solves this task in $\Theta(1)$ time using $n^2$ CRCW processors.
The input ($n$ different keys) is the sequence $X = x_1, \ x_2, \ \ldots, \ x_n$, and
the selected largest element is returned as $y$.
\begin{algN}{Quadratic-Select($X$)}\inddef{\textsc{Quadratic-Select}}
1 \' \key{if} \= $n = 1$ \= \\
2 \' \> \key{then} \= $y \leftarrow x_1$ \\
3 \' \> \> \key{return} $y$ \\
4 \' $P_{ij}$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } n, \ j \leftarrow 1 \key{ to } n$ \\
\> \key{do} \= \key{if} \= $k_i < k_j$ \\
5 \' \> \> \> \key{then} \= $x_{i,j} \leftarrow \textsc{false}$ \\
6 \' \> \> \> \key{else} \> $x_{i,j} \leftarrow \textsc{true}$ \\
7 \' $P_{i1}$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } n$ \\
8 \' \> \key{do} $L_i \leftarrow \textsc{true}$ \\
9 \' $P_{ij}$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } n, \ j \leftarrow 1 \key{ to } n$ \\
10 \' \> \key{if} \= $x_{i,j} = \textsc{false}$ \\
11 \' \> \> \key{then} $L_i \leftarrow \textsc{false}$ \\
12 \' $P_{i1}$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } n$ \\
13 \' \> \key{do} \= \key{if} \= $L_i = \textsc{true}$ \\
14 \' \> \> \> \key{then} $y \leftarrow x_i$ \\
15 \' \key{return} $y$
\end{algN}
In the first round (lines 4--6) the keys are compared in parallel manner, using all the $n^2$ processors.
$P_{ij} \ (1 \le i, j \le n)$ so, that processor $P_{ij}$ computes the logical value $x_{i,j} = x_i < x_j$.
We suppose that the keys are different. If the elements are not different, then we can use instead
of $x_i$ the pair $(x_i, i)$ (this solution requires an additional number of length
$(\lg n)$ bits. Since there is a unique key for which all comparison result \textsc{false},
this unique key can be found with a logical \textsc{or} operation is lines 7--11.
\begin{tetel} [selection in $\Theta(1)$ time] Algorithm \textsc{Quadratic-Select} determines the largest key
of $n$ different keys in $\Theta(1)$ time using $n^2$ \textsc{CRCW} common \textsc{PRAM}
processors.\label{theorem:quadratic}
\end{tetel}
\begin{biz} First and third rounds require unit time, the second round requires $\Theta(1)$ time,
so the total running time is $\Theta(1)$.
\end{biz}
The speedup of this algorithm is $\Theta (n)$. The work of the algorithm is
$w = \Theta (n^2)$. So the efficiency is
$E = \Theta(n)/\Theta (n^2) = \Theta(1/n)$. It follows that this algorithm is not
work-optimal, even it is not work-effective.
\subsubsection{Selection in logarithmic time on $n$ processors.}
Now we show that the maximal element among $n$ keys can be found, using even only
$n$ common CRCW PRAM processors and $\Theta(\lg \lg n)$ time.
The used technique is the \textit{divide-and-conquer}.\index{divide-and-conquer}
For the simplicity let $n$ be a square number.
The input and the output are the same as at the previous algorithm.
\begin{alg}{Quick-Selection$(X,y)$}\inddef{\textsc{Quick-Selection}}
1 \' \key{if} \= $p = 1$ \\
2 \' \> \key{then} \= $y \leftarrow x_1$ \\
3 \' \> \> \key{return} $y$ \\
4 \' \key{if} \= $p > 1$ \\
5 \' \> \key{then} \= divide the input into groups $G_1, G_2, \ldots, G_{a}$ and \\
\' \> \> divide the processors into groups $Q_1, Q_2, \ldots, Q_{a}$ \\
6 \' $Q_i$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } a$ \\
6 \' \> \key{do} recursively determines the maximal element $M_i$ of the group $G_i$ \\
7 \' \textsc{Quadratic-Select}$(M)$ \\
8 \' \key{return} $y$
\end{alg}
The algorithm divides the input into $\sqrt p = a$ groups $(G_1, G_2, \ldots, G_{a})$ so, that
each group contains $a$ elements $(x_{(i - 1)a + 1}, x_{(i - 1)a + 2}, \ldots, x_{ia})$,
and divides the processors into $a$ groups
$(Q_1, Q_2, \ldots, Q_{a})$ so, that group $Q_i$ contains $a$ processors
$P_{(i - 1)a + 1}, P_{(i - 1)a + 2}, \ldots, P_{ia}$.
Then the group of processors $Q_i$ computes recursively the maximum $M_i$ of group $G_i$.
Finally the previous algorithm \textsc{Quadratic-Select} gets as input the sequence
$M = M_1, \ldots, M_a$ and finds the maximum y of the input sequence $X$.
\begin{tetel} [selection in $\Theta(\lg \lg p)$ time] Algorithm \textsc{Quick-Select} determines the largest of
$p$ different elements in $O(\lg \lg p)$ time using $n$ common \textsc{CRCW PRAM} processors.
\end{tetel}
\begin{biz} Let the running time of the algorithm denoted by $W(n)$.
Step 1 requires $W(\sqrt{n})$ time, step 2 requires $\Theta(1)$ time. Therefore $W(p)$
satisfies the recursive equation
\begin{equation}
W(p) = W(\sqrt{p}) + \Theta(1),
\end{equation}
having the solution $\Theta(\lg \lg p)$.
\end{biz}
The total work of algorithm \textsc{Quick-Select} is $\Theta (p \lg \lg p)$, so
its efficiency is $\Theta(p)/\Theta (p \lg \lg p) = \Theta (1/\lg \lg p)$,
therefore \textsc{Quick-Select} is not work-optimal, it is only work-effective.
\vspace{-2mm}
\subsubsection{Selection from integer numbers.}
If the problem is to find the maximum of $n$ keys when the keys consist of
one bit, then the problem can be solved using a logical \textsc{or} operation,
and so requires only constant time using $n$ processors.
Now we try to extend this observation.
Let $c$ be a given positive integer constant, and we suppose the keys are integer numbers,
belonging to the interval $[0, n^c]$. Then the keys can be represented using at
most $c \lg n$ bits. For the simplicity we suppose that all the keys are given as binary numbers
of length $c \lg n$ bits.
The following algorithm \textsc{Integer-Selection} requires only constant time
and n CRCW \index{CRCW} PRAM processors to find the maximum.
The basic idea is to partition the $b_1, b_2, \ldots, b_{2c}$ bits of the numbers into
parts of length $(\lg n)/2$. The $i$-th part contains the bits
$b_{(i - 1) + 1}$, $b_{(i - 1) + 2}, \ldots, b_{(i - 1) + b_{(i - 1) + (\lg n)/2}}$,
the number of the parts is $2c$. Figure \ref{fig:intmax} shows the partition.
\parkep{ia9.16}{Selection of maximal integer number.\label{fig:intmax}}
The input of \textsc{Integer-Selection} is the number of processors $(n)$ and
the sequence $X = x_1, \ x_2, \ \ldots, \ x_n$ containing different integer numbers, and output
is the maximal number $y$.
\begin{alg}{Integer-Selection$(p,X)$}\inddef{\textsc{Integer-Selection}}
1 \' \textbf{for} \= $i \leftarrow 1 \key{ to } 2c$ \\
2 \' \> \key{do} \= compute the maximum $(M)$ of the remaining numbers on the base of \\
\> \> their $i$-th part\\
3 \' \> \> delete the numbers whose $i$-th part is smaller than $M$\\
4 \' $y \leftarrow$ one of the remaining numbers \\
5 \' \key{return} $y$
\end{alg}
The algorithm starts with searching the maximum on the base of the first part of the numbers. Then it
delete the numbers, whose first part is smaller, than the maximum. Then this is repeated for the second, ...,
last part of the numbers. Any of the non deleted numbers is maximal.
\begin{tetel} [selection from integer numbers] If the numbers are integers drawn
from the interval $[0, n^c],$ then algorithm \textsc{Integer-Selection} determines
the largest number among $n$ numbers for any positive $c$ in $\Theta(1)$ time using $n$
\textsc{CRCW PRAM} processors.
\end{tetel}
\begin{biz} Let suppose that we start with the selection of numbers, whose
$(\lg n)/2$ most significant bits are maximal. Let this maximum in the first part denoted by $M$.
It is sure that the numbers whose first part is smaller than $M$ are not maximal, therefore can be deleted.
If we execute this basis operation for all parts (that is $2c$ times), then exactly those numbers will be deleted,
what are not maximal, and all maximal element remain.
If a key contains at most $(\lg n)/2$ bits, then its value is at most $\sqrt{n} - 1$. So algorithm
\textsc{Integer-Select} in its first step determines the maximum of integer numbers taken from the
interval $[0,\sqrt{n} - 1]$. The algorithm assigns a processor to each number and uses $\sqrt{n}$
common memory locations $(M_1, M_2, \ldots, M_{\sqrt{n} - 1})$, containing
initially $-\infty$. In one step processor $P_i$ writes $k_i$ into $M_{k_i}$. Later the maximum of all numbers
can be determined from $\sqrt{n}$ memory cells using $n$ processors by Theorem \ref{theorem:quadratic}
in constant time.\index{Selection|)}
\end{biz}
\subsubsection{General selection.}
Let the sequence $X = x_1, \ x_2, \ldots, x_n$ contain different numbers and the problem is
to select the $k$th smallest element of $X$.
Let we have $p = n^2/\lg n$ CREW processors.
\begin{alg}{General-Selection$(X)$}\inddef{\textsc{General-Selection}}
1 \' divide the $n^2/\lg n$ processors into $n$ groups $G_1,\ldots,G_n$ so, that group $G_i$ \\
\' contains the processors $P_{i,1},P_{i,2},\ldots,P_{i,n/\lg n}$ and divide \\
\' the $n$ elements into $n/\lg n$ groups $(X_1, X_2,\ldots,X_{n/\lg n})$ so, that group $X_i$ \\
\' contains the elements $x_{(i-1)\lg n)+1},x_{(i-1)\lg n)+2},\ldots,x_{(i-1)\lg n)+\lg n}$ \\
2 \' $P_{ij}$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } n$ \\
3 \' \> \key{do} determine $h_{ij}$ (how many elements of $X_j$ are smaller, than $x_i$) \\
4 \' $G_i$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } n$ \\
5 \' \> \key{do} using \textsc{Optimal-Prefix} determine $s_i$ \\
\' \> (how many elements of $X$ are smaller, than $x_i$) \\
6 \' $P_{i,1}$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } n$ \\
7 \' \> \key{do} \= \key{if} \= $s_i = k - 1$ \\
8 \' \> \> \> \key{then return} $x_i$
\end{alg}
\begin{tetel}[general selection] The algorithm \textsc{General-Selection}
determines the $i$-th smallest of $n$ different numbers in $\Theta(\lg n)$ time
using $n^2/\lg n$ processors.\label{theorem:genselect}
\end{tetel}
\begin{biz}
In lines 2--3 $P_{ij}$ works as a sequential processor, therefore these lines require $\Theta{\lg n}$ time.
Lines 4--5 require $\Theta\lg n$ time according to Theorem \ref{theorem:optprefix}. Lines 6--8 can be executed
in constant time, so the total running time is $\Theta(\lg n)$.
\end{biz}
The work of \textsc{General-Selection} is $\Theta(n^2)$,
therefore this algorithm is not work-effective.
\index{selection|)}
\subsection{Sorting}\index{sorting|(}
Given a sequence $X = x_1, x_2, \ldots, x_n$ the \ki{sorting problem}\inddef{sorting problem} is
to rearrange the elements of $X$ e.g. in increasing order.
It is well-known that any A sequential comparison-based sorting algorithm needs
$N(n,A) = \Omega (n \lg n)$ comparisons, and there are comparison-based sorting algorithms
with $O(n \lg n)$ running time.
There are also algorithms, using special operations or sorting numbers with special features,
which solve the sorting problem in linear time. If we have to investigate all elements of $X$
and permitted operations can handle at most 2 elements, then we get $N(n) = \Omega{(n)}$.
So it is true, that among the comparison-based and also among the non-comparison-based
sorting algorithms are asymptotically optimal sequential algorithms.
In this subsection we consider three different sorting algorithm.
\subsubsection{Sorting in logarithmic time using $n^2$ processors.}
Using the ideas of algorithms \textsc{Quadratic-Selection} and \textsc{Optimal-Prefix}
we can sort $n$ elements using $n^2$ processors in $\lg n$ time.
\begin{algN}{Quadratic-Sort($K$)}\inddef{\textsc{Quadratic-Sort}}
1 \' \key{if} \= $n = 1$ \= \\
2 \' \> \key{then} \= $y \leftarrow x_1$ \\
3 \' \> \> \key{return} $Y$ \\
4 \' $P_{ij}$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } n, \ j \leftarrow 1 \key{ to } n$ \\
\> \key{do} \= \key{if} \= $x_i < x_j$ \\
5 \' \> \> \> \key{then} \= $x_{i,j} \leftarrow 0$ \\
6 \' \> \> \> \key{else} \> $x_{i,j} \leftarrow 1$ \\
7 \' divide \= the processors into $n$ groups $(G_1,G_2,\ldots,G_n)$ so, that group $G_i$ contains \\
\' \> processors $P_{i,1},P_{i,2},\ldots,P_{i,n}$ \\
8 \' $G_{i}$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } n$ \\
9 \' \> \key{do} compute $s_i = x_{i,1}+x_{i,2}+\cdots+x_{i,n}$ \\
10 \' $P_{i1}$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } n$ \\
11 \' \> \key{do} $y_{s_i + 1} \leftarrow x_i$ \\
12 \' \key{return} $Y$
\end{algN}
In lines 4--7 the algorithm compares all pairs of the elements (as \textsc{Quadratic-Selection}),
then in lines 7--9 (in a similar way as \textsc{Optimal-Prefix works}) it counts,
how many elements of $X$ is smaller, than the investigated $x_i$, and finally in lines 10--12
one processor of each group writes the final result into the corresponding memory cell.
\begin{tetel} [sorting in $\Theta(\lg n)$ time] Algorithm \textsc{Quadratic-Sort} sorts $n$ elements using
$n^2$ \textsc{CRCW PRAM} processors in $\Theta(\lg n)$ time.
\end{tetel}
\begin{biz}
Lines 8--9 require $\Theta(\lg n)$ time, and the remaining lines require only constant time.
\end{biz}
Since the work of \textsc{Quadratic-Sort} is $\Theta({n^2 \lg n})$, this algorithm is not work-effective.
\subsubsection{Odd-even algorithm with $O(\lg n)$ running time.}
The next algorithm uses the \textsc{Odd-Even-Merge}\index{\textsc{Odd-Even-Merge}} algorithm and the
classical \textit{divide-and-conquer} principle.\index{divide-and-conquer}
The input is the sequence $X = x_1, \ \ldots, \ x_p$, containing the numbers to be sorted, and the output is
the sequence $Y = y_1, \ \ldots, \ y_p$, containing the sorted numbers.
\begin{algN}{Odd-Even-Sort($X$)}\inddef{\textsc{Odd-Even-Sort}}
1 \' \key{if} \= $n = 1$ \\
2 \' \> \key{then} $Y \leftarrow X$ \\
3 \' \key{if} \= $n > 1$ \\
4 \' \> \key{then} \= let $X_1 = x_1, x_2, \ldots, x_{n/2}$ and
$X_2 = x_{n/2 + 1}, x_{n/2+2}, \ldots, x_n$. \\
5 \' \> $P_i$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } n/2$ \\
6 \' \> \> \key{do} sort recursively $X_1$ to get $Y_1$ \\
7 \' \> $P_i$ \= \key{in parallel for} $i \leftarrow n/2+1 \key{ to } n$ \\
8 \' \> \> \key{do} sort recursively $X_2$ to get $Y_2$ \\
9 \' \> $P_i$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } n$ \\
10 \' \> \> \key{do} merge $Y_1$ and $Y_2$ using \textsc{Odd-Even-Merge}$(Y_1,Y_2)$ \\
11 \' \key{return} $Y$
\end{algN}
The running time of this EREW PRAM algorithm is $O(\lg ^2 n)$.
\begin{tetel} [sorting in $\Theta(\lg^2 n)$ time] Algorithm \textsc{Odd-Even-Sort} sorts $n$ elements
in $\Theta(\lg^2 n)$ time using $n$ \textsc{EREW PRAM} processors.
\end{tetel}
\begin{biz} Let $W(n)$ be the running time of the algorithm. Lines 3--4 require
$\Theta(1)$ time, Lines 5--8 require $W(n/2)$ time, and lines 9--10 require
$\Theta(\lg n)$ time, line 11 require $\Theta(1)$ time. Therefore $W(n)$ satisfies the recurrence
\begin{equation}
W(n) = \Theta(1) + W(n/2) + \Theta(\lg n),
\end{equation}
having the solution $W(n) = \Theta(\lg^2 n)$.
\end{biz}
\begin{pelda} \textit{Sorting on 16 processors.} Sort using 16 processors the following numbers:
62, 19, 8, 5, 1, 13, 11, 16, 23, 31, 9, 3, 18, 12, 27, 34. At first we get the
odd and even parts, then the first 8 processors gets the sequence
$X_1 = 62, 19, 8, 5, 1, 13, 11, 16$, while the other 8 processors get
$X_2 = 23, 31, 9, 4, 18, 12, 27, 34$. The output of the first 8 processors is
$Y_1 = 1, 5, 8, 11, 13, 16, 19, 62$, while the output of the second 8 processors is
$Y_2 = 3, 9, 12, 18, 23, 27, 31, 34$. The merged final result is
$Y = 1, 3, 5, 8, 9, 11, 12, 13, 16, 18, 19, 23, 27, 31, 34, 62$.
\end{pelda}
The work of the algorithm is $\Theta (n \lg ^2 n)$, its efficiency is
$\Theta(1/\lg n)$, and its speedup is $\Theta \left (n/\lg n \right )$.
The algorithm is not work-optimal, but it is work-effective.
\subsubsection{Algorithm of Preparata with $\Theta(\lg n)$ running time.}
If we have more processors, then the running time can be decreased.
The following recursive algorithm due to Preparata
uses $n \lg n$ CREW PRAM processors and $\lg n$ time.
Input is the sequence $X = x_1, x_2, \ldots, x_n$, and the output is
the sequence $Y = y_1, y_2, \ldots, y_n$ containing the sorted elements.
\begin{alg}{Preparata(X)}\inddef{\textsc{Preparata}}
1 \' \key{if} \= $n \leq 20$ \\
2 \' \> \key{then} sort $X$ using $n$ processors and \textsc{Odd-Even-Sort} \\
3 \' \> \key{return} $Y$ \\
4 \' divide the $n$ elements into $\lg n$ parts $(X_1, \ X_2, \ \ldots, \ X_{\lg n})$ so, that each part \\
\' contains $n/\lg n$ elements, and divide the processors into $\lg n$ groups \\
\' $(G_1, \ G_2, \ \ldots, \ G_n)$ so, that each group contains $n$ processors \\
5 \' $G_i$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } \lg n$ \\
6 \' \> \key{do} \= sort the part $X_i$ recursively to get a sorted sequence $S_i$ \\
7 \' \> \> divide the processors into $(\lg n)^2$ groups $(H_{1,1}, H_{1,2}, \ldots, H_{(\lg n, \lg n)})$ \\
\' \> containing $n/\lg n$ processors\\
8 \' $H_{i,j}$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } \lg n, j \leftarrow 1 \key{ to } \lg n$ \\
9 \' \> \key{do} merge $S_i$ and $S_j$ \\
10 \' divide the processors into $n$ groups $(J_1, \ J_2, \ldots, J_n )$ so, that each group \\
\' contains $\lg n$ processors \\
11 \' $J_i$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } n$ \\
12 \' \> \key{do} \= determine the ranks of the $x_i$ element in $X$ using the local ranks \\
\' \> \> received in line 9 and using the algorithm \textsc{Optimal-Prefix} \\
13 \' \> \> $Y_i \leftarrow$ the elements of $X$ having a rank $i$ \\
14 \' \key{return} $Y$
\end{alg}
This algorithm uses the \textit{divide-and-conquer}\index{divide-and-conquer} principle.
It divides the input into $\lg n$ parts, then merges each pair of parts.
This merge results local ranks of the elements. The global rank of the elements can be computed
summing up these local ranks.
\begin{tetel} [sorting in $\Theta(\lg n)$ time] Algorithm \textsc{Preparata} sorts $n$ elements in $\Theta(\lg n)$
time using $n \lg n$ \textsc{CREW PRAM} processors.
\end{tetel}
\begin{biz} Let the running time be $W(n)$. Lines 4--6 require $W(n/\lg n)$ time, lines 7--12
together $\Theta(\lg \lg n)$. Therefore $W(n)$ satisfies the equation
\begin{equation}
W(n) = W \left (n/\lg n \right ) + \Theta(\lg \lg n),
\end{equation}
having the solution $W(n) = \Theta(\lg n)$.
\end{biz}
The work of \textsc{Preparata} is the same, as the work of \textsc{Odd-Even-Sort},
but the speedup is better: $\Theta(n)$. The efficiency of both algorithms is
$\Theta(1/\lg n)$.
\index{sorting|)}
\begin{gyak}
\ujgyak\label{ex:global}
The memory cell $M_1$ of the global memory contains some data. Design an algorithm,
which copies this data to the memory cells $M_2, M_3,$ $\ldots,$ $M_n$ in $O(\lg n)$
time, using $n$ EREW PRAM processors.
\ujgyak
Design an algorithm which solves the previous Exercise \ref{ex:global} using only $n/\lg n$
EREW PRAM\gyakindex{EREW PRAM} processors saving the $O(\lg n)$ running time.
\ujgyak
Design an algorithm having $O(\lg \lg n)$ running time and determining the maximum
of $n$ numbers using $n/\lg \lg n$ common CRCW PRAM\gyakindex{CRCW PRAM} processors.
\ujgyak
Let $X$ be a sequence containing $n$ keys. Design an algorithm to determine the rank of any $k \in X$ key
using $n/\lg n$ CREW PRAM processors and $O(\lg n)$ time.
\vspace{-4mm}
\ujgyak
Design an algorithm having $O(1)$ running time, which decides using $n$ common CRCW PRAM processors,
whether element 5 is contained by a given array $A[1 \twodots n]$, and if is contained,
then gives the largest index $i$, for which $A[i] = 5$ holds.
\vspace{-4mm}
\ujgyak
Design algorithm to merge \gyakindex{merge} two sorted sequence of length $m$
in $O(1)$ time, using $n^2$ CREW PRAM processors.
\ujgyak
Determine the running time,\gyakindex{speedup} speedup, work,\gyakindex{work} and
efficiency\gyakindex{efficiency} of all algorithms, discussed in this section.
\index{relative speed}\index{efficiency}\index{work}
\end{gyak}
\section{Mesh algorithms}\label{sec:mesh}\index{mesh|(}
To illustrate another model of computation we present two algorithms
solving the prefix problem\index{prefix problem} on meshes.
\subsection{Prefix on chain}\index{prefix computation on chain}
Let suppose that processor $P_i \ (i = 1, \ 2, \ \ldots, \ p)$ of the chain
$\mathcal{L} = \{ P_1, P_2, \ldots, P_p \}$ stores element $x_i$ in its local memory,
and after the parallel computations the prefix $y_i$ will be stored in the
local memory of $P_i$.
At first we introduce a naive algorithm. Its input is the sequence of elements
$X = x_1, \ x_2, \ \ldots, \ x_p$, and its output is the sequence
$Y = y_1, \ y_2, \ \ldots, y_p$, containing the prefixes.
\begin{alg}{Chain-Prefix(X)}\inddef{\textsc{Chain-Prefix}}
1 \' $P_1$ sends $y_1 = x_1$ to $P_2$ \\
2 \' $P_i$ \= \key{in parallel for} $i \leftarrow 2 \key{ to } p-1$ \\
3 \' \key{for} $i \leftarrow 2 \key{to} p-1$ \\
4 \' \> \key{do} \= gets $y_{i-1}$ from $P_{i-1}$, then computes and stores $y_i \leftarrow y_{i-1}\oplus x_i$ \\
\' \> \> stores $z_i = z_{p-1} \oplus x_p$, and sends $z_i$ to $P_{i+1}$ \\
5 \' $P_a$ gets $z_{p-1}$ from $P_{p-1}$, then computes and stores $y_a = y_{a-1} \oplus x_a$ \\
\end{alg}
Saying the truth, this is not a real parallel algorithm.
\begin{tetel} Algorithm \textsc{Chain-Prefix} determines the prefixes of p elements
using a chain $\mathcal{C}_p$ in $\Theta(p)$ time.
\end{tetel}
\begin{biz}
The cycle in lines 2--5 requires $\Theta(p)$ time, line 1 and line 6 requires $\Theta(1)$ time.
\end{biz}
Since the prefixes can be determined in $O(p)$ time using a sequential processor, and
$w(p,p,\textsc{Chain-Prefix}) = pW(p,p,\textsc{Chain-Prefix}) = \Theta(p^2)$, so
\textsc{CHAIN-Prefix} is not work-effective.
\subsection{Prefix on square}\index{prefix computation on square}
An algorithm, similar to \textsc{Chain-Prefix}, can be developed for a square too.
Let us consider a square of size $a \times a$. We need an
\ki{indexing} of the processors.\inddef{indexing of processors}
There are many different indexing schemes, but for the next algorithm
\textsc{Square-Prefix} sufficient is the one of the simplest solutions, the
\ki{row-major indexing scheme,}\inddef{row-major indexing scheme}
where processor $P_{i,j}$ gets the index $a(i - 1) + j$.
The input and the output are the same, as in the case of \textsc{Chain-Prefix}.
The processors $P_{i-1)a +1, \ P_{(i-1)a + 2)}, \ \ldots} P_{(i-1)a) + a}$ form the processor row
$R_i (1 \leq i \leq a)$ and the processors $P_{a+j}, \ P_{2a + j}, \ \ldots P_{a(a-1)+j}$ form
the processor column $C_j \ (1 \leq j \leq a)$.
The input stored by the processors of row $R_i$ is denoted by $X_i$, and the similar
output is denoted by $Y_i$.
The algorithm works in 3 rounds. In the first round (lines 1--8) processor rows $R_i \ (1\leq i \leq a)$
compute the row-local prefixes (working as processors of \textsc{Chain-Prefix}). In the second round (lines 9--17)
the column $C_a$ computes the prefixes using the results of the first round, and
the processors of this column $P_{ja} \ (1\leq j \leq a-1)$ send the computed prefix
to the neighbour $P_{(j+1)a)}$. Finally in the third round the rows $R_i \ (2 \leq i \leq a)$
determine the final prefixes.
\begin{algN}{Square-Prefix(X)}\inddef{\textsc{Square-Prefix}}
1 \' $P_{j,1}$ \= \key{in parallel for} $j \leftarrow 1 \key{ to } a$ \\
2 \' \> \key{do} sends $y_{j,1} = x_{j,1}$ to $P_{j,2}$
\algnewpage
3 \' $P_{j,i}$ \= \key{in parallel for} $i \leftarrow 1 \key{ to } a-1$ \\
4 \ \> \key{for} \= $i \leftarrow 2 \key{to} a-1$ \\
5 \' \> \> \key{do} \= gets $y_{j,i-1}$ from $P_{j,i-1}$, then computes and \\
6 \' \> \> \> stores $y_{j,}i = y_{j,p-1} \oplus x_{j,p}$, and sends $y_{j,i}$ to $P_{j,i+1}$ \\
7 \' $P_{j,a}$ \= \key{in parallel for} $j \leftarrow 1 \key{ to } a$ \\
8 \' \> \key{do} gets $y_{j,a-1}$ from $P_{j,a-1}$, then computes and stores $y_{1,a} = y_{1,a-1} \oplus x_{1,a}$ \\
9 \' $P_{1,a}$ sends $y_{1,a}$ to $P_{2,a}$ \\
10 \' $P_{j,a}$ \= \key{in parallel for} $j \leftarrow 2 \key{ to } a-1$ \\
11 \ \key{for} $j \leftarrow 2 \key{to} a-1$ \\
12 \' \> \key{do} \= gets $y_{j-1,a}$ from $P_{j-1,a}$, then computes and stores \\
\' \> \> stores $y_{j,a} = y_{j-1,a} \oplus y_{j,a}$, and sends $y_{j,a}$ to $P_{j+1,a}$ \\
13 \' $P_{a,a}$ gets $y_{a-1,a}$ from $P_{a-1,a}$, then computes and stores $y_{a,a} = y_{a-1,a} \oplus y_{a,a}$ \\
14 \' $P_{j,a}$ \= \key{in parallel for} $j \leftarrow 1 \key{ to } a-1$ \\
15 \' \> \key{do} send $y_{j,a}$ to $P_{j+1,a}$ \\
16 \' $P_{j,a}$ \= \key{in parallel for} $j \leftarrow 2 \key{ to } a$ \\
17 \' \> \key{do} sends $y_{j,a}$ to $P_{j,a-1}$ \\
18 \' $P_{j,i}$ \= \key{in parallel for} $i \leftarrow a-1 \key{ downto } 2$ \\
19 \' \> \key{for} \= $j \leftarrow 2 \key{to} a$ \\
20 \' \> \> \key{do} \= gets $y_{j,a}$ from $P_{j,i+1}$, then computes and \\
21 \' \> \> \> stores $y_{j,i} = y_{j,i+1} \oplus y_{j,i}$, and sends $y_{j,a}$ to $P_{j,i-1}$ \\
22 \' $P_{j,1}$ \= \key{in parallel for} $j \leftarrow 2 \key{ to } a-1$ \\
23 \' \> \key{do} gets $y_{j,a}$ from $P_{j,2}$, then computes and stores $y_{j,1} = y_{j,a} \oplus y_{j,1}$ \\
\end{algN}
\begin{tetel} Algorithm \textsc{Square-Prefix} solves the prefix problem using a square of size
$a \times a$, major row indexing in $3a + 2 = \Theta(a)$ time.
\end{tetel}
\begin{biz}
In the first round lines 1--2 contain 1 parallel operation, lines 3--6 require $a-1$ operations, and line 8
again 1 operation, that is all together $a + 1$ operations. In a similar way in the third round lines 18--23
require $a + 1$ time units, and in round 2 lines 9--17 require $a$ time units. The sum of the necessary time units
is $3s + 2.$
\end{biz}
\index{mesh|)}
\begin{pelda} \textit{Prefix computation on square of size $4 \times 4$}
Figure \ref{fig:fourfourprefix}(a) shows 16 input elements. In the first round \textsc{Square-Prefix}
computes the row-local prefixes, part (b) of the figure show the results. Then in the second
round only the processors
of the fourth column work, and determine the column-local prefixes -- results are in part (c) of the figure.
Finally in the third round algorithm determines the final results shown in part (d) of the figure.
\end{pelda}
\parkep{14.12EN}{Prefix computation on square.\label{fig:fourfourprefix}}
\begin{fejmegj}
Basic sources of this chapter are for architectures and models the book of Leopold \cite{Leopold01}, and
the book of Sima, Fountaine and Kacsuk \cite{SimaFo98}, for parallel
programming the book due to Kumar et al. \cite{Kumar03} and \cite{Leopold01},
for parallel algorithms the books of Berman and
Paul, \cite{BermanPa97}\nevindex{Paul, Jerome}\nevindex{Berman, Ken A.} Cormen, Leiserson and Rivest
\cite{CormenLe90},\nevindex{Cormen, Thomas H.}\nevindex{Leiserson, Charles E.}\nevindex{Rivest, Ronald Lewis}
the book written by Horowitz, Sahni and Rajasekaran
\cite{HorowitzSa98}\nevindex{Sahni, Sartaj}\nevindex{Horowitz, Ellis}\nevindex{Rajasekaran, Sanguthevar} and
the book \cite{Ivanyi03},\nevindex{Iványi, Antal Miklós} and the recent book due to
Casanova,\nevindex{Casanova, Henri} Legrand\nevindex{Legrand, Arnaud} and Robert\nevindex{Robert, Yves} \cite{Casanova2009}.
The website \cite{top500} contains the Top 500 list, a regularly
updated survey of the most powerful computers worldwide~\cite{top500}. It contains 42\% clusters.
Described classifications of computers are proposed by Flynn \cite{Flynn66},\nevindex{Flynn, Michael J.}
and Leopold \cite{Leopold01}.\nevindex{Fohry, Claudia} The Figures \ref{fig:SIMD}, \ref{fig:busSMP},
\ref{fig:ccNUMA}, \ref{fig:speedup}, \ref{fig:cols}, \ref{fig:openMP} are taken from the book
of Leopold \cite{Leopold01}, the program \ref{fig:mpiProg} from the book written by Gropp et al.
\cite{MPI}.\nevindex{Gropp, William}
The clusters are characterised using the book of Pfister \cite{Pfister98},\nevindex{Pfister, Gregory F.}
grids are presented on the base of the book and manuscript of Foster and Kellerman
\cite{FosterKe04,physioGrid}.\nevindex{Foster, Ian}\nevindex{Kellerman, Carl}
With the problems of shared memory deal the book written by Hwang and Xu \cite{Hwang98}, the book due to
Kleiman, Shah, and Smaalders \cite{KleimanSh96}\nevindex{Kleiman, S.}\nevindex{Shah, D.}\nevindex{Smaalders, B.},
and the textbook of Tanenbaum and van Steen \cite{Tavan}.\nevindex{Tanenbaum, Andrew S.}\nevindex{Woodhull, Albert S.}
Details on concepts as tasks, processes and threads can be found in many textbook, e.g. in
\cite{SGG00,Tanenbaum01}.\nevindex{Tanenbaum, Andrew S.}\nevindex{Woodhull, Albert S.}\nevindex{Silberschatz, Avi}
\nevindex{Gagne, Greg}\nevindex{Galvin, Peter Baer} Decomposition of the tasks into smaller parts
is analysed by Tanenbaum and van Steen \cite{Tavan}.\nevindex{van Steen, Maarten}
The laws concerning the speedup were described by Amdahl \cite{Amdahl67},
Gustafson-Barsis \cite{Gustafson88} and Brent
\cite{Brent74}.\nevindex{Amdahl, Gene M.}\nevindex{Gustafson, John L.}\nevindex{Brent, R. P.}
Kandemir, Ramanujam and Choudray review the different methods of the improvement
of locality \cite{Kandemir00}.\nevindex{Kandemir, Mahmut Taylan}
\nevindex{Ramanujam, J.}\nevindex{Choudhary, Alok} Wolfe \cite{Wolfe96} analyses in details
the connection between the transformation of the data and the program code.\nevindex{Wolfe, Michael J.}
In connection with code optimisation the book published by Kennedy and Allen
\cite{KennedyAl01} is a useful source.\nevindex{Kennedy, Ken (1946--2007)}\nevindex{Allen, Randy}
The MPI programming model is presented according to Gropp, Snir, Nitzberg, and Lusk
\cite{MPI},\nevindex{Gropp, William}\nevindex{Snir, Marc}\nevindex{Nitzberg, Bill}\nevindex{Lusk, Ewing}
while the base of the description of the OpenMP model is the paper due to Chandra\nevindex{Chandra, Rohit},
Dragum,\nevindex{Dagum, Leo} Kohr,\nevindex{Kohr, Dave} Dror,\nevindex{Maydan, Dror}
McDonald\nevindex{McDonald, Jeff} and Menon\nevindex{Menon, Ramesh} \cite{CDK00}, further
a review found on the internet \cite{openMPweb}.
Lewis and Berg \cite{LewisBe98} discuss pthreads, while Oaks and Wong \cite{OaksWo99} the Java threads in
details.\nevindex{Lewis, Bil}\nevindex{Berg, Daniel J.}\nevindex{Oaks, Scott}\nevindex{Wong, Henry} Description of
\textit{High Performance Fortran}\index{High Performance Fortran}\index{HPF|see{High Performance Fortran}}
can be found in the book Koelbel et al. \cite{Koelbel94}.\nevindex{Koelbel, C. H.}
\nevindex{Loveman, D. B.}\nevindex{Schreiber, Robert S.}\nevindex{Steele Guy L., Jr.}\nevindex{Zosel, Mary E.}
Among others Wolfe \cite{Wolfe96} studied the parallelising compilers.\nevindex{Wolfe, Michael J.}
The concept of PRAM is due to Fortune and Wyllie and is known since
1978 \cite{FortuneWy78}.\nevindex{Fortune, S.}\nevindex{Wyllie, J.}
BSP was proposed in 1990 by Valiant \cite{Valiant90}.\nevindex{Valiant, L. G.}
LogP has been suggested as an alternative of BSP by Culler et al. in 1993
\cite{CullerKapa96}.\nevindex{Karp, Richard M.}\nevindex{Culler, D. E.}\nevindex{Patterson, D.}\nevindex{Sahay, A.}
\nevindex{Santos, E. E.}\nevindex{Schauser, K. E.}\nevindex{Subramonian, E.}
\nevindex{von Eicken, T.} QSM was introduced in 1999 by Gibbons, Matias and Ramachandran
\cite{GibbonsMa99}.\nevindex{Gibbons, P. B.}\nevindex{Matias, Y.}\nevindex{Ramachandran, V.}
The majority of the pseudocode conventions used in Section \ref{sec:PRAM}
and\nevindex{Cormen, Thomas H.}\nevindex{Leiserson, Charles E.}
the description of crossover points and comparison of different methods of matrix multiplication
can be found in \cite{CormenLe2009}.\nevindex{Rivest, Ronald Lewis}\nevindex{Stein, Clifford}
The Readers interested in further programming models, as skeletons, parallel functional programming,
languages of coordination and parallel mobile agents, can find a detailed description in \cite{Leopold01}.
Further problems and parallel algorithms are analysed in
the books of Leighton \cite{Leighton92,Leighton01} and in the chapter
\textit{Memory Management} of this book \cite{BaloghIvanyi07}.\nevindex{Balogh, Ádám}\nevindex{Iványi, Antal Miklós}
\nevindex{Leighton, F. Tom} and in the book of Horowitz, Sahni and Rajasekaran \cite{HorowitzSa98}
\nevindex{Sahni, Sartaj}\nevindex{Horowitz, Ellis}\nevindex{Rajasekaran, Sanguthevar}
A model of scheduling of parallel processes is discussed in \cite{Gacs2004,Ivanyi2009,Winkler2000}.
Cost-optimal parallel merge is analysed by Wu\nevindex{Wu, Jie} and Olariu\nevindex{Olariu, Stephen} in \cite{Wu04}.
New ideas (as the application of multiple comparisons to get a constant time sorting algoritm) of parallel sorting
can be found in the paper of Gararch,\nevindex{Gararch, William} Golub,\nevindex{Golub, Evan}
and Kruskal \cite{Gararch03}.\nevindex{Kruskal, Clyde}\index{parallel computations|)}
\end{fejmegj}