% This file is part of the MMIXware package (c) Donald E Knuth 1999 @i boilerplate.w %<< legal stuff: PLEASE READ IT BEFORE MAKING ANY CHANGES! \def\title{ERISCAL} \def\ERISC{\.{ELTE RISC}} \def\MMIX{\.{MMIX}} \def\ERISCAL{\.{ERISCAL}} \def\MMIXAL{\.{MMIXAL}} \def\Hex#1{\hbox{$^{\scriptscriptstyle\#}$\tt#1}} % experimental hex constant \def\<#1>{\hbox{$\langle\,$#1$\,\rangle$}}\let\is=\longrightarrow \def\bull{\smallbreak\textindent{$\bullet$}} @s and normal @q unreserve a C++ keyword @> @s or normal @q unreserve a C++ keyword @> @s xor normal @q unreserve a C++ keyword @> \ifx\exotic+ \font\heb=heb8 at 10pt \font\rus=lhwnr8 \input unicode \unicodeptsize=8pt \fi @* Definition of ERISCAL. This program takes input written in \ERISCAL, the \ERISC\ assembly language, and translates it @^assembly language@> into binary files that can be loaded and executed on \ERISC\ simulators or hardwares. \ERISCAL\ is much simpler than the ``industrial strength'' assembly languages that computer manufacturers usually provide, because it is primarily intended for the simple demonstration programs. Yet it tries to have enough features to serve also as the back end of compilers for \CEE/ and other high-level languages. Instructions for using the program appear at the end of this document. First we will discuss the input and output languages in detail; then we'll consider the translation process, step by step; then we'll put everything together. @ A program in \ERISCAL\ consists of a series of {\it lines}, each of which usually contains a single instruction. However, lines with no instructions are possible, and so are lines with two or more instructions. Each instruction has three parts called its label field, opcode field, and operand field; these fields are separated from each other by one or more spaces. The label field, which is often empty, consists of all characters up to the first blank space. The opcode field, which is never empty, runs from the first nonblank after the label to the next blank space. The operand field, which again might be empty, runs from the next nonblank character (if any) to the first blank or semicolon that isn't part of a string or character constant. If the operand field is followed by a semicolon, possibly with intervening blanks, a new instruction begins immediately after the semicolon; otherwise the rest of the line is ignored. The end of a line is treated as a blank space for the purposes of these rules, with the additional proviso that string or character constants are not allowed to extend from one line to another. The label field must begin with a letter or a digit; otherwise the entire line is treated as a comment. Popular ways to introduce comments, either at the beginning of a line or after the operand field, are to precede them by the character \.\% as in \TeX, or by \.{//} as in \CPLUSPLUS/; \ERISCAL\ is not very particular. However, Lisp-style comments introduced by single semicolons will fail if they follow an instruction, because they will be assumed to introduce another instruction. @ \ERISCAL\ has no built-in macro capability, nor does it know how to include header files and such things. But users can run their files through a standard \CEE/ preprocessor to obtain \ERISCAL\ programs in which macros and such things have been expanded. (Caution: The preprocessor also removes \CEE/-style comments, unless it is told not to do so.) Literate programming tools could also be used for preprocessing. @^C preprocessor@> @^literate programming@> If a line begins with the special form `\.\# \ \', this program interprets it as a {\it line directive\/} emitted by a preprocessor. For example, $$\leftline{\indent\.{\# 13 "foo.mms"}}$$ means that the following line was line 13 in the user's source file \.{foo.mms}. Line directives allow us to correlate errors with the user's original file; we also pass them to the output, for use by simulators and debuggers. @^line directives@> @ \ERISCAL\ deals primarily with {\it symbols\/} and {\it constants}, which it interprets and combines to form machine language instructions and data. Constants are simplest, so we will discuss them first. A {\it decimal constant\/} is a sequence of digits, representing a number in radix~10. A~{\it hexadecimal constant\/} is a sequence of hexadecimal digits, preceded by~\.\#, representing a number in radix~16: $$\vbox{\halign{$#$\hfil\cr \\is\.0\mid\.1\mid\.2\mid\.3\mid\.4\mid \.5\mid\.6\mid\.7\mid\.8\mid\.9\cr \\is\\mid\.A\mid\.B\mid\.C\mid\.D\mid\.E\mid\.F\mid \.a\mid\.b\mid\.c\mid\.d\mid\.e\mid\.f\cr \\is\\mid\\\cr \\is\.\#\\mid\\\cr }}$$ Constants whose value is $2^{32}$ or more are reduced modulo $2^{32}$. @ A {\it character constant\/} is a single character enclosed in single quote marks; it denotes the {\mc ASCII} or Unicode number @^Unicode@> corresponding to that character. For example, \.{'a'} represents the constant \.{\#61}, also known as~\.{97}. The quoted character can be anything except the character that the \CEE/ library calls \.{\\n} or {\it newline}; that character should be represented as \.{\#a}. $$\vbox{\halign{$#$\hfil\cr \\is\.'\\.'\cr \\is\\mid\\mid\ \cr}}$$ Notice that \.{'''} represents a single quote, the code \.{\#27}; and \.{'\\'} represents a backslash, the code \.{\#5c}. \ERISCAL~characters are never ``quoted'' by backslashes as in the \CEE/~language. In the present implementation a character constant will always be at most 255, since wyde character input is not supported. \ifx\exotic+ But if the input were in Unicode one could write, say, \.'{\heb\char"40}\.' or \.'{\rus ZH}\.' for \.{\#05d0} or \.{\#0416}. \fi The present program does not support Unicode directly because basic software for inputting and outputting 16-bit characters was still in a primitive state at the time of writing. But the data structures below are designed so that a change to Unicode will not be difficult when the time is ripe. @ A {\it string constant\/} like \.{"Hello"} is an abbreviation for a sequence of one or more character constants separated by commas: \.{'H','e','l','l','o'}. Any character except newline or the double quote mark~\." can appear between the double quotes of a string constant. \ifx\exotic+ Similarly, \."\Uni1.08:24:24:-1:20% Unicode char "9ad8 <002000001800000806ffffff00000002004003ffe00300e00300c00300c003ffc0% 0300c02000043ffffe30000e31008c31ffcc3181cc31818c31818c31ff8c31818c3% 0007c300018>% \thinspace\Uni1.08:24:24:-1:20% Unicode char "5fb7 <1c038018030018030631ffff30060067860446fffe86ccce0ccccc0ccccc18cccc% 18fffc38c00c38001878fffc58040098030818398618b18318b00b19b0081b300c1% b3ffc181ff8>% \thinspace\Uni1.08:24:24:-1:20% Unicode char "7eb3 <0601c00e01800c018018018018218231bfff61b187433186ff3186c631860c3186% 18334630332663b6367e341660380600300600300603b0061e3006f03006c030060% 0303e00300c>% \kern.1em\." is an abbreviation for \.'\Uni1.08:24:24:-1:20% Unicode char "9ad8 <002000001800000806ffffff00000002004003ffe00300e00300c00300c003ffc0% 0300c02000043ffffe30000e31008c31ffcc3181cc31818c31818c31ff8c31818c3% 0007c300018>% \.{','}\Uni1.08:24:24:-1:20% Unicode char "5fb7 <1c038018030018030631ffff30060067860446fffe86ccce0ccccc0ccccc18cccc% 18fffc38c00c38001878fffc58040098030818398618b18318b00b19b0081b300c1% b3ffc181ff8>% \.{','}\Uni1.08:24:24:-1:20% Unicode char "7eb3 <0601c00e01800c018018018018218231bfff61b187433186ff3186c631860c3186% 18334630332663b6367e341660380600300600300603b0061e3006f03006c030060% 0303e00300c>% \.' (namely \.{\#9ad8,\#5fb7,\#7eb3}) when Unicode is supported. @^Unicode@> \fi @ A {\it symbol\/} in \ERISCAL\ is any sequence of letters and digits, beginning with a letter. A~colon~`\.:' or underscore symbol `\.\_' is regarded as a letter, for purposes of this definition. All extended-ASCII characters like `{\tt \'e}', whose 8-bit code exceeds 126, are also treated as letters. $$\vbox{\halign{$#$\hfil\cr \\is\.A\mid\.B\mid\cdots\mid\.Z\mid\.a\mid\.b\mid\cdots\mid\.z\mid \.:\mid\.\_\mid\<{character with code value $>126$}>\cr \\is\\mid\\\mid\\\cr }}$$ In future implementations, when \ERISCAL\ is used with Unicode, @^Unicode@> all wyde characters whose 16-bit code exceeds 126 will be regarded as letters; thus \ERISCAL\ symbols will be able to involve Greek letters or Chinese characters or thousands of other glyphs. @ A symbol is said to be {\it fully qualified\/} if it begins with a colon. Every symbol that is not fully qualified is an abbreviation for the fully qualified symbol obtained by placing the {\it current prefix\/} in front of it; the current prefix is always fully qualified. At the beginning of an \ERISCAL\ program the current prefix is simply the single character~`\.:', but the user can change it with the \.{PREFIX} command. For example, $$\vbox{\halign{&\quad\tt#\hfil\cr ADD&x,y&\% means ADD :x,:y\cr PREFIX&Foo:&\% current prefix is :Foo:\cr ADD&x,y&\% means ADD :Foo:x,:Foo:y\cr PREFIX&Bar:&\% current prefix is :Foo:Bar:\cr ADD&:x,y&\% means ADD :x,:Foo:Bar:y\cr PREFIX&:&\% current prefix reverts to :\cr ADD&x,Foo:Bar:y&\% means ADD :x,:Foo:Bar:y\cr }}$$ This mechanism allows large programs to avoid conflicts between symbol names, when parts of the program are independent and/or written by different users. The current prefix conventionally ends with a colon, but this convention need not be obeyed. @ A {\it local symbol\/} is a decimal digit followed by one of the letters \.B, \.F, or~\.H, meaning ``backward,'' ``forward,'' or ``here'': $$\vbox{\halign{$#$\hfill\cr \\is\\,\.B\mid\\,\.F\cr \\is\\,\.H\cr }}$$ The \.B and \.F forms are permitted only in the operand field of \ERISCAL\ instructions; the \.H form is permitted only in the label field. A local operand such as~\.{2B} stands for the last local label~\.{2H} in instructions before the current one, or 0 if \.{2H} has not yet appeared as a label. A~local operand such as~\.{2F} stands for the first \.{2H} in instructions after the current one. Thus, in a sequence such as $$\vbox{\halign{\tt#\cr 2H JMP 2F\cr 2H JMP 2B\cr}}$$ the first instruction jumps to the second and the second jumps to the first. Local symbols are useful for references to nearby points of a program, in cases where no meaningful name is appropriate. They can also be useful in special situations where a redefinable symbol is needed; for example, an instruction like $$\.{9H IS 9B+1}$$ will maintain a running counter. @ Each symbol receives a value called its {\it equivalent\/} when it appears in the label field of an instruction; it is said to be {\it defined\/} after its equivalent has been established. A few symbols, like \.{Fopen}, are predefined because they refer to fixed constants associated with the \ERISC\ hardware or its rudimentary operating system; otherwise every symbol should be defined exactly once. The two appearances of `\.{2H}' in the example above do not violate this rule, because the second `\.{2H}' is not the same symbol as the first. A predefined symbol can be redefined (given a new equivalent). After it has been redefined it acts like an ordinary symbol and cannot be redefined again. A complete list of the predefined symbols appears in the program below. @^predefined symbols@> Equivalents are either {\it pure\/} or {\it register numbers}. A pure equivalent is an unsigned wyde, but a register number equivalent is a nybble value, between 0 and~15. A dollar sign is used to change a pure number into a register number; for example, `\.{\$15}' means register number~15. @ Constants and symbols are combined into {\it expressions\/} in a simple way: $$\vbox{\halign{$#$\hfil\cr \\is\\mid\\mid\\mid \.{@@}\mid\cr \hskip12pc\.(\\.)\mid\\\cr \\is\\mid \\\\cr \\is\\mid\\\\cr \\is\.*\mid\.+\mid\.-\mid\.\~\mid\.\$\mid\.\&\cr \\is\.*\mid\./\mid\.{//}\mid\.\%\mid\.{<<}\mid\.{>>} \mid\.\&\cr \\is\.+\mid\.-\mid\.{\char'174}\mid\.\^\cr }}$$ Each expression has a value that is either pure, a register number or an indirect version of these. The character \.{@@} stands for the current location, which is always pure. The unary operators \.*,\.+, \.-, \.\~, \.\$, and \.\& mean, respectively, ``indirectize'', ``relativize,'' ``subtract from zero,'' ``complement the bits,'' ``change from pure value to register number,'' and ``take the serial number.'' Only the first of these, \.*, can be applied to a register number. The last unary operator, \.\&, applies only to symbols, and it is of interest primarily to system programmers; it converts a symbol to the unique positive integer that is used to identify it in the binary file output by \ERISCAL. The unusual operator \.+ make a relative value from a value subtracting the current location from it. A relative value is useful if we want to add it to the current location; it is mainly used in the \.{SRC} field of instructions. Another unusual operator \.* symple gives the information of the compiler that the value of this field will be used indirectly, i.e, addressing mode is 3. It main use is in instruction as \.{SETL} \$4,*\$5 setting \$4 from address contained in \$5, but it is also used as for example \.{SETL} \$4,*xxxx meaning that \$4 set from the \$0 relative data \.{xxxx} contained in wyde after the instruction. @^serial number@> Binary operators come in two flavors, strong and weak. The strong ones are essentially concerned with multiplication or division: \.{x*y}, \.{x/y}, \.{x//y}, \.{x\%y}, \.{x<>y}, and \.{x\&y} stand respectively for $(x\times y)\bmod2^{64}$ (multiplication), $\lfloor x/y\rfloor$ (division), $\lfloor2^{64}x/y\rfloor$ (fractional division), $x\bmod y$ (remainder), $(x\times2^y)\bmod2^{64}$ (left~shift), $\lfloor x/2^y\rfloor$ (right shift), and $x\mathbin{\char`\&}y$ (bitwise and) on unsigned wydes. Division is legal only if $y>0$; fractional division is legal only if $x+\$, $\+\$, $\-\$ and $\-\$. For example, if \.{x} denotes \.{\$1} and \.{y} denotes \.{\$10}, then \.{x+3} and \.{3+x} denote \.{\$4}, and \.{y-x} denotes the pure value \.{9}. Register numbers within expressions are allowed to be arbitrary wydes, but a register number assigned as the equivalent of a symbol should not exceed 15. (Incidentally, one might ask why the designer of \ERISCAL\ did not simply adopt the existing rules of \CEE/ for expressions. The primary reason is that the designers of \CEE/ chose to give \.{<<}, \.{>>}, and \.\& a lower precedence than~\.+; but in \ERISCAL\ we want to be able to write things like \.{o<<24+x<<16+y<<8+z} or \.{@@+yz<<2} or \.{@@+(\#100-@@)\&\#ff}. Since the conventions of \CEE/ were inappropriate, it was better to make a clean break, not pretending to have a close relationship with that language. The new rules are quite easily memorized, because \ERISCAL\ has just two levels of precedence, and the strong binary operations are all essentially multiplicative by nature while the weak binary operations are essentially additive.) @ A symbol is called a {\it future reference\/} until it has been defined. \ERISCAL\ restricts the use of future references, so that programs can be assembled quickly in one pass over the input; therefore all expressions can be evaluated when the \ERISCAL\ processor first sees them. The restrictions are easily stated: Future references cannot be used in expressions together with unary or binary operators (except the unary~\.+, which does nothing); moreover, future references can appear as operands only in instructions that have relative addresses (namely branches, probable branches, \.{JMP}, \.{PUSHJ}, \.{GETA}) or in wyde constants (the pseudo-operation \.{OCTA}). Thus, for example, one can say \.{JMP}~\.{1F} or \.{JMP}~\.{1B-4}, but not \.{JMP}~\.{1F-4}. @ We noted earlier that each \ERISCAL\ instruction contains a label field, an opcode field, and an operand field. The label field is either empty or a symbol or local label; when it is nonempty, the symbol or local label receives an equivalent. The operand field is either empty or a sequence of expressions separated by commas; when it is empty, it is equivalent to the simple operand field~`\.0'. $$\vbox{\halign{$#$\hfil\cr \\is\