% This file is part of the MMIXware package (c) Donald E Knuth 1999 @i boilerplate.w %<< legal stuff: PLEASE READ IT BEFORE MAKING ANY CHANGES! \def\title{ERISCAL} \def\ERISC{\.{ELTE RISC}} \def\MMIX{\.{MMIX}} \def\ERISCAL{\.{ERISCAL}} \def\MMIXAL{\.{MMIXAL}} \def\Hex#1{\hbox{$^{\scriptscriptstyle\#}$\tt#1}} % experimental hex constant \def\<#1>{\hbox{$\langle\,$#1$\,\rangle$}}\let\is=\longrightarrow \def\bull{\smallbreak\textindent{$\bullet$}} @s and normal @q unreserve a C++ keyword @> @s or normal @q unreserve a C++ keyword @> @s xor normal @q unreserve a C++ keyword @> \ifx\exotic+ \font\heb=heb8 at 10pt \font\rus=lhwnr8 \input unicode \unicodeptsize=8pt \fi @* Definition of ERISCAL. This program takes input written in \ERISCAL, the \ERISC\ assembly language, and translates it @^assembly language@> into binary files that can be loaded and executed on \ERISC\ simulators or hardwares. \ERISCAL\ is much simpler than the ``industrial strength'' assembly languages that computer manufacturers usually provide, because it is primarily intended for the simple demonstration programs. Yet it tries to have enough features to serve also as the back end of compilers for \CEE/ and other high-level languages. Instructions for using the program appear at the end of this document. First we will discuss the input and output languages in detail; then we'll consider the translation process, step by step; then we'll put everything together. @ A program in \ERISCAL\ consists of a series of {\it lines}, each of which usually contains a single instruction. However, lines with no instructions are possible, and so are lines with two or more instructions. Each instruction has three parts called its label field, opcode field, and operand field; these fields are separated from each other by one or more spaces. The label field, which is often empty, consists of all characters up to the first blank space. The opcode field, which is never empty, runs from the first nonblank after the label to the next blank space. The operand field, which again might be empty, runs from the next nonblank character (if any) to the first blank or semicolon that isn't part of a string or character constant. If the operand field is followed by a semicolon, possibly with intervening blanks, a new instruction begins immediately after the semicolon; otherwise the rest of the line is ignored. The end of a line is treated as a blank space for the purposes of these rules, with the additional proviso that string or character constants are not allowed to extend from one line to another. The label field must begin with a letter or a digit; otherwise the entire line is treated as a comment. Popular ways to introduce comments, either at the beginning of a line or after the operand field, are to precede them by the character \.\% as in \TeX, or by \.{//} as in \CPLUSPLUS/; \ERISCAL\ is not very particular. However, Lisp-style comments introduced by single semicolons will fail if they follow an instruction, because they will be assumed to introduce another instruction. @ \ERISCAL\ has no built-in macro capability, nor does it know how to include header files and such things. But users can run their files through a standard \CEE/ preprocessor to obtain \ERISCAL\ programs in which macros and such things have been expanded. (Caution: The preprocessor also removes \CEE/-style comments, unless it is told not to do so.) Literate programming tools could also be used for preprocessing. @^C preprocessor@> @^literate programming@> If a line begins with the special form `\.\# \ \', this program interprets it as a {\it line directive\/} emitted by a preprocessor. For example, $$\leftline{\indent\.{\# 13 "foo.mms"}}$$ means that the following line was line 13 in the user's source file \.{foo.mms}. Line directives allow us to correlate errors with the user's original file; we also pass them to the output, for use by simulators and debuggers. @^line directives@> @ \ERISCAL\ deals primarily with {\it symbols\/} and {\it constants}, which it interprets and combines to form machine language instructions and data. Constants are simplest, so we will discuss them first. A {\it decimal constant\/} is a sequence of digits, representing a number in radix~10. A~{\it hexadecimal constant\/} is a sequence of hexadecimal digits, preceded by~\.\#, representing a number in radix~16: $$\vbox{\halign{$#$\hfil\cr \\is\.0\mid\.1\mid\.2\mid\.3\mid\.4\mid \.5\mid\.6\mid\.7\mid\.8\mid\.9\cr \\is\\mid\.A\mid\.B\mid\.C\mid\.D\mid\.E\mid\.F\mid \.a\mid\.b\mid\.c\mid\.d\mid\.e\mid\.f\cr \\is\\mid\\\cr \\is\.\#\\mid\\\cr }}$$ Constants whose value is $2^{32}$ or more are reduced modulo $2^{32}$. @ A {\it character constant\/} is a single character enclosed in single quote marks; it denotes the {\mc ASCII} or Unicode number @^Unicode@> corresponding to that character. For example, \.{'a'} represents the constant \.{\#61}, also known as~\.{97}. The quoted character can be anything except the character that the \CEE/ library calls \.{\\n} or {\it newline}; that character should be represented as \.{\#a}. $$\vbox{\halign{$#$\hfil\cr \\is\.'\\.'\cr \\is\\mid\\mid\ \cr}}$$ Notice that \.{'''} represents a single quote, the code \.{\#27}; and \.{'\\'} represents a backslash, the code \.{\#5c}. \ERISCAL~characters are never ``quoted'' by backslashes as in the \CEE/~language. In the present implementation a character constant will always be at most 255, since wyde character input is not supported. \ifx\exotic+ But if the input were in Unicode one could write, say, \.'{\heb\char"40}\.' or \.'{\rus ZH}\.' for \.{\#05d0} or \.{\#0416}. \fi The present program does not support Unicode directly because basic software for inputting and outputting 16-bit characters was still in a primitive state at the time of writing. But the data structures below are designed so that a change to Unicode will not be difficult when the time is ripe. @ A {\it string constant\/} like \.{"Hello"} is an abbreviation for a sequence of one or more character constants separated by commas: \.{'H','e','l','l','o'}. Any character except newline or the double quote mark~\." can appear between the double quotes of a string constant. \ifx\exotic+ Similarly, \."\Uni1.08:24:24:-1:20% Unicode char "9ad8 <002000001800000806ffffff00000002004003ffe00300e00300c00300c003ffc0% 0300c02000043ffffe30000e31008c31ffcc3181cc31818c31818c31ff8c31818c3% 0007c300018>% \thinspace\Uni1.08:24:24:-1:20% Unicode char "5fb7 <1c038018030018030631ffff30060067860446fffe86ccce0ccccc0ccccc18cccc% 18fffc38c00c38001878fffc58040098030818398618b18318b00b19b0081b300c1% b3ffc181ff8>% \thinspace\Uni1.08:24:24:-1:20% Unicode char "7eb3 <0601c00e01800c018018018018218231bfff61b187433186ff3186c631860c3186% 18334630332663b6367e341660380600300600300603b0061e3006f03006c030060% 0303e00300c>% \kern.1em\." is an abbreviation for \.'\Uni1.08:24:24:-1:20% Unicode char "9ad8 <002000001800000806ffffff00000002004003ffe00300e00300c00300c003ffc0% 0300c02000043ffffe30000e31008c31ffcc3181cc31818c31818c31ff8c31818c3% 0007c300018>% \.{','}\Uni1.08:24:24:-1:20% Unicode char "5fb7 <1c038018030018030631ffff30060067860446fffe86ccce0ccccc0ccccc18cccc% 18fffc38c00c38001878fffc58040098030818398618b18318b00b19b0081b300c1% b3ffc181ff8>% \.{','}\Uni1.08:24:24:-1:20% Unicode char "7eb3 <0601c00e01800c018018018018218231bfff61b187433186ff3186c631860c3186% 18334630332663b6367e341660380600300600300603b0061e3006f03006c030060% 0303e00300c>% \.' (namely \.{\#9ad8,\#5fb7,\#7eb3}) when Unicode is supported. @^Unicode@> \fi @ A {\it symbol\/} in \ERISCAL\ is any sequence of letters and digits, beginning with a letter. A~colon~`\.:' or underscore symbol `\.\_' is regarded as a letter, for purposes of this definition. All extended-ASCII characters like `{\tt \'e}', whose 8-bit code exceeds 126, are also treated as letters. $$\vbox{\halign{$#$\hfil\cr \\is\.A\mid\.B\mid\cdots\mid\.Z\mid\.a\mid\.b\mid\cdots\mid\.z\mid \.:\mid\.\_\mid\<{character with code value $>126$}>\cr \\is\\mid\\\mid\\\cr }}$$ In future implementations, when \ERISCAL\ is used with Unicode, @^Unicode@> all wyde characters whose 16-bit code exceeds 126 will be regarded as letters; thus \ERISCAL\ symbols will be able to involve Greek letters or Chinese characters or thousands of other glyphs. @ A symbol is said to be {\it fully qualified\/} if it begins with a colon. Every symbol that is not fully qualified is an abbreviation for the fully qualified symbol obtained by placing the {\it current prefix\/} in front of it; the current prefix is always fully qualified. At the beginning of an \ERISCAL\ program the current prefix is simply the single character~`\.:', but the user can change it with the \.{PREFIX} command. For example, $$\vbox{\halign{&\quad\tt#\hfil\cr ADD&x,y&\% means ADD :x,:y\cr PREFIX&Foo:&\% current prefix is :Foo:\cr ADD&x,y&\% means ADD :Foo:x,:Foo:y\cr PREFIX&Bar:&\% current prefix is :Foo:Bar:\cr ADD&:x,y&\% means ADD :x,:Foo:Bar:y\cr PREFIX&:&\% current prefix reverts to :\cr ADD&x,Foo:Bar:y&\% means ADD :x,:Foo:Bar:y\cr }}$$ This mechanism allows large programs to avoid conflicts between symbol names, when parts of the program are independent and/or written by different users. The current prefix conventionally ends with a colon, but this convention need not be obeyed. @ A {\it local symbol\/} is a decimal digit followed by one of the letters \.B, \.F, or~\.H, meaning ``backward,'' ``forward,'' or ``here'': $$\vbox{\halign{$#$\hfill\cr \\is\\,\.B\mid\\,\.F\cr \\is\\,\.H\cr }}$$ The \.B and \.F forms are permitted only in the operand field of \ERISCAL\ instructions; the \.H form is permitted only in the label field. A local operand such as~\.{2B} stands for the last local label~\.{2H} in instructions before the current one, or 0 if \.{2H} has not yet appeared as a label. A~local operand such as~\.{2F} stands for the first \.{2H} in instructions after the current one. Thus, in a sequence such as $$\vbox{\halign{\tt#\cr 2H JMP 2F\cr 2H JMP 2B\cr}}$$ the first instruction jumps to the second and the second jumps to the first. Local symbols are useful for references to nearby points of a program, in cases where no meaningful name is appropriate. They can also be useful in special situations where a redefinable symbol is needed; for example, an instruction like $$\.{9H IS 9B+1}$$ will maintain a running counter. @ Each symbol receives a value called its {\it equivalent\/} when it appears in the label field of an instruction; it is said to be {\it defined\/} after its equivalent has been established. A few symbols, like \.{Fopen}, are predefined because they refer to fixed constants associated with the \ERISC\ hardware or its rudimentary operating system; otherwise every symbol should be defined exactly once. The two appearances of `\.{2H}' in the example above do not violate this rule, because the second `\.{2H}' is not the same symbol as the first. A predefined symbol can be redefined (given a new equivalent). After it has been redefined it acts like an ordinary symbol and cannot be redefined again. A complete list of the predefined symbols appears in the program below. @^predefined symbols@> Equivalents are either {\it pure\/} or {\it register numbers}. A pure equivalent is an unsigned wyde, but a register number equivalent is a nybble value, between 0 and~15. A dollar sign is used to change a pure number into a register number; for example, `\.{\$15}' means register number~15. @ Constants and symbols are combined into {\it expressions\/} in a simple way: $$\vbox{\halign{$#$\hfil\cr \\is\\mid\\mid\\mid \.{@@}\mid\cr \hskip12pc\.(\\.)\mid\\\cr \\is\\mid \\\\cr \\is\\mid\\\\cr \\is\.*\mid\.+\mid\.-\mid\.\~\mid\.\$\mid\.\&\cr \\is\.*\mid\./\mid\.{//}\mid\.\%\mid\.{<<}\mid\.{>>} \mid\.\&\cr \\is\.+\mid\.-\mid\.{\char'174}\mid\.\^\cr }}$$ Each expression has a value that is either pure, a register number or an indirect version of these. The character \.{@@} stands for the current location, which is always pure. The unary operators \.*,\.+, \.-, \.\~, \.\$, and \.\& mean, respectively, ``indirectize'', ``relativize,'' ``subtract from zero,'' ``complement the bits,'' ``change from pure value to register number,'' and ``take the serial number.'' Only the first of these, \.*, can be applied to a register number. The last unary operator, \.\&, applies only to symbols, and it is of interest primarily to system programmers; it converts a symbol to the unique positive integer that is used to identify it in the binary file output by \ERISCAL. The unusual operator \.+ make a relative value from a value subtracting the current location from it. A relative value is useful if we want to add it to the current location; it is mainly used in the \.{SRC} field of instructions. Another unusual operator \.* symple gives the information of the compiler that the value of this field will be used indirectly, i.e, addressing mode is 3. It main use is in instruction as \.{SETL} \$4,*\$5 setting \$4 from address contained in \$5, but it is also used as for example \.{SETL} \$4,*xxxx meaning that \$4 set from the \$0 relative data \.{xxxx} contained in wyde after the instruction. @^serial number@> Binary operators come in two flavors, strong and weak. The strong ones are essentially concerned with multiplication or division: \.{x*y}, \.{x/y}, \.{x//y}, \.{x\%y}, \.{x<>y}, and \.{x\&y} stand respectively for $(x\times y)\bmod2^{64}$ (multiplication), $\lfloor x/y\rfloor$ (division), $\lfloor2^{64}x/y\rfloor$ (fractional division), $x\bmod y$ (remainder), $(x\times2^y)\bmod2^{64}$ (left~shift), $\lfloor x/2^y\rfloor$ (right shift), and $x\mathbin{\char`\&}y$ (bitwise and) on unsigned wydes. Division is legal only if $y>0$; fractional division is legal only if $x+\$, $\+\$, $\-\$ and $\-\$. For example, if \.{x} denotes \.{\$1} and \.{y} denotes \.{\$10}, then \.{x+3} and \.{3+x} denote \.{\$4}, and \.{y-x} denotes the pure value \.{9}. Register numbers within expressions are allowed to be arbitrary wydes, but a register number assigned as the equivalent of a symbol should not exceed 15. (Incidentally, one might ask why the designer of \ERISCAL\ did not simply adopt the existing rules of \CEE/ for expressions. The primary reason is that the designers of \CEE/ chose to give \.{<<}, \.{>>}, and \.\& a lower precedence than~\.+; but in \ERISCAL\ we want to be able to write things like \.{o<<24+x<<16+y<<8+z} or \.{@@+yz<<2} or \.{@@+(\#100-@@)\&\#ff}. Since the conventions of \CEE/ were inappropriate, it was better to make a clean break, not pretending to have a close relationship with that language. The new rules are quite easily memorized, because \ERISCAL\ has just two levels of precedence, and the strong binary operations are all essentially multiplicative by nature while the weak binary operations are essentially additive.) @ A symbol is called a {\it future reference\/} until it has been defined. \ERISCAL\ restricts the use of future references, so that programs can be assembled quickly in one pass over the input; therefore all expressions can be evaluated when the \ERISCAL\ processor first sees them. The restrictions are easily stated: Future references cannot be used in expressions together with unary or binary operators (except the unary~\.+, which does nothing); moreover, future references can appear as operands only in instructions that have relative addresses (namely branches, probable branches, \.{JMP}, \.{PUSHJ}, \.{GETA}) or in wyde constants (the pseudo-operation \.{OCTA}). Thus, for example, one can say \.{JMP}~\.{1F} or \.{JMP}~\.{1B-4}, but not \.{JMP}~\.{1F-4}. @ We noted earlier that each \ERISCAL\ instruction contains a label field, an opcode field, and an operand field. The label field is either empty or a symbol or local label; when it is nonempty, the symbol or local label receives an equivalent. The operand field is either empty or a sequence of expressions separated by commas; when it is empty, it is equivalent to the simple operand field~`\.0'. $$\vbox{\halign{$#$\hfil\cr \\is\\\\cr \\is\\mid\\mid\\cr \\is\\mid\\cr \\is\\mid\\.,\\cr }}$$ The opcode field contains either a symbolic \ERISC\ operation name (like \.{ADD}), or an {\it alias operation}, or a {\it pseudo-operation}. Alias operations are alternate names for \ERISC\ operations whose standard names are inappropriate in certain contexts. Pseudo-operations do not correspond directly to \ERISC\ commands, but they govern the assembly process in important ways. There are ?????? alias operations: \smallskip $$\vbox{\halign{$#$\hfil\cr \\is\\mid\\cr \hskip12pc\mid\\cr \\is\.{LZ}\mid\cdots\mid\.{JMP}\cr \\is\.{XXX}\mid\cdots\mid\.{ZZZ}\cr \\is\.{IS}\mid\.{LOC}\mid\.{PREFIX}\mid\.{DATA}\mid \.{CODE}\mid\.{BSPEC}\mid\.{ESPEC}\mid\.{WYDE}\cr }}$$ @ \ERISC\ operations like \.{ADD} require exactly two expressions as operands. @ In all cases when the opcode corresponds to an \ERISC\ operation, the \ERISCAL\ instruction tells the assembler to carry out three steps: (1)~Define the equivalent of the label field to be the current location, if the label is nonempty; (2)~Evaluate the operands and assemble the specified \ERISC\ instruction into the current location; (3)~Increase the current location by~1. @ Now let's consider the pseudo-operations, starting with the simplest cases. \bull\ \.{IS} \ defines the value of the label to be the value of the expression, which must not be a future reference. The expression may be either pure or a register number. \bull\ \.{LOC} \ first defines the label to be the value of the current location, if the label is nonempty. Then the current location is changed to the value of the expression, which must be pure. \smallskip For example, `\.{LOC} \.{\#1000}' will start assembling subsequent instructions or data in location whose hexa\-decimal value is \Hex{1000}. `\.X~\.{LOC}~\.{@@+500}' defines \.X to be the address of the first of 500 bytes in memory; assembly will continue at location $\.X+500$. The operation of aligning the current location to a multiple of~256, if it is not already aligned in that way, can be expressed as `\.{LOC}~\.{@@+(256-@@)\&255}'. A less trivial example arises if we want to emit instructions and data into two separate areas of memory, but we want to intermix them in the \ERISCAL\ source file. We could start by defining \.{8H} and \.{9H} to be the starting addresses of the instruction and data segments, respectively. Then, a sequence of instructions could be enclosed in `\.{LOC}~\.{8B}; \dots; \.{8H}~\.{IS}~\.{@@}'; a sequence of data could be enclosed in `\.{LOC}~\.{9B}; \dots; \.{9H}~\.{IS}~\.{@@}'. Any number of such sequences could then be combined. Instead of the two pseudo-instructions `\.{8H}~\.{IS}~\.{@@;} \.{LOC}~\.{9B}' one could in fact write simply `\.{8H}~\.{LOC}~\.{9B}' when switching from instructions to data. \bull \.{PREFIX} \ redefines the current prefix to be the given symbol (fully qualified). The label field should be blank. @ The next pseudo-operations assemble wydes of data. \bull \ \.{WYDE} \ defines the label to be the current location, if the label field is nonempty; then it assembles one wyde for each expression in the expression list, and advances the current location by the number of wydes. The expressions should all be pure numbers that fit in one wyde. String constants are often used in such expression lists. For example, if the current location is \Hex{1000}, the instruction \.{WYDE}~\.{"Hello",0} assembles six wydes containing the constants \.{'H'}, \.{'e'}, \.{'l'}, \.{'l'}, \.{'o'}, and~\.0 into locations \Hex{1000}, \dots,~\Hex{1005}, and advances the current location to \Hex{1006}. @ Global registers are important by starting in \ERISC\ programs. We give starting values to these registers. \bull \ \.{GREG} \ allocates a new global register, and assigns its number as the equivalent of the label. At the beginning of assembly, the current global threshold~G is~\$0. Each distinct \.{GREG} instruction increases~G by~1. The value of the expression will be loaded into the global register at the beginning of the program, except if the ABI of the given operating system and/or source language does not dictate otherwise. Register \$0 is always has a defined starting value, the value of the label \.{:Main}. When \ERISCAL\ programs use subroutines with a memory stack in addition to the built-in register stack, they usually begin with the instructions `\.{sp}~\.{GREG}~\.{0;fp}~\.{GREG}~\.0'; these instructions allocate a {\it stack pointer\/} \.{sp=\$1} and a {\it frame pointer\/} \.{fp=\$2}. Usually with `\.{lp}~\.{GREG}\.{0;}' we also give a name to a {\it link pointer\/} \.{lp=\$3} for return addresses. However, subroutine libraries are free to implement any conventions for registers and stacks that they like. @^stack pointer@> @^frame pointer@> @^link pointer@> @ If our program will run on an \ERISC\ processor embedded in hardware supporting Harvard architecture instead of von Neumann architecture, we need two more pseudo-instructions. (In this case we have to use the \.{-h} compiler option.) \bull \.{CODE} ends generating data going to the data segment and begins generate code going to the code segment; it has no effect if \.{-h} option is not given. \bull \.{DATA} ends generating code going to the code segment and begins generate data going to the data segment; it has no effect if \.{-h} option is not given. Remark that by \.{-h} option there is the possibility also to generate code into the data segment, but cannot execute there (only read/write), and there is the possibility to generate data to the code segment but cannot read/write there, only execute. Nevertheless, there is one possibility to write the content of a register \.{\$s} into the code segment by a `\ \.{PUSH}~\.{\$0,\$s}' instruction and leter execute it; we does not suggest this trick to use in user programs, because some operating systems does not save the content of the code segment by pageing. @ Finally, there are two pseudo-instructions to pass information and hints to the loading routine and/or to debuggers that will be using the assembled program. \bull \.{BSPEC} \ begins ``special mode''; the \ should have a value that fits in two bytes, and the label field should be blank. \bull \.{ESPEC} ends ``special mode''; the operand field is ignored, and the label field should be blank. \smallskip\noindent All material assembled between \.{BSPEC} and \.{ESPEC} is passed directly to the output, but not loaded as part of the assembled program. Ordinary \ERISC\ instructions cannot appear in special mode; only the pseudo-operations \.{IS}, \.{PREFIX}, \.{WYDE} are allowed. The operand of \.{BSPEC} should have a value that fits in a wyde; this value identifies the kind of data that follows. (For example, \.{BSPEC}~\.0 might introduce information about subroutine calling conventions at the current location, and \.{BSPEC}~\.1 might introduce line numbers from a high-level-language program that was compiled into the code at the current place. System routines often need to pass such information through an assembler to the operating system, hence \ERISCAL\ provides a general-purpose conduit.) @ A program should begin at the special symbolic location \.{Main} @.Main@> (more precisely, at the address corresponding to the fully qualified symbol \.{:Main}). This symbol always has serial number~1, and it must always be defined. @^serial number@> Locations should not receive assembled data more than once. (More precisely, the loader will load the bitwise~xor of all the data assembled for each wyde position; but the general rule ``do not load two things into the same wyde'' is safest.) All locations that do not receive assembled data are initially zero, except that the the operating system may put command-line data and debugger data into data segment above the stack. (The rudimentary \ERISC\ operating system starts a program with the number of command-line arguments in and a pointer to the beginning of an array of argument pointers in stack pointed by \$2.) @* Binary ERO output. When the \ERISCAL\ processor assembles a file called \.{foo.ers}, it produces a binary output file called \.{foo.ero}. (The suffix \.{ers} stands for ``\ERISC\ symbolic,'' and \.{ero} stands for ``\ERISC\ object.'') Such \.{ero} files have a simple structure consisting of a sequence of wydes. Some of the wydes are instructions to a loading routine; others are data to be loaded. @^object files@> Loader instructions are distinguished from wydes of data by their first three (most significant) nybble, which has the special escape-code value \Hex{0e5}, called |ero| in the program below. This code value corresponds to \ERISC's insruction \.{RESUME}, which is unlikely to occur in wydes of data. The last nybble of a loader instruction is the loader opcode, called the {\it lopcode}. @^lopcodes@> @d ero 0x0e5 @ When a wyde of the \.{ero} file does not begin with the escape code, it is loaded into the current location~$\lambda$, and $\lambda$ is increased by one. More exacly, there may be two current locations, one for code segment, and one for data segment and we may change between them. Both start with zero. The current line number is also increased by~1, if it is nonzero. When a wyde does begin with the escape code, its last nybble is the lopcode defining a loader instruction. There are thirteen lopcodes: \bull |lop_quote|: $\Hex{0}$. Treat the next wyde as an ordinary wyde, even if it begins with the escape code. \bull |lop_seg|: $\Hex{1}$. Change between data and code segment. \bull |lop_skip|: $\Hex{2}$. Increase the current location by the next wyde. \bull |lop_fixw|: $\Hex{3}$. Load (by \.{XOR}) the value of the current location into wyde P, where P~is the 16-bit address defined by the next wyde. (The wyde at~P was previously assembled as zero because of a future reference.) \bull |lop_fixr|: $\Hex{4}$. Load (by \.{XOR}) the next wyde called $\delta$ into the \.{SRC} field of the wyde in location~P, where P~is the address that precedes the current location by $\delta$. (This nybble was previously loaded by an \ERISC\ instruction with a relative address. Its \.{SRC} field was previously assembled as zero because of a future reference.) \bull |lop_fixwx|: $\Hex{5}$. Proceed as in |lop_fixw|, but load the current location to the other segment. \bull |lop_fixrx|: $\Hex{6}$. Proceed as in |lop_fixr|, but load the current location to the other segment. \bull |lop_file|: $\Hex{9}$. Set the current file number to the upper half of the next wyde and the current line number to~zero. The lower half of the next wyde gives the length of the filename. The following wydes are the characters of the file name. If this file number has occurred previously, the file name has length zero. \bull |lop_line|: $\Hex{a}$. Set the current line number to the next wyde. If the line number is nonzero, the current file and current line should correspond to the source location that generated the next data to be loaded, for use in diagnostic messages. (The \ERISCAL\ program gives precise line numbers to the sources of wydes in code segment, which tend to be instructions, but not to the sources of wydes assembled in data segments.) \bull |lop_spec|: $\Hex{b}$. Begin special data of type given by the next wyde. The subsequent wydes, continuing until the next loader operation other than |lop_quote|, comprise the special data. A |lop_quote| instruction allows wydes of special data to begin with the escape code. \bull |lop_pre|: $\Hex{c}$. A~|lop_pre| instruction, which defines the ``preamble,'' must be the first wyde of every \.{ero} file. The higher byte of the next wyde specifies the version number of \.{ero} format, currently~1; other version numbers may be defined later, but version~1 should always be supported as described in the present document. The lower byte of the next wyde specifies how many wydes following a |lop_pre| command provide additional information that might be of interest to system routines. If it is nonzero, the first two wydes of additional information in big endian order records the time that this \.{ero} file was created, measured in seconds since 00:00:00 Greenwich Mean Time on 1~Jan~1970. \bull |lop_post|: $\Hex{d}$. This instruction begins the {\it postamble}, which follows all instructions and data to be loaded. It causes \$0, $\rm1$, \dots,~\$15 initially set to the values of the next 16 wydes. \bull |lop_stab|: $\Hex{e}$. This instruction must appear immediately after the wydes following~|lop_post|. It is followed by the symbol table, which lists the equivalents of all user-defined symbols in a compact form that will be described later. \bull |lop_end|: $\Hex{f}$. This instruction must be the very last two wydes of each \.{ero} file. The next wyde gives exactly, how many wydes must appear between it and the |lop_stab| command. (Therefore a program can easily find the symbol table without reading forward through the entire \.{ero} file.) \smallskip A separate routine called \.{EROtype} is available to translate binary \.{ero} files into human-readable form. @d lop_quote 0x0 /* the quotation lopcode */ @d lop_seg 0x1 /* the segment change lopcode */ @d lop_skip 0x2 /* the skip lopcode */ @d lop_fixw 0x3 /* the wyde-fix lopcode */ @d lop_fixr 0x4 /* the relative-fix lopcode */ @d lop_fixwx 0x5 /* extended relative-fix lopcode */ @d lop_fixrx 0x6 /* extended relative-fix lopcode */ @d lop_file 0x9 /* the file name lopcode */ @d lop_line 0xa /* the file position lopcode */ @d lop_spec 0xb /* the special hook lopcode */ @d lop_pre 0xc /* the preamble lopcode */ @d lop_post 0xd /* the postamble lopcode */ @d lop_stab 0xe /* the symbol table lopcode */ @d lop_end 0xf /* the end-it-all lopcode */ @ Many readers will have noticed that \ERISCAL\ has no facilities for relocatable output, nor does \.{ero} format support such features. Knuth's first drafts of \.{MMIXAL} and \.{mmo} did allow relocatable objects, with external linkages, but the rules were substantially more complicated and therefore inconsistent with the goals of {\sl The Art of Computer Programming}. His \.{MMIXAL} design might actually prove to be superior to the current practice, now that computer memory is significantly cheaper than it used to be, because one-pass assembly and loading are extremely fast when relocatability and external linkages are disallowed. Different program modules can be assembled together about as fast as they could be linked together under a relocatable scheme, and they can communicate with each other in much more flexible ways. Debugging tools are enhanced when open-source libraries are combined with user programs, and such libraries will certainly improve in quality when their source form is accessible to a larger community of users. @* Basic data types. This program for the 16-bit ELTE RISC architecture is based on 32-bit integer arithmetic, because it is essencial to be possible to rewrite to the \.{scc} compiler running on \ERISC. The definition of type \&{wyde} should be changed. @^system dependencies@> @= typedef unsigned int wyde; /* assumes that an int is at least 16 bits wide */ typedef unsigned int tetra; /* assumes that an int is at exactly 32 bits wide */ typedef enum {@!false,@!true}@+@!bool; @ @= wyde zero_wyde; /* |zero_wyde=0| */ wyde neg_one=-1; /* |neg_one=-1| */ wyde aux; /* auxiliary output of a subroutine */ bool overflow; /* set by certain subroutines for signed arithmetic */ @ Left and right shifts are not difficult. @= wyde shift_left @,@,@[ARGS((wyde,int))@];@+@t}\6{@> wyde shift_left(y,s) /* shift left by $s$ bits, where $0\le s\le16$ */ wyde y; int s; { while (s>=8) y<<=8,s-=8; y<<=s; return y&0xffff; } @# wyde shift_right @,@,@[ARGS((wyde,int,int))@];@+@t}\6{@> wyde shift_right(y,s,u) /* shift right, arithmetically if $u=0$ */ wyde y; int s,u; { while (s>=8) y=(y>>8)+(u?0: -((y>>7)&0xff)), s-=8; if (s) y=(y>>s)+(u? 0:(-(y>>7))<<(8-s)); return y; } @* Multiplication. We need to multiply two unsigned 16-bit integers, obtaining an unsigned 32-bit product. It is easy to do this on a 16-bit machine by using Algorithm 4.3.1M of {\sl Seminumerical Algorithms}, with $b=2^8$. @^multiprecision multiplication@> The following subroutine returns the lower half of the product, and puts the upper half into a global tetrabyte called |aux|. @= wyde wmult @,@,@[ARGS((wyde,wyde))@];@+@t}\6{@> wyde wmult(y,z) wyde y,z; { wyde u,v,t; wyde acc; u=y&0xff; v=z&0xff; t=u*v; acc=t&0xff; t>>=8; /* low times low */ y>>=8; y&=0xff; v*=y; v+=t; t=v&0xff; v>>=8; /* high times low */ z>>=8; z&=0xff; u*=z; u+=t; t=u&0xff; u>>=8; /* low times high */ aux=y*z; aux+=u; acc+=t; /* high times high */ return acc; } @ Division inputs the high half of a dividend in the global variable~|aux| and returns the remainder in~|aux|. Long division of an unsigned 32-bit integer by an unsigned 16-bit integer is, of course, one of the most challenging routines needed for \ERISC\ arithmetic. The following program, based on Algorithm 4.3.1D of {\sl Seminumerical Algorithms}, computes wydes $q$ and $r$ such that $(2^{16}x+y)=qz+r$ and $0\le r @= wyde wdiv @,@,@[ARGS((wyde,wyde,wyde))@];@+@t}\6{@> wyde wdiv(x,y,z) wyde x,y,z; { int j,n; wyde zl,zh,c,q,m,t; x&=0xffff;@+ y&=0xffff;@+ z&=0xffff; if (x>=z) @+{ @+aux=y&0xffff; @+return x&0xffff; @+} n=0; while (!(z&(1<<15))) { z<<=1; c=y>>15; x<<=1; x+=c; y<<=1; ++n; } zl=z&0xff; z-=zl; zh=z>>8; aux=0; for (j=1;j>=0;j--) { if (x>=z) q=0xff; else q=x/zh; /*approx q-digit */ m=zl*q; t=(m&0xff)<<8; m>>8; /* q times low part */ c=0; t=y-t; if (t>y) ++c; y=t; /* multiple back */ t=x-c; c=0; if (t>x) ++c; x=t-m; if (x>t) ++c; t=x; x=t-zh*q; if(x>t) ++c; while (c) { /* add back while carry */ --q; t=0; y+=(zl<<8); if (y<(zl<<8)) ++t; x+=t; if (x>8; y<<=8; } x>>=n; return x; } @ Here's a rudimentary check to see if arithmetic is in trouble. @ Future versions of this program will work with symbols formed from Unicode characters, but the present code limits itself to an 8-bit subset. @^Unicode@> The type \&{Char} is defined here in order to ease the later transition: At present, \&{Char} is the same as \&{char}, but \&{Char} can be changed to a 16-bit type in the Unicode version. Other changes will also be necessary when the transition to Unicode is made; for example, some calls of |fprintf| will become calls of |fwprintf|, and some occurrences of \.{\%s} will become \.{\%ls} in print formats. The switchable type name \&{Char} provides at least a first step towards a brighter future with Unicode. @= typedef char Char; /* bytes that will become wydes some day */ @ While we're talking about classic systems versus future systems, we might as well define the |ARGS| macro, which makes function prototypes available on {\mc ANSI \CEE/} systems without making them uncompilable on older systems. Each subroutine below is declared first with a prototype, then with an old-style definition. @= #ifdef __STDC__ #define ARGS(list) list #else #define ARGS(list) () #endif @* Basic input and output. Input goes into a buffer that is normally limited to 72 characters. This limit can be raised, by using the \.{-b} option when invoking the assembler; but short buffers will keep listings from becoming unwieldy, because a symbolic listing adds 19 characters per~line. @= if (buf_size<72) buf_size=72; buffer=(Char*)calloc(buf_size+1,sizeof(Char)); lab_field=(Char*)calloc(buf_size+1,sizeof(Char)); op_field=(Char*)calloc(buf_size,sizeof(Char)); operand_list=(Char*)calloc(buf_size,sizeof(Char)); err_buf=(Char*)calloc(buf_size+60,sizeof(Char)); if (!buffer || !lab_field || !op_field || !operand_list || !err_buf) panic("No room for the buffers"); @.No room...@> @ @= Char *buffer; /* raw input of the current line */ Char *buf_ptr; /* current position within |buffer| */ Char *lab_field; /* copy of the label field of the current instruction */ Char *op_field; /* copy of the opcode field of the current instruction */ Char *operand_list; /* copy of the operand field of the current instruction */ Char *err_buf; /* place where dynamic error messages are sprinted */ @ @= if (!fgets(buffer,buf_size+1,src_file)) break; ++line_no; line_listed=false; j=strlen(buffer); if (buffer[j-1]=='\n') buffer[j-1]='\0'; /* remove the newline */ else if ((j=fgetc(src_file))!=EOF) @; if (buffer[0]=='#') @; buf_ptr=buffer; @ @= { while(j!='\n' && j!= EOF) j=fgetc(src_file); if (!long_warning_given) { long_warning_given=true; err("*trailing characters of long input line have been dropped"); @.trailing characters...@> fprintf(stderr, "(say `-b ' to increase the length of my input buffer)\n"); }@+else err("*trailing characters dropped"); } @ @= int cur_file; /* index of the current file in |filename| */ int line_no; /* current position in the file */ bool line_listed; /* have we listed the buffer contents? */ bool long_warning_given; /* have we given the hint about \.{-b}? */ @ We keep track of source file name and line number at all times, for error reporting and for synchronization data in the object file. Up to 256 different source file names can be remembered. @= Char *filename[257]; /* source file names, including those in line directives */ int filename_count; /* how many |filename| entries have we filled? */ @ If the current line is a line directive, it will also be treated as a comment by the assembler. @= { for (p=buffer+1;isspace(*p);p++); for (j=0;isdigit(*p);p++) j=10*j+*p-'0'; for (;isspace(*p);p++); if (*p=='\"') { if (!filename[filename_count]) { filename[filename_count]=(Char*)calloc(FILENAME_MAX+1,sizeof(Char)); if (!filename[filename_count]) panic("Capacity exceeded: Out of filename memory"); @.Capacity exceeded...@> } for (p++,k=0;*p && *p!='\"' && k= #ifndef FILENAME_MAX #define FILENAME_MAX 256 #endif @ @= register Char *p,*q; /* the place where we're currently scanning */ @ The next several subroutines are useful for preparing a listing of the assembled results. In such a listing, which the user can request with a command-line option, we fill the leftmost 19 columns with a representation of the output that has been assembled from the input in the buffer. Sometimes the assembled output requires more than one line, because we have room to output only a tetrabyte per line. The |flush_listing_line| subroutine is called when we have finished generating one line's worth of assembled material. Its parameter is a string to be printed between the assembled material and the buffer contents, if the input line hasn't yet been echoed. The length of this string should be 19 minus the number of characters already printed on the current line of the listing. @= void flush_listing_line @,@,@[ARGS((char*))@];@+@t}\6{@> void flush_listing_line(s) char *s; { if (line_listed) fprintf(listing_file,"\n"); else { fprintf(listing_file,"%s%s\n",s,buffer); line_listed=true; } } @ Only the two least significant hex digits of a location are shown on the listing, unless the other digits have changed. The following subroutine prints an extra line when a change needs to be shown. @= void update_listing_loc @,@,@[ARGS((void))@];@+@t}\6{@> void update_listing_loc() { if (cur_seg!=listing_seg || ((cur_loc^listing_loc)&0xff00)) { fprintf(listing_file,"%01x%04x:",cur_seg,cur_loc); flush_listing_line(" "); } listing_seg=cur_seg;@+ listing_loc=cur_loc; } @ @= wyde cur_loc; /* current location of assembled output */ wyde cur_seg=0; /* current segment of assembled output */ wyde cur_code_loc; /* current location of assembled output */ wyde cur_data_loc; /* current location of assembled output */ wyde listing_loc; /* current location on the listing */ wyde listing_seg; /* current segment on the listing */ unsigned char hold_buf[4]; /* assembled nybbles */ unsigned char held_bits; /* which nybbles of |hold_buf| are active? */ unsigned char listing_bits; /* which of them haven't been listed yet? */ bool spec_mode; /* are we between |BSPEC| and |ESPEC|? */ wyde spec_mode_loc; /* number of wydes in the current special output */ @ When nybbles are assembled, they are placed into the |hold_buf|. Furthermore, |listing_bits| is increased by |0x10<= void listing_clear @,@,@[ARGS((void))@];@+@t}\6{@> void listing_clear() { register int j; if (spec_mode) fprintf(listing_file," "); else { update_listing_loc(); fprintf(listing_file,"%02x: ",listing_loc); } for (j=0;j<4;j++) if (listing_bits&(0x10<>1]>>4:hold_buf[j>>1]&0xf); flush_listing_line(" "); listing_bits=0; } @ Error messages are written to |stderr|. If the message begins with `\.*' it is merely a warning; if it begins with `\.!' it is fatal; otherwise the error is probably serious enough to make manual correction necessary, yet it is not tragic. Errors and warnings appear also on the optional listing file. @d err(m) {@+report_error(m);@+if (m[0]!='*') goto bypass;@+} @d derr(m,p) {@+sprintf(err_buf,m,p); report_error(err_buf);@+if (err_buf[0]!='*') goto bypass;@+} @d dderr(m,p,q) {@+sprintf(err_buf,m,p,q); report_error(err_buf);@+if (err_buf[0]!='*') goto bypass;@+} @d panic(m) {@+sprintf(err_buf,"!%s",m);@+report_error(err_buf);@+} @d dpanic(m,p) {@+err_buf[0]='!';@+sprintf(err_buf+1,m,p);@+ report_error(err_buf);@+} @= void report_error @,@,@[ARGS((char*))@];@+@t}\6{@> void report_error(message) char *message; { if (!filename[cur_file]) filename[cur_file]="(nofile)"; if (message[0]=='*') fprintf(stderr,"\"%s\", line %d warning: %s\n", filename[cur_file],line_no,message+1); else if (message[0]=='!') fprintf(stderr,"\"%s\", line %d fatal error: %s\n", filename[cur_file],line_no,message+1); else { fprintf(stderr,"\"%s\", line %d: %s!\n", filename[cur_file],line_no,message); err_count++; } if (listing_file) { if (!line_listed) flush_listing_line("****************** "); if (message[0]=='*') fprintf(listing_file, "************ warning: %s\n",message+1); else if (message[0]=='!') fprintf(listing_file, "******** fatal error: %s!\n",message+1); else fprintf(listing_file, "********** error: %s!\n",message); } if (message[0]=='!') exit(-2); } @ @= int err_count; /* this many errors were found */ @ Output to the binary |obj_file| occurs four nybbles at a time. The nybbles are assembled in small buffers, not output as single wyde, because we want the output to be big-endian even when the assembler is running on a little-endian machine. @^big-endian versus little-endian@> @^little-endian versus big-endian@> @d ero_write(buf) if (fwrite(buf,1,2,obj_file)!=2) dpanic("Can't write on %s",obj_file_name) @.Can't write...@> @= unsigned char lop_quote_command[2]={(ero>>4)&0xff,(ero<<4)&0xf0+lop_quote}; unsigned char ero_buf[2]; int ero_ptr; @ @= void ero_clear @,@,@[ARGS((void))@]; void ero_out @,@,@[ARGS((void))@]; void ero_wyde @,@,@[ARGS((wyde))@]; void ero_lop @,@,@[ARGS((char,unsigned char,unsigned char))@]; void ero_lopp @,@,@[ARGS((char,wyde))@]; void ero_clear() /* clears |hold_buf|, when |held_bits!=0| */ { if (hold_buf[0]==ero>>4 && hold_buf[1]>>4==ero&0xf) ero_write(lop_quote_command); ero_write(hold_buf); if (listing_file && listing_bits) listing_clear(); held_bits=0; hold_buf[0]=hold_buf[1]=0; ++ero_cur_loc; if (ero_line_no) ++ero_line_no; } @# void ero_out() { if (held_bits) ero_clear(); ero_write(ero_buf); } @# void ero_wyde(t) /* output a wyde */ wyde t; { ero_buf[0]=(t>>8)&0xff;@+ ero_buf[1]=t&0xff; ero_out(); } void ero_trie_wyde(t) /* output a trie wyde */ wyde t; { ero_buf[0]=(t>>8)&0xff;@+ ero_buf[1]=t&0xff; ero_out(); ++ero_ptr; } @# void ero_lop(x,y,z) /* output a loader operation */ char x; unsigned char y,z; { ero_buf[0]=(ero>>4)&0xff;@+ ero_buf[1]=((ero<<4)&0xf0)+(x&0xf); ero_out(); ero_buf[0]=y&0xff;@+ ero_buf[1]=z&0xff; ero_out(); } @# void ero_lopp(x,yz) /* output a loader operation with wyde operand */ char x; wyde yz; { ero_buf[0]=(ero>>4)&0xff;@+ ero_buf[1]=((ero<<4)&0xf0)+(x&0xf); ero_out(); ero_buf[0]=(yz>>8)&0xff;@+ ero_buf[1]=yz&0xff; ero_out(); } @ The |ero_seg| subroutine makes the current segment in the object file equal to |cur_seg|. @= void ero_seg @,@,@[ARGS((void))@];@+@t}\6{@> void ero_seg() { if (held_bits) ero_clear(); if (ero_cur_seg!=cur_seg) ero_wyde((ero<<4)+lop_seg); ero_cur_seg=cur_seg; } @ The |ero_loc| subroutine makes the current location in the object file equal to |cur_loc|. @= void ero_loc @,@,@[ARGS((void))@];@+@t}\6{@> void ero_loc() { wyde w; ero_seg(); w=(cur_loc-ero_cur_loc)&0xffff; if (w) ero_lopp(lop_skip,w); ero_cur_loc=cur_loc; } @ Similarly, the |ero_sync| subroutine makes sure that the current file and line number in the output file agree with |cur_file| and |line_no|. @= void ero_sync @,@,@[ARGS((void))@];@+@t}\6{@> void ero_sync() { register int j; register unsigned char *p; if (cur_file!=ero_cur_file) { if (filename_passed[cur_file]) ero_lop(lop_file,cur_file,0); else { ero_lop(lop_file,cur_file,strlen(filename[cur_file])); for (p=filename[cur_file];*p;p++) { ero_buf[0]=(*p>>8)&0xff; ero_buf[1]=*p&0xff; ero_out(); } filename_passed[cur_file]=1; } ero_cur_file=cur_file; ero_line_no=0; } if (line_no!=ero_line_no) { if (line_no>=0x10000) panic("I can't deal with line numbers exceeding 65535"); @.I can't deal with...@> ero_lopp(lop_line,line_no); ero_line_no=line_no; } } @ @= wyde ero_cur_loc; /* current location in the object file */ wyde ero_cur_seg; /* current segment in the object file */ int ero_line_no; /* current line number in the \.{ero} output so far */ int ero_cur_file; /* index of the current file in the \.{ero} output so far */ char filename_passed[256]; /* has a filename been recorded in the output? */ @ Here is a basic subroutine that assembles a wyde starting at |cur_loc|. The |x_bits| parameter tells which wydes, if any, are part of a future reference. @= void assemble @,@,@[ARGS((wyde,unsigned char))@];@+@t}\6{@> void assemble(dat,x_bits) wyde dat; unsigned char x_bits; /* These nybbles will be listed as x */ { register int j,l; if (spec_mode) l=spec_mode_loc; else { l=cur_loc; @; } hold_buf[0]=(dat>>8)&0xff; hold_buf[1]=dat&0xff; listing_bits|=x_bits+0xf; held_bits=0xf; if (listing_file) listing_clear(); ero_clear(); if (spec_mode) ++spec_mode_loc; else ++cur_loc; } @ @= if (cur_seg!=ero_cur_seg) ero_seg(); if ((cur_loc^ero_cur_loc)&0xffff) ero_loc(); @* The symbol table. Symbols are stored and retrieved by means of a {\it ternary search trie}, following ideas of Bentley and Sedgewick. (See {\sl ACM--SIAM Symp.\ on Discrete Algorithms\/ \bf8} (1997), 360--369; R.~Sedgewick, {\sl Algorithms in C\/} (Reading, Mass.:\ Addison--Wesley, 1998), \S15.4.) Each trie node stores a character, @^Bentley, Jon Louis@> @^Sedgewick, Robert@> and there are branches to subtries for the cases where a given character is less than, equal to, or greater than the character in the trie. There also is a pointer to a symbol table entry if a symbol ends at the current node. @s sym_tab_struct int @= typedef struct ternary_trie_struct { unsigned short ch; /* the (possibly wyde) character stored here */ struct ternary_trie_struct *left, *mid, *right; /* downward in the ternary trie */ struct sym_tab_struct *sym; /* equivalents of symbols */ } trie_node; @ We allocate trie nodes in chunks of 1000 at a time. @= trie_node* new_trie_node @,@,@[ARGS((void))@];@+@t}\6{@> trie_node* new_trie_node() { register trie_node *t=next_trie_node; if (t==last_trie_node) { t=(trie_node*)calloc(1000,sizeof(trie_node)); if (!t) panic("Capacity exceeded: Out of trie memory"); @.Capacity exceeded...@> last_trie_node=t+1000; } next_trie_node=t+1; return t; } @ @= trie_node *trie_root; /* root of the trie */ trie_node *op_root; /* root of subtrie for opcodes */ trie_node *next_trie_node, *last_trie_node; /* allocation control */ trie_node *cur_prefix; /* root of subtrie for unqualified symbols */ @ The |trie_search| subroutine starts at a given node of the trie and finds a given string in its middle subtrie, inserting new nodes if necessary. The string ends with the first nonletter or nondigit; the location of the terminating character is stored in global variable~|terminator|. @d isletter(c) (isalpha(c)||c=='_'||c==':'||(unsigned int)(c)>126) @= trie_node *trie_search @,@,@[ARGS((trie_node*,Char*))@]; Char *terminator; /* where the search ended */ trie_node *trie_search(t,s) trie_node *t; Char *s; { register trie_node *tt=t; register Char *p=s; while (1) { if (!isletter(*p) && !isdigit(*p)) { terminator=p;@+return tt; } if (tt->mid) { tt=tt->mid; while (*p!=tt->ch) { if (*pch) { if (tt->left) tt=tt->left; else { tt->left=new_trie_node();@+tt=tt->left;@+goto store_new_char; } }@+else { if (tt->right) tt=tt->right; else { tt->right=new_trie_node();@+tt=tt->right;@+goto store_new_char; } } } p++; }@+else { tt->mid=new_trie_node();@+tt=tt->mid; store_new_char: tt->ch=*p++; } } } @ Symbol table nodes hold the serial numbers and equivalents of defined symbols. They also hold ``fixup information'' for undefined symbols; this will allow the loader to correct any previously assembled instructions that refer to such symbols when they are eventually defined. In the symbol table node for a defined symbol, the |link| field has one of the special codes |DEFINED| or |REGISTER| or |PREDEFINED|, and the |equiv| field holds the defined value. The |serial| number is a unique identifier for all user-defined symbols. In the symbol table node for an undefined symbol, the |equiv| field is ignored. The |link| field points to the first node of fixup information; that node is, in turn, a symbol table node that might link to other fixups. The |serial| number in a fixup node is either 0 or 1 or 2, meaning respectively ``fixup the wyde pointed to by |equiv|'' or ``fixup the relative address in the YZ field of the instruction pointed to by |equiv|'' or ``fixup the relative address in the XYZ field of the instruction pointed to by |equiv|.'' @s sym_node int @s bool int @d DEFINED (sym_node*)1 /* code value for wyde equivalents */ @d REGISTER (sym_node*)2 /* code value for register-number equivalents */ @d PREDEFINED (sym_node*)3 /* code value for not-yet-used equivalents */ @d seg_bit 1 /* |serial| code bit for data segment */ @d nyb_bit 2 /* |serial| code bit for signed nybble fixup */ @d rel_bit 4 /* |serial| code bit for relative fixup */ @= typedef struct sym_tab_struct { int serial; /* serial number of symbol; type number for fixups */ struct sym_tab_struct *link; /* |DEFINED| status or link to fixup */ wyde equiv; /* the equivalent value */ wyde seg; /* the segment: 0 for code, 1 for data, if it is */ } sym_node; @ The allocation of new symbol table nodes proceeds in chunks, like the allocation of trie nodes. But in this case we also have the possibility of reusing old fixup nodes that are no longer needed. @d recycle_fixup(pp) pp->link=sym_avail, sym_avail=pp @= sym_node* new_sym_node @,@,@[ARGS((bool))@];@+@t}\6{@> sym_node* new_sym_node(serialize) bool serialize; /* should the new node receive a unique serial number? */ { register sym_node *p=sym_avail; if (p) { sym_avail=p->link;@+p->link=NULL;@+p->serial=0;@+p->equiv=zero_wyde; @+p->seg=zero_wyde; }@+else { p=next_sym_node; if (p==last_sym_node) { p=(sym_node*)calloc(1000,sizeof(sym_node)); if (!p) panic("Capacity exceeded: Out of symbol memory"); @.Capacity exceeded...@> last_sym_node=p+1000; } next_sym_node=p+1; } if (serialize) p->serial=++serial_number; return p; } @ @= int serial_number; sym_node *sym_root; /* root of the sym */ sym_node *next_sym_node, *last_sym_node; /* allocation control */ sym_node *sym_avail; /* stack of recycled symbol table nodes */ @ We initialize the trie by inserting all the predefined symbols. Opcodes are given the prefix \.{\^}, to distinguish them from ordinary symbols; this character nicely divides uppercase letters from lowercase letters. @= trie_root=new_trie_node(); cur_prefix=trie_root; op_root=new_trie_node(); trie_root->mid=op_root; trie_root->ch=':'; op_root->ch='^'; @; @; @ Most of the assembly work can be table driven, based on bits that are stored as the ``equivalents'' of opcode symbols like \.{\^ADD}. @d arg_num_bits 0x3 /* number of arguments: 0,1,2,>=3? */ @d immed_bit 0x4 /* immediate addressing is allowed? */ @d dest_bit 0x8 /* destination modified addressing is allowed? */ @d reg_bit 0x10 /* register addressing is allowed? */ @d indir_bit 0x20 /* indirect addressings is allowed? */ @d no_label_bit 0x40 /* should the label be blank? */ @d spec_bit 0x80 /* is this opcode allowed in \.{SPEC} mode? */ @= typedef struct { Char *name; /* symbolic opcode */ int bits; /* treatment of operands */ } op_spec; @# typedef enum { @!IS=0x01,@!LOC,@!PREFIX,@!BSPEC=0x11,@!ESPEC,@!WYDE,@!GREG=0x21, @!CODE,@!DATA}@+@!pseudo_op; @ @= op_spec op_init_table[]={@/ {"LZ", 0x003e}, @.LZ@> {"TRAP", 0xe506}, @.LZ@> {"RESUME", 0xe512}, @.LZ@> {"JMP", 0xf43e}, @.JMP@> {"IS", (IS<<8)+0x81}, @.IS@> {"LOC", (LOC<<8)+0x01}, @.LOC@> {"PREFIX", (PREFIX<<8)+0xc1}, @.PREFIX@> {"WYDE", (WYDE<<8)+0x83},@/ @.WYDE@> {"GREG", (GREG<<8)+0x81}, @.GREG@> {"CODE", (CODE<<8)+0x00}, @.CODE@> {"DATA", (DATA<<8)+0x00}, @.DATA@> {"BSPEC", (BSPEC<<8)+0x41}, @.BSPEC@> {"ESPEC", (ESPEC<<8)+0xc0}};@/ @.ESPEC@> int op_init_size; /* the number of items in |op_init_table| */ @ @= op_init_size=(sizeof op_init_table)/sizeof(op_spec); for (j=0;jsym=new_sym_node(false); pp->link=PREDEFINED; pp->equiv=op_init_table[j].bits; pp->seg=0; } @ @= register trie_node *tt; register sym_node *pp,*qq; @ @= typedef struct { Char* name; wyde h,l; }@+predef_spec; @ @= predef_spec predefs[]={ {"Inf",1,0xff00},@/ @.Inf@> {"StdIn",0,0}, @.StdIn@> {"StdOut",0,1}, @.StdOut@> {"StdErr",0,2},@/ @.StdErr@> {"TextRead",0,0}, @.TextRead@> {"TextWrite",0,1}, @.TextWrite@> {"BinaryRead",0,2}, @.BinaryRead@> {"BinaryWrite",0,3}, @.BinaryWrite@> {"BinaryReadWrite",0,4},@/ @.BinaryReadWrite@> {"Halt",0,0}, @.Halt@> {"Fopen",0,1}, @.Fopen@> {"Fclose",0,2}, @.Fclose@> {"Fread",0,3}, @.Fread@> {"Fgets",0,4}, @.Fgets@> {"Fgetws",0,5}, @.Fgetws@> {"Fwrite",0,6}, @.Fwrite@> {"Fputs",0,7}, @.Fputs@> {"Fputws",0,8}, @.Fputws@> {"Fseek",0,9}, @.Fseek@> {"Ftell",0,10}}; @.Ftell@> int predef_size; @^predefined symbols@> @ @= predef_size=(sizeof predefs)/sizeof(predef_spec); for (j=0;jsym=new_sym_node(false); pp->link=PREDEFINED; pp->seg=harvard&predefs[j].h, pp->equiv=predefs[j].l; } @ We place \.{Main} into the trie at the beginning of assembly, so that it will show up as an undefined symbol if the user specifies no starting point. @.Main@> @= trie_search(trie_root,"Main")->sym=new_sym_node(true); @ At the end of assembly we traverse the entire symbol table, visiting each symbol in lexicographic order and transmitting the trie structure to the output file. We detect any undefined future references at this time. The order of traversal has a simple recursive pattern: To traverse the subtrie rooted at~|t|, we $$\vbox{\halign{#\hfil\cr traverse |t->left|, if the left subtrie is nonempty;\cr visit |t->sym|, if this symbol table entry is present;\cr traverse |t->mid|, if the middle subtrie is nonempty;\cr traverse |t->right|, if the right subtrie is nonempty.\cr }}$$ This pattern leads to a compact representation in the \.{ero} file, usually requiring fewer than two wydes per trie node plus the wydes needed to encode the equivalents and serial numbers. Each node of the trie is encoded as a ``master wyde'' followed by the encodings of the left subtrie, character, equivalent, middle subtrie, and right subtrie. If possible, we put the character |ch| and part or all of the equivalent into the master wyde. The master wyde is the sum of $$ \vbox{\halign{#\hfil\cr \Hex{8000}, if the left subtrie is nonempty;\cr \Hex{4000}, if the middle subtrie is nonempty;\cr \Hex{2000}, if the right subtrie is nonempty;\cr \qquad and one of the following values:\cr \Hex{0xyz}, if the symbol's equivalent is \$0 plus |x|\cr \qquad and the character code is |yz|;\cr \Hex{1xyz}, if |xyz=(s<<10)+ch|, where |s|is the symbol's segment and the\cr \qquad character code at most 10 bits (so most significant bit of |x| is 0);\cr \Hex{1xyz}, if the symbol is nondefined and |xyz=(1<<11)+ch|, where the\cr \qquad character code at most 10 bits (so most significant bits of |x| are 10);\cr \Hex{1c0z}, if the symbol's equivalent is \$0 plus |z|, and |ch| is in a\cr \qquad separate wyde (so most significant bits of the second wyde are 110);\cr \Hex{1e0z}, if the symbol's segment is |z|, and |ch| is in separate wyde;\cr \qquad (so bits of the second wyde are 1110);\cr \Hex{1f0z}, if the symbol is nondefined, and |ch| is in separate wyde;\cr \qquad (so bits of the second wyde are 1111);\cr}} $$ the character is omitted if the middle subtrie and the equivalent are both empty. Symbol equivalents are followed by the serial number, represented as a wyde. @ First we prune the trie by removing all predefined symbols that the user did not redefine. @= trie_node* prune @,@,@[ARGS((trie_node*))@];@+@t}\6{@> trie_node* prune(t) trie_node* t; { register int useful=0; if (t->sym) { if (t->sym->serial) useful=1; else t->sym=NULL; } if (t->left) { t->left=prune(t->left); if (t->left) useful=1; } if (t->mid) { t->mid=prune(t->mid); if (t->mid) useful=1; } if (t->right) { t->right=prune(t->right); if (t->right) useful=1; } if (useful) return t; else return NULL; } @ Then we output the trie by following the recursive traversal pattern. @= void out_stab @,@,@[ARGS((trie_node*))@];@+@t}\6{@> void out_stab(t) trie_node* t; { register int m=0; /* master wyde */ register int s=1; /* defined serial? */ register int c=1; /* out character in separate wyde? */ register sym_node *pp; if (t->ch>0x3ff) m+=0x1f00; else m+=t->ch,c=0; if (t->left) m+=0x8000; if (t->mid) m+=0x4000; if (t->right) m+=0x2000; if (t->sym) { if (t->sym->link==REGISTER) if (t->ch<0xff) m+=0x1000+t->ch+(((t->sym->equiv)&0xf)<<7),c=0; else m+=0x1c00+((t->sym->equiv)&0xf); else if (t->sym->link==DEFINED) if (t->ch<0x3ff) m+=0x1800+(t->sym->seg<<10)+t->ch,c=0; else m+=0x1e00+t->sym->seg; else if (t->sym->link || t->sym->serial==1) @@; else s=0; } else s=0; ero_trie_wyde(m); if (t->left) out_stab(t->left); if (m&0x4000 || c || s) @mid|@>; if (t->right) out_stab(t->right); } @ We make room for symbols up to 999 bytes long. Strictly speaking, the program should check if this limit is exceeded; but really! @= Char sym_buf[1000]; Char *sym_ptr; @ A global variable called |sym_buf| holds all characters on middle branches to the current trie node; |sym_ptr| is the first currently unused character in |sym_buf|. @^Unicode@> @mid|@>= { if (c) ero_trie_wyde(t->ch); *sym_ptr++=(t->ch>0xff? '?': t->ch); /* Unicode? not yet */ if (s && t->sym->link) { if (listing_file) @; ero_trie_wyde(t->sym->serial); } if (t->mid) out_stab(t->mid); sym_ptr--; } @ The initial `\.:' of each fully qualified symbol is omitted here, since most users of \ERISCAL\ will probably not need the \.{PREFIX} feature. One consequence of this omission is that the one-character symbol~`\.:' itself, which is allowed by the rules of \ERISCAL, is printed as the null string. @= { *sym_ptr='\0'; fprintf(listing_file," %s = ",sym_buf+1); pp=t->sym; if (pp->link==DEFINED) fprintf(listing_file,"#%01x%04x",pp->seg,pp->equiv); else if (pp->link==REGISTER) fprintf(listing_file,"$%02d",pp->equiv); else fprintf(listing_file,"?"); fprintf(listing_file," (%d)\n",pp->serial); } @ @= { *sym_ptr=(t->ch>0xff? '?' : t->ch); /* Unicode? not yet */ *(sym_ptr+1)='\0'; fprintf(stderr,"undefined symbol: %s\n",sym_buf+1); @.undefined symbol@> err_count++,s=0; if (t->ch<0x3ff) m+=0x1800+t->ch,c=1; else m+=0x1fff; } @ @= op_root->mid=NULL; /* annihilate all the opcodes */ prune(trie_root); sym_ptr=sym_buf; if (listing_file) fprintf(listing_file,"\nSymbol table:\n"); ero_wyde((ero<<4)+lop_stab); out_stab(trie_root); ero_lopp(lop_end,ero_ptr); @* Expressions. The most intricate part of the assembly process is the task of scanning and evaluating expressions in the operand field. Fortunately, \ERISCAL's expressions have a simple structure that can be handled easily with a stack-based approach. Two stacks hold pending data as the operand field is scanned and evaluated. The |op_stack| contains operators that have not yet been performed; the |val_stack| contains values that have not yet been used. After an entire operand list has been scanned, the |op_stack| will be empty and the |val_stack| will hold the operand values needed to assemble the current instruction. @ Entries on |op_stack| have one of the constant values defined here, and they have one of the precedence levels defined here. Entries on |val_stack| have |equiv|, |link|, and |status| fields; the |link| points to a trie node if the expression is a symbol that has not yet been subjected to any operations. @= typedef enum {@!indirectize,@!relativize,@!negate,@!serialize, @!complement,@!registerize,@| @!plus,@!minus,@!times,@!over,@!frac,@!mod,@!shl,@!shr,@!and,@!or,@!xor,@| @!outer_lp,@!outer_rp,@!inner_lp,@!inner_rp} @!stack_op; typedef enum {@!zero,@!weak,@!strong,@!unary} @!prec; typedef enum {@!pure,@!reg_val,@!undefined,@!rel_undefined,@| @!ind_pure,@!ind_reg_val,@!ind_undefined,@!ind_rel_undefined} @!stat; typedef struct { wyde equiv; /* current value */ trie_node *link; /* trie reference for symbol */ stat status; /* |pure|, |reg_val|, |undefined|, ... */ } val_node; @ @d top_op op_stack[op_ptr-1] /* top entry on the operator stack */ @d top_val val_stack[val_ptr-1] /* top entry on the value stack */ @d next_val val_stack[val_ptr-2] /* next-to-top entry of the value stack */ @= stack_op *op_stack; /* stack for pending operators */ int op_ptr; /* number of items on |op_stack| */ val_node *val_stack; /* stack for pending operands */ int val_ptr; /* number of items on |val_stack| */ prec precedence[]={unary,unary,unary,unary,unary,unary,@| weak,weak,strong,strong,strong,strong,strong,strong,strong,weak,weak,@| zero,zero,zero,zero}; /* precedences of the respective |stack_op| values */ stack_op rt_op; /* newly scanned operator */ wyde acc; /* temporary accumulator */ @ @= op_stack=(stack_op*)calloc(buf_size,sizeof(stack_op)); val_stack=(val_node*)calloc(buf_size,sizeof(val_node)); if (!op_stack || !val_stack) panic("No room for the stacks"); @.No room...@> @ The operand field of an instruction will have been copied into a separate \&{Char} array called |operand_list| when we reach this part of the program. @= p=operand_list; val_ptr=0; /* |val_stack| is empty */ op_stack[0]=outer_lp, op_ptr=1; /* |op_stack| contains an ``outer left parenthesis'' */ while (1) { @; scan_close: @; while (precedence[top_op]>=precedence[rt_op]) @; hold_op: op_stack[op_ptr++]=rt_op; } operands_done:@; @ A comment that follows an empty operand list needs to be detected here. @= scan_open:@+if (isletter(*p)) @@; else if (isdigit(*p)) { if (*(p+1)=='F') @@; else if (*(p+1)=='B') @@; else @; }@+else@+ switch(*p++) { case '#': @;@+break; case '\'': @;@+break; case '\"': @;@+break; case '@@': @;@+break; case '*': op_stack[op_ptr++]=indirectize;@+goto scan_open; case '+': op_stack[op_ptr++]=relativize;@+goto scan_open; case '-': op_stack[op_ptr++]=negate;@+goto scan_open; case '&': op_stack[op_ptr++]=serialize;@+goto scan_open; case '~': op_stack[op_ptr++]=complement;@+goto scan_open; case '$': op_stack[op_ptr++]=registerize;@+goto scan_open; case '(': op_stack[op_ptr++]=inner_lp;@+goto scan_open; default: if (p==operand_list+1) { /* treat operand list as empty */ operand_list[0]='0', operand_list[1]='\0', p=operand_list; goto scan_open; } if (*(p-1)) derr("syntax error at character `%c'",*(p-1)) derr("syntax error after character `%c'",*(p-2)) @.syntax error...@> } @ @= { if (*p==':') tt=trie_search(trie_root,p+1); else tt=trie_search(cur_prefix,p); p=terminator; symbol_found: val_ptr++; pp=tt->sym; if (!pp) pp=tt->sym=new_sym_node(true); top_val.link=tt, top_val.equiv=pp->equiv; if (pp->link==PREDEFINED) pp->link=DEFINED; top_val.status=(pp->link==DEFINED? pure: pp->link==REGISTER? reg_val: undefined); } @ @= { tt=&forward_local_host[*p-'0'];@+ p+=2;@+ goto symbol_found; } @ @= { tt=&backward_local_host[*p-'0'];@+ p+=2;@+ goto symbol_found; } @ Statically allocated variables |forward_local_host[j]| and |backward_local_host[j]| masquerade as nodes of the trie. @= trie_node forward_local_host[10], backward_local_host[10]; sym_node forward_local[10], backward_local[10]; @ Initially \.{0H}, \.{1H}, \dots, \.{9H} are defined to be zero. @= for (j=0;j<10;j++) { forward_local_host[j].sym=&forward_local[j]; backward_local_host[j].sym=&backward_local[j]; backward_local[j].link=DEFINED; } @ We have already checked to make sure that the character constant is legal. @= acc=*p; p+=2; goto constant_found; @ @= acc=*p; if (*p=='\"') { p++; acc=0; err("*null string is treated as zero") @.null string...@> }@+else if (*(p+1)=='\"') p+=2; else *p='\"', *--p=','; goto constant_found; @ @= acc=*p-'0'; for (p++;isdigit(*p);p++) { acc+=(acc<<2); acc=(acc<<1)+(*p-'0'); } constant_found: val_ptr++; top_val.link=NULL; top_val.equiv=acc; top_val.status=pure; @ @= if (!isxdigit(*p)) err("illegal hexadecimal constant"); @.illegal hexadecimal constant@> acc=0; for (;isxdigit(*p);p++) { acc=(acc<<4)+(*p-'0'); if (*p>='a') acc+='0'-'a'+10; else if (*p>='A') acc+='0'-'A'+10; } goto constant_found; @ @= acc=cur_loc; goto constant_found; @ @= switch(*p++) { case '+': rt_op=plus;@+break; case '-': rt_op=minus;@+break; case '*': rt_op=times;@+break; case '/':@+if (*p!='/') rt_op=over; else p++,rt_op=frac;@+break; case '%': rt_op=mod;@+break; case '<': rt_op=shl;@+goto sh_check; case '>': rt_op=shr; sh_check: p++;@+if (*(p-1)==*(p-2)) break; derr("syntax error at `%c'",*(p-2)); @.syntax error...@> case '&': rt_op=and;@+break; case '|': rt_op=or;@+break; case '^': rt_op=xor;@+break; case ')': rt_op=inner_rp;@+break; case '\0': case ',': rt_op=outer_rp;@+break; default: derr("syntax error at `%c'",*(p-1)); } @ @= switch(op_stack[--op_ptr]) { case inner_lp:@+if (rt_op==inner_rp) goto scan_close; err("*missing right parenthesis");@+break; @.missing right parenthesis@> case outer_lp:@+if (rt_op==outer_rp) { if ((top_val.status==reg_val || top_val.status==ind_reg_val) &&@| top_val.equiv>0xf) { err("*register number too large, will be reduced mod 16"); @.register number...@> top_val.equiv &= 0xf; } if (!*(p-1)) goto operands_done; else rt_op=outer_lp;@+goto hold_op; /* comma */ }@+else { op_ptr++; err("*missing left parenthesis"); @.missing left parenthesis@> goto scan_close; } @t\4@>@@; @t\4@>@@; } @ Now we come to the part where equivalents are changed by unary or binary operators found in the expression being scanned. The most typical operator, and in some ways the fussiest one to deal with, is binary addition. Once we've written the code for this case, the other cases almost take care of themselves. @= case plus:@+if (top_val.status>=ind_pure) err("cannot add an indirect quantity"); @.cannot add...@> if (next_val.status>=ind_pure) err("cannot add to an indirect quantity"); if (top_val.status>=undefined) err("cannot add an undefined quantity"); if (next_val.status>=undefined) err("cannot add to an undefined quantity"); if (top_val.status==reg_val && next_val.status==reg_val) err("cannot add two register numbers"); next_val.equiv+=top_val.equiv; fin_bin: next_val.status=(top_val.status==next_val.status? pure: reg_val); val_ptr--; delink: top_val.link=NULL;@+break; @ @d unary_check(verb) if (top_val.status!=pure) derr("can %s pure values only",verb) @= case indirectize:@+if (top_val.status if (top_val.status==pure) top_val.equiv-=cur_loc; else top_val.status=rel_undefined;@+goto delink; case negate: unary_check("negate"); @.can negate...@> top_val.equiv=zero_wyde-top_val.equiv;@+goto delink; case complement: unary_check("complement"); @.can complement...@> top_val.equiv=~top_val.equiv; goto delink; case registerize: unary_check("registerize"); @.can registerize...@> top_val.status=reg_val;@+goto delink; case serialize:@+if (!top_val.link) err("can take serial number of symbol only"); @.can take serial number...@> top_val.equiv=top_val.link->sym->serial; top_val.status=pure;@+goto delink; @ @d binary_check(verb) if (top_val.status!=pure || next_val.status!=pure) derr("can %s pure values only",verb) @= case minus:@+if (top_val.status>=ind_pure) err("cannot subtract an indirect quantity"); @.cannot subtract...@> if (top_val.status>=undefined) err("cannot subtract an undefined quantity"); if (next_val.status>=ind_pure) err("cannot subtract from an indirect quantity"); if (next_val.status>=undefined) err("cannot subtract from an undefined quantity"); if (top_val.status==reg_val && next_val.status!=reg_val) err("cannot subtract register number from pure value"); next_val.equiv-=top_val.equiv;@+goto fin_bin; case times: binary_check("multiply"); @.can multiply...@> next_val.equiv=wmult(next_val.equiv,top_val.equiv);@+goto fin_bin; case over: case mod: binary_check("divide"); @.can divide...@> if (top_val.equiv==0) err("*division by zero"); @.division by zero@> next_val.equiv=wdiv(zero_wyde,next_val.equiv,top_val.equiv); if (op_stack[op_ptr]==mod) next_val.equiv=aux; goto fin_bin; case frac: binary_check("compute a ratio of"); @.can compute...@> if (next_val.equiv>=top_val.equiv) err("*illegal fraction"); @.illegal fraction@> next_val.equiv=wdiv(next_val.equiv,zero_wyde,top_val.equiv);@+goto fin_bin; case shl: case shr: binary_check("compute a bitwise shift of"); if (top_val.equiv>15) next_val.equiv=zero_wyde; else if (op_stack[op_ptr]==shl) next_val.equiv<<=top_val.equiv; else next_val.equiv>>=top_val.equiv; goto fin_bin; case and: binary_check("compute bitwise and of"); next_val.equiv&=top_val.equiv; goto fin_bin; case or: binary_check("compute bitwise or of"); next_val.equiv|=top_val.equiv; goto fin_bin; case xor: binary_check("compute bitwise xor of"); next_val.equiv^=top_val.equiv; goto fin_bin; @* Assembling an instruction. Now let's move up from the expression level to the instruction level. We get to this part of the program at the beginning of a line, or after a semicolon at the end of an instruction earlier on the current line. Our current position in the buffer is the value of |buf_ptr|. @= p=buf_ptr;@+ buf_ptr=""; @; @; @; buf_ptr=p; if (spec_mode && !(op_bits&spec_bit)) derr("cannot use `%s' in special mode",op_field); @.cannot use...@> if ((op_bits&no_label_bit) && lab_field[0]) { derr("*label field of `%s' instruction is ignored",op_field); lab_field[0]='\0'; } @.label field...ignored@> @; if (opcode==GREG) @; if (lab_field[0]) @; @; bypass:@; @ @= if (!*p) goto bypass; q=lab_field; if (!isspace(*p)) { if (!isdigit(*p)&&!isletter(*p)) goto bypass; /* comment */ for (*q++=*p++;isdigit(*p)||isletter(*p);p++,q++) *q=*p; if (*p && !isspace(*p)) derr("label syntax error at `%c'",*p); @.label syntax error...@> } *q='\0'; if (isdigit(lab_field[0]) && (lab_field[1]!='H' || lab_field[2])) derr("improper local label `%s'",lab_field); @.improper local label...@> for (p++;isspace(*p);p++); @ We copy the opcode field to a special buffer because we might want to refer to the symbolic opcode in error messages. @= q=op_field;@+ while (isletter(*p)||isdigit(*p)) *q++=*p++; *q='\0'; if (!isspace(*p) && *p && op_field[0]) derr("opcode syntax error at `%c'",*p); @.opcode syntax error...@> pp=trie_search(op_root,op_field)->sym; if (!pp) { if (op_field[0]) derr("unknown operation code `%s'",op_field); @.unknown operation code@> if (lab_field[0]) derr("*no opcode; label `%s' will be ignored",lab_field); @.no opcode...@> goto bypass; } opcode=(pp->equiv>>8)&0xff, op_bits=pp->equiv&0xff; while (isspace(*p)) p++; @ @= wyde opcode; /* numeric code for \ERISC\ operation or \ERISCAL\ pseudo-op */ wyde op_bits; /* flags describing an operator's special characteristics */ wyde arg_num; /* number of arguments: 0,1,2,>=3 */ @ We copy the operand field to a special buffer so that we can change string constants while scanning them later. @= q=operand_list; while (*p) { if (*p==';') break; if (*p=='\'') { *q++=*p++; if (!*p) err("incomplete character constant"); @.incomplete...constant@> *q++=*p++; if (*p!='\'') err("illegal character constant"); @.illegal character constant@> }@+else if (*p=='\"') { for (*q++=*p++;*p && *p!='\"';p++,q++) *q=*p; if (!*p) err("incomplete string constant"); } *q++=*p++; if (isspace(*p)) break; } while (isspace(*p)) p++; if (*p==';') p++; else p=""; /* if not followed by semicolon, rest of the line is a comment */ if (q==operand_list) *q++='0'; /* change empty operand field to `\.0' */ *q='\0'; @ @= { if (greg==15) err("too many global registers") @.too many global registers@> else { ++greg; greg_val[greg]=val_stack[0].equiv; } } @ If the label is, say \.{2H}, we will already have used the old value of \.{2B} when evaluating the operands. Furthermore, an operand of \.{2F} will have been treated as undefined, which it still is. Symbols can be defined more than once, but only if each definition gives them the same equivalent value. A warning message is given when a predefined symbol is being redefined, if its predefined value has already been used. @= { sym_node *new_link=DEFINED; acc=cur_loc; if (opcode==IS) { cur_loc=val_stack[0].equiv; if (val_stack[0].status==reg_val) new_link=REGISTER; }@+else if (opcode==GREG) cur_loc=greg, new_link=REGISTER; @; if (pp->link==DEFINED || pp->link==REGISTER) { if (pp->seg!=cur_seg || pp->equiv!=cur_loc || pp->link!=new_link) { if (pp->serial) derr("symbol `%s' is already defined",lab_field); @.symbol...already defined@> pp->serial=++serial_number; derr("*redefinition of predefined symbol `%s'",lab_field); @.redefinition...@> } }@+ else if (pp->link==PREDEFINED) pp->serial=++serial_number; else if (pp->link) { if (new_link==REGISTER) err("future reference cannot be to a register"); @.future reference cannot...@> do @@;@+while (pp->link); } if (isdigit(lab_field[0])) pp=&backward_local[lab_field[0]-'0']; pp->equiv=cur_loc;@+pp->seg=cur_seg;@+ pp->link=new_link; @; if (listing_file && (opcode==IS || opcode==LOC)) @; cur_loc=acc; } @ @= if (!isdigit(lab_field[0])) for (j=0;jsym==pp) { val_stack[j].status=(new_link==REGISTER? reg_val: pure); val_stack[j].equiv=cur_loc; } @ @= if (isdigit(lab_field[0])) pp=&forward_local[lab_field[0]-'0']; else { if (lab_field[0]==':') tt=trie_search(trie_root,lab_field+1); else tt=trie_search(cur_prefix,lab_field); pp=tt->sym; if (!pp) pp=tt->sym=new_sym_node(true); } @ @= { qq=pp->link; pp->link=qq->link; @@; recycle_fixup(qq); } @ @= { wyde w; int s; s=qq->serial; if (s&seg_bit!=cur_seg) /* different segment */ dderr("location #%01x%04x is in a different segment",qq->seg,qq->equiv) else { @@; if (s&rel_bit) { k=0; w=cur_loc-qq->equiv; if (s&nyb_bit) { if(!(w&0x8000)) if (w<0x8) ero_lopp(lop_fixr,w); else k=1; else if (w>=0xfff8) ero_lopp(lop_fixr,w&0xf); else k=1; }@+else ero_lopp(lop_fixr,w); if (k) dderr("relative address in location #%01x%04x is too far away", qq->seg,qq->equiv); } else { k=0; w=qq->equiv; if (s&nyb_bit) { if(!(w&0x8000)) if (w<0x8) ero_lopp(lop_fixw,w); else k=1; else if (w>=0xfff8) ero_lopp(lop_fixw,w&0xf); else k=1; }@+else ero_lopp(lop_fixr,w); if (k) dderr("defined nybble in location #%01x%04x is too large", qq->seg,qq->equiv); } } } @ @= if (new_link==DEFINED) { fprintf(listing_file,"(%04x)",cur_loc); flush_listing_line(" "); }@+else { fprintf(listing_file,"($%02d)",cur_loc&0xf); flush_listing_line(" "); } @ @= future_bits=0; arg_num=op_bits&arg_num_bits; if (arg_num==3) @@; else@+switch (arg_num) { case 0:@+if (val_ptr>1) derr("opcode `%s' needs no operand",op_field); @.opcode...operand(s)@> @; break; case 1:@+if (val_ptr>1) derr("opcode `%s' needs one operand",op_field); @.opcode...operand(s)@> @; break; case 2:@+if (val_ptr!=2) derr("opcode `%s' must have two operands",op_field)@; @; break; default: derr("too many operands for opcode `%s'",op_field); @.too many operands...@> } @ The many-operand operator is |WYDE|. @= for (j=0;j; if (val_stack[j].status==undefined || val_stack[j].status==rel_undefined) assemble(0,0xf0); else assemble(val_stack[j].equiv,0); } @ @= if (val_stack[j].status>=ind_pure) { err("*indirect number used as a constant")@; val_stack[j].status-=ind_pure; } if (val_stack[j].status==reg_val) err("*register number used as a constant")@; @.register number...@> else if (val_stack[j].status==undefined) { pp=val_stack[j].link->sym; qq=new_sym_node(false); qq->link=pp->link; pp->link=qq; qq->serial=cur_seg; qq->equiv=cur_loc; } else if (val_stack[j].status==rel_undefined) { pp=val_stack[j].link->sym; qq=new_sym_node(false); qq->link=pp->link; pp->link=qq; qq->serial=cur_seg+rel_bit; qq->equiv=cur_loc; } @ Individual fields of an instruction are placed into global variables |x|, |y|, |z|. @= wyde x,y,z; /* pieces for assembly */ int future_bits; /* places where there are future references */ char addr_mode; /* places where there are the addressing mode bits */ @ @= z=0; /* Presuppose one-wyde code */ @; switch (addr_mode) { case 0: @@; @@; @; break; case 1: @@; @@; @@; break; case 2: @@; @@; @@; break; case 3: @@; @@; @@; break; } assemble_DST: @; assemble_inst: assemble((x<<12)+((opcode+addr_mode)<<4)+y,future_bits); if (z) @; @ @= if (!((1<<(2+addr_mode))&op_bits)) dderr("addressing mode %01d is not allowed by `%s'",addr_mode,op_field); @ @= acc=val_stack[1].equiv; val_stack[1].equiv=val_stack[0].equiv; val_stack[0].equiv=acc; acc=val_stack[1].status; val_stack[1].status=val_stack[0].status; val_stack[0].status=acc; @ @= if (opcode==0xe5) { /* TRAP */ if (val_stack[0].status!=reg_val||val_stack[0].equiv==0) derr("DST field of `%s' should be a nonzero register",op_field); @.DST field...nonzero register@> if (val_stack[1].status!=pure) derr("SRC field of `%s' should be a number",op_field); @.SRC field...a number@> if (val_stack[1].equiv>0xf) err("SRC field doesn't fit in one unsigned nybble"); @.SRC field...unsigned nybble@> y=val_stack[1].equiv&0xf;@+ break; } @ @= if (opcode==0xe5) { /* RESUME */ if (val_stack[0].status!=reg_val||val_stack[0].equiv) derr("DST field of `%s' should be the zero register",op_field); @.DST field...the zero register@> if (val_stack[1].equiv>0xf) err("*SRC field doesn't fit in one unsigned nybble"); @.SRC field doesn't fit...@> y=val_stack[1].equiv&0xf,x=0,addr_mode=0;@+ goto assemble_inst; } else if (opcode==0xc5) /* MOR */ addr_mode=0; else if (opcode==0xe6) /* SADD */ addr_mode=0; @ @= { if (opcode==0xd5) { /* PUSH */ if (val_stack[1].status!=reg_val) derr("SRC field of `%s' should be a register",op_field); @.SRC field...a register@> if (val_stack[1].equiv>0xf) err("*SRC field doesn't fit in one unsigned nybble"); @.SRC field doesn't fit...@> y=val_stack[1].equiv&0xf,addr_mode=0;@+ goto assemble_DST; } else if (opcode==0xe7) /* POP */ addr_mode=0; } @ @= { if (val_stack[1].status==undefined) @@; else if (val_stack[1].status==rel_undefined) @@; else if (val_stack[1].status==reg_val) derr("*SRC field of `%s' should not be a register number",op_field) @.SRC field...register number@> else { if (val_stack[1].equiv<=0xfff8 && val_stack[1].equiv>0x7) err("*SRC field doesn't fit in one signed nybble"); @.SRC field doesn't fit...@> y=val_stack[1].equiv&0xf; } } @ @= if (val_stack[1].status!=reg_val) derr("*SRC field of `%s' should be a register number",op_field); if (val_stack[1].equiv>0xf) err("*SRC field doesn't fit in one unsigned nybble"); @.SRC field doesn't fit...@> y=val_stack[1].equiv&0xf; @ @= if (val_stack[1].status>=ind_pure) addr_mode=3; else if (val_stack[1].status==reg_val) if (val_stack[0].status==reg_val) addr_mode=2; else addr_mode=1; else addr_mode=0; @ @= val_stack[1].status-=ind_pure; if (val_stack[1].status==reg_val) { if (!val_stack[1].equiv) derr("*SRC field of `%s' should not be *$0",op_field); @.SRC field...register number@> @; } else y=0,z=1; @ @= if (val_stack[0].status!=reg_val) derr("*DST field of `%s' should be a register number",op_field); if (val_stack[0].equiv>0xf) err("*DST field doesn't fit in one nybble"); @.DST field doesn't fit...@> x=val_stack[0].equiv&0xf; @ @= { pp=val_stack[0].link->sym; qq=new_sym_node(false); qq->link=pp->link; pp->link=qq; qq->serial=cur_seg+nyb_bit; qq->equiv=cur_loc; y=0; future_bits=0x80; goto assemble_DST; } @ @= { pp=val_stack[0].link->sym; qq=new_sym_node(false); qq->link=pp->link; pp->link=qq; qq->serial=cur_seg+nyb_bit+rel_bit; qq->equiv=cur_loc; y=0; future_bits=0x80; goto assemble_DST; } @ @= if (val_stack[1].status==pure) assemble(val_stack[1].equiv,0); else if (val_stack[1].status==undefined) @@; else /* |val_stack[1].status==rel_undefined| */ @; @ @= { pp=val_stack[1].link->sym; qq=new_sym_node(false); qq->link=pp->link; pp->link=qq; qq->serial=cur_seg; qq->equiv=cur_loc; assemble(0,0xf0); } @ @= { pp=val_stack[1].link->sym; qq=new_sym_node(false); qq->link=pp->link; pp->link=qq; qq->serial=cur_seg+rel_bit; qq->equiv=cur_loc; assemble(0,0xf0); } @ @= switch(opcode) { /* Pseudo operations */ case CODE: if (harvard && cur_seg) { cur_data_loc=cur_loc; cur_loc=cur_code_loc; cur_seg=0; }@+goto bypass; case DATA: if (harvard && !cur_seg) { cur_code_loc=cur_loc; cur_loc=cur_data_loc; cur_seg=harvard; }@+goto bypass; case LOC: cur_loc=val_stack[0].equiv; case IS: goto bypass; case PREFIX:@+if (!val_stack[0].link) err("not a valid prefix"); @.not a valid prefix@> cur_prefix=val_stack[0].link;@+goto bypass; case GREG:@+if (listing_file) @; goto bypass; case BSPEC:@+if (val_stack[0].equiv>0xffff) err("*operand of `BSPEC' doesn't fit in a wyde"); @.operand of `BSPEC'...@> ero_loc();@+ero_sync(); ero_lopp(lop_spec,val_stack[0].equiv); spec_mode=true;@+spec_mode_loc=0;@+ goto bypass; case ESPEC: spec_mode=false;@+goto bypass; } @ @= wyde greg_val[256]; /* initial values of global registers */ @ @= fprintf(listing_file,"($%02d=#%04x",greg,val_stack[0].equiv); flush_listing_line(" "); @* Running the program. On a \UNIX/-like system, the command $$\.{eriscal [options] sourcefilename}$$ will assemble the \ERISCAL\ program in file \.{sourcefilename}, writing any error messages on the standard error file. (Nothing is written to the standard output.) The options, which may appear in any order, are: \bull\.{-o objectfilename}\quad Send the output to a binary file called \.{objectfilename}. If no \.{-o} specification is given, the object file name is obtained from the input file name by changing the final letter from `\.s' to~`\.o', or by appending `\.{.ero}' if \.{sourcefilename} doesn't end with~\.s. \bull\.{-l listingname}\quad Output a listing of the assembled input and output to a text file called \.{listingname}. \bull\.{-h}\quad Allow Harward type architecture, assembling data to a separate space from instructions. \bull\.{-b bufsize}\quad Allow up to \.{bufsize} characters per line of input. @ Here, finally, is the overall structure of this program. @c #include #include #include #include #include @# @@; @@; @@; @@; @# int main(argc,argv) int argc;@+ char *argv[]; { register int j,k; /* all-purpose integers */ @; @; @; while(1) { @; while(1) { @; if (!*buf_ptr) break; } if (listing_file) { if (listing_bits) listing_clear(); else if (!line_listed) flush_listing_line(" "); } } @; } @ @= for (j=1;j argv[0],"[-l listingname] [-b#] [-h] [-o objectfilename]"); exit(-1); } src_file_name=argv[j]; @ @= src_file=fopen(src_file_name,"r"); if (!src_file) dpanic("Can't open the source file %s",src_file_name); @.Can't open...@> if (!obj_file_name[0]) { j=strlen(src_file_name); if (src_file_name[j-1]=='s') { strcpy(obj_file_name,src_file_name);@+ obj_file_name[j-1]='o'; } else sprintf(obj_file_name,"%s.ero",src_file_name); } obj_file=fopen(obj_file_name,"wb"); if (!obj_file) dpanic("Can't open the object file %s",obj_file_name); if (listing_name[0]) { listing_file=fopen(listing_name,"w"); if (!listing_file) dpanic("Can't open the listing file %s",listing_name); } @ @= char *src_file_name; /* name of the \ERISCAL\ input file */ char obj_file_name[FILENAME_MAX+1]; /* name of the binary output file */ char listing_name[FILENAME_MAX+1]; /* name of the optional listing file */ FILE *src_file, *obj_file, *listing_file; bool harvard; /* 1 if separate data memory, else 0 */ int buf_size; /* maximum number of characters per line of input */ tetra present_time; /* THE time */ @ @= @; filename[0]=src_file_name; filename_count=1; @; @ @= ero_lopp(lop_pre,3); ero_wyde(0x101); present_time=time(NULL); ero_wyde(present_time>>16); ero_wyde(present_time&0xffff); ero_cur_file=-1; @ @= @; @; @; if (err_count) if (err_count>1) fprintf(stderr,"(%d errors were found.)\n",err_count); else fprintf(stderr,"(One error was found.)\n"); exit(err_count); @ @= int greg=0; /* global register allocator */ @ @= ero_lopp(lop_post,greg+1); greg_val[0]=trie_search(trie_root,"Main")->sym->equiv; for (j=0;j<=greg;j++) ero_wyde(greg_val[j]); @ @= for (j=0;j<10;j++) if (forward_local[j].link) ++err_count,fprintf(stderr,"undefined local symbol %dF\n",j); @.undefined local symbol@> @* Index.