summaryrefslogtreecommitdiff
path: root/internal.doc
diff options
context:
space:
mode:
Diffstat (limited to 'internal.doc')
-rw-r--r--internal.doc268
1 files changed, 268 insertions, 0 deletions
diff --git a/internal.doc b/internal.doc
new file mode 100644
index 0000000..f04152a
--- /dev/null
+++ b/internal.doc
@@ -0,0 +1,268 @@
+Internals of the Netwide Assembler
+==================================
+
+The Netwide Assembler is intended to be a modular, re-usable x86
+assembler, which can be embedded in other programs, for example as
+the back end to a compiler.
+
+The assembler is composed of modules. The interfaces between them
+look like:
+
+ +---- parser.c ----+
+ | | |
+ | float.c |
+ | |
+ +--- assemble.c ---+
+ | | |
+ nasm.c ---+ insnsa.c +--- nasmlib.c
+ | |
+ +---- labels.c ----+
+ | |
+ +--- outform.c ----+
+ | |
+ +----- *out.c -----+
+
+In other words, each of `parser.c', `assemble.c', `labels.c',
+`outform.c' and each of the output format modules `*out.c' are
+independent modules, which do not inter-communicate except through
+the main program.
+
+The Netwide *Disassembler* is not intended to be particularly
+portable or reusable or anything, however. So I won't bother
+documenting it here. :-)
+
+nasmlib.c
+---------
+
+This is a library module; it contains simple library routines which
+may be referenced by all other modules. Among these are a set of
+wrappers around the standard `malloc' routines, which will report a
+fatal error if they run out of memory, rather than returning NULL.
+
+parser.c
+--------
+
+This contains a source-line parser. It parses `canonical' assembly
+source lines, containing some combination of the `label', `opcode',
+`operand' and `comment' fields: it does not process directives or
+macros. It exports two functions: `parse_line' and `cleanup_insn'.
+
+`parse_line' is the main parser function: you pass it a source line
+in ASCII text form, and it returns you an `insn' structure
+containing all the details of the instruction on that line. The
+parameters it requires are:
+
+- The location (segment, offset) where the instruction on this line
+ will eventually be placed. This is necessary in order to evaluate
+ expressions containing the Here token, `$'.
+
+- A function which can be called to retrieve the value of any
+ symbols the source line references.
+
+- Which pass the assembler is on: an undefined symbol only causes an
+ error condition on pass two.
+
+- The source line to be parsed.
+
+- A structure to fill with the results of the parse.
+
+- A function which can be called to report errors.
+
+Some instructions (DB, DW, DD for example) can require an arbitrary
+amount of storage, and so some of the members of the resulting
+`insn' structure will be dynamically allocated. The other function
+exported by `parser.c' is `cleanup_insn', which can be called to
+deallocate any dynamic storage associated with the results of a
+parse.
+
+names.c
+-------
+
+This doesn't count as a module - it defines a few arrays which are
+shared between NASM and NDISASM, so it's a separate file which is
+#included by both parser.c and disasm.c.
+
+float.c
+-------
+
+This is essentially a library module: it exports one function,
+`float_const', which converts an ASCII representation of a
+floating-point number into an x86-compatible binary representation,
+without using any built-in floating-point arithmetic (so it will run
+on any platform, portably). It calls nothing, and is called only by
+`parser.c'. Note that the function `float_const' must be passed an
+error reporting routine.
+
+assemble.c
+----------
+
+This module contains the code generator: it translates `insn'
+structures as returned from the parser module into actual generated
+code which can be placed in an output file. It exports two
+functions, `assemble' and `insn_size'.
+
+`insn_size' is designed to be called on pass one of assembly: it
+takes an `insn' structure as input, and returns the amount of space
+that would be taken up if the instruction described in the structure
+were to be converted to real machine code. `insn_size' also requires
+to be told the location (as a segment/offset pair) where the
+instruction would be assembled, the mode of assembly (16/32 bit
+default), and a function it can call to report errors.
+
+`assemble' is designed to be called on pass two: it takes all the
+parameters that `insn_size' does, but has an extra parameter which
+is an output driver. `assemble' actually converts the input
+instruction into machine code, and outputs the machine code by means
+of calling the `output' function of the driver.
+
+insnsa.c
+--------
+
+This is another library module: it exports one very big array of
+instruction translations. It has to be a separate module so that DOS
+compilers, with less memory to spare than typical Unix ones, can
+cope with it.
+
+labels.c
+--------
+
+This module contains a label manager. It exports six functions:
+
+`init_labels' should be called before any other function in the
+module. `cleanup_labels' may be called after all other use of the
+module has finished, to deallocate storage.
+
+`define_label' is called to define new labels: you pass it the name
+of the label to be defined, and the (segment,offset) pair giving the
+value of the label. It is also passed an error-reporting function,
+and an output driver structure (so that it can call the output
+driver's label-definition function). `define_label' mentally
+prepends the name of the most recently defined non-local label to
+any label beginning with a period.
+
+`define_label_stub' is designed to be called in pass two, once all
+the labels have already been defined: it does nothing except to
+update the "most-recently-defined-non-local-label" status, so that
+references to local labels in pass two will work correctly.
+
+`declare_as_global' is used to declare that a label should be
+global. It must be called _before_ the label in question is defined.
+
+Finally, `lookup_label' attempts to translate a label name into a
+(segment,offset) pair. It returns non-zero on success.
+
+The label manager module is (theoretically :) restartable: after
+calling `cleanup_labels', you can call `init_labels' again, and
+start a new assembly with a new set of symbols.
+
+outform.c
+---------
+
+This small module contains a set of routines to manage a list of
+output formats, and select one given a keyword. It contains three
+small routines: `ofmt_register' which registers an output driver as
+part of the managed list, `ofmt_list' which lists the available
+drivers on stdout, and `ofmt_find' which tries to find the driver
+corresponding to a given name.
+
+The output modules
+------------------
+
+Each of the output modules, `binout.o', `elfout.o' and so on,
+exports only one symbol, which is an output driver data structure
+containing pointers to all the functions needed to produce output
+files of the appropriate type.
+
+The exception to this is `coffout.o', which exports _two_ output
+driver structures, since COFF and Win32 object file formats are very
+similar and most of the code is shared between them.
+
+nasm.c
+------
+
+This is the main program: it calls all the functions in the above
+modules, and puts them together to form a working assembler. We
+hope. :-)
+
+Segment Mechanism
+-----------------
+
+In NASM, the term `segment' is used to separate the different
+sections/segments/groups of which an object file is composed.
+Essentially, every address NASM is capable of understanding is
+expressed as an offset from the beginning of some segment.
+
+The defining property of a segment is that if two symbols are
+declared in the same segment, then the distance between them is
+fixed at assembly time. Hence every externally-declared variable
+must be declared in its own segment, since none of the locations of
+these are known, and so no distances may be computed at assembly
+time.
+
+The special segment value NO_SEG (-1) is used to denote an absolute
+value, e.g. a constant whose value does not depend on relocation,
+such as the _size_ of a data object.
+
+Apart from NO_SEG, segment indices all have their least significant
+bit clear, if they refer to actual in-memory segments. For each
+segment of this type, there is an auxiliary segment value, defined
+to be the same number but with the LSB set, which denotes the
+segment-base value of that segment, for object formats which support
+it (Microsoft .OBJ, for example).
+
+Hence, if `textsym' is declared in a code segment with index 2, then
+referencing `SEG textsym' would return zero offset from
+segment-index 3. Or, in object formats which don't understand such
+references, it would return an error instead.
+
+The next twist is SEG_ABS. Some symbols may be declared with a
+segment value of SEG_ABS plus a 16-bit constant: this indicates that
+they are far-absolute symbols, such as the BIOS keyboard buffer
+under MS-DOS, which always resides at 0040h:001Eh. Far-absolutes are
+handled with care in the parser, since they are supposed to evaluate
+simply to their offset part within expressions, but applying SEG to
+one should yield its segment part. A far-absolute should never find
+its way _out_ of the parser, unless it is enclosed in a WRT clause,
+in which case Microsoft 16-bit object formats will want to know
+about it.
+
+Porting Issues
+--------------
+
+We have tried to write NASM in portable ANSI C: we do not assume
+little-endianness or any hardware characteristics (in order that
+NASM should work as a cross-assembler for x86 platforms, even when
+run on other, stranger machines).
+
+Assumptions we _have_ made are:
+
+- We assume that `short' is at least 16 bits, and `long' at least
+ 32. This really _shouldn't_ be a problem, since Kernighan and
+ Ritchie tell us we are entitled to do so.
+
+- We rely on having more than 6 characters of significance on
+ externally linked symbols in the NASM sources. This may get fixed
+ at some point. We haven't yet come across a linker brain-dead
+ enough to get it wrong anyway.
+
+- We assume that `fopen' using the mode "wb" can be used to write
+ binary data files. This may be wrong on systems like VMS, with a
+ strange file system. Though why you'd want to run NASM on VMS is
+ beyond me anyway.
+
+That's it. Subject to those caveats, NASM should be completely
+portable. If not, we _really_ want to know about it.
+
+Porting Non-Issues
+------------------
+
+The following is _not_ a portability problem, although it looks like
+one.
+
+- When compiling with some versions of DJGPP, you may get errors
+ such as `warning: ANSI C forbids braced-groups within
+ expressions'. This isn't NASM's fault - the problem seems to be
+ that DJGPP's definitions of the <ctype.h> macros include a
+ GNU-specific C extension. So when compiling using -ansi and
+ -pedantic, DJGPP complains about its own header files. It isn't a
+ problem anyway, since it still generates correct code.