ARM Symbolic Debug Table Format =============================== Acknowledgement --------------- This design is based on work originally done for Acorn Computers Ltd. by Topexpress Ltd. Introduction ------------ This document specifies the format of symbolic debugging data generated by ARM compilers, which is used by the ARM Symbolic Debugger () to support high level language oriented, interactive debugging. For each separate compilation unit (called a
) the compiler produces debugging data, and a special in the object code (see "" for an explanation of ARM Object Format, including areas and their attributes). Debugging data are position independent, containing only relative references to other debugging data within the same section, and relocatable references to other compiler-generated areas. Debugging data areas are combined by the ARM linker into a single contiguous section of a program image. For details of the ARM linker's capabilities see " " of the User Manual. For a description of the linker's principal output format see "" starting on page10. Since the debugging section is position-independent, the debugger can move it to a safe location before the image starts executing. If the image is not executed under debugger control, the debugging data are simply overwritten. The format of debugging data allows for a variable amount of detail. This potentially allows the user to trade off among memory used, disc space used, execution time, and debugging detail. Assembly-language level debugging is also supported, though in this case the debugging tables are generated by the linker. If required, the assembler can generate debugging table entries relating code addresses to source lines. Low-level debugging tables appear in an extra section item, as if generated by an independent compilation (see "" starting on page61). Low-level and high-level debugging are orthogonal facilities, though allows the user to move smoothly between levels if both sets of debugging data are present in an image. Terminology ----------- A is 8 bits, usually considered unsigned. A is 32 bits (4 bytes), often considered signed. A , also called a , is 16 bits (2 bytes). Half words are unused, except in the long form of items. Order of Debugging Data ----------------------- A debug data area consists of a series of . The arrangement of these items mimics the structure of the high-level language program itself. For each debug area, the first item is a
item, giving global information about the compilation, including a code identifying the language, and flags indicating the amount of detail included in the debugging tables. Each datum, function, procedure, etc., definition in the source program has a corresponding debug data item; these items appear in an order corresponding to the order of definitions in the source. This means that any nested structure in the source program is preserved in the debugging data, and the debugger can use this structure to make deductions about the scope of various source-level objects. Of course, for procedure definitions, two debug items are needed: a item to mark the definition itself, and an item to mark the end of the procedure's body and the end of any nested definitions. If procedure definitions are nested then the brackets are nested too. Variable and type definitions made at the outermost level, of course, appear outside of all procedure/endproc items. Information about the relationship between the executable code and source files is collected together and appears as a item, which is always the final item in a debugging area. Because of the C language's #include facility, the executable code produced from an outer-level source file may be separated into disjoint pieces interspersed with that produced from the included files. Therefore, source files are considered to be collections of , each corresponding to a contiguous area of executable code, and the item is a list with an entry for each file, each in turn containing a list with an entry for each fragment. The fileinfo field in the
item addresses the item itself. In each item there is a field, which refers to the file-list entry for the source file containing the procedure's start; there is a separate one in the item because it may possibly not be in the same source file. Endian-ness and the Encoding of Debugging Data ---------------------------------------------- The ARM can be configured to use either a little-endian memory system (the least significant byte of each 4-byte word has the lowest address), or a big-endian memory system (the most significant byte of each 4-byte word has the lowest address). In general, the code to be generated varies according to the byte-sex (or endian-ness) of the target, and the linker has insufficient information to change the byte sex of an object file. Therefore, object files are encoded using the byte order of the intended target, independently of the byte order of the host system on which the compiler or assembler runs. The ARM linker accepts inputs having either byte order, but rejects mixed sex inputs, and generates its output using the same byte order. This means that producers of debugging tables must be prepared to generate them in either byte order, as required. In turn, this requires definitions to be very clear about when a 4-byte word is being used (which will require reversal on output or input when cross-sex compiling or debugging), and when a sequence of bytes is being used (which requires no special treatment provided it is written and read as a sequence of bytes in address order). Representation of Data Types ---------------------------- Several of the debugging data items (e.g. procedure and variable) have a word field to identify their data type. This field contains, in the most significant 24 bits, a code to identify a base type, and in the least significant 8 bits, a pointer count: 0 to denote the type itself; 1 to denote a pointer to the type; 2 to denote a pointer to a pointer to...; etc. For simple types the code is a positive integer as follows, (all codes are decimal): void 0 signed integers single byte 10 half-word 11 word 12 unsigned integers single byte 20 half-word 21 word 22 floating point float 30 double 31 long double 32 complex single complex 41 double complex 42 functions function 100 For compound types (arrays, structures, etc.) there is a special kind of debug data item (array, struct, etc.) to give details such as array bounds and field types. The type code for compound types is negative, the negation of the (byte) offset of the debug item from the start of the debugging area. If a type has been given a name in a source program, it will give rise to a debugging data item which contains the name and a type word as defined above. If necessary, there will also be a debugging data item, such as an or item, to define the type itself. In that case, the type word will refer to this item. Set types in Pascal are not treated in detail: the only information recorded for them is the total size occupied by the object in bytes. Neither are Pascal variables supported by the debugger, since their behaviour under debugger control is unlikely to be helpful to the user. Fortran character types are supported by special kinds of debugging data item, the format of which is specific to each Fortran compiler. Representation of Source File Positions --------------------------------------- Several of the debugging data items have a field to identify a position in the source file. This field contains a line number and character position within the line packed into a single word. The most significant 10 bits encode the character offset (0-based) from the start of the line and the least-significant 22 bits give the line number. Debugging Data Items in Detail ------------------------------ The Code and Length Field ......................... The first word of each debugging data item contains the byte length of the item (encoded in the most significant 16 bits), and a code identifying the kind of item (in the least significant 16 bits). The defined codes are: 1 section 2 procedure/function definition 3 endproc 4 variable 5 type 6 struct 7 array 8 subrange 9 set 10 fileinfo 11 contiguous enumeration 12 discontiguous enumeration 13 procedure/function declaration 14 begin naming scope 15 end naming scope 16 bitfield The meaning of the second and subsequent words of each item is defined below. If a debugger encounters a code it does not recognise, it should use the length field to skip the item entirely. This discipline allows the debugging tables to be extended without invalidating existing debuggers. Text Names in Items ................... Where items include a string field, the string is packed into successive bytes beginning with a length byte, and padded at the end to a word boundary with 0 bytes. The length of a string is in the range [0..255] bytes. Offsets in File and Addresses in Memory ....................................... Where an item contains a field giving an offset in the debugging data area (usually to address another item), this means a byte offset from the start of the debugging data for the whole section (in other words, from the start of the
item). When the same structure is used to map debugging data in memory, an offset field may be used to hold a pointer to another debug item in memory, rather than the offset of it in the debug area. Section Items ............. A section item is the first item of each section of the debugging data. After its code and length word it contains the fields listed below. First there are 4 flag bytes: lang a byte identifying the source language flags a byte describing the level of detail unused asdversion a byte version number of the debugging data The following language byte codes are defined: LANG_NONE 0 Low-level debugging data only LANG_C 1 C source level debugging data LANG_PASCAL 2 Pascal source level debugging data LANG_FORTRAN 3 Fortran-77 source level debugging data LANG_ASM 4 ARM Assembler line number data All other codes are reserved to ARM. The byte uses the following mask values: 1 debugging data contains line-number information 2 debugging data contains information about top-level variables 3 both of the above The byte should be set to 2, the version of this definition. The flag bytes are followed by the following word-sized fields: codestart address of first instruction in this section datastart address of start of static data for this section codesize byte size of executable code in this section datasize byte size of the static data in this section fileinfo offset in the debugging area of the fileinfo item for this section (0 if no fileinfo item present) debugsize total byte length of debug data for this section name or nsyms string or integer and are addresses, relocated by the linker. The field, nominally an offset, is also used as a pointer when this structure is mapped in memory. The field is 0 if no source file information is present. The field contains the program name for Pascal and Fortran programs. For C programs it contains a name derived by the compiler from the root file name (notionally a module name). In each case, the name is similar to a variable name in the source language. For a low-level debugging section (language = 0), the field is treated as a 4 byte integer giving the number of symbols following. For linker-generated debugging data, the fields have the following values: language 0 codestart Image$$RO$$Base datastart Image$$RW$$Base codesize Image$$RO$$Limit - Image$$RO$$Base datasize Image$$RW$$Limit - Image$$RW$$Base fileinfo 0 nsyms number of symbols in the following debugging data debugsize total size of the low-level debugging data including the size of this section item For linker-generated debugging data, the section item is followed by nsyms items, each consisting of 2 words: sym flags + byte offset in string table of symbol name value the value of the symbol encodes an index into the string table in the 24 least significant bits, and the following flag values in the 8 most significant bits: ASD_GLOBSYM 0 if the symbol is absolute ASD_ABSSYM 0x01000000L if the symbol is global ASD_TEXTSYM 0x02000000L if the symbol names code ASD_DATASYM 0x04000000L if the symbol names data ASD_ZINITSYM 0x06000000L if the symbol names 0-initialised data Note that the linker reduces all symbol values to absolute values, so that the flag values record the history, or origin, of the symbol in the image. Immediately following the symbol table is the string table, in standard AOF format. It consists of: * a length word * the strings themselves, each terminated by a NUL (0) The length word includes the size of the length word, so no offset into the string table is less than 4. The end of the string table is padded with NULs to the next word boundary, (so the length is a multiple of 4). Procedure Items ............... A procedure item appears once for each procedure or function definition in the source program. Any definitions within the procedure have their related debugging data items between the procedure item and its matching endproc item. After its code and length field, a procedure items contains the following word-sized fields: type the return type if this is a function, else 0 (see "") args the number of arguments sourcepos the source position of the procedure's start (see "") startaddr address of 1st instruction of procedure prologue entry address of 1st instruction of the procedure body (see note below) endproc offset of the related endproc item (in file) or pointer to related endproc item (in memory) fileentry offset of the file list entry for the source file (in file) or a pointer to it (in memory) name string The field addresses the first instruction following the procedure prologue. That is, the first address at which a high-level breakpoint could sensibly be set. The field addresses the start of the prologue. That is, the instruction at which control arrives when the procedure is called. Label Items ........... A label in a source program is represented by a special procedure item with no matching endproc, (the endproc field is 0 to denote this). Pascal and Fortran numerical labels are converted by their respective compilers into strings prefixed by "$n". For Fortran77, multiple entry points to the same procedure each give rise to a separate procedure item, all of which have the same endproc offset referring to the unique, matching endproc item. Endproc Items ............. An endproc item marks the end of the debugging data items belonging to a particular procedure. It also contains information relating to the procedure's return. After its code and length field, an endproc item contains the following word-sized fields: sourcepos position in the source file of the procedure's end (see "" starting on page 60) endpoint address of the code byte AFTER the compiled code for the procedure fileentry offset of the file-list entry for the procedure's end (in file) or a pointer to it (in memory) nreturns number of procedure return points (may be 0) retaddrs array of addresses of procedure return code If the procedure body is an infinite loop, there will be no return point, so nreturns will be 0. Otherwise each member of retaddrs should point to a suitable location at which a breakpoint may be set "at the exit of the procedure". When execution reaches this point, the current stack frame should still be for this procedure. Variable Items .............. A variable item contains debugging data relating to a source program variable, or a formal argument to a procedure (the first variable items in a procedure always describe its arguments). After its code and length field, a variable item contains the following word-sized fields: type type of this variable (see "" starting on page 59) sourcepos the source position of the variable (see "" starting on page60) storageclass a word encoding the variable's storage class location see explanation below name string The following codes define the storage classes of variables: 1 external variables (or Fortran common) 2 static variables private to one section 3 automatic variables 4 register variables 5 Pascal 'var' arguments 6 Fortran arguments 7 Fortran character arguments The meaning of the location field of a variable item depends on the storage class: it contains an absolute address for static and external variables (relocated by the linker); a stack offset (an offset from the frame pointer) for automatic and var-type arguments; an offset into the argument list for Fortran arguments; and a register number for register variables, (the 8 floating point registers are numbered 16..23). No account is taken of variables which ought to be addressed by +ve offsets from the stack-pointer rather than -ve offsets from the frame-pointer. The sourcepos field is used by the debugger to distinguish between different definitions having the same name (e.g. identically named variables in disjoint source-level naming scopes such as nested blocks in C). Type Items ........... A type item is used to describe a named type in the source language (e.g. a typedef in C). After its code and length field, a type item contains two word-sized fields: type a type word (described in "") name string Struct Items ............ A struct item is used to describe a structured data type (e.g. a struct in C or a record in Pascal). After its code and length field, a struct item contains the following word-sized fields: fields the number of fields in the structure size total byte size of the structure fieldtable... an array of struct field items Each struct field item has the following word-sized fields: offset byte offset of this field within the structure type a type word (described in "") name string Union types are described by struct items in which all fields have 0 offsets. C bit fields are not treated in full detail: a bit field is simply represented by an integer starting on the appropriate word boundary (so that the word contains the whole field). Array Items ........... An array item is used to describe a one-dimensional array. Multi-dimensional arrays are described as "arrays of arrays". Which dimension comes first is dependent on the source language (which is different for C and Fortran). After its code and length field, an array item contains the following word-sized fields: size total byte size of the array flags (see below) basetype a type word (described in "") lowerbound constant value or location of variable upperbound constant value or location of variable If the size field is zero, debugger operations affecting the whole array, rather than individual elements of it, are forbidden. The following mask values are defined for the flags field: ARRAY_UNDEF_LBOUND 1 lower bound is undefined ARRAY_CONST_LBOUND 2 lower bound is a constant ARRAY_UNDEF_UBOUND 4 upper bound is undefined ARRAY_CONST_UBOUND 8 upper bound is a constant ARRAY_VAR_LBOUND 16 lower bound is a variable ARRAY_VAR_UBOUND 32 upper bound is a variable A bound is described as undefined when no information about it is available. A bound is described as constant when its value is known at compile time. In this case, the corresponding bound field gives its value. If a bound is described as variable, the offset field identifies a variable debug item describing the location containing the bound. In a debug area in an object file, the offset field contains the offset from the start of the debug area to the variable item; in memory it contains a pointer to the corresponding variable item. Note that a variable item may be used to describe a location known to the compiler, which need not correspond to a source language variable. Subrange Items .............. A subrange item is used to describe a subrange typed in Pascal. It also serves to describe enumerated types in C, and scalars in Pascal (in which case the base type is understood to be an unsigned integer of appropriate size). After its code and length field, a subrange item contains the following word-sized fields: sizeandtype see below lb low bound of subrange hb high bound of subrange The field encodes the byte size of container for the subrange (1, 2 or 4) in its least significant 16 bits, and a simple type code (see" ") in its most significant 16 bits. The type code refers to the base type of the subrange. (For example, a subrange 256..511 of unsigned short might be held in 1 byte). Set Items ......... A set item is used to describe a Pascal set type. Currently, the description is only partial. After its code and length field, a set item consists of a single word: size byte size of the object Enumeration Items ................. An enumeration item describes a Pascal or C enumerated type. After its code and length word, the description of a contains the following word-sized fields: type a type word describing the type of the container for the enumeration (see "" starting on page 59) count the cardinality of the enumeration base the first (lowest) value (may be -ve) nametable a character array containing names (see "") The description of a discontiguous enumeration (such as the C enumeration enum bits {bit0=1, bit1=2, bit2=4, bit3=8, bit4=16}) contains the following fields after its code and length word: type as above count as above nametable a table of (value, name) pairs Each nametable entry has the following format (which is variable in length): val the enumerated value (1/2/4/8/16 in the example) name the name of the enumerated element (may be several words long) Function Declaration Items .......................... After its code and length word, a function declaration item contains the following fields: type a type word (described in "" ) describing the return type of the function or procedure argcount the number of arguments to the function args a sequence of argument description items Each argument description item contains the following: type a type word (described in "" ) describing the type of the argument name the name of the argument (may be several words) An argument descriptor need not be named; in this case the length of the name is zero, and the name field is a single zero word. Begin and End Naming Scope Items ................................. These debug items are used to mark the beginning and end of a naming scope. They must be properly nested in the debug area. In each case, after the code and length word, there is one word-sized field: codeaddress address of the start/end of scope (which is determined by the code word) Bitfield Items .............. A bitfield item describes an individual bitfield member of a C structure. After its code and length word, a bitfield item contains the following fields: type a type word describing the type of the bitfield (see "") container a type word describing the type of the container for the bitfield size a byte giving the size of the bitfield, in bits offset a byte giving the offset of the bitfield within the container zero 2 zero bytes The offset is the offset of the least-significant bit of the bitfield from the least significant bit of its container. Fileinfo Items .............. A fileinfo item appears once per section, after all other debugging data items. If the fileinfo item is too large for its length to be encoded in 16 bits, its length field must be written as 0 (since this is the last item in a section and the section header contains the length of the whole section, the length field is strictly redundant. Each source file is described by a sequence of . Each describes a contiguous region of the file, within which the addresses of compiled code increase monotonically with source file position. The order in which fragments appear in the sequence is not necessarily related to the source file positions to which they refer. Note that for compilations which make no use of the #include facility, the list of fragments may have only one entry, and all line-number information can be contiguous. After its code and length word, the fileinfo item is a sequence of file entry items with the following format: len length of this entry in bytes (including the length of the following fragments) date date and time when the file was last modified (may be 0, indicating not available, or unused) filename string (or "" if the name is not known) fragment data see below If present, the date field contains the number of seconds since the beginning of 1970 (the Unix date origin). Following the final file entry item, is a single 0 word marking the end of the sequence. The fragment data is a word giving the number of following fragments followed by a sequence of fragment items: n number of fragments following fragments... n fragment items Each fragment item consists of 5 words, followed by a sequence of byte pairs and half word pairs, formatted as follows: size length of this fragment in bytes (including length of following lineinfo items) firstline linenumber lastline linenumber codestart pointer to the start of the fragment's executable code codesize byte size of the code in the fragment lineinfo... a variable number of bytes matching line numbes to code addresses Each lineinfo item describes a source statement and consists of a pair of (unsigned) bytes, possibly followed by a two or three (unsigned) half words, (each half word has the byte ordering appropriate to the target memory system's endian-ness or byte sex). The short form (pair of bytes) lineinfo item is as follows: codeinc # bytes of code generated by this statement lineinc # source space occupied by this statement describes how to calculate the source position (line, column) of the next statement from the source position of this one. If is in the range 0 <= < 64, the new position is (line+,1). If >= 64, the new position is (line,column+-64). The number of bytes of code generated for a statement may be zero, provided the line increment is non-zero (such an item may describe a block end or block start, for example). It is not possible to describe a statement which generates no code and no line number increment, as that encoding is used as an escape to the long form lineinfo items described below. If is greater than 255, or is required to describe a line number change greater than 63 or a column change greater than 191, then both bytes are written to describe 0 increments, and the real values are given in the following two or three (unsigned) half words. (Note that there are two ways to describe 0 increments: 0 lines and 0 columns, which serves to descriminate between the two half word and three half word forms). If the starting column for the next statement is 1, the two half word form is used, which in effect is a triple of half words as follows: zero 2 zero bytes lineinc # source lines occupied by this statement codeinc # bytes of code generated by this statement Note that the order of the and half words is the reverse of the corresponding bytes. If the starting column for the next statement is not 1, the three half word form is used, which in effect is a quadruple of half words, as follows: codeinc = 0, lineinc = 64 lineinc # source lines occupied by this statement codeinc # bytes of code generated by this statement newcol starting column for the next statement Note as above that the order of the and half words is the reverse of the corresponding bytes. Note also that the column item here is the absolute column number for the next statement, and not an increment as in the two byte form. (This encoding of lineinfo items is an incompatible change from the previous format (version 2): in that format, in a two byte lineinfo item always describes a line increment, and accordingly, there is no four half word form. Programs interpreting asd tables should interpret lineinfo items differently according to the table format in the section item.)