diff options
Diffstat (limited to 'Documentation/design-docs')
-rw-r--r-- | Documentation/design-docs/first-class-structs.md | 651 | ||||
-rw-r--r-- | Documentation/design-docs/inline-size-estimates.md | 526 | ||||
-rw-r--r-- | Documentation/design-docs/inlining-plans.md | 438 | ||||
-rw-r--r-- | Documentation/design-docs/longs-on-32bit-arch.md | 117 | ||||
-rw-r--r-- | Documentation/design-docs/removing-embedded-statements.md | 180 |
5 files changed, 1912 insertions, 0 deletions
diff --git a/Documentation/design-docs/first-class-structs.md b/Documentation/design-docs/first-class-structs.md new file mode 100644 index 0000000000..fd6a3762c4 --- /dev/null +++ b/Documentation/design-docs/first-class-structs.md @@ -0,0 +1,651 @@ +First Class Structs +=================== + +Objectives +---------- +Primary Objectives +* Avoid forcing structs to the stack if they are only assigned to/from, or passed to/returned + from a call or intrinsic + - Including SIMD types as well as other pointer-sized-or-less struct types + - Enable enregistration of structs that have no field accesses +* Optimize these types as effectively as any other basic type + - Value numbering, especially for types that are used in intrinsics (e.g. SIMD) + - Register allocation + +Secondary Objectives +* No “swizzling” or lying about struct types – they are always struct types + - No confusing use of GT_LCL_FLD to refer to the entire struct as a different type + +Struct-Related Issues in RyuJIT +------------------------------- +The following issues illustrate some of the motivation for improving the handling of value types +(structs) in RyuJIT: + +* VSO Bug 98404: .NET JIT x86 - poor code generated for value type initialization + * This is a simple test case that should generate simply `xor eax; ret` on x86 and x64, but + instead generates many unnecessary copies. It is addressed by full enregistration of + structs that fit into a register: + +```C# +struct foo { public byte b1, b2, b3, b4; } +static foo getfoo() { return new foo(); } +``` + +* [\#1133 JIT: Excessive copies when inlining](https://github.com/dotnet/coreclr/issues/1133) + * The scenario given in this issue involves a struct that is larger than 8 bytes, so + it is not impacted by the fixed-size types. However, by enabling assertion propagation + for struct types (which, in turn is made easier by using normal assignments), the + excess copies can be eliminated. + * Note that these copies are not generated when passing and returning scalar types, + and it may be worth considering (in future) whether we can avoiding adding them + in the first place. + +* [\#1161 RyuJIT properly optimizes structs with a single field if the field type is int but not if it is double](https://github.com/dotnet/coreclr/issues/1161) + * This issue arises because we never promote a struct with a single double field, due to + the fact that such a struct may be passed or returned in a general purpose register. + This issue could be addressed independently, but should "fall out" of improved heuristics + for when to promote and enregister structs. + +* [\#1636 Add optimization to avoid copying a struct if passed by reference and there are no + writes to and no reads after passed to a callee](https://github.com/dotnet/coreclr/issues/1636). + * This issue is nearly the same as the above, except that in this case the desire is to + eliminate unneeded copies locally (i.e. not just due to inlining), in the case where + the struct may or may not be passed or returned directly. + * Unfortunately, there is not currently a scenario or test case for this issue. + +* [\#3144 Avoid marking tmp as DoNotEnregister in tmp=GT_CALL() where call returns a + enregisterable struct in two return registers](https://github.com/dotnet/coreclr/issues/3144) + * This issue could be addressed without First Class Structs. However, + it will be easier with struct assignments that are normalized as regular assignments, and + should be done along with the streamlining of the handling of ABI-specific struct passing + and return values. + +* [\#3539 RyuJIT: Poor code quality for tight generic loop with many inlineable calls](https://github.com/dotnet/coreclr/issues/3539) +(factor x8 slower than non-generic few calls loop). + * I am still investigating this issue. + +* [\#5556 RuyJIT: structs in parameters and enregistering](https://github.com/dotnet/coreclr/issues/5556) + * This also requires further investigation, but requires us to "Add support in prolog to extract fields, and + remove the restriction of not promoting incoming reg structs that have more than one field" - see [Dependent Work Items](https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md#dependent-work-items) + +Normalizing Struct Types +------------------------ +We would like to facilitate full enregistration of structs with the following properties: +1. Its fields are infrequently accessed, and +1. The entire struct fits into a register, and +2. Its value is used or defined in a register +(i.e. as an argument to or return value from calls or intrinsics). + +In RyuJIT, the concept of a type is very simplistic (which helps support the high throughput +of the JIT). Rather than a symbol table to hold the properties of a type, RyuJIT primarily +deals with types as simple values of an enumeration. When more detailed information is +required about the structure of a type, we query the type system, across the JIT/EE interface. +This is generally done only during the importer (translation from MSIL to the RyuJIT IR), and +during struct promotion analysis. As a result, struct types are treated as an opaque type +(TYP_STRUCT) of unknown size and structure. + +In order to treat fully-enregisterable struct types as "first class" types in RyuJIT, we + create new types with fixed size and structure: +* TYP_SIMD8, TYP_SIMD12, TYP_SIMD16 and (where supported by the target) TYP_SIMD32 + - These types already exist, and represent some already-completed steps toward First Class Structs. +* TYP_STRUCT1, TYP_STRUCT2, TYP_STRUCT4, TYP_STRUCT8 (on 64-bit systems) + - These types are new, and will be used where struct types of the given size are passed and/or + returned in registers. + +We want to identify and normalize these types early in the compiler, before any decisions are +made regarding whether they are constrained to live on the stack and whether and how they are +promoted (scalar replaced) or copied. + +One issue that arises is that it becomes necessary to know the size of any struct type that +we encounter, even if we may not actually need to know the size in order to generate code. +The major cause of additional queries seems to be for field references. It is possible to +defer some of these cases. I don't know what the throughput impact will be to always do the +normalization, but in principle I think it is worth doing because the alternative would be +to transform the types later (e.g. during morph) and use a contextual tree walk to see if we +care about the size of the struct. That would likely be a messier analysis. + +Current Struct IR Phase Transitions +----------------------------------- + +There are three phases in the JIT that make changes to the representation of struct tree +nodes and lclVars: + +* Importer + * All struct type lclVars have TYP_STRUCT + * All struct assignments/inits are block ops + * All struct call args are ldobj + * Other struct nodes have TYP_STRUCT +* Struct promotion + * Fields of promoted structs become separate lclVars (scalar promoted) with primitive types +* Global morph + * All struct nodes are transformed to block ops + - Besides call args + * Some promoted structs are forced to stack + - Become “dependently promoted” + * Call args + - Morphed to GT_LCL_FLD if passed in a register + - Treated in various ways otherwise (inconsistent) + +Proposed Approach +----------------- +The most fundamental change with first class structs is that struct assignments become +just a special case of assignment. The existing block ops (GT_INITBLK, GT_COPYBLK, + GT_COPYOBJ, GT_LDOBJ) are eliminated. Instead, the block operations in the incoming MSIL + are translated into assignments to or from a new GT_OBJ node. + +New fixed-size struct types are added: (TYP_STRUCT[1|2|4|8]), which are somewhat similar +to the (existing) SIMD types (TYP_SIMD[8|16|32]). As is currently done for the SIMD types, +these types are normalized in the importer. + +Conceptually, struct nodes refer to the object, not the address. This is important, as +the existing block operations all take address operands, meaning that any lclVar involved +in an assignment (including initialization) will be in an address-taken context in the JIT, +requiring special analysis to identify the cases where the address is only taken in order +to assign to or from the lclVar. This further allows for consistency in the treatment of +structs and simple types - even potentially enabling optimizations of non-enregisterable +structs. + +### Struct promotion + +* Struct promotion analysis + * Aggressively promote pointer-sized fields of structs used as args or returns + * Consider FULL promotion of pointer-size structs + * If there are fewer field references than calls or returns + +### Assignments +* Struct assignments look like any other assignment +* GenTreeAsg (GT_ASG) extends GenTreeOp with: + +```C# +// True if this assignment is a volatile memory operation. +bool IsVolatile() const { return (gtFlags & GTF_BLK_VOLATILE) != 0; } +bool gtAsgGcUnsafe; + +// What code sequence we will be using to encode this operation. +enum +{ + AsgKindInvalid, + AsgKindDirect, + AsgKindHelper, + AsgKindRepInstr, + AsgKindUnroll, +} gtAsgKind; +``` + +### Struct “objects” as lvalues +* Lhs of a struct assignment is a block node or lclVar +* Block nodes represent the address and “shape” info formerly on the block copy: + * GT_BLK and GT_STORE_BLK (GenTreeBlk) + * Has a (non-tree node) size field + * Addr() is op1 + * Data() is op2 + * GT_OBJ and GT_STORE_OBJ (GenTreeObj extends GenTreeBlk) + * gtClass, gtGcPtrs, gtGcPtrCount, gtSlots + * GT_DYN_BLK and GT_STORE_DYN_BLK (GenTreeDynBlk extends GenTreeBlk) + * Additional child gtDynamicSize + +### Struct “objects” as rvalues +After morph, structs on rhs of assignment are either: +* The tree node for the object: e.g. call, retExpr +* GT_IND of an address (e.g. GT_LEA) + +The lhs provides the “shape” for the assignment. Note: it has been suggested that these could +remain as GT_BLK nodes, but I have not given that any deep consideration. + +### Preserving Struct Types in Trees + +Prior to morphing, all nodes that may represent a struct type will have a class handle. +After morphing, some will become GT_IND. + +### Structs As Call Arguments + +All struct args imported as GT_OBJ, transformed as follows during morph: +* P_FULL promoted locals: + * Remain as a GT_LCL_VAR nodes, with the appropriate fixed-size struct type. + * Note that these may or may not be passed in registers. +* P_INDEP promoted locals: + * These are the ones where the fields don’t match the reg types + GT_STRUCT (or something) for aggregating multiple fields into a single register + * Op1 is a lclVar for the first promoted field + * Op2 is the lclVar for the next field, OR another GT_STRUCT + * Bit offset for the second child +* All other cases (non-locals, OR P_DEP or non-promoted locals): + * GT_LIST of GT_IND for each half + +### Struct Return + +The return of a struct value from the current method is represented as follows: +* GT_RET(GT_OBJ) initially +* GT_OBJ morphed, and then transformed similarly to call args + +Proposed Struct IR Phase Transitions +------------------------------------ + +* Importer + * Struct assignments are imported as GT_ASG + * Struct type is normalized to TYP_STRUCT* or TYP_SIMD* +* Struct promotion + * Fields of promoted structs become separate lclVars (as is) + * Enregisterable structs (including Pair Types) may be promoted to P_FULL (i.e. fully enregistered) + * As a future optimization, we may "restructure" multi-register argument or return values as a + synthesized struct of appropriately typed fields, and then promoted in the normal manner. +* Global morph + * All struct type local variables remain as simple GT_LCL_VAR nodes. + * All other struct nodes are transformed to GT_IND (rhs of assignment) or remain as GT_OBJ + * In Lowering, GT_OBJ will be changed to GT_BLK if there are no gc ptrs. This could be done + earlier, but there are some places where the object pointer is desired. + * It is not actually clear if there is a great deal of value in the GT_BLK, but it was added + to be more compatible with existing code that expects block copies with gc pointers to be + distinguished from those that do not. + * Promoted structs are forced to stack ONLY if address taken + * Call arguments + * Fixed-size enregisterable structs: GT_LCL_VAR or GT_OBJ of appropriate type. + * Multi-register arguments: GT_LIST of register-sized operands: + * GT_LCL_VAR if there is a promoted field that exactly matches the register size and type + (note that, if we have performed the optimization mentioned above in struct promotion, + we may have a GT_LCL_VAR of a synthesized struct field). + * GT_LCL_FLD if there is a matching field in the struct that has not been promoted. + * GT_IND otherwise. Note that if this is a struct local that does not have a matching field, + this will force the local to live on the stack. +* Lowering + * Pair types (e.g. TYP_LONG on 32-bit targets) are decomposed as needed to expose register requirements. + Although these are not strictly structs, their handling is similar. + * Computations are decomposed into their constituent parts when they independently write + separate registers. + * TYP_LONG lclVars (and TYP_DOUBLE on ARM) are split (similar to promotion/scalar replacement of + structs) if and only if they are register candidates. + * Other TYP_LONG/TYP_DOUBLE lclVars are loaded into independent registers either via: + * Single GT_LCL_VAR that will translate into a pair load instruction (ldp), with two register + targets, or + * GT_LCL_FLD (current approach) or GT_IND (probaby a better approach) + * Calls and loads that target multiple registers + * Existing gtLsraInfo has the capability to specify multiple destination registers + * Additional work is required in LSRA to handle these correctly + * If HFAs can be return values (not just call args), then we may need to support up to 4 destination + registers for LSRA + +Sample IR +--------- +### Bug 98404 +#### Before + +The `getfoo` method initializes a struct of 4 bytes. +The dump of the (single) local variable is included to show the change from `struct (8)` to +`struct4`, as the "exact size" of the struct is 4 bytes. +Here is the IR after Import: + +``` +; V00 loc0 struct ( 8) + + ▌ stmtExpr void (top level) (IL 0x000... ???) + │ ┌──▌ const int 4 + └──▌ initBlk void + │ ┌──▌ const int 0 + └──▌ <list> void + └──▌ addr byref + └──▌ lclVar struct V00 loc0 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + └──▌ return int + └──▌ lclFld int V00 loc0 [+0] +``` +This is how it currently looks just before code generation: +``` + ▌ stmtExpr void (top level) (IL 0x000...0x003) + │ ┌──▌ const int 0 REG rax $81 + │ ├──▌ &lclVar byref V00 loc0 d:3 REG NA + └──▌ storeIndir int REG NA + + ▌ stmtExpr void (top level) (IL 0x008...0x009) + │ ┌──▌ lclFld int V00 loc0 u:3[+0] (last use) REG rax $180 + └──▌ return int REG NA $181 +``` +And here is the resulting code: +``` + push rax + xor rax, rax + mov qword ptr [V00 rsp], rax + xor eax, eax + mov dword ptr [V00 rsp], eax + mov eax, dword ptr [V00 rsp] + add rsp, 8 + ret +``` +#### After +Here is the IR after Import with the prototype First Class Struct changes. +Note that the fixed-size struct variable is assigned and returned just as for a scalar type. + +``` +; V00 loc0 struct4 + + ▌ stmtExpr void (top level) (IL 0x000... ???) + │ ┌──▌ const int 0 + └──▌ = struct4 (init) + └──▌ lclVar struct4 V00 loc0 + + + ▌ stmtExpr void (top level) (IL 0x008... ???) + └──▌ return struct4 + └──▌ lclVar struct4 V00 loc0 +``` +And Here is the resulting code just prior to code generation: +``` + ▌ stmtExpr void (top level) (IL 0x008...0x009) + │ ┌──▌ const struct4 0 REG rax $81 + └──▌ return struct4 REG NA $140 +``` +Finally, here is the resulting code that we were hoping to acheive: +``` + xor eax, eax +``` + +### Issue 1133: +#### Before + +Here is the IR after Inlining for the `TestValueTypesInInlinedMethods` method that invokes a +sequence of methods that are inlined, creating a sequence of copies. +Because this struct type does not fit into a single register, the types do not change (and +therefore the local variable table is not shown). + +``` + ▌ stmtExpr void (top level) (IL 0x000...0x003) + │ ┌──▌ const int 16 + └──▌ initBlk void + │ ┌──▌ const int 0 + └──▌ <list> void + └──▌ addr byref + └──▌ lclVar struct V00 loc0 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + │ ┌──▌ const int 16 + └──▌ copyBlk void + │ ┌──▌ addr byref + │ │ └──▌ lclVar struct V00 loc0 + └──▌ <list> void + └──▌ addr byref + └──▌ lclVar struct V01 tmp0 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + │ ┌──▌ const int 16 + └──▌ copyBlk void + │ ┌──▌ addr byref + │ │ └──▌ lclVar struct V01 tmp0 + └──▌ <list> void + └──▌ addr byref + └──▌ lclVar struct V02 tmp1 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + │ ┌──▌ const int 16 + └──▌ copyBlk void + │ ┌──▌ addr byref + │ │ └──▌ lclVar struct V02 tmp1 + └──▌ <list> void + └──▌ addr byref + └──▌ lclVar struct V03 tmp2 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + └──▌ call help long HELPER.CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE + ├──▌ const long 0x7ff918494e10 + └──▌ const int 1 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + │ ┌──▌ const int 16 + └──▌ copyBlk void + │ ┌──▌ addr byref + │ │ └──▌ lclVar struct V03 tmp2 + └──▌ <list> void + │ ┌──▌ const long 8 Fseq[#FirstElem] + └──▌ + byref + └──▌ field ref s_dt + + ▌ stmtExpr void (top level) (IL 0x00E... ???) + └──▌ return void +``` +And here is the resulting code: +``` +sub rsp, 104 +xor rax, rax +mov qword ptr [V00 rsp+58H], rax +mov qword ptr [V00+0x8 rsp+60H], rax +xor rcx, rcx +lea rdx, bword ptr [V00 rsp+58H] +vxorpd ymm0, ymm0 +vmovdqu qword ptr [rdx], ymm0 +vmovdqu ymm0, qword ptr [V00 rsp+58H] +vmovdqu qword ptr [V01 rsp+48H]ymm0, qword ptr +vmovdqu ymm0, qword ptr [V01 rsp+48H] +vmovdqu qword ptr [V02 rsp+38H]ymm0, qword ptr +vmovdqu ymm0, qword ptr [V02 rsp+38H] +vmovdqu qword ptr [V03 rsp+28H]ymm0, qword ptr +mov rcx, 0x7FF918494E10 +mov edx, 1 +call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE +mov rax, 0x1FAC6EB29C8 +mov rax, gword ptr [rax] +add rax, 8 +vmovdqu ymm0, qword ptr [V03 rsp+28H] +vmovdqu qword ptr [rax], ymm0 +add rsp, 104 +ret +``` + +#### After +After fginline: +(note that the obj node will become a blk node downstream). +``` + ▌ stmtExpr void (top level) (IL 0x000...0x003) + │ ┌──▌ const int 0 + └──▌ = struct (init) + └──▌ lclVar struct V00 loc0 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + │ ┌──▌ lclVar struct V00 loc0 + └──▌ = struct (copy) + └──▌ lclVar struct V01 tmp0 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + │ ┌──▌ lclVar struct V01 tmp0 + └──▌ = struct (copy) + └──▌ lclVar struct V02 tmp1 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + │ ┌──▌ lclVar struct V02 tmp1 + └──▌ = struct (copy) + └──▌ lclVar struct V03 tmp2 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + └──▌ call help long HELPER.CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE + ├──▌ const long 0x7ff9184b4e10 + └──▌ const int 1 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + │ ┌──▌ lclVar struct V03 tmp2 + └──▌ = struct (copy) + └──▌ obj(16) struct + │ ┌──▌ const long 8 Fseq[#FirstElem] + └──▌ + byref + └──▌ field ref s_dt + + ▌ stmtExpr void (top level) (IL 0x00E... ???) + └──▌ return void +``` +Here is the IR after fgMorph: +Note that copy propagation has propagated the zero initialization through to the final store. +``` + ▌ stmtExpr void (top level) (IL 0x000...0x003) + │ ┌──▌ const int 0 + └──▌ = struct (init) + └──▌ lclVar struct V00 loc0 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + │ ┌──▌ const struct 0 + └──▌ = struct (init) + └──▌ lclVar struct V01 tmp0 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + │ ┌──▌ const struct 0 + └──▌ = struct (init) + └──▌ lclVar struct V02 tmp1 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + │ ┌──▌ const struct 0 + └──▌ = struct (init) + └──▌ lclVar struct V03 tmp2 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + └──▌ call help long HELPER.CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE + ├──▌ const long 0x7ffc8bbb4e10 + └──▌ const int 1 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + │ ┌──▌ const struct 0 + └──▌ = struct (init) + └──▌ obj(16) struct + │ ┌──▌ const long 8 Fseq[#FirstElem] + └──▌ + byref + └──▌ indir ref + └──▌ const(h) long 0x2425b6229c8 static Fseq[s_dt] + + ▌ stmtExpr void (top level) (IL 0x00E... ???) + └──▌ return void + +``` +After liveness analysis the dead stores have been eliminated: +``` + ▌ stmtExpr void (top level) (IL 0x008... ???) + └──▌ call help long HELPER.CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE + ├──▌ const long 0x7ffc8bbb4e10 + └──▌ const int 1 + + ▌ stmtExpr void (top level) (IL 0x008... ???) + │ ┌──▌ const struct 0 + └──▌ = struct (init) + └──▌ obj(16) struct + │ ┌──▌ const long 8 Fseq[#FirstElem] + └──▌ + byref + └──▌ indir ref + └──▌ const(h) long 0x2425b6229c8 static Fseq[s_dt] + + ▌ stmtExpr void (top level) (IL 0x00E... ???) + └──▌ return void +``` +And here is the resulting code, going from a code size of 129 bytes down to 58. +``` +sub rsp, 40 +mov rcx, 0x7FFC8BBB4E10 +mov edx, 1 +call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE +xor rax, rax +mov rdx, 0x2425B6229C8 +mov rdx, gword ptr [rdx] +add rdx, 8 +vxorpd ymm0, ymm0 +vmovdqu qword ptr [rdx], ymm0 +add rsp, 40 +ret +``` + +Work Items +---------- +This is a preliminary breakdown of the work into somewhat separable tasks. Those whose descriptions +are prefaced by '*' have been prototyped in an earlier version of the JIT, and that work is now +being re-integrated and tested, but may require some cleanup and/or phasing with other work items +before a PR is submitted. + +### Mostly-Independent work items +1. *Replace block ops with assignments & new nodes. + +2. *Add new fixed-size types, and normalize them in the importer (might be best to do this with or after #1, but not really dependent) + +3. LSRA + * Enable support for multiple destination regs, call nodes that return a struct in multiple + registers (for x64/ux, and for arm) + * Handle multiple destination regs for ldp on arm64 (could be done before or concurrently with the above). + Note that this work item is specifically intended for call arguments. It is likely the case that + utilizing ldp for general-purpose code sequences would be handled separately. + +4. X64/ux: aggressively promote lclVar struct incoming or outgoing args with two 8-byte fields + +5. X64/ux: + * modify the handling of multireg struct args to use GT_LIST of GT_IND + * remove the restriction to NOT promote things that are multi-reg args, as long as they match (i.e. two 8-byte fields). + Pass those using GT_LIST of GT_LCL_VAR. + * stop adding extra lclVar copies + +6. Arm64: + * Promote 16-byte struct lclVars that are incoming or outgoing register arguments only if they have 2 8-byte fields (DONE). + Pass those using GT_LIST of GT_LCL_VAR (as above for x64/ux). + Note that, if the types do not match, e.g. a TYP_DOUBLE field that will be passed in an integer register, + it will require special handling in Lowering and LSRA, as is currently done in the TYP_SIMD8 case. + * For other cases, pass as GT_LIST of GT_IND (DONE) + * The GT_LIST would be created in fgMorphArgs(). Then in Lower, putarg_reg nodes will be inserted between + the GT_LIST and the list item (GT_LCL_VAR or GT_IND). (DONE) + * Add support for HFAs. + + ### Dependent work items: + +7. *(Depends on 1 & 2): Fully enregister TYP_STRUCT[1|2|3|4|8] with no field accesses. + +8. *(Depends on 1 & 2): Enable value numbering and assertion propagation for struct types. + +9. (Depends on 1 & 2, mostly to avoid conflicts): Add support in prolog to extract fields, and + remove the restriction of not promoting incoming reg structs that have more than one field. + Note that SIMD types are already reassembled in the prolog. + +10. (Not really dependent, but probably best done after 1, 2, 5, 6): Add support for assembling + non-matching fields into registers for call args and returns. This includes producing the + appropriate IR, which may be simply be shifts and or's of the appropriate fields. + This would either be done during `fgMorphArgs()` and the `GT_RETURN` case of `fgMorphSmpOp()` + or as described below in + [Extracting and Assembling Structs](#Extract-Assemble). + +11. (Not really dependent, but probably best done after 1, 2, 5, 6): Add support for extracting the fields for the + returned struct value of a call, producing the appropriate IR, which may simply be shifts and + and's. + This would either be done during the morphing of the call itself, or as described below in + [Extracting and Assembling Structs](#Extract-Assemble). + +12. (Depends on 3, may replace the second part of 6): For arm64, add support for loading non-promoted + or non-local structs with ldp + * Either using TYP_STRUCT and special case handling, OR adding TYP_STRUCT16 + +13. (Depends on 7, 9, 10, 11): Enhance struct promotion to allow full enregistration of structs, + even if some field are accessed, if there are more call/return references than field references. + This work item should address issue #1161, by removing the automatic non-promotion + of structs with a single double field, and adding appropriate heuristics for when it + should be allowed. + +Related Work Item +----------------- +These changes are somewhat orthogonal, though will likely have merge issues if done in parallel with any of +the above: +* Unified API for ABI info + * Pass/Return info: + * Num regs used for passing + * Per-slot location (reg num / REG_STK) + * Per-slot type (for reg “slots”) + * Starting stack slot offset (if passed on stack) + * By reference? + * etc. + * We should be able to unify HFA handling into this model + * For arg passing, the API for creating the argEntry should take an arg state that keeps track of + what regs have been used, and handles the backfilling case for ARM + +Open Design Questions +--------------------- +### <a name="Extract-Assemble"/>Extracting and Assembling Structs + +Should the IR for extracting and assembling struct arguments from or to argument or return registers +be generated directly during the morphing of call arguments and returns, or should this capability +be handled in a more general fashion in `fgMorphCopyBlock()`? +The latter seems desirable for its general applicability. + +One way to handle this might be: + +1. Whenever you have a case of mismatched structs (call args, call node, or return node), + create a promoted temp of the "fake struct type", e.g. for arm you would introduce three + new temps for the struct, and for each of its TYP_LONG promoted fields. +2. Add an assignment to or from the temp (e.g. as a setup arg node), BUT the structs on + both sides of that assignment can now be promoted. +3. Add code to fgMorphCopyBlock to handle the extraction and assembling of structs. +4. The promoted fields of the temp would be preferenced to the appropriate argument or return registers. diff --git a/Documentation/design-docs/inline-size-estimates.md b/Documentation/design-docs/inline-size-estimates.md new file mode 100644 index 0000000000..12286c7f5d --- /dev/null +++ b/Documentation/design-docs/inline-size-estimates.md @@ -0,0 +1,526 @@ +# Inline Size Estimates + +Note this is work in progress.... + +Inlining is a heuristic-based optimization. There are some cases where +it's obvious that a particular inline is good or bad for performance, +but more typically, things are not that clear. In those cases where +the benfit is in doubt, the compiler will rely on heurstics, typically +based on various estimates of an inline's size impact, speed impact, +or other important factors. + +In this writeup we consider approaches to what should be the simplest +of these estimates: the size impact of an inline. + +## Background + +There are a number of interesting facets to the inline size estimate +problem, but let's start with some generalities. Suppose we have some +size estimate for the caller `C` (ignoring for the moment exactly what +we are measuring), say `CallerSize`, and some size estimate for the +callee `E`, say `CalleeSize`. Now suppose we are contemplating what +would happen if `E` is inlined into `C` to create `C'`. We'd like some +sort of size estimate `CallerSize'`. The simplest estimate is that +`CallerSize'` = `CallerSize + CalleeSize`. + +``` +(1) `CallerSize'` = `CallerSize + CalleeSize` +``` + +However, calling conventions impose some addtional code overhead on +both the caller and callee. The caller must set up arguments in +registers or on the stack, and if there is a return value, might need +to move it or store it somewhere. It also might need to spill values +that are stored in caller-saved registers around the call. + +The callee must set up a stack frame and possibly spill callee-save +registers it intends to use, and then ultimately must undo all this. + +When we inline `E` into `C`, none of this calling convention setup is +necessary. So one can imagine that there is some additional size +savings from inlining, one that depends on the calling convention and +the number and kind of arguments passed from caller to callee. + +``` +(2) CallerSize' = CallerSize + CalleeSize - Overhead +``` + +Note that it's entirely possible that `Overhead > CalleeSize`, so +that `CallerSize' < CallerSize`, that is the inline not only results +in faster code but also in smaller code. Indeed this state of affars +is increasingly common with that advent of modern programming styles +that emphasize building functionality out of lots of small procedures. + +Alternatively, we can compute the size impact as the size change of C +because of the inline of E: + +``` +(2) SizeImpact = CallerSize - CallerSize' +``` + +or with a bit of math, + +``` +(2) SizeImpact = Overhead - CalleeSize +``` + +Now let's look at some actual data to see how well this simple model +for `SizeImpact` holds up. The data below shows the measured size +impact of a single inline into various calling methods in mscorlib. +This data was obtained by first compiling all the methods without +inlining to obtain a baseline size as measured by the encoded +instruction size in bytes. The inliner was then modified so it would +only do a single inline per caller, with the candidate chosen by a +parameter `k` that could be varied externally. mscorlib was then +repeatedly compiled with varying values of `k`, and the inlines +performed and the subsequent modified caller sizes recorded. This data +was then permuted and histogrammed to show the variety of `SizeImpact` +values for a fixed `E`. Since the calling convention and number and +kind of parameters is fixed across all observations for a fixed `E`, +the better model would predict `SizeImpact` is the same in every +case. + +In this first example, case `E` is `System.Object:.ctor()`, which is +close to the simplest possible callee -- it takes one argument and has +no return value. In all the observed cases the `SizeImpact` is +negative, but it's far from the same in every case. + + +``` +Inlining for System.Object:.ctor():this (size 6) +Instances 1160 Mean SizeImpact -12.53 Min -52 Max -5 +Distribution + less than -50: 1 + [-50,-41]: 0 1 1 0 0 1 0 0 1 1 + [-40,-31]: 0 1 0 0 1 0 0 3 1 2 + [-30,-21]: 1 3 2 2 1 8 76 5 22 18 + [-20,-11]: 38 23 98 7 29 4 8 27 14 21 + [-10, -1]: 1 362 373 0 0 3 0 0 0 0 + [ 0, 9]: 0 0 0 0 0 0 0 0 0 0 + [ 10, 19]: 0 0 0 0 0 0 0 0 0 0 + [ 20, 29]: 0 0 0 0 0 0 0 0 0 0 + [ 30, 39]: 0 0 0 0 0 0 0 0 0 0 + [ 40, 49]: 0 0 0 0 0 0 0 0 0 0 +greater than 49: 0 +``` + +It should be evident from this set of observations that `SizeImpact` +cannot be completely characterized by a simple formula like `(2)`. For +this inlinee, inlining always saves at least 5 bytes, roughly half the time +it saves 8 or 9 bytes, on average it saves about 12.5 bytes, but it often +saves considerably more. + +Other inlinees show similar spreads in `SizeImpact`. Here we see a case +where sometimes the `SizeImpact` is negative and other times it's +positive. + +``` +Inlining for System.Threading.CancellationToken:get_IsCancellationRequested():bool:this (size 29) +Instances 42 SizeImpact Mean 11.33 Min -20 Max 28 +Distribution + less than -50: 0 + [-50,-41]: 0 0 0 0 0 0 0 0 0 0 + [-40,-31]: 0 0 0 0 0 0 0 0 0 0 + [-30,-21]: 0 0 0 0 0 0 0 0 0 0 + [-20,-11]: 1 0 0 0 0 0 0 1 0 0 + [-10, -1]: 0 2 0 0 0 0 0 0 1 2 + [ 0, 9]: 0 0 1 1 0 2 0 1 4 1 + [ 10, 19]: 1 0 1 0 4 1 2 1 2 6 + [ 20, 29]: 0 0 2 0 2 0 0 0 3 0 + [ 30, 39]: 0 0 0 0 0 0 0 0 0 0 + [ 40, 49]: 0 0 0 0 0 0 0 0 0 0 +greater than 49: 0 +``` + +Not all inlinee `SizeImpacts` exhibit such wide distributions. Some spread +just a little: + +``` +Inlining for System.Environment:GetResourceString(ref):ref (size 15) +Instances 2238 SizeImpact Mean 0.01 Min -3 Max 6 +Distribution + less than -50: 0 + [-50,-41]: 0 0 0 0 0 0 0 0 0 0 + [-40,-31]: 0 0 0 0 0 0 0 0 0 0 + [-30,-21]: 0 0 0 0 0 0 0 0 0 0 + [-20,-11]: 0 0 0 0 0 0 0 0 0 0 + [-10, -1]: 0 0 0 0 0 0 0 7 0 7 + [ 0, 9]:2212 3 0 5 0 0 4 0 0 0 + [ 10, 19]: 0 0 0 0 0 0 0 0 0 0 + [ 20, 29]: 0 0 0 0 0 0 0 0 0 0 + [ 30, 39]: 0 0 0 0 0 0 0 0 0 0 + [ 40, 49]: 0 0 0 0 0 0 0 0 0 0 +greater than 49: 0 +``` + +and some not at all: + +``` +Inlining for System.DateTime:get_Ticks():long:this (size 15) +Instances 129 SizeImpact Mean 0.00 Min 0 Max 0 +Distribution + less than -50: 0 + [-50,-41]: 0 0 0 0 0 0 0 0 0 0 + [-40,-31]: 0 0 0 0 0 0 0 0 0 0 + [-30,-21]: 0 0 0 0 0 0 0 0 0 0 + [-20,-11]: 0 0 0 0 0 0 0 0 0 0 + [-10, -1]: 0 0 0 0 0 0 0 0 0 0 + [ 0, 9]: 129 0 0 0 0 0 0 0 0 0 + [ 10, 19]: 0 0 0 0 0 0 0 0 0 0 + [ 20, 29]: 0 0 0 0 0 0 0 0 0 0 + [ 30, 39]: 0 0 0 0 0 0 0 0 0 0 + [ 40, 49]: 0 0 0 0 0 0 0 0 0 0 +greater than 49: 0 +``` + +So it appears there must be other -- perhaps many -- factors that +influence the `SizeImpact` of a particular inline. The question is, +what factors are important, how do we measure them, and how do they +they combine to influence the overall result? The remainder of this +document will look into some of the ways we can answer this question. + +## Some Ground Rules + +Before we move on, however, we should address what kind of information +will actually be available to observe and use in building a heuristic +model. + +It's not realistic to feed the heuristic on the actual encoded size of +methods. While compiling a method `C` obtaining the encoded size of +`C` is problematic. It's also quite likely that the encoded size of +the inlinee `E` is unknown, except possibly in some AOT scenarios +where one could generally arrange for methods to be compiled in +bottom-up order (that is, callees before callers). While one could +imagine spending compile time doing experimental compilations of `C`'s +and `E`'s to see what code size results, this is probably too costly +to be practical. + +Second, even if we could obtain the actual size of prospective inline +candidates, we might not want to use this data. The final code +sequence emitted by the compiler depends intimately on details of the +target archtecture, runtime conventions (ABIs), and capabilites of the +compiler phases that run after inlining. If we allow feedback into hey +heuristics by incorporating data from these "downstream" sources, we +introduce various forms of coupling that have important +consequences. For example, it would be challenging to work on a +downstream phase (say, register allocation) if some trivial change to +allocation in an inlinee `E` perturbs inlining decisions for other +methods like `C`. Likewise we may or may not want to allow runtime +conventions to influence inlining -- if our application can run cross +platform (say in both Windows and Linux) we might not want the various +versions to exhibit vastly different inlining patterns. + +For that reason, we intend to restrict the information we can use in +forming heuristics to things that are generally true of the caller and +callee, and the general capabilities of the downstream phases. For the +caller and callee this largely means information that can be obtained +by inspecting either the input to the compiler, or the compiler's IR +in the stages leading up to inlining, and perhaps (if available) some +carefully chosen information obtained from actual compilation of the +callee. + +At the same time, we still intend for the predictions made by the +heuristics to be indicative of the final binary size. We have no other +reliable source of truth to guide us. + +Given all this, we can restate the central question as: how best can +the compiler estimate the ultimate size impact of an inline while +restricting itself to considering features that are generally true of +caller, callee, and the capabilities of the compiler? + +## Building a Heuristic Manually + +The tried-and-true approach is to build the heuristic manually. There +are two prongs to the approach: the first is case analysis of actual +behavior, and the second is modelling based on the compiler writer's +experience and intution. + +### Some Case Analysis + +#### Case 1: Maximal Savings + +It's always instructive to look at actual behavior, so let's consider +some of the cases. From the table above for `System.Object:.ctor()` we +see a number of cases where there were large savings in byte size. The +most significant of these is the following: + +``` +Inline System.Object:.ctor():this +into System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this +CalleeSize = 6 +CallerSize = 72 +CallerSize' = 24 +SizeImpact = -48 +``` + +where `System.Object:.ctor():this` is simply: + +```Assembly +; System.Object:.ctor():this +; Total bytes of code 6, prolog size 5 + 0F1F440000 nop + C3 ret +``` + +As an aside, one might wonder why there are 5 bytes of nops here -- +they are put there to support the *rejit* feature used by the +application insights framework. + +The caller's source code shows it simply delegates immediately to +another method to initialize the object: + +```C# + UnmanagedMemoryAccessor( + SafeBuffer buffer, Int64 offset, + Int64 capacity, FileAccess access) { + Initialize(buffer, offset, capacity, access); + } +``` + +Here's the code before the inline: + +```Assembly +; BEFORE +; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this +; Total bytes of code 72, prolog size 10 + 4156 push r14 + 57 push rdi + 56 push rsi + 55 push rbp + 53 push rbx + 4883EC20 sub rsp, 32 + 488BF1 mov rsi, rcx + 488BFA mov rdi, rdx + 498BD8 mov rbx, r8 + 498BE9 mov rbp, r9 + 488BCE mov rcx, rsi + E800000000 call System.Object:.ctor():this + 448B742470 mov r14d, dword ptr [rsp+70H] + 4489742470 mov dword ptr [rsp+70H], r14d + 488BCE mov rcx, rsi + 488BD7 mov rdx, rdi + 4C8BC3 mov r8, rbx + 4C8BCD mov r9, rbp + 488D0500000000 lea rax, [(reloc)] + 4883C420 add rsp, 32 + 5B pop rbx + 5D pop rbp + 5E pop rsi + 5F pop rdi + 415E pop r14 + 48FFE0 rex.jmp rax +``` + +and here's the code after the inline: + +```Assembly +; AFTER +; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this +; Total bytes of code 24, prolog size 5 + 0F1F440000 nop + 90 nop + 8B442428 mov eax, dword ptr [rsp+28H] + 89442428 mov dword ptr [rsp+28H], eax + 488D0500000000 lea rax, [(reloc)] + 48FFE0 rex.jmp rax +``` + +In both cases this method ends up tail calling to `Initialize` (via the +`rex.jmp`), passing it the exact set of arguments that +was passed to the method. + +In the before case, the un-inlined call changes the method from a leaf +to a non-leaf. Leaf functions don't need to set up their own frame, +while non-leaves do. So some of the extra code in the prolog and +epilog is the frame setup. Beyond that, however, the un-inlined callee +forces the compiler to move the arguments that will be ultimately be +passed to the tail-called method out of the volatile registers and +into preserved registers before the call, then back again to again to +the proper argument registers after the call. And in order to use the +callee saves these registers must be pushed in the prolog and popped +in the epilog. + +In the after case, no frame setup is required, no registers need to be +preserved across the call, so no registers need saving and +restoring. The two `mov`s that remain appear to be gratuitous. The +first `nop' is there for rejit padding. Not sure why the second one is +there. + +So we can now see why this particular case has such extreme +`SizeImpact`: the un-inlined call triggers frame creation and a fair +amount of shuffling to accomodate the potential side effects of the +call. + +#### Case 2: Typical Savings + +In the data above, the inline of `System.Object:.ctor()` typically +saves either 8 or 9 bytes of code. Let's look at one such case. + +``` +Inline System.Object:.ctor():this +into System.Security.SecurityElement:.ctor():this +CalleeSize = 6 +CallerSize = 15 +CallerSize' = 6 +SizeImpact = -9 +``` + +Here are before and after listings for the caller: + +```Assembly +; BEFORE +; System.Security.SecurityElement:.ctor():this +; Total bytes of code 15, prolog size 5 + 0F1F440000 nop + 488D0500000000 lea rax, [(reloc)] + 48FFE0 rex.jmp rax +``` + +```Assembly +; AFTER +; System.Security.SecurityElement:.ctor():this +; Total bytes of code 6, prolog size 5 + 0F1F440000 nop + C3 ret +``` + +In this case the call site was initially handled via a tail call that +required a 7 byte `lea` and a 3 byte `rex.jmp`. With the inline all +that's left is a single byte `ret`. This case covers 154 of the 187 +instances where inlining `System.Object:.ctor()` saved 9 bytes. + +### Cases 3, 4: Size May Decrease or Increase + +Now let's look at a set of examples where an inline can either +decrease or increase the caller size. Here's the good case: + +``` +Inline System.IntPtr:op_Explicit(long):long +into System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long) +CalleeSize = 35 +CallerSize = 50 +CallerSize' = 22 +SizeImpact = -28 +``` + +Here's the callee code: + +SIGH -- turns out the above has the wrong callee size -- there are +several overloaded versions and my script picks up the wrong one. +true callee is only 9 bytes long: + +```Assembly +; System.IntPtr:op_Explicit(long):long +; Total bytes of code 9, prolog size 5 + 0F1F440000 nop + 488BC1 mov rax, rcx + C3 ret +``` + +At any rate the `SizeImpact` is still correct. + +Here's the before caller code: + +```Assembly +; BEFORE +; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long) +; Total bytes of code 50, prolog size 12 + 55 push rbp + 57 push rdi + 56 push rsi + 4883EC20 sub rsp, 32 + 488D6C2430 lea rbp, [rsp+30H] + 488BF1 mov rsi, rcx + 8BFA mov edi, edx + 498BC8 mov rcx, r8 + E800000000 call System.IntPtr:op_Explicit(long):long + 4C8BC0 mov r8, rax + 8BD7 mov edx, edi + 488BCE mov rcx, rsi + 488D0500000000 lea rax, [(reloc)] + 488D65F0 lea rsp, [rbp-10H] + 5E pop rsi + 5F pop rdi + 5D pop rbp + 48FFE0 rex.jmp rax // tail call WriteInt64 +``` + +and the after caller code: + +```Assembly +; AFTER +; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long) +; Total bytes of code 22, prolog size 10 + 55 push rbp + 4883EC20 sub rsp, 32 + 488D6C2420 lea rbp, [rsp+20H] + E800000000 call System.Runtime.InteropServices.Marshal:WriteInt64(ref,int,long) + 90 nop + 488D6500 lea rsp, [rbp] + 5D pop rbp + C3 ret +``` + +Somewhat confusingly, the inline has stopped the jit from making a +tail call, and the size savings should be even greater than they +are. There appears to be some kind of code generation issue within the +jit. The IL for the callee is + +``` +IL_0000 0f 00 ldarga.s 0x0 +IL_0002 7b 8f 02 00 04 ldfld 0x400028F +IL_0007 6e conv.u8 +IL_0008 2a ret + +``` +and the JIT rejects the tail call because + +``` +Rejecting tail call late for call [000008]: Local address taken +``` + +so presumably the `ldarga.s` in the inlinee is causing trouble. + +Now let's see if that same inlinee can cause a size increase. Here's a +case: + +(NOTE this is calling a different overload..., need to get this +straghtened out) + +``` +Inline System.IntPtr:op_Explicit(long):long [35] +into EventData:get_DataPointer():long:this [18] size 35 delta 17 +CalleeSize = 35 +CallerSize = 18 +CallerSize' = 35 +SizeImpact = 17 +``` + +```Assembly +; BEFORE +; EventData:get_DataPointer():long:this +; Total bytes of code 18, prolog size + 0F1F440000 nop + 488B09 mov rcx, qword ptr [rcx] + 488D0500000000 lea rax, [(reloc)] + 48FFE0 rex.jmp rax +``` + +```Assembly +; AFTER +; EventData:get_DataPointer():long:this +; Total bytes of code 35, prolog size 5 + 4883EC28 sub rsp, 40 + 488B11 mov rdx, qword ptr [rcx] + 33C9 xor rcx, rcx + 48894C2420 mov qword ptr [rsp+20H], rcx + 488D4C2420 lea rcx, bword ptr [rsp+20H] + E800000000 call System.IntPtr:.ctor(long):this + 488B442420 mov rax, qword ptr [rsp+20H] + 4883C428 add rsp, 40 + C3 ret +``` + +Here the un-inlined case made a tail call. The inlined case was unable +to do the same optimization for some reason, though it looks like it +should have been able to do so as well. diff --git a/Documentation/design-docs/inlining-plans.md b/Documentation/design-docs/inlining-plans.md new file mode 100644 index 0000000000..2279c16e37 --- /dev/null +++ b/Documentation/design-docs/inlining-plans.md @@ -0,0 +1,438 @@ +# Proposed Plans for Inlining + +This document outlines a plan for improving the inlining capabilities +of RyuJit as well as adding inlining capabilities to LLILC. + +## Background + +Inlining is a key optimization, directly reducing call overhead and +indirectly exposing a wider scope for powerful intraprocedural +optimizations. + +From an implementation standpoint, an inliner has the following aspects: + +* Machinery: transformations on the compiler's IR that physically +incorporate the code for the callee into the caller. + +* Legality: a set of rules that describe when inlining is legal (that +is, preserves program semantics). Here, the legality rules are largely +dictated to us by the CoreCLR runtime. + +* Ability: a set of rules that describe whether the machinery is able +to perform a legal inline and whether the result can be handled by +downstream phases. + +* Profitability: a set of heuristic rules that describe which legal +and able inlines are desirable. + +The general consensus is that the inlining in RyuJit can be improved, +while carefully maintaining RyuJit's current lean compilation +overhead. It is anticipated that this work will encompass Machinery, +Ability, and Profitability. + +LLILC does no inlining today. Since we aspire to have LLILC be a +high-performance .Net code generator, we need to enable inlining in +LLILC. LLILC can likely leverage much of LLVM's built-in inlining +Machinery and Ability, but will need new code for Legality and +Profitability. + +We envision various scenarios for .Net Code generation that impact +inlining: a first-tier JIT compiler, a higher-tier JIT compiler, a +fast AOT compiler, and an optimizing AOT compiler. Each scenario calls +for inlining, but the tradeoffs are different. For a given scenario, +we may choose to deploy RyuJit or LLILC or both depending on the +scenario requirements and code generator capabilities. + +There are also aspects of inlining that are runtime and target machine +specific. Each runtime and target architecture will require some +specialized tuning to accommodate the different costs of operations at +runtime. + +Taking this all into account, the proposed scope of the work +encompasses two different compiler code bases, a number of codegen +scenarios, and a number of target architectures and runtime +conventions. The goal is to come up with one conceptual framework that +covers all these cases and avoids relying too much on time-intensive +manual investigation and tuning. + +We will measure the quality of inlining along three axes: the time +spent generating code (aka *throughput* -- TP) abbreviate as TP), the +time spent executing the code (aka *code quality* -- CQ), and the size +of the generated code (CS). + +We also desire that the resulting decision-making machinery (in particular +the Profitability heuristic) is expressed in terms that are sensible to +compiler writers. + +## The Proposal + +### Machinery + +RyuJit's machinery may or may not require updates depending on how many +key inlines are rejected because of lack of ability. This is something +that requires further investigation. We may also decide to pursue +enhancements here to ensure that in cascaded inline cases the deeper +callsites are provided up-to-date information. + +LLILC should be able to leverage LLVM's existing inline machinery. In +JIT-like scenarios, LLILC's reader will need some updates to properly +handle cases where a method is being "read" as an inline candidate +rather than a root method to compiled. For AOT scenarios we ultimately +will want to do something more elaborate where all of the methods +being compiled are represented as LLVM IR. + +### Legality + +As mentioned above, the legality constraints are generally imposed by +the CoreCLR runtime. + +These constraints are already captured in the RyuJit source +code. However, we generally prefer to at least logically perform all +the legality checks before any of ability or profitability checks, so +that we can reason more simply about why a particular inline didn't +happen. This may not be entirely practical, since some of the +profitability or ability early outs may be helping TP by keeping +RyuJit from doing a lot of exploratory work before ultimately +rejecting an inline candidate. + +LLILC lacks legality checking, and this needs to be implemented. + +### Ability + +RyuJit currently bails out of some legal inlining cases because of +internal restrictions. For instance, if a callee modifies one of its +parameters, RyuJit won't inline it because of limitations in how the +IR is incorporated. We expect to work on reducing or removing key +ability limiters in RyuJit. + +Assuming the LLVM inlining framework is robust, LLILC shouldn't have +too many ability constraints. + +### Profitability + +Given the wide range of scenarios, targets, and the two distinct code +bases, we propose to heavily rely on machine learning techniques for +building and tuning the Profitability components of the inliners. + +The general approach we suggest is based on past experiences we've had +developing inliners for various compilers and is inspired by the CGO +paper [Automatic Construction of Inlining Heuristics using Machine +Learning](http://dl.acm.org/citation.cfm?id=2495914) by Kulkarni, +Cavazos, Wimmer, and Simon. + +Kulkarni et. al. treat profitability as an unsupervised learning +problem, and create a well-performing heuristic black box (neural +network) using evolutionary programming techniques. They then turn +around and use this black box as an oracle to label inline instances, +and from this guide a supervised machine learning algorithm to produce +a decision tree that expresses the profitability heuristics in terms +sensible to the compiler writer. + +The inputs to the heuristic black box are various facts and estimates +about an inline candidate, based on observations of the caller, callee, +and call site (collectively, *features*). These features may be booleans, +enumerates (categories), integers, or floating-point values. For example: + +* (boolean) `CalleeIsLeaf`, `CallerSiteInTry` +* (enumerate) `CalleeReturnCorType` +* (int) `CalleeILSize`,`CallSiteProfileWeight` +* (float) `CalleeLoadStoreFrequency` + +The output is a yes/no inlining decision for a particular candidate. + +The evolutionary process is governed by a fitness metric that expresses the +appropriate balance of TP, CQ, and CS for the scenario. For instance, in a JIT +scenario we likely will want to keep TP and CS close to the current +baseline values. For AOT we may want to allow higher TP and larger CS if we +can demonstrate CQ improvements. + +## Implications + +For this approach to work, we will need to make frequent and robust +measurements of the key fitness metrics across a range of +representative benchmarks. So this presumes: + +* we can identify and capture these key benchmarks +* we can develop methodology for doing appropriate measurements +* we can do these measurements at appropriate scale + +### The Benchmarks + +Clearly in a machine learning approach the actual observations made +are crucial. We need a solid, representative benchmark set. But this +is a general problem with a heuristic based optimization and not an +inherent limitation of this approach -- one can only measure what one +can measure. + +We plan on capturing the results in standard schema and hope that with +this and the regularization of the metric data (see next section) that +interested 3rd parties can perhaps supply their own benchmark results. + +### The Metrics + +CS is typically very easy to measure and the measurements are "noise +free". So the measurement only needs to be done once. + +If TP and CQ are both measured in terms of time (or cycles) then the +measurements will be inherently noisy. To reduce the variance caused +by noise some repeated runs may be necessary, and it might also be +necessary to run on dedicated, matched hardware. This severely limits +scalability. + +We instead propose to measure both TP and CQ in terms of instructions +retired. While this may not be entirely noise-free, we expect the +measurements to be low noise and fairly insensitive to other +computations on the machine. Thus this should give us decent scale +(for instance we can run scenarios in the cloud on VMs with varying +underlying hardware, etc). + +For TP, instructions retired should be a reasonable proxy for time, +since we expect the CPI for the compiler proper to be relatively +static. For CQ, we realize there may not be as strong a correlation +with time, but typically the inliner operates early in the compilation +process and inlining is mainly focused on reducing instruction count +and less about reducing cycle stalls. + +We'll do some experimentation early on to see how well these proxy +metrics hold up. + +### Analysis + +To help prioritize work on profitability and ability (and perhaps find +areas where we might discuss trying to remove legality constraints), +we need some mechanism to correlate inlining outcomes with success and +failure reasons. + +We propose to develop a textual, machine-readable inline log schema +that describes a set of inlining decisions. This schema can be +produced by a compiler as it makes decisions, and contains sufficient +correlating data so that we can then match it up with runtime +observations. For example, we might end up with the following sort of +per-method output (though more likely it will be embedded in some kind of +xml or json markup): + +``` + Inlining for 9811 Lookup + [o] 22825 Lookup@Lattice + [o] 22827 ??0?$interior_ptr@PAVCell + [x] 22826 Lookup@Lattice@@@Z (reason: SizeLimit) + [o] 21728 ?get_CellList + +``` + +where `[o]` is a successful inline, `[x]` a failed inline, and +indentation shows the inlining tree. For .Net compilation we'll need +some kind of persistent ID for methods, which may not be all that easy +to come by. + +This inline log can also be read back by the code generator to enable +*inline replay* and force the inliner to perform a particular pattern +of inlines. This is useful for things like performance investigations, +bug isolation, and for trying to have one compiler emulate the +inlining behavior of another. Since the logs are textual they can also +be edited by hand if necessary to perform experiments. + +### Obtaining Runtime Data + +If we can obtain runtime data about the frequency of call +instructions, we can run this through the log to accumulate failure +reasons and create a failure reason table like the following +(categories and data here totally made up): + +``` + FailedReason | Count | [%] | Sites | [%] +---------------------|----------|-------|--------|------ +SpeculativeNonGround | 79347955 | 37.8% | 417 | 14.1% +NonGroundShared | 28756634 | 13.7% | 217 | 07.4% +SizeLimit | 22066033 | 10.5% | 760 | 25.8% +ExternFunction | 21923402 | 10.5% | 300 | 10.2% +... etc ... +``` + +This can be used to prioritize inlining work. For instance, if +`SpeculativeNonGround` was an ability issue, and we wanted to improve +CQ, we'd give it high priority to fix, since it impacts inlining most +frequently. + +### Robustness + +Since we'll likely be altering inlining decisions, we may uncover bugs +or performance blemishes in the compilers. + +It may make sense to build up a random inlining methodology (RyuJit +evidently has something like this) to try and stress test the compiler +in the face of inlining changes. Log replay can be used with automatic +bisection tools to try and isolate the particular inline that exposes +a bug. + +Performance issues are harder to spot automatically, and will likely +require some amount of manual analysis. However, we may be able to study +inline instances that are not well-predicted by the inlining models -- for +instance, cases where the CS is much greater than predicted, or CQ is lower -- +and see if the modelling shortfall is an indicator of problems within the +compiler itself. + +## Development Stages + +Given the above, here's a rough draft outline for the how the +functionality could be built. + +### Preliminaries -- Measurement + +1. Make some benchmarks available for measurement. + +2. Demonstrate we can reliably measure the metrics of interest: CS, +CQ, and TP (as bytes, instructions, and instructions) when jitting and +pre-jitting the benchmarks. Get some idea of the noise levels for CQ +and TP. Hopefully they are small. + +### Preliminaries -- RyuJit + +1. Refactor RyuJit code to clearly separate out legality, ability, +and profitability aspects of the inliner. Hopefully this can be done +with no CQ changes and no TP impact. + +2. Develop the inline log format. + +3. Produce inline logs from RyuJit. + +4. Develop methodology for runtime correlation. + +5. Produce failure reason analysis for the benchmark set. Decide +which if any abilities need improving. + +6. Implement inline log replay for RyuJit. + +7. Modify logs by hand or via a tool and replay them to assess the +landscape of TP/CQ/CS tradeoffs from inliner changes. + +### Preliminaries -- LLILC + +1. Add code to capture legality constraints when reading in a method. + +2. Generalize LLILC's MSIL reader to read in an inlinee in +the context of the caller. + +3. Write a simple LLVM pass (or specialize an existing pass) to +perform inlining. Initially just implement a very simple profitability +heuristic. Verify it all works correctly. + +4. Implement inline log generation by LLILC. + +5. Implement inline log replay by LLILC. + +6. Modify logs by hand or via a tool and replay them to assess the +landscape of TP/CQ/CS tradeoffs from inliner changes. + +### Develop the Heuristics (JIT case) + +1. Start by trying to produce code size estimates for a given inline +candidate. Measure actual code size impacts for individual inlining +decisions. + +2. Determine features to surface and produce a labelled data set of +(feature*, size impact). Use this to build a machine-learning model +estimator for code size. + +3. We probably can't just plug the new size estimate into the existing +heuristic framework and expect good results (generally, better +information in does not always result in better decisions out). So +instead build up a parallel (off by default) heuristic path where we +use the new size estimator. + +4. Surface features for the CQ and TP impacts. Build a pluggable model +for the black box heuristic in the compiler. Likely this means that +the model code is written by a tool from some textual description. + +5. Run evolutionary experiments to produce an optimized black box +heuristic. This involves: (a) randomly or otherwise generating an +initial population of models; (b) running each model on the +benchmarks, gathering metrics; (c) combining metrics across benchmarks +to produce fitness; (d) using fitness to create a new set of models +using various genetic approaches; (e) iterating until satisfied with +the result. + +All of this setup should be automated and available in the CI system. + +6. Use the winning model to produce a labelled data set of (feature*, +inlining decision) for all the cases in the benchmarks. Build a +decision tree from this via standard techniques. Decide if any pruning +is appropriate. + +7. Back validate the results via measurement to show (a) final heuristic +works as expected and metrics match expectations on training examples; +cross validate on untrained benchmarks to show results are not over-fitted. +Flight this to various 3rd parties for independent confirmation. + +8. Enable new heuristics as defaults. + +9. Repeat the above process for each code base, scenario, +architecture, etc. + +### AOT Considerations + +In the AOT case (particularly the optimized AOT) we will likely need +to develop a different set of heuristics. + +The current frameworks for AOT (NGEN and Ready2Run) give the code +generator the opportunity to preview the methods to be compiled. We'll +take advantage of that to build a call graph and orchestrate +compilation of methods in a "bottom-up" fashion (callees before +callers). Because of that, when inlining, the caller typically has the +opportunity to look at a richer set of data about the callee -- for +instance, detailed usage summary for parameters and similar. + +The inliner also has the opportunity to plan inlining over a wider +scope, perhaps adjusting heuristics for the relative importance of +callers or the shape of the call graph or similar. + +## Vetting and Revalidation + +With an evaluation framework in place, we can now run carefully +controlled experiments to see if further enhancements to the inliner +are warranted. For example, the order in which candidates are +considered may have an impact -- ideally the inliner would greedily +look at the best candidates first. Other feature data may prove +important, for instance actual or synthetic profile data. + +The rest of the compiler also changes over time. Periodically we +should rerun the heuristic development to see if the current set of +heuristics are still good, especially if major new capabilities are +added to the compiler. + +We should also continually strive to improve the benchmark set. + +## Risks + +The plan calls for the inliner's Profitability logic to be constructed +via machine learning. This exposes us to a number of risks: + +* We may not understand how to successfully apply ML techniques. +* We may run into one of the common ML pitfalls, like having too many +features, or strongly correlated features. +* We may not be using the right features. +* Some of the features may themselves be estimates (eg code size). +* The benchmark set may not be sufficiently representative. +* We may overfit the benchmark set. The plan here is to cross-validate and +not use all our benchmarks as training examples. +* The training process may be too slow to be practical. +* There may be too much measurement noise. +* The resulting models may not be stable -- meaning that small changes in +the input programs might trigger large changes in the inlining decisions. +* The supervised learning output may still be black-box like in practice. +* The downstream phases of the compiler might not adapt well to inlining +changes or may offer an uneven performance surface. +* We've found it tricky in the past to produce a good combined fitness metric +from a large set of benchmarks and metrics. +* It's generally not possible to build a new heuristic which is better in +every way than the old one -- some metrics on some benchmarks will inevitably +regress. + +However, it's worth pointing out that a manually constructed heuristic faces +many of these same risks. + +## Notes + +See this [early +draft](https://github.com/AndyAyersMS/coreclr/commit/035054402a345f643d9dee0ec31dbdf5fadbb17c), +now orphaned by a squash, for some interesting comments. diff --git a/Documentation/design-docs/longs-on-32bit-arch.md b/Documentation/design-docs/longs-on-32bit-arch.md new file mode 100644 index 0000000000..d04efb59fa --- /dev/null +++ b/Documentation/design-docs/longs-on-32bit-arch.md @@ -0,0 +1,117 @@ +# Modeling Longs on 32-bit Architectures + +The challenge here is to model long operations in a way that reflects register lifetimes for the +lo and hi halves of a 64-bit integer accurately enough that we can achieve good register +allocation. + +## Background + +The current liveness model supports: +* Internal registers (interfere with all sources, but no defs) +* Sources that are not last use (interfere with all others) +* Sources that are last use and NOT DelayFree (interfere with all sources, but no defs) +* Sources that are last use AND DelayFree (interfere with all sources and defs) +* Kills (interfere with all uses but no defs) + +Previously (eons ago) when we were working on arm32, the model had all but the last def +interfering with the uses and internal registers (i.e. they went live during the first +location for the node). +* This supports the ability to reuse inputs, e.g. for long adds, subs or addresses, after the first def register goes live. +* This doesn’t work for supporting multi-reg call nodes: + * All defs need to occur AFTER the kills, and they do not interfere with any sources. + +Considerations for an expanded model: +* Must be “pay for play” (not make LSRA more expensive for non-long code generation) +* It is not practical to precisely model nodes that generate multiple instructions: + * In the case of longs, we may have one source operand that could be overwritten by + the first result, but other operands that need to remain live. + * However, on x86 it will be really important not to waste registers. + +## Decomposition + +This is the approach currently being taken. In the initial implementation, the reordering +of nodes to support the required "valid tree traversal" property of the IR was causing +the code to be incorrect because for instructions like `add` the carry bit needs to be +immediately consumed by the computation ofthe hi bits. + +## Decomposition with temps + +In order to preserve the current decomposition approach and not change the fundamental +tree ordering properties, it is necessary to evaluate sub-expressions into temps, to keep +the lo and hi computations together. + +There are concerns about this, because it requires generating a number of extra temps +in the case of nested expressions. However, mikedn has done some experimentation +[here](https://github.com/mikedn/coreclr/blob/decompose/src/jit/lower.cpp#L424) +that indicaates that this approach may not be as problematic as we feared. + +This basic idea is that whenever we need to decompose hi/lo operations but keep them +together (i.e. can’t preserve the tree traversal/linear order invariants), we create a temp. + +## Richer Liveness Model (No Decomposition) + +The idea here woudl be to retain `TYP_LONG` nodes in the IR, and to find a way to extend +the liveness model used by Lowering, LSRA and CodeGen to ensure good register allocation. +o + +In the table below, there are several operations that have different register lifetime +characteristics. +Each is modeled with a sequence of "Locations" for which the changes in register lifetimes +can be viewed as happening at the same time. +All lifetimes participating in a given Location are considered to interfere. +Note that the simplest model is that all the uses happen at one location, and then the +def(s) happen at the next Location (i.e. the def does not interfere with the uses). +The model becomes more complicated when we have constraints such as RMW (read-modify-write) +operands, and even more so when we are actually modeling multiple target instructions +with a single IR node. + +To avoid even more complication than is already inherent in this issue, we will assume +that the evaluation order is fixed during Lowering (and in these examples we are assuming +that the lo operand is evaluated first). + +In future we may want to consider the implications of relaxing that constraint, though +note that the Decomposition approach also requires a predetermined evaluation order. + +| Operation | Location 1 | Location 2 | Location 3 | +| --------------|:----------:| ----------:| ----------:| +| x = y | use y.lo | def x.lo | | +| | use y.hi | def x.hi | | +| | | | | +| x = y + z | use y.lo | def x.lo | def x.hi | +| | use z.lo | use y.hi | | +| | y.hi live | use z.hi | | +| | z.hi live | use z.hi | | +| | | | | +| x = *p | use p | def x.hi | | +| (non-ldp) | def x.lo | | | +| | | | | +| x = *p | use p | def x.lo | | +| (ldp) | | def x.hi | | +| | | | | +| x = \*(p+i*8) | use p | def x.hi | | +| (non-ldp) | use i | | | +| | def x.lo | | | +| | | | | +| x = call | use args | def x.lo | | +| | kill (end) | def x.hi | | + + +Both the non-ldp (load pair - available on Arm but not x86) cases take +advantage of the fact that all the sources must remain live +with the first def. The same can be done in the non-ldp case of `x = *p`. +However, that approach doesn't reduce the Location requirement for the `x = y + z` case, +where, if we assume that all inputs must be live for the first def, then we keep the lo +registers live longer than necessary, and therefore can't reuse them for the x.lo (or +other) results. + +TO DO: +* Extend the above to include more operations, and to show the actual instructions + generated, to determine how well we can approximate the "optimal" liveness model. + +## True Linear IR + +It would be really nice to break the “valid tree traversal” restriction, and allow +arbitrary (or at least more flexible) reordering of nodes when in linear form. +* Gets rid of embedded statements!! +* Affects (at least) rationalizer and lsra, which implicitly depend on this property in their use of stacks. +* Probably has many unknown other impacts diff --git a/Documentation/design-docs/removing-embedded-statements.md b/Documentation/design-docs/removing-embedded-statements.md new file mode 100644 index 0000000000..c7d63e7926 --- /dev/null +++ b/Documentation/design-docs/removing-embedded-statements.md @@ -0,0 +1,180 @@ +## Removing Embedded Statements From the RyuJIT Backend IR + +Pat Gavlin ([*pagavlin@microsoft.com*](mailto:pagavlin@microsoft.com)) + +July 2016 + +### Abstract + +RyuJIT’s IR comes in two forms: that produced and manipulated in the front end and that expected and manipulated by the +back end. The boundary between these two forms of IR is comprised of rationalization and lowering, both of which +transform the front-end IR into back end IR. For the purposes of this paper, the relevant differences between the two +IRs revolve around ordering constructs within basic blocks: top-level statements, trees, comma nodes, and embedded +statements. The latter two constructs are used to represent arbitrary (often side-effecting) code that executes at a +specific point in a tree but does not otherwise participate in the tree’s dataflow. Unfortunately, representational +challenges with embedded statements make them difficult to understand and error-prone to manipulate. This paper proposes +that we remove all statements--embedded and otherwise--from the backend IR by chaining the last linearly-threaded node +within a statement to the first linearly-threaded node within its successor and vice versa as well as removing certain +constraints on the shape of the nodes within a block. + +### Review: IR ordering semantics + +As previously mentioned, RyuJIT uses two forms of IR: the front-end IR (referred to hereafter as HIR) and the back-end +IR (referred to hereafter as LIR). Aside from using different representations for operations such as stores, HIR and LIR +differ in their ordering constructs within basic blocks. + +Within a basic block, the HIR is ordered first by statements. Each statement consists of a single tree that performs the +computation associated with that statement. The nodes of that tree are executed in the order produced by a left-to-right +post-order visit of the tree’s nodes (with the exception of binary operator nodes that have the GTF\_REVERSE\_OPS flag +set, which reverses the order of their operand trees). As such, the edges in a tree represent both ordering and edges in +a dataflow graph where every definition has a single use, with some exceptions: + +- Edges to nodes that represent defs of storage locations (e.g. edges to nodes on the LHS of assignment operators) + often represent neither ordering or a use-to-def relationship: the def of the location happens as part of the + execution of its parent, and the edge does not represent a use of an SDSU temp. + +- Edges to unused values do not represent a use-to-def relationship: these edges exist only for ordering. + +The primary source of the latter are comma nodes, which are used in the HIR specifically to insert arbitrary code into a +tree’s execution that does not otherwise participate in its SDSU dataflow. + +Similar to the HIR, the LIR is ordered first by statements. Each statement consists of a single tree and a linear +threading of nodes that represent the SDSU dataflow and computation, respectively, that are associated with that +statement. The linear threading of nodes must contain each node that is present in the tree, and all nodes that make up +the tree must be present in the linear threading in the same order in which they would have been executed by the HIR ’s +tree ordering. Additional nodes may be present in the linear threading in the form of embedded statements, which are +sub-statements whose nodes form a contiguous subsequence of their containing statement’s execution order, but do not +participate in the containing statement’s SDSU dataflow (i.e. the embedded statement’s nodes are not present in the +containing statement’s tree). Embedded statements otherwise have the same ordering semantics as top-level statements, +and are used for the same purpose as comma nodes in the HIR: they allow the compiler to insert arbitrary code into a +statement’s execution that does not otherwise participate in its SDSU dataflow. + +As mentioned, both comma nodes and embedded statements are used for the same purpose in the HIR and LIR, respectively. +Each construct represents a contiguous sequence of code that executes within the context of a tree but that does not +participate in its SDSU dataflow. Of the two, however, embedded statements are generally more difficult to work with: +since they are not present in their containing statement’s tree, the developer must take particular care when processing +LIR in order to avoid omitting the nodes that comprise embedded statements from analyses such as those required for code +motion (e.g. the analysis required to make addressing mode “contained”) and to avoid violating the tree order constraint +placed upon the LIR ’s linear threading. + +The problems with manipulating embedded statements have two relatively clear solutions: either replace embedded +statements with a construct that is represented in its parent statement’s tree, or remove the tree order constraint and +move the LIR to a linear ordering that is constrained only by the requirements of dataflow and side-effect consistency. +We believe that the latter is preferable to the former, as it is consistent with the overall direction that the LIR has +taken thus far and reduces the number of concepts bound up in tree edges, which would thereafter represent only the SDSU +dataflow for a node. This would clarify the meaning of the IR as well as the analysis required to perform the +transformations required by the backend. + +### Approach + +The approach that we propose is outlined below. + +#### 1. Add utilities for working with LIR. + +Efficiently working with the new IR shape will require the creation of utilities for manipulating, analyzing, +validating, and displaying linearly-ordered LIR. Of these, validation and display of the LIR are particularly +interesting. Validation is likely to check at least the following properties: + +1. The linear threading is loop-free +2. All SDSU temps (i.e. temps represented by edges) are in fact singly-used +3. All defs of SDSU temps occur before the corresponding use +4. All SDSU defs that are used are present in the linear IR + +In the short term, we propose that the LIR be displayed using the linear dump that was recently added to the JIT. These +dumps would be configured such that all of the information typically present in today’s tree-style dumps (e.g. node +flags, value numbers, etc.) is also present in the linear dump. + +#### 2. Stop Generating embedded statements in the rationalizer. + +This can be approached in a few different ways: + +1. Remove commas from the IR before rationalizing. There is already infrastructure in-flight in order to perform this + transformation, so this is attractive from a dev-hours point of view. Unfortunately, this approach carries both + throughput and a code quality risks due to the addition of another pass and the creation of additional local vars. + +2. Change the rationalizer such that it simply does not produce embedded statements. This requires that the + rationalizer is either able to operate on both HIR and LIR or simply linearize commas as it goes. + +3. Remove embedded statements from the IR with a linearization pass between the rationalizer and lowering. + +We will move forward with option 2, as it is the most attractive from a throughput and code quality standpoint: +it does not add additional passes (as does option 3), nor does it introduce additional local vars (as does option +1). + +#### 3. Refactor decomposition and lowering to work with linear LIR. + +The bulk of the work in this step involves moving these passes from statement- and tree-ordered walks to linear walks +and refactoring any code that uses the parent stack to instead use a helper that can calculate the def-to-use edge for a +node. It will also be necessary to replace embedded statement insertion with simple linear IR insertion, which is +straightforward. + +##### 3.i. Decomposition + +Transitioning decomposition should be rather simple. This pass walks the nodes of each statement in execution order, +decomposing 64-bit operations into equivalent sequences of 32-bit operations as it goes. Critically, the rewrites +performed by decomposition are universally expansions of a single operator into a contiguous sequence of operators whose +results are combined to form the ultimate result. This sort of transformation is easily performed on linear IR by +inserting the new nodes before the node being replaced. Furthermore, because nodes will appear in the linear walk in the +same relative order that they appear in the execution-order tree walk, rewrites performed in linear order will occur in +the same order in which they were originally performed. + +##### 3.ii. Lowering + +The picture for lowering is much the same as the picture for decomposition. Like decomposition, all rewrites are +performed in execution order, and most rewrites are simple expansions of a single node into multiple nodes. There is one +notable exception, however: the lowering of add nodes for the x86 architecture examines the parent stack provided by +tree walks in order to defer lowering until it may be possible to form an address mode. In this case, the use of the +parent stack can be replaced with a helper that finds the node that uses the def produced by the add node. + +##### 3.iii. General issues + +Both decomposition and lowering make use of a utility to fix up information present in the per-call-node side table that +tracks extra information about the call’s arguments when necessary. This helper can be replaced by instead fixing up the +side table entries when the call is visited by each pass. + +#### 4. Remove uses of statements and the tree-order invariant from LSRA. + +LSRA currently depends on the tree order invariant so that it can build a stack to track data associated with the defs +consumed by an operator. Removing the tree order invariant requires that LSRA is able to accommodate situations where +the contents of this stack as produced by a linear walk would no longer contain correct information due to the insertion +of new nodes that produce values that are live across the operator and some subset of its operands. Analysis indicates +that a simple map from defs to the necessary information should be sufficient. + +#### 5. Remove uses of statements from the rest of the backend. + +The rest of the backend--i.e. codegen--has no known dependencies on tree order, so there are no concerns with respect to +the change in ordering semantics. However, statement nodes are used to derive IL offsets for debug information. There +are a number of alternative methods of tracking IL offsets to consider: + +1. Insert “IL offset” nodes into the linearly-ordered LIR. These nodes would then be noted by code generation and used + to emit the necessary IP-mapping entries in the same way that statements are today. This has the disadvantage of + requiring additional work when performing code motion on LIR if the IL offset for a particular node needs to be + kept up-to-date. However, today’s backend does not perform this sort of code motion and even if it did, optimized + debugging is not currently a scenario, so a loss of debugging fidelity may be acceptable. + +2. Use a side table to map from each node to its IL offset. Although this increases the total working set of the + backend, this has the advantage of making it easy to keep IL offsets correct in the face of code motion. + +3. Add IL offset information directly to each node. This has the disadvantage of increasing the working set of the + entire compiler. + +Unless we expect to require correct debug information in the face of code motion in the future, our recommendation is +for the first option, which comes at a minimum size and implementation cost. + +#### 6. Remove contained nodes from the LIR’s execution order. + +Once statements are removed from the LIR, there is one major issue left to address: the presence of contained nodes in +the LIR’s execution order. The execution of these nodes logically happens as part of the node in which they are +contained, but these nodes remain physically present in the IR at their original locations. As a result, the backend +must often take care to skip contained nodes when manipulating the LIR in ways that must take execution order into +account. Instead of leaving such nodes in the LIR, these node should be removed from execution order and instead be +represented as (probably unordered) trees referenced only by the containing node. + +### Conclusion and Future Directions + +The changes suggested in this paper move RyuJIT’s LIR from a linear view of a tree-ordered nodes with certain nodes +represented only in execution order to a linearly ordered sequence of nodes. Furthermore, with this change, tree edges +in LIR would represent either uses of SDSU temps that have a place in execution order (where the edge points from the +use of the temp to the def) or uses of unordered expression trees that execute as part of the parent node. This form +should be easier to work with due to the removal of the tree order constraint and embedded statements and its similarity +to other linear IR designs. |