5 files changed, 1912 insertions, 0 deletions
diff --git a/Documentation/design-docs/first-class-structs.md b/Documentation/design-docs/first-class-structs.md
new file mode 100644
index 0000000000..fd6a3762c4
--- /dev/null
+++ b/Documentation/design-docs/first-class-structs.md
@@ -0,0 +1,651 @@
+First Class Structs
+===================
+
+Objectives
+----------
+Primary Objectives
+* Avoid forcing structs to the stack if they are only assigned to/from, or passed to/returned
+ from a call or intrinsic
+ - Including SIMD types as well as other pointer-sized-or-less struct types
+ - Enable enregistration of structs that have no field accesses
+* Optimize these types as effectively as any other basic type
+ - Value numbering, especially for types that are used in intrinsics (e.g. SIMD)
+ - Register allocation
+
+Secondary Objectives
+* No “swizzling” or lying about struct types – they are always struct types
+ - No confusing use of GT_LCL_FLD to refer to the entire struct as a different type
+
+Struct-Related Issues in RyuJIT
+-------------------------------
+The following issues illustrate some of the motivation for improving the handling of value types
+(structs) in RyuJIT:
+
+* VSO Bug 98404: .NET JIT x86 - poor code generated for value type initialization
+ * This is a simple test case that should generate simply `xor eax; ret` on x86 and x64, but
+   instead generates many unnecessary copies. It is addressed by full enregistration of
+   structs that fit into a register:
+ 
+```C#
+struct foo { public byte b1, b2, b3, b4; }
+static foo getfoo() { return new foo(); }
+```
+
+* [\#1133 JIT: Excessive copies when inlining](https://github.com/dotnet/coreclr/issues/1133)
+ * The scenario given in this issue involves a struct that is larger than 8 bytes, so
+   it is not impacted by the fixed-size types. However, by enabling assertion propagation
+   for struct types (which, in turn is made easier by using normal assignments), the
+   excess copies can be eliminated.
+   * Note that these copies are not generated when passing and returning scalar types,
+     and it may be worth considering (in future) whether we can avoiding adding them
+     in the first place.
+ 
+* [\#1161  RyuJIT properly optimizes structs with a single field if the field type is int but not if it is double](https://github.com/dotnet/coreclr/issues/1161)
+  * This issue arises because we never promote a struct with a single double field, due to
+    the fact that such a struct may be passed or returned in a general purpose register.
+    This issue could be addressed independently, but should "fall out" of improved heuristics
+    for when to promote and enregister structs.
+  
+* [\#1636 Add optimization to avoid copying a struct if passed by reference and there are no
+  writes to and no reads after passed to a callee](https://github.com/dotnet/coreclr/issues/1636).
+  * This issue is nearly the same as the above, except that in this case the desire is to
+    eliminate unneeded copies locally (i.e. not just due to inlining), in the case where
+    the struct may or may not be passed or returned directly.
+  * Unfortunately, there is not currently a scenario or test case for this issue.
+  
+* [\#3144 Avoid marking tmp as DoNotEnregister in tmp=GT_CALL() where call returns a
+  enregisterable struct in two return registers](https://github.com/dotnet/coreclr/issues/3144)
+  * This issue could be addressed without First Class Structs. However,
+    it will be easier with struct assignments that are normalized as regular assignments, and
+    should be done along with the streamlining of the handling of ABI-specific struct passing
+    and return values.
+    
+* [\#3539 RyuJIT: Poor code quality for tight generic loop with many inlineable calls](https://github.com/dotnet/coreclr/issues/3539)
+(factor x8 slower than non-generic few calls loop).
+  * I am still investigating this issue.
+
+* [\#5556 RuyJIT: structs in parameters and enregistering](https://github.com/dotnet/coreclr/issues/5556)
+  * This also requires further investigation, but requires us to "Add support in prolog to extract fields, and
+    remove the restriction of not promoting incoming reg structs that have more than one field" - see [Dependent Work Items](https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md#dependent-work-items)
+
+Normalizing Struct Types
+------------------------
+We would like to facilitate full enregistration of structs with the following properties:
+1. Its fields are infrequently accessed, and
+1. The entire struct fits into a register, and
+2. Its value is used or defined in a register 
+(i.e. as an argument to or return value from calls or intrinsics).
+
+In RyuJIT, the concept of a type is very simplistic (which helps support the high throughput
+of the JIT). Rather than a symbol table to hold the properties of a type, RyuJIT primarily
+deals with types as simple values of an enumeration. When more detailed information is
+required about the structure of a type, we query the type system, across the JIT/EE interface.
+This is generally done only during the importer (translation from MSIL to the RyuJIT IR), and
+during struct promotion analysis. As a result, struct types are treated as an opaque type
+(TYP_STRUCT) of unknown size and structure.
+
+In order to treat fully-enregisterable struct types as "first class" types in RyuJIT, we
+ create new types with fixed size and structure:
+* TYP_SIMD8, TYP_SIMD12, TYP_SIMD16 and (where supported by the target) TYP_SIMD32
+ - These types already exist, and represent some already-completed steps toward First Class Structs.
+* TYP_STRUCT1, TYP_STRUCT2, TYP_STRUCT4, TYP_STRUCT8 (on 64-bit systems)
+ - These types are new, and will be used where struct types of the given size are passed and/or
+ returned in registers.
+
+We want to identify and normalize these types early in the compiler, before any decisions are
+made regarding whether they are constrained to live on the stack and whether and how they are
+promoted (scalar replaced) or copied.
+
+One issue that arises is that it becomes necessary to know the size of any struct type that
+we encounter, even if we may not actually need to know the size in order to generate code.
+The major cause of additional queries seems to be for field references. It is possible to
+defer some of these cases. I don't know what the throughput impact will be to always do the
+normalization, but in principle I think it is worth doing because the alternative would be
+to transform the types later (e.g. during morph) and use a contextual tree walk to see if we
+care about the size of the struct. That would likely be a messier analysis.
+
+Current Struct IR Phase Transitions
+-----------------------------------
+
+There are three phases in the JIT that make changes to the representation of struct tree
+nodes and lclVars:
+
+* Importer
+ * All struct type lclVars have TYP_STRUCT
+ * All struct assignments/inits are block ops
+ * All struct call args are ldobj
+ * Other struct nodes have TYP_STRUCT
+* Struct promotion
+ * Fields of promoted structs become separate lclVars (scalar promoted) with primitive types
+* Global morph
+ * All struct nodes are transformed to block ops
+   - Besides call args
+  * Some promoted structs are forced to stack
+   - Become “dependently promoted”
+ * Call args 
+   - Morphed to GT_LCL_FLD if passed in a register
+   - Treated in various ways otherwise (inconsistent)
+
+Proposed Approach
+-----------------
+The most fundamental change with first class structs is that struct assignments become
+just a special case of assignment. The existing block ops (GT_INITBLK, GT_COPYBLK,
+ GT_COPYOBJ, GT_LDOBJ) are eliminated. Instead, the block operations in the incoming MSIL
+ are translated into assignments to or from a new GT_OBJ node.
+
+New fixed-size struct types are added: (TYP_STRUCT[1|2|4|8]), which are somewhat similar
+to the (existing) SIMD types (TYP_SIMD[8|16|32]). As is currently done for the SIMD types,
+these types are normalized in the importer.
+
+Conceptually, struct nodes refer to the object, not the address. This is important, as
+the existing block operations all take address operands, meaning that any lclVar involved
+in an assignment (including initialization) will be in an address-taken context in the JIT,
+requiring special analysis to identify the cases where the address is only taken in order
+to assign to or from the lclVar. This further allows for consistency in the treatment of
+structs and simple types - even potentially enabling optimizations of non-enregisterable
+structs.
+
+### Struct promotion
+
+* Struct promotion analysis
+ * Aggressively promote pointer-sized fields of structs used as args or returns
+ * Consider FULL promotion of pointer-size structs
+   * If there are fewer field references than calls or returns
+
+### Assignments
+* Struct assignments look like any other assignment
+* GenTreeAsg (GT_ASG) extends GenTreeOp with:
+
+```C#
+// True if this assignment is a volatile memory operation.
+bool IsVolatile() const { return (gtFlags & GTF_BLK_VOLATILE) != 0; }
+bool gtAsgGcUnsafe;
+
+// What code sequence we will be using to encode this operation.
+enum
+{
+    AsgKindInvalid,
+    AsgKindDirect,
+    AsgKindHelper,
+    AsgKindRepInstr,
+    AsgKindUnroll,
+} gtAsgKind;
+```
+
+### Struct “objects” as lvalues
+* Lhs of a struct assignment is a block node or lclVar
+* Block nodes represent the address and “shape” info formerly on the block copy:
+ * GT_BLK and GT_STORE_BLK (GenTreeBlk)
+   * Has a (non-tree node) size field
+   * Addr() is op1
+   * Data() is op2
+ * GT_OBJ and GT_STORE_OBJ (GenTreeObj extends GenTreeBlk)
+   * gtClass, gtGcPtrs, gtGcPtrCount, gtSlots
+ * GT_DYN_BLK and GT_STORE_DYN_BLK (GenTreeDynBlk extends GenTreeBlk)
+   * Additional child gtDynamicSize
+
+### Struct “objects” as rvalues
+After morph, structs on rhs of assignment are either:
+* The tree node for the object: e.g. call, retExpr
+* GT_IND of an address (e.g. GT_LEA)
+
+The lhs provides the “shape” for the assignment. Note: it has been suggested that these could 
+remain as GT_BLK nodes, but I have not given that any deep consideration.
+
+### Preserving Struct Types in Trees
+
+Prior to morphing, all nodes that may represent a struct type will have a class handle.
+After morphing, some will become GT_IND.
+
+### Structs As Call Arguments
+
+All struct args imported as GT_OBJ, transformed as follows during morph:
+* P_FULL promoted locals:
+  * Remain as a GT_LCL_VAR nodes, with the appropriate fixed-size struct type.
+  * Note that these may or may not be passed in registers.
+* P_INDEP promoted locals:
+  * These are the ones where the fields don’t match the reg types
+    GT_STRUCT (or something) for aggregating multiple fields into a single register
+  * Op1 is a lclVar for the first promoted field
+  * Op2 is the lclVar for the next field, OR another GT_STRUCT
+  * Bit offset for the second child
+* All other cases (non-locals, OR P_DEP or non-promoted locals):
+  * GT_LIST of GT_IND for each half
+
+### Struct Return
+
+The return of a struct value from the current method is represented as follows:
+* GT_RET(GT_OBJ) initially
+* GT_OBJ morphed, and then transformed similarly to call args
+
+Proposed Struct IR Phase Transitions
+------------------------------------
+
+* Importer
+  * Struct assignments are imported as GT_ASG
+  * Struct type is normalized to TYP_STRUCT* or TYP_SIMD*
+* Struct promotion
+  * Fields of promoted structs become separate lclVars (as is)
+  * Enregisterable structs (including Pair Types) may be promoted to P_FULL (i.e. fully enregistered)
+  * As a future optimization, we may "restructure" multi-register argument or return values as a
+    synthesized struct of appropriately typed fields, and then promoted in the normal manner.
+* Global morph
+  * All struct type local variables remain as simple GT_LCL_VAR nodes.
+  * All other struct nodes are transformed to GT_IND (rhs of assignment) or remain as GT_OBJ
+    * In Lowering, GT_OBJ will be changed to GT_BLK if there are no gc ptrs. This could be done
+      earlier, but there are some places where the object pointer is desired.
+    * It is not actually clear if there is a great deal of value in the GT_BLK, but it was added
+      to be more compatible with existing code that expects block copies with gc pointers to be
+      distinguished from those that do not.
+  * Promoted structs are forced to stack ONLY if address taken
+  * Call arguments
+    * Fixed-size enregisterable structs: GT_LCL_VAR or GT_OBJ of appropriate type.
+    * Multi-register arguments: GT_LIST of register-sized operands:
+      * GT_LCL_VAR if there is a promoted field that exactly matches the register size and type
+        (note that, if we have performed the optimization mentioned above in struct promotion,
+        we may have a GT_LCL_VAR of a synthesized struct field).
+      * GT_LCL_FLD if there is a matching field in the struct that has not been promoted.
+      * GT_IND otherwise. Note that if this is a struct local that does not have a matching field,
+        this will force the local to live on the stack.
+* Lowering
+  * Pair types (e.g. TYP_LONG on 32-bit targets) are decomposed as needed to expose register requirements.
+    Although these are not strictly structs, their handling is similar.
+    * Computations are decomposed into their constituent parts when they independently write
+      separate registers.
+    * TYP_LONG lclVars (and TYP_DOUBLE on ARM) are split (similar to promotion/scalar replacement of
+      structs) if and only if they are register candidates.
+    * Other TYP_LONG/TYP_DOUBLE lclVars are loaded into independent registers either via:
+      * Single GT_LCL_VAR that will translate into a pair load instruction (ldp), with two register 
+        targets, or
+      * GT_LCL_FLD (current approach) or GT_IND (probaby a better approach)
+  * Calls and loads that target multiple registers
+    * Existing gtLsraInfo has the capability to specify multiple destination registers
+    * Additional work is required in LSRA to handle these correctly
+    * If HFAs can be return values (not just call args), then we may need to support up to 4 destination
+      registers for LSRA
+
+Sample IR
+---------
+### Bug 98404
+#### Before
+
+The `getfoo` method initializes a struct of 4 bytes.
+The dump of the (single) local variable is included to show the change from `struct (8)` to
+`struct4`, as the "exact size" of the struct is 4 bytes.
+Here is the IR after Import:
+
+```
+;  V00 loc0           struct ( 8) 
+
+   ▌  stmtExpr  void  (top level) (IL 0x000...  ???)
+   │  ┌──▌  const     int    4
+   └──▌  initBlk   void  
+      │  ┌──▌  const     int    0
+      └──▌  <list>    void  
+         └──▌  addr      byref 
+            └──▌  lclVar    struct V00 loc0         
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   └──▌  return    int   
+      └──▌  lclFld    int    V00 loc0         [+0]
+```
+This is how it currently looks just before code generation:
+```
+   ▌  stmtExpr  void  (top level) (IL 0x000...0x003)
+   │  ┌──▌  const     int    0 REG rax $81
+   │  ├──▌  &lclVar   byref  V00 loc0         d:3 REG NA
+   └──▌  storeIndir int    REG NA
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...0x009)
+   │  ┌──▌  lclFld    int    V00 loc0         u:3[+0] (last use) REG rax $180
+   └──▌  return    int    REG NA $181
+```
+And here is the resulting code:
+```
+  push     rax
+  xor      rax, rax
+  mov      qword ptr [V00 rsp], rax
+  xor      eax, eax
+  mov      dword ptr [V00 rsp], eax
+  mov      eax, dword ptr [V00 rsp]
+  add      rsp, 8
+  ret      
+```
+#### After
+Here is the IR after Import with the prototype First Class Struct changes.
+Note that the fixed-size struct variable is assigned and returned just as for a scalar type.
+
+```
+;  V00 loc0          struct4 
+
+   ▌  stmtExpr  void  (top level) (IL 0x000...  ???)
+   │  ┌──▌  const     int    0
+   └──▌  =         struct4 (init)
+      └──▌  lclVar    struct4 V00 loc0         
+
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   └──▌  return    struct4
+      └──▌  lclVar    struct4    V00 loc0         
+```
+And Here is the resulting code just prior to code generation:
+```
+   ▌  stmtExpr  void  (top level) (IL 0x008...0x009)
+   │  ┌──▌  const     struct4    0 REG rax $81
+   └──▌  return    struct4    REG NA $140
+```
+Finally, here is the resulting code that we were hoping to acheive:
+```
+  xor      eax, eax
+```
+
+### Issue 1133:
+#### Before
+
+Here is the IR after Inlining for the `TestValueTypesInInlinedMethods` method that invokes a
+sequence of methods that are inlined, creating a sequence of copies.
+Because this struct type does not fit into a single register, the types do not change (and
+therefore the local variable table is not shown).
+
+```
+   ▌  stmtExpr  void  (top level) (IL 0x000...0x003)
+   │  ┌──▌  const     int    16
+   └──▌  initBlk   void  
+      │  ┌──▌  const     int    0
+      └──▌  <list>    void  
+         └──▌  addr      byref 
+            └──▌  lclVar    struct V00 loc0         
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   │  ┌──▌  const     int    16
+   └──▌  copyBlk   void  
+      │  ┌──▌  addr      byref 
+      │  │  └──▌  lclVar    struct V00 loc0         
+      └──▌  <list>    void  
+         └──▌  addr      byref 
+            └──▌  lclVar    struct V01 tmp0         
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   │  ┌──▌  const     int    16
+   └──▌  copyBlk   void  
+      │  ┌──▌  addr      byref 
+      │  │  └──▌  lclVar    struct V01 tmp0         
+      └──▌  <list>    void  
+         └──▌  addr      byref 
+            └──▌  lclVar    struct V02 tmp1         
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   │  ┌──▌  const     int    16
+   └──▌  copyBlk   void  
+      │  ┌──▌  addr      byref 
+      │  │  └──▌  lclVar    struct V02 tmp1         
+      └──▌  <list>    void  
+         └──▌  addr      byref 
+            └──▌  lclVar    struct V03 tmp2         
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   └──▌  call help long   HELPER.CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
+      ├──▌  const     long   0x7ff918494e10
+      └──▌  const     int    1
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   │  ┌──▌  const     int    16
+   └──▌  copyBlk   void  
+      │  ┌──▌  addr      byref 
+      │  │  └──▌  lclVar    struct V03 tmp2         
+      └──▌  <list>    void  
+         │  ┌──▌  const     long   8 Fseq[#FirstElem]
+         └──▌  +         byref 
+            └──▌  field     ref    s_dt
+
+   ▌  stmtExpr  void  (top level) (IL 0x00E...  ???)
+   └──▌  return    void  
+```
+And here is the resulting code:
+```
+sub      rsp, 104
+xor      rax, rax
+mov      qword ptr [V00 rsp+58H], rax
+mov      qword ptr [V00+0x8 rsp+60H], rax
+xor      rcx, rcx
+lea      rdx, bword ptr [V00 rsp+58H]
+vxorpd   ymm0, ymm0
+vmovdqu  qword ptr [rdx], ymm0
+vmovdqu  ymm0, qword ptr [V00 rsp+58H]
+vmovdqu  qword ptr [V01 rsp+48H]ymm0, qword ptr 
+vmovdqu  ymm0, qword ptr [V01 rsp+48H]
+vmovdqu  qword ptr [V02 rsp+38H]ymm0, qword ptr 
+vmovdqu  ymm0, qword ptr [V02 rsp+38H]
+vmovdqu  qword ptr [V03 rsp+28H]ymm0, qword ptr 
+mov      rcx, 0x7FF918494E10
+mov      edx, 1
+call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
+mov      rax, 0x1FAC6EB29C8
+mov      rax, gword ptr [rax]
+add      rax, 8
+vmovdqu  ymm0, qword ptr [V03 rsp+28H]
+vmovdqu  qword ptr [rax], ymm0
+add      rsp, 104
+ret      
+```
+
+#### After
+After fginline:
+(note that the obj node will become a blk node downstream).
+```
+   ▌  stmtExpr  void  (top level) (IL 0x000...0x003)
+   │  ┌──▌  const     int    0
+   └──▌  =         struct (init)
+      └──▌  lclVar    struct V00 loc0         
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   │  ┌──▌  lclVar    struct V00 loc0         
+   └──▌  =         struct (copy)
+      └──▌  lclVar    struct V01 tmp0         
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   │  ┌──▌  lclVar    struct V01 tmp0         
+   └──▌  =         struct (copy)
+      └──▌  lclVar    struct V02 tmp1         
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   │  ┌──▌  lclVar    struct V02 tmp1         
+   └──▌  =         struct (copy)
+      └──▌  lclVar    struct V03 tmp2         
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   └──▌  call help long   HELPER.CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
+      ├──▌  const     long   0x7ff9184b4e10
+      └──▌  const     int    1
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   │  ┌──▌  lclVar    struct V03 tmp2         
+   └──▌  =         struct (copy)
+      └──▌  obj(16)   struct
+         │  ┌──▌  const     long   8 Fseq[#FirstElem]
+         └──▌  +         byref 
+            └──▌  field     ref    s_dt
+
+   ▌  stmtExpr  void  (top level) (IL 0x00E...  ???)
+   └──▌  return    void  
+```
+Here is the IR after fgMorph:
+Note that copy propagation has propagated the zero initialization through to the final store.
+```
+   ▌  stmtExpr  void  (top level) (IL 0x000...0x003)
+   │  ┌──▌  const     int    0
+   └──▌  =         struct (init)
+      └──▌  lclVar    struct V00 loc0         
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   │  ┌──▌  const     struct 0
+   └──▌  =         struct (init)
+      └──▌  lclVar    struct V01 tmp0         
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   │  ┌──▌  const     struct 0
+   └──▌  =         struct (init)
+      └──▌  lclVar    struct V02 tmp1         
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   │  ┌──▌  const     struct 0
+   └──▌  =         struct (init)
+      └──▌  lclVar    struct V03 tmp2         
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   └──▌  call help long   HELPER.CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
+      ├──▌  const     long   0x7ffc8bbb4e10
+      └──▌  const     int    1
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   │  ┌──▌  const     struct 0
+   └──▌  =         struct (init)
+      └──▌  obj(16)   struct
+         │  ┌──▌  const     long   8 Fseq[#FirstElem]
+         └──▌  +         byref 
+            └──▌  indir     ref   
+               └──▌  const(h)  long   0x2425b6229c8 static Fseq[s_dt]
+
+   ▌  stmtExpr  void  (top level) (IL 0x00E...  ???)
+   └──▌  return    void  
+
+```
+After liveness analysis the dead stores have been eliminated:
+```
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   └──▌  call help long   HELPER.CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
+      ├──▌  const     long   0x7ffc8bbb4e10
+      └──▌  const     int    1
+
+   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
+   │  ┌──▌  const     struct 0
+   └──▌  =         struct (init)
+      └──▌  obj(16)   struct
+         │  ┌──▌  const     long   8 Fseq[#FirstElem]
+         └──▌  +         byref 
+            └──▌  indir     ref   
+               └──▌  const(h)  long   0x2425b6229c8 static Fseq[s_dt]
+
+   ▌  stmtExpr  void  (top level) (IL 0x00E...  ???)
+   └──▌  return    void  
+```
+And here is the resulting code, going from a code size of 129 bytes down to 58.
+```
+sub      rsp, 40
+mov      rcx, 0x7FFC8BBB4E10
+mov      edx, 1
+call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
+xor      rax, rax
+mov      rdx, 0x2425B6229C8
+mov      rdx, gword ptr [rdx]
+add      rdx, 8
+vxorpd   ymm0, ymm0
+vmovdqu  qword ptr [rdx], ymm0
+add      rsp, 40
+ret 
+```
+
+Work Items
+----------
+This is a preliminary breakdown of the work into somewhat separable tasks. Those whose descriptions
+are prefaced by '*' have been prototyped in an earlier version of the JIT, and that work is now
+being re-integrated and tested, but may require some cleanup and/or phasing with other work items
+before a PR is submitted.
+
+### Mostly-Independent work items
+1.	*Replace block ops with assignments & new nodes.
+
+2.	*Add new fixed-size types, and normalize them in the importer (might be best to do this with or after #1, but not really dependent)
+
+3.	LSRA
+    * Enable support for multiple destination regs, call nodes that return a struct in multiple
+      registers (for x64/ux, and for arm)
+    * Handle multiple destination regs for ldp on arm64 (could be done before or concurrently with the above).
+      Note that this work item is specifically intended for call arguments. It is likely the case that
+      utilizing ldp for general-purpose code sequences would be handled separately.
+
+4.	X64/ux: aggressively promote lclVar struct incoming or outgoing args with two 8-byte fields
+
+5.	X64/ux:
+    * modify the handling of multireg struct args to use GT_LIST of GT_IND
+    * remove the restriction to NOT promote things that are multi-reg args, as long as they match (i.e. two 8-byte fields).
+      Pass those using GT_LIST of GT_LCL_VAR.
+    * stop adding extra lclVar copies
+
+6.	Arm64:
+    * Promote 16-byte struct lclVars that are incoming or outgoing register arguments only if they have 2 8-byte fields (DONE).
+      Pass those using GT_LIST of GT_LCL_VAR (as above for x64/ux).
+      Note that, if the types do not match, e.g. a TYP_DOUBLE field that will be passed in an integer register,
+      it will require special handling in Lowering and LSRA, as is currently done in the TYP_SIMD8 case.
+    * For other cases, pass as GT_LIST of GT_IND (DONE)
+    * The GT_LIST would be created in fgMorphArgs(). Then in Lower, putarg_reg nodes will be inserted between
+      the GT_LIST and the list item (GT_LCL_VAR or GT_IND). (DONE)
+    * Add support for HFAs.
+    
+    ### Dependent work items:
+    
+7.	*(Depends on 1 & 2): Fully enregister TYP_STRUCT[1|2|3|4|8] with no field accesses.
+
+8.  *(Depends on 1 & 2): Enable value numbering and assertion propagation for struct types.
+
+9.	(Depends on 1 & 2, mostly to avoid conflicts): Add support in prolog to extract fields, and
+    remove the restriction of not promoting incoming reg structs that have more than one field.
+    Note that SIMD types are already reassembled in the prolog.
+    
+10.	(Not really dependent, but probably best done after 1, 2, 5, 6): Add support for assembling
+    non-matching fields into registers for call args and returns. This includes producing the
+    appropriate IR, which may be simply be shifts and or's of the appropriate fields.
+    This would either be done during `fgMorphArgs()` and the `GT_RETURN` case of `fgMorphSmpOp()`
+    or as described below in
+    [Extracting and Assembling Structs](#Extract-Assemble).
+    
+11. (Not really dependent, but probably best done after 1, 2, 5, 6): Add support for extracting the fields for the
+    returned struct value of a call, producing the appropriate IR, which may simply be shifts and
+    and's.
+    This would either be done during the morphing of the call itself, or as described below in
+    [Extracting and Assembling Structs](#Extract-Assemble).
+
+12.	(Depends on 3, may replace the second part of 6): For arm64, add support for loading non-promoted
+    or non-local structs with ldp
+    * Either using TYP_STRUCT and special case handling, OR adding TYP_STRUCT16
+
+13.	(Depends on 7, 9, 10, 11): Enhance struct promotion to allow full enregistration of structs,
+    even if some field are accessed, if there are more call/return references than field references.
+    This work item should address issue #1161, by removing the automatic non-promotion
+    of structs with a single double field, and adding appropriate heuristics for when it
+    should be allowed.
+
+Related Work Item
+-----------------
+These changes are somewhat orthogonal, though will likely have merge issues if done in parallel with any of
+the above:
+* Unified API for ABI info
+  * Pass/Return info:
+    * Num regs used for passing
+    * Per-slot location (reg num / REG_STK)
+    * Per-slot type (for reg “slots”)
+    * Starting stack slot offset (if passed on stack)
+    * By reference?
+    * etc.
+  * We should be able to unify HFA handling into this model
+  * For arg passing, the API for creating the argEntry should take an arg state that keeps track of
+    what regs have been used, and handles the backfilling case for ARM
+    
+Open Design Questions
+---------------------
+### <a name="Extract-Assemble"/>Extracting and Assembling Structs
+
+Should the IR for extracting and assembling struct arguments from or to argument or return registers
+be generated directly during the morphing of call arguments and returns, or should this capability
+be handled in a more general fashion in `fgMorphCopyBlock()`?
+The latter seems desirable for its general applicability.
+
+One way to handle this might be:
+
+1. Whenever you have a case of mismatched structs (call args, call node, or return node),
+   create a promoted temp of the "fake struct type", e.g. for arm you would introduce three
+   new temps for the struct, and for each of its TYP_LONG promoted fields.
+2. Add an assignment to or from the temp (e.g. as a setup arg node), BUT the structs on
+   both sides of that assignment can now be promoted.
+3. Add code to fgMorphCopyBlock to handle the extraction and assembling of structs.
+4. The promoted fields of the temp would be preferenced to the appropriate argument or return registers.
diff --git a/Documentation/design-docs/inline-size-estimates.md b/Documentation/design-docs/inline-size-estimates.md
new file mode 100644
index 0000000000..12286c7f5d
--- /dev/null
+++ b/Documentation/design-docs/inline-size-estimates.md
@@ -0,0 +1,526 @@
+# Inline Size Estimates 
+
+Note this is work in progress....
+
+Inlining is a heuristic-based optimization. There are some cases where
+it's obvious that a particular inline is good or bad for performance,
+but more typically, things are not that clear. In those cases where
+the benfit is in doubt, the compiler will rely on heurstics, typically
+based on various estimates of an inline's size impact, speed impact,
+or other important factors.
+
+In this writeup we consider approaches to what should be the simplest
+of these estimates: the size impact of an inline.
+
+## Background
+
+There are a number of interesting facets to the inline size estimate
+problem, but let's start with some generalities. Suppose we have some
+size estimate for the caller `C` (ignoring for the moment exactly what
+we are measuring), say `CallerSize`, and some size estimate for the
+callee `E`, say `CalleeSize`. Now suppose we are contemplating what
+would happen if `E` is inlined into `C` to create `C'`. We'd like some
+sort of size estimate `CallerSize'`.  The simplest estimate is that
+`CallerSize'` = `CallerSize + CalleeSize`.
+
+```
+(1)  `CallerSize'` = `CallerSize + CalleeSize`
+```
+
+However, calling conventions impose some addtional code overhead on
+both the caller and callee. The caller must set up arguments in
+registers or on the stack, and if there is a return value, might need
+to move it or store it somewhere. It also might need to spill values
+that are stored in caller-saved registers around the call.
+
+The callee must set up a stack frame and possibly spill callee-save
+registers it intends to use, and then ultimately must undo all this.
+
+When we inline `E` into `C`, none of this calling convention setup is
+necessary. So one can imagine that there is some additional size
+savings from inlining, one that depends on the calling convention and
+the number and kind of arguments passed from caller to callee.
+
+```
+(2)  CallerSize' = CallerSize + CalleeSize - Overhead
+```
+
+Note that it's entirely possible that `Overhead > CalleeSize`, so
+that `CallerSize' < CallerSize`, that is the inline not only results
+in faster code but also in smaller code. Indeed this state of affars
+is increasingly common with that advent of modern programming styles
+that emphasize building functionality out of lots of small procedures.
+
+Alternatively, we can compute the size impact as the size change of C
+because of the inline of E:
+
+```
+(2)  SizeImpact = CallerSize - CallerSize'
+```
+
+or with a bit of math,
+
+```
+(2)  SizeImpact = Overhead - CalleeSize
+```
+
+Now let's look at some actual data to see how well this simple model
+for `SizeImpact` holds up. The data below shows the measured size
+impact of a single inline into various calling methods in mscorlib.
+This data was obtained by first compiling all the methods without
+inlining to obtain a baseline size as measured by the encoded
+instruction size in bytes.  The inliner was then modified so it would
+only do a single inline per caller, with the candidate chosen by a
+parameter `k` that could be varied externally. mscorlib was then
+repeatedly compiled with varying values of `k`, and the inlines
+performed and the subsequent modified caller sizes recorded. This data
+was then permuted and histogrammed to show the variety of `SizeImpact`
+values for a fixed `E`.  Since the calling convention and number and
+kind of parameters is fixed across all observations for a fixed `E`,
+the better model would predict `SizeImpact` is the same in every
+case.
+
+In this first example, case `E` is `System.Object:.ctor()`, which is
+close to the simplest possible callee -- it takes one argument and has
+no return value. In all the observed cases the `SizeImpact` is
+negative, but it's far from the same in every case.
+
+
+```
+Inlining for System.Object:.ctor():this (size 6)
+Instances 1160 Mean SizeImpact -12.53 Min -52 Max -5
+Distribution
+  less than -50:   1
+      [-50,-41]:   0    1    1    0    0    1    0    0    1    1
+      [-40,-31]:   0    1    0    0    1    0    0    3    1    2
+      [-30,-21]:   1    3    2    2    1    8   76    5   22   18
+      [-20,-11]:  38   23   98    7   29    4    8   27   14   21
+      [-10, -1]:   1  362  373    0    0    3    0    0    0    0
+      [  0,  9]:   0    0    0    0    0    0    0    0    0    0
+      [ 10, 19]:   0    0    0    0    0    0    0    0    0    0
+      [ 20, 29]:   0    0    0    0    0    0    0    0    0    0
+      [ 30, 39]:   0    0    0    0    0    0    0    0    0    0
+      [ 40, 49]:   0    0    0    0    0    0    0    0    0    0
+greater than 49:   0
+```
+
+It should be evident from this set of observations that `SizeImpact`
+cannot be completely characterized by a simple formula like `(2)`. For
+this inlinee, inlining always saves at least 5 bytes, roughly half the time 
+it saves 8 or 9 bytes, on average it saves about 12.5 bytes, but it often
+saves considerably more.
+
+Other inlinees show similar spreads in `SizeImpact`. Here we see a case
+where sometimes the `SizeImpact` is negative and other times it's
+positive.
+
+```
+Inlining for System.Threading.CancellationToken:get_IsCancellationRequested():bool:this (size 29) 
+Instances 42 SizeImpact Mean 11.33 Min -20 Max 28
+Distribution
+  less than -50:   0
+      [-50,-41]:   0    0    0    0    0    0    0    0    0    0 
+      [-40,-31]:   0    0    0    0    0    0    0    0    0    0 
+      [-30,-21]:   0    0    0    0    0    0    0    0    0    0 
+      [-20,-11]:   1    0    0    0    0    0    0    1    0    0 
+      [-10, -1]:   0    2    0    0    0    0    0    0    1    2 
+      [  0,  9]:   0    0    1    1    0    2    0    1    4    1 
+      [ 10, 19]:   1    0    1    0    4    1    2    1    2    6 
+      [ 20, 29]:   0    0    2    0    2    0    0    0    3    0 
+      [ 30, 39]:   0    0    0    0    0    0    0    0    0    0 
+      [ 40, 49]:   0    0    0    0    0    0    0    0    0    0 
+greater than 49:   0
+```
+
+Not all inlinee `SizeImpacts` exhibit such wide distributions. Some spread
+just a little:
+
+```
+Inlining for System.Environment:GetResourceString(ref):ref (size 15) 
+Instances 2238 SizeImpact Mean 0.01 Min -3 Max 6
+Distribution
+  less than -50:   0 
+      [-50,-41]:   0    0    0    0    0    0    0    0    0    0 
+      [-40,-31]:   0    0    0    0    0    0    0    0    0    0 
+      [-30,-21]:   0    0    0    0    0    0    0    0    0    0 
+      [-20,-11]:   0    0    0    0    0    0    0    0    0    0 
+      [-10, -1]:   0    0    0    0    0    0    0    7    0    7 
+      [  0,  9]:2212    3    0    5    0    0    4    0    0    0 
+      [ 10, 19]:   0    0    0    0    0    0    0    0    0    0 
+      [ 20, 29]:   0    0    0    0    0    0    0    0    0    0 
+      [ 30, 39]:   0    0    0    0    0    0    0    0    0    0 
+      [ 40, 49]:   0    0    0    0    0    0    0    0    0    0 
+greater than 49:   0
+```
+
+and some not at all:
+
+```
+Inlining for System.DateTime:get_Ticks():long:this (size 15)
+Instances 129 SizeImpact Mean 0.00 Min 0 Max 0
+Distribution
+  less than -50:   0 
+      [-50,-41]:   0    0    0    0    0    0    0    0    0    0 
+      [-40,-31]:   0    0    0    0    0    0    0    0    0    0 
+      [-30,-21]:   0    0    0    0    0    0    0    0    0    0 
+      [-20,-11]:   0    0    0    0    0    0    0    0    0    0 
+      [-10, -1]:   0    0    0    0    0    0    0    0    0    0 
+      [  0,  9]: 129    0    0    0    0    0    0    0    0    0 
+      [ 10, 19]:   0    0    0    0    0    0    0    0    0    0 
+      [ 20, 29]:   0    0    0    0    0    0    0    0    0    0 
+      [ 30, 39]:   0    0    0    0    0    0    0    0    0    0 
+      [ 40, 49]:   0    0    0    0    0    0    0    0    0    0 
+greater than 49:   0
+```
+
+So it appears there must be other -- perhaps many -- factors that
+influence the `SizeImpact` of a particular inline. The question is,
+what factors are important, how do we measure them, and how do they
+they combine to influence the overall result? The remainder of this
+document will look into some of the ways we can answer this question.
+
+## Some Ground Rules
+
+Before we move on, however, we should address what kind of information
+will actually be available to observe and use in building a heuristic
+model.
+
+It's not realistic to feed the heuristic on the actual encoded size of
+methods. While compiling a method `C` obtaining the encoded size of
+`C` is problematic. It's also quite likely that the encoded size of
+the inlinee `E` is unknown, except possibly in some AOT scenarios
+where one could generally arrange for methods to be compiled in
+bottom-up order (that is, callees before callers). While one could
+imagine spending compile time doing experimental compilations of `C`'s
+and `E`'s to see what code size results, this is probably too costly
+to be practical.
+
+Second, even if we could obtain the actual size of prospective inline
+candidates, we might not want to use this data. The final code
+sequence emitted by the compiler depends intimately on details of the
+target archtecture, runtime conventions (ABIs), and capabilites of the
+compiler phases that run after inlining. If we allow feedback into hey
+heuristics by incorporating data from these "downstream" sources, we
+introduce various forms of coupling that have important
+consequences. For example, it would be challenging to work on a
+downstream phase (say, register allocation) if some trivial change to
+allocation in an inlinee `E` perturbs inlining decisions for other
+methods like `C`. Likewise we may or may not want to allow runtime
+conventions to influence inlining -- if our application can run cross
+platform (say in both Windows and Linux) we might not want the various
+versions to exhibit vastly different inlining patterns.
+
+For that reason, we intend to restrict the information we can use in
+forming heuristics to things that are generally true of the caller and
+callee, and the general capabilities of the downstream phases. For the
+caller and callee this largely means information that can be obtained
+by inspecting either the input to the compiler, or the compiler's IR
+in the stages leading up to inlining, and perhaps (if available) some
+carefully chosen information obtained from actual compilation of the
+callee.
+
+At the same time, we still intend for the predictions made by the
+heuristics to be indicative of the final binary size. We have no other
+reliable source of truth to guide us.
+
+Given all this, we can restate the central question as: how best can
+the compiler estimate the ultimate size impact of an inline while
+restricting itself to considering features that are generally true of
+caller, callee, and the capabilities of the compiler?
+
+## Building a Heuristic Manually
+
+The tried-and-true approach is to build the heuristic manually. There
+are two prongs to the approach: the first is case analysis of actual
+behavior, and the second is modelling based on the compiler writer's
+experience and intution.
+
+### Some Case Analysis
+
+#### Case 1: Maximal Savings
+
+It's always instructive to look at actual behavior, so let's consider
+some of the cases. From the table above for `System.Object:.ctor()` we
+see a number of cases where there were large savings in byte size. The
+most significant of these is the following:
+
+```
+Inline System.Object:.ctor():this
+into   System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
+CalleeSize  = 6
+CallerSize  = 72
+CallerSize' = 24 
+SizeImpact  = -48
+```
+
+where `System.Object:.ctor():this` is simply:
+
+```Assembly
+; System.Object:.ctor():this
+; Total bytes of code 6, prolog size 5
+       0F1F440000           nop      
+       C3                   ret      
+```
+
+As an aside, one might wonder why there are 5 bytes of nops here --
+they are put there to support the *rejit* feature used by the
+application insights framework.
+
+The caller's source code shows it simply delegates immediately to
+another method to initialize the object:
+
+```C#
+ UnmanagedMemoryAccessor(
+     SafeBuffer buffer, Int64 offset, 
+     Int64 capacity, FileAccess access) {
+   Initialize(buffer, offset, capacity, access);
+ }
+```
+
+Here's the code before the inline:
+
+```Assembly
+; BEFORE
+; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
+; Total bytes of code 72, prolog size 10
+       4156                 push     r14
+       57                   push     rdi
+       56                   push     rsi
+       55                   push     rbp
+       53                   push     rbx
+       4883EC20             sub      rsp, 32
+       488BF1               mov      rsi, rcx
+       488BFA               mov      rdi, rdx
+       498BD8               mov      rbx, r8
+       498BE9               mov      rbp, r9
+       488BCE               mov      rcx, rsi
+       E800000000           call     System.Object:.ctor():this
+       448B742470           mov      r14d, dword ptr [rsp+70H]
+       4489742470           mov      dword ptr [rsp+70H], r14d
+       488BCE               mov      rcx, rsi
+       488BD7               mov      rdx, rdi
+       4C8BC3               mov      r8, rbx
+       4C8BCD               mov      r9, rbp
+       488D0500000000       lea      rax, [(reloc)]
+       4883C420             add      rsp, 32
+       5B                   pop      rbx
+       5D                   pop      rbp
+       5E                   pop      rsi
+       5F                   pop      rdi
+       415E                 pop      r14
+       48FFE0               rex.jmp  rax
+```
+
+and here's the code after the inline:
+
+```Assembly
+; AFTER
+; System.IO.UnmanagedMemoryAccessor:.ctor(ref,long,long,int):this
+; Total bytes of code 24, prolog size 5
+       0F1F440000           nop      
+       90                   nop      
+       8B442428             mov      eax, dword ptr [rsp+28H]
+       89442428             mov      dword ptr [rsp+28H], eax
+       488D0500000000       lea      rax, [(reloc)]
+       48FFE0               rex.jmp  rax
+```
+
+In both cases this method ends up tail calling to `Initialize` (via the
+`rex.jmp`), passing it the exact set of arguments that
+was passed to the method.
+
+In the before case, the un-inlined call changes the method from a leaf
+to a non-leaf.  Leaf functions don't need to set up their own frame,
+while non-leaves do. So some of the extra code in the prolog and
+epilog is the frame setup. Beyond that, however, the un-inlined callee
+forces the compiler to move the arguments that will be ultimately be
+passed to the tail-called method out of the volatile registers and
+into preserved registers before the call, then back again to again to
+the proper argument registers after the call.  And in order to use the
+callee saves these registers must be pushed in the prolog and popped
+in the epilog.
+
+In the after case, no frame setup is required, no registers need to be
+preserved across the call, so no registers need saving and
+restoring. The two `mov`s that remain appear to be gratuitous. The
+first `nop' is there for rejit padding. Not sure why the second one is
+there.
+
+So we can now see why this particular case has such extreme
+`SizeImpact`: the un-inlined call triggers frame creation and a fair
+amount of shuffling to accomodate the potential side effects of the
+call.
+
+#### Case 2: Typical Savings
+
+In the data above, the inline of `System.Object:.ctor()` typically
+saves either 8 or 9 bytes of code. Let's look at one such case.
+
+```
+Inline System.Object:.ctor():this
+into   System.Security.SecurityElement:.ctor():this
+CalleeSize = 6
+CallerSize = 15
+CallerSize' = 6 
+SizeImpact = -9
+```
+
+Here are before and after listings for the caller:
+
+```Assembly
+; BEFORE
+; System.Security.SecurityElement:.ctor():this
+; Total bytes of code 15, prolog size 5
+       0F1F440000           nop      
+       488D0500000000       lea      rax, [(reloc)]
+       48FFE0               rex.jmp  rax
+```
+
+```Assembly
+; AFTER
+; System.Security.SecurityElement:.ctor():this
+; Total bytes of code 6, prolog size 5
+       0F1F440000           nop      
+       C3                   ret      
+```
+
+In this case the call site was initially handled via a tail call that
+required a 7 byte `lea` and a 3 byte `rex.jmp`. With the inline all
+that's left is a single byte `ret`. This case covers 154 of the 187
+instances where inlining `System.Object:.ctor()` saved 9 bytes.
+
+### Cases 3, 4: Size May Decrease or Increase
+
+Now let's look at a set of examples where an inline can either
+decrease or increase the caller size. Here's the good case:
+
+```
+Inline System.IntPtr:op_Explicit(long):long 
+into   System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
+CalleeSize  = 35
+CallerSize  = 50
+CallerSize' = 22
+SizeImpact  = -28
+```
+
+Here's the callee code: 
+
+SIGH -- turns out the above has the wrong callee size -- there are
+several overloaded versions and my script picks up the wrong one.
+true callee is only 9 bytes long:
+
+```Assembly
+; System.IntPtr:op_Explicit(long):long
+; Total bytes of code 9, prolog size 5
+       0F1F440000           nop      
+       488BC1               mov      rax, rcx
+       C3                   ret      
+```
+
+At any rate the `SizeImpact` is still correct.
+
+Here's the before caller code:
+
+```Assembly
+; BEFORE
+; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
+; Total bytes of code 50, prolog size 12
+       55                   push     rbp
+       57                   push     rdi
+       56                   push     rsi
+       4883EC20             sub      rsp, 32
+       488D6C2430           lea      rbp, [rsp+30H]
+       488BF1               mov      rsi, rcx
+       8BFA                 mov      edi, edx
+       498BC8               mov      rcx, r8
+       E800000000           call     System.IntPtr:op_Explicit(long):long
+       4C8BC0               mov      r8, rax
+       8BD7                 mov      edx, edi
+       488BCE               mov      rcx, rsi
+       488D0500000000       lea      rax, [(reloc)]
+       488D65F0             lea      rsp, [rbp-10H]
+       5E                   pop      rsi
+       5F                   pop      rdi
+       5D                   pop      rbp
+       48FFE0               rex.jmp  rax   // tail call WriteInt64
+```
+
+and the after caller code:
+
+```Assembly
+; AFTER
+; System.Runtime.InteropServices.Marshal:WriteIntPtr(ref,int,long)
+; Total bytes of code 22, prolog size 10
+       55                   push     rbp
+       4883EC20             sub      rsp, 32
+       488D6C2420           lea      rbp, [rsp+20H]
+       E800000000           call     System.Runtime.InteropServices.Marshal:WriteInt64(ref,int,long)
+       90                   nop      
+       488D6500             lea      rsp, [rbp]
+       5D                   pop      rbp
+       C3                   ret      
+```
+
+Somewhat confusingly, the inline has stopped the jit from making a
+tail call, and the size savings should be even greater than they
+are. There appears to be some kind of code generation issue within the
+jit. The IL for the callee is
+
+```
+IL_0000  0f 00             ldarga.s     0x0
+IL_0002  7b 8f 02 00 04    ldfld        0x400028F
+IL_0007  6e                conv.u8     
+IL_0008  2a                ret         
+
+```
+and the JIT rejects the tail call because
+
+```
+Rejecting tail call late for call [000008]: Local address taken
+```
+
+so presumably the `ldarga.s` in the inlinee is causing trouble.
+
+Now let's see if that same inlinee can cause a size increase. Here's a
+case:
+
+(NOTE this is calling a different overload..., need to get this
+straghtened out)
+
+```
+Inline System.IntPtr:op_Explicit(long):long [35] 
+into   EventData:get_DataPointer():long:this [18] size 35 delta 17
+CalleeSize  = 35
+CallerSize  = 18
+CallerSize' = 35
+SizeImpact  = 17
+```
+
+```Assembly
+; BEFORE
+; EventData:get_DataPointer():long:this
+; Total bytes of code 18, prolog size
+       0F1F440000           nop      
+       488B09               mov      rcx, qword ptr [rcx]
+       488D0500000000       lea      rax, [(reloc)]
+       48FFE0               rex.jmp  rax
+```
+
+```Assembly
+; AFTER
+; EventData:get_DataPointer():long:this
+; Total bytes of code 35, prolog size 5
+       4883EC28             sub      rsp, 40
+       488B11               mov      rdx, qword ptr [rcx]
+       33C9                 xor      rcx, rcx
+       48894C2420           mov      qword ptr [rsp+20H], rcx
+       488D4C2420           lea      rcx, bword ptr [rsp+20H]
+       E800000000           call     System.IntPtr:.ctor(long):this
+       488B442420           mov      rax, qword ptr [rsp+20H]
+       4883C428             add      rsp, 40
+       C3                   ret      
+```
+
+Here the un-inlined case made a tail call. The inlined case was unable
+to do the same optimization for some reason, though it looks like it
+should have been able to do so as well.
diff --git a/Documentation/design-docs/inlining-plans.md b/Documentation/design-docs/inlining-plans.md
new file mode 100644
index 0000000000..2279c16e37
--- /dev/null
+++ b/Documentation/design-docs/inlining-plans.md
@@ -0,0 +1,438 @@
+# Proposed Plans for Inlining
+
+This document outlines a plan for improving the inlining capabilities
+of RyuJit as well as adding inlining capabilities to LLILC.
+
+## Background
+
+Inlining is a key optimization, directly reducing call overhead and
+indirectly exposing a wider scope for powerful intraprocedural
+optimizations. 
+
+From an implementation standpoint, an inliner has the following aspects:
+
+* Machinery: transformations on the compiler's IR that physically
+incorporate the code for the callee into the caller.
+
+* Legality: a set of rules that describe when inlining is legal (that
+is, preserves program semantics). Here, the legality rules are largely
+dictated to us by the CoreCLR runtime.
+
+* Ability: a set of rules that describe whether the machinery is able
+to perform a legal inline and whether the result can be handled by
+downstream phases.
+
+* Profitability: a set of heuristic rules that describe which legal
+and able inlines are desirable.
+
+The general consensus is that the inlining in RyuJit can be improved,
+while carefully maintaining RyuJit's current lean compilation
+overhead. It is anticipated that this work will encompass Machinery,
+Ability, and Profitability.
+
+LLILC does no inlining today. Since we aspire to have LLILC be a
+high-performance .Net code generator, we need to enable inlining in
+LLILC. LLILC can likely leverage much of LLVM's built-in inlining
+Machinery and Ability, but will need new code for Legality and
+Profitability.
+
+We envision various scenarios for .Net Code generation that impact
+inlining: a first-tier JIT compiler, a higher-tier JIT compiler, a
+fast AOT compiler, and an optimizing AOT compiler. Each scenario calls
+for inlining, but the tradeoffs are different. For a given scenario,
+we may choose to deploy RyuJit or LLILC or both depending on the
+scenario requirements and code generator capabilities.
+
+There are also aspects of inlining that are runtime and target machine
+specific. Each runtime and target architecture will require some
+specialized tuning to accommodate the different costs of operations at
+runtime.
+
+Taking this all into account, the proposed scope of the work
+encompasses two different compiler code bases, a number of codegen
+scenarios, and a number of target architectures and runtime
+conventions. The goal is to come up with one conceptual framework that
+covers all these cases and avoids relying too much on time-intensive
+manual investigation and tuning.
+
+We will measure the quality of inlining along three axes: the time
+spent generating code (aka *throughput* -- TP) abbreviate as TP), the
+time spent executing the code (aka *code quality* -- CQ), and the size
+of the generated code (CS).
+
+We also desire that the resulting decision-making machinery (in particular
+the Profitability heuristic) is expressed in terms that are sensible to
+compiler writers.
+
+## The Proposal
+
+### Machinery
+
+RyuJit's machinery may or may not require updates depending on how many
+key inlines are rejected because of lack of ability. This is something
+that requires further investigation. We may also decide to pursue
+enhancements here to ensure that in cascaded inline cases the deeper
+callsites are provided up-to-date information.
+
+LLILC should be able to leverage LLVM's existing inline machinery. In
+JIT-like scenarios, LLILC's reader will need some updates to properly
+handle cases where a method is being "read" as an inline candidate
+rather than a root method to compiled. For AOT scenarios we ultimately
+will want to do something more elaborate where all of the methods
+being compiled are represented as LLVM IR.
+
+### Legality
+
+As mentioned above, the legality constraints are generally imposed by
+the CoreCLR runtime. 
+
+These constraints are already captured in the RyuJit source
+code. However, we generally prefer to at least logically perform all
+the legality checks before any of ability or profitability checks, so
+that we can reason more simply about why a particular inline didn't
+happen. This may not be entirely practical, since some of the
+profitability or ability early outs may be helping TP by keeping
+RyuJit from doing a lot of exploratory work before ultimately
+rejecting an inline candidate.
+
+LLILC lacks legality checking, and this needs to be implemented. 
+
+### Ability
+
+RyuJit currently bails out of some legal inlining cases because of
+internal restrictions. For instance, if a callee modifies one of its
+parameters, RyuJit won't inline it because of limitations in how the
+IR is incorporated. We expect to work on reducing or removing key
+ability limiters in RyuJit.
+
+Assuming the LLVM inlining framework is robust, LLILC shouldn't have
+too many ability constraints.
+
+### Profitability
+
+Given the wide range of scenarios, targets, and the two distinct code
+bases, we propose to heavily rely on machine learning techniques for
+building and tuning the Profitability components of the inliners.
+
+The general approach we suggest is based on past experiences we've had
+developing inliners for various compilers and is inspired by the CGO
+paper [Automatic Construction of Inlining Heuristics using Machine
+Learning](http://dl.acm.org/citation.cfm?id=2495914) by Kulkarni,
+Cavazos, Wimmer, and Simon.
+
+Kulkarni et. al. treat profitability as an unsupervised learning
+problem, and create a well-performing heuristic black box (neural
+network) using evolutionary programming techniques. They then turn
+around and use this black box as an oracle to label inline instances,
+and from this guide a supervised machine learning algorithm to produce
+a decision tree that expresses the profitability heuristics in terms
+sensible to the compiler writer.
+
+The inputs to the heuristic black box are various facts and estimates
+about an inline candidate, based on observations of the caller, callee,
+and call site (collectively, *features*). These features may be booleans, 
+enumerates (categories), integers, or floating-point values. For example:
+
+* (boolean) `CalleeIsLeaf`, `CallerSiteInTry`
+* (enumerate) `CalleeReturnCorType`
+* (int) `CalleeILSize`,`CallSiteProfileWeight`
+* (float) `CalleeLoadStoreFrequency`
+
+The output is a yes/no inlining decision for a particular candidate.
+
+The evolutionary process is governed by a fitness metric that expresses the
+appropriate balance of TP, CQ, and CS for the scenario. For instance, in a JIT
+scenario we likely will want to keep TP and CS close to the current
+baseline values. For AOT we may want to allow higher TP and larger CS if we
+can demonstrate CQ improvements.
+
+## Implications
+
+For this approach to work, we will need to make frequent and robust
+measurements of the key fitness metrics across a range of
+representative benchmarks. So this presumes:
+
+* we can identify and capture these key benchmarks
+* we can develop methodology for doing appropriate measurements
+* we can do these measurements at appropriate scale
+
+### The Benchmarks
+
+Clearly in a machine learning approach the actual observations made
+are crucial. We need a solid, representative benchmark set. But this
+is a general problem with a heuristic based optimization and not an
+inherent limitation of this approach -- one can only measure what one
+can measure.
+
+We plan on capturing the results in standard schema and hope that with
+this and the regularization of the metric data (see next section) that
+interested 3rd parties can perhaps supply their own benchmark results.
+
+### The Metrics
+
+CS is typically very easy to measure and the measurements are "noise
+free". So the measurement only needs to be done once.
+
+If TP and CQ are both measured in terms of time (or cycles) then the
+measurements will be inherently noisy. To reduce the variance caused
+by noise some repeated runs may be necessary, and it might also be
+necessary to run on dedicated, matched hardware. This severely limits
+scalability.
+
+We instead propose to measure both TP and CQ in terms of instructions
+retired. While this may not be entirely noise-free, we expect the
+measurements to be low noise and fairly insensitive to other
+computations on the machine. Thus this should give us decent scale
+(for instance we can run scenarios in the cloud on VMs with varying
+underlying hardware, etc).
+
+For TP, instructions retired should be a reasonable proxy for time,
+since we expect the CPI for the compiler proper to be relatively
+static. For CQ, we realize there may not be as strong a correlation
+with time, but typically the inliner operates early in the compilation
+process and inlining is mainly focused on reducing instruction count
+and less about reducing cycle stalls.
+
+We'll do some experimentation early on to see how well these proxy
+metrics hold up.
+
+### Analysis
+
+To help prioritize work on profitability and ability (and perhaps find
+areas where we might discuss trying to remove legality constraints),
+we need some mechanism to correlate inlining outcomes with success and
+failure reasons.
+
+We propose to develop a textual, machine-readable inline log schema
+that describes a set of inlining decisions. This schema can be
+produced by a compiler as it makes decisions, and contains sufficient
+correlating data so that we can then match it up with runtime
+observations. For example, we might end up with the following sort of
+per-method output (though more likely it will be embedded in some kind of
+xml or json markup):
+
+```
+  Inlining for 9811 Lookup
+     [o] 22825 Lookup@Lattice
+        [o] 22827 ??0?$interior_ptr@PAVCell 
+        [x] 22826 Lookup@Lattice@@@Z (reason: SizeLimit)
+     [o] 21728 ?get_CellList
+
+```
+
+where `[o]` is a successful inline, `[x]` a failed inline, and
+indentation shows the inlining tree. For .Net compilation we'll need
+some kind of persistent ID for methods, which may not be all that easy
+to come by. 
+
+This inline log can also be read back by the code generator to enable
+*inline replay* and force the inliner to perform a particular pattern
+of inlines. This is useful for things like performance investigations,
+bug isolation, and for trying to have one compiler emulate the
+inlining behavior of another. Since the logs are textual they can also
+be edited by hand if necessary to perform experiments.
+
+### Obtaining Runtime Data
+
+If we can obtain runtime data about the frequency of call
+instructions, we can run this through the log to accumulate failure
+reasons and create a failure reason table like the following
+(categories and data here totally made up):
+
+```
+  FailedReason       | Count    | [%]   |  Sites |   [%]
+---------------------|----------|-------|--------|------
+SpeculativeNonGround | 79347955 | 37.8% |    417 | 14.1%
+NonGroundShared      | 28756634 | 13.7% |    217 | 07.4%
+SizeLimit            | 22066033 | 10.5% |    760 | 25.8%
+ExternFunction       | 21923402 | 10.5% |    300 | 10.2%
+... etc ...
+```
+
+This can be used to prioritize inlining work. For instance, if
+`SpeculativeNonGround` was an ability issue, and we wanted to improve
+CQ, we'd give it high priority to fix, since it impacts inlining most
+frequently.
+
+### Robustness
+
+Since we'll likely be altering inlining decisions, we may uncover bugs
+or performance blemishes in the compilers.
+
+It may make sense to build up a random inlining methodology (RyuJit
+evidently has something like this) to try and stress test the compiler
+in the face of inlining changes. Log replay can be used with automatic
+bisection tools to try and isolate the particular inline that exposes
+a bug.
+
+Performance issues are harder to spot automatically, and will likely
+require some amount of manual analysis. However, we may be able to study
+inline instances that are not well-predicted by the inlining models -- for
+instance, cases where the CS is much greater than predicted, or CQ is lower --
+and see if the modelling shortfall is an indicator of problems within the
+compiler itself.
+
+## Development Stages
+
+Given the above, here's a rough draft outline for the how the
+functionality could be built.
+
+### Preliminaries -- Measurement
+
+1. Make some benchmarks available for measurement.
+
+2. Demonstrate we can reliably measure the metrics of interest: CS,
+CQ, and TP (as bytes, instructions, and instructions) when jitting and
+pre-jitting the benchmarks. Get some idea of the noise levels for CQ
+and TP. Hopefully they are small.
+
+### Preliminaries -- RyuJit
+
+1. Refactor RyuJit code to clearly separate out legality, ability,
+and profitability aspects of the inliner. Hopefully this can be done
+with no CQ changes and no TP impact.
+
+2. Develop the inline log format.
+
+3. Produce inline logs from RyuJit.
+
+4. Develop methodology for runtime correlation.
+
+5. Produce failure reason analysis for the benchmark set.  Decide
+which if any abilities need improving.
+
+6. Implement inline log replay for RyuJit.
+
+7. Modify logs by hand or via a tool and replay them to assess the
+landscape of TP/CQ/CS tradeoffs from inliner changes.
+
+### Preliminaries -- LLILC
+
+1. Add code to capture legality constraints when reading in a method.
+
+2. Generalize LLILC's MSIL reader to read in an inlinee in
+the context of the caller.
+
+3. Write a simple LLVM pass (or specialize an existing pass) to
+perform inlining. Initially just implement a very simple profitability
+heuristic. Verify it all works correctly.
+
+4. Implement inline log generation by LLILC.
+
+5. Implement inline log replay by LLILC.
+
+6. Modify logs by hand or via a tool and replay them to assess the
+landscape of TP/CQ/CS tradeoffs from inliner changes.
+
+### Develop the Heuristics (JIT case)
+
+1. Start by trying to produce code size estimates for a given inline
+candidate. Measure actual code size impacts for individual inlining
+decisions.
+
+2. Determine features to surface and produce a labelled data set of
+(feature*, size impact). Use this to build a machine-learning model
+estimator for code size.
+
+3. We probably can't just plug the new size estimate into the existing
+heuristic framework and expect good results (generally, better
+information in does not always result in better decisions out).  So
+instead build up a parallel (off by default) heuristic path where we
+use the new size estimator.
+
+4. Surface features for the CQ and TP impacts. Build a pluggable model
+for the black box heuristic in the compiler. Likely this means that
+the model code is written by a tool from some textual description.
+
+5. Run evolutionary experiments to produce an optimized black box
+heuristic. This involves: (a) randomly or otherwise generating an
+initial population of models; (b) running each model on the
+benchmarks, gathering metrics; (c) combining metrics across benchmarks
+to produce fitness; (d) using fitness to create a new set of models
+using various genetic approaches; (e) iterating until satisfied with
+the result.
+
+All of this setup should be automated and available in the CI system.
+
+6. Use the winning model to produce a labelled data set of (feature*,
+inlining decision) for all the cases in the benchmarks. Build a
+decision tree from this via standard techniques. Decide if any pruning
+is appropriate.
+
+7. Back validate the results via measurement to show (a) final heuristic
+works as expected and metrics match expectations on training examples;
+cross validate on untrained benchmarks to show results are not over-fitted.
+Flight this to various 3rd parties for independent confirmation.
+
+8. Enable new heuristics as defaults.
+
+9. Repeat the above process for each code base, scenario,
+architecture, etc.
+
+### AOT Considerations
+
+In the AOT case (particularly the optimized AOT) we will likely need
+to develop a different set of heuristics.
+
+The current frameworks for AOT (NGEN and Ready2Run) give the code
+generator the opportunity to preview the methods to be compiled. We'll
+take advantage of that to build a call graph and orchestrate
+compilation of methods in a "bottom-up" fashion (callees before
+callers). Because of that, when inlining, the caller typically has the
+opportunity to look at a richer set of data about the callee -- for
+instance, detailed usage summary for parameters and similar.
+
+The inliner also has the opportunity to plan inlining over a wider
+scope, perhaps adjusting heuristics for the relative importance of
+callers or the shape of the call graph or similar.
+
+## Vetting and Revalidation
+
+With an evaluation framework in place, we can now run carefully
+controlled experiments to see if further enhancements to the inliner
+are warranted. For example, the order in which candidates are
+considered may have an impact -- ideally the inliner would greedily
+look at the best candidates first. Other feature data may prove
+important, for instance actual or synthetic profile data.
+
+The rest of the compiler also changes over time. Periodically we
+should rerun the heuristic development to see if the current set of
+heuristics are still good, especially if major new capabilities are
+added to the compiler.
+
+We should also continually strive to improve the benchmark set.
+
+## Risks
+
+The plan calls for the inliner's Profitability logic to be constructed
+via machine learning. This exposes us to a number of risks:
+
+* We may not understand how to successfully apply ML techniques.
+* We may run into one of the common ML pitfalls, like having too many
+features, or strongly correlated features.
+* We may not be using the right features.
+* Some of the features may themselves be estimates (eg code size).
+* The benchmark set may not be sufficiently representative.
+* We may overfit the benchmark set. The plan here is to cross-validate and
+not use all our benchmarks as training examples.
+* The training process may be too slow to be practical.
+* There may be too much measurement noise.
+* The resulting models may not be stable -- meaning that small changes in
+the input programs might trigger large changes in the inlining decisions.
+* The supervised learning output may still be black-box like in practice.
+* The downstream phases of the compiler might not adapt well to inlining
+changes or may offer an uneven performance surface.
+* We've found it tricky in the past to produce a good combined fitness metric
+from a large set of benchmarks and metrics.
+* It's generally not possible to build a new heuristic which is better in
+every way than the old one -- some metrics on some benchmarks will inevitably
+regress.
+
+However, it's worth pointing out that a manually constructed heuristic faces
+many of these same risks.
+
+## Notes
+
+See this [early
+draft](https://github.com/AndyAyersMS/coreclr/commit/035054402a345f643d9dee0ec31dbdf5fadbb17c),
+now orphaned by a squash, for some interesting comments.
diff --git a/Documentation/design-docs/longs-on-32bit-arch.md b/Documentation/design-docs/longs-on-32bit-arch.md
new file mode 100644
index 0000000000..d04efb59fa
--- /dev/null
+++ b/Documentation/design-docs/longs-on-32bit-arch.md
@@ -0,0 +1,117 @@
+# Modeling Longs on 32-bit Architectures
+
+The challenge here is to model long operations in a way that reflects register lifetimes for the
+lo and hi halves of a 64-bit integer accurately enough that we can achieve good register
+allocation.
+
+## Background
+
+The current liveness model supports:
+* Internal registers (interfere with all sources, but no defs)
+* Sources that are not last use (interfere with all others)
+* Sources that are last use and NOT DelayFree (interfere with all sources, but no defs)
+* Sources that are last use AND DelayFree (interfere with all sources and defs)
+* Kills (interfere with all uses but no defs)
+
+Previously (eons ago) when we were working on arm32, the model had all but the last def 
+interfering with the uses and internal registers (i.e. they went live during the first
+location for the node).
+* This supports the ability to reuse inputs, e.g. for long adds, subs or addresses, after the first def register goes live.
+* This doesn’t work for supporting multi-reg call nodes:
+  * All defs need to occur AFTER the kills, and they do not interfere with any sources.
+
+Considerations for an expanded model:
+* Must be “pay for play” (not make LSRA more expensive for non-long code generation)
+* It is not practical to precisely model nodes that generate multiple instructions:
+  * In the case of longs, we may have one source operand that could be overwritten by
+    the first result, but other operands that need to remain live.
+  * However, on x86 it will be really important not to waste registers.
+
+## Decomposition
+
+This is the approach currently being taken. In the initial implementation, the reordering
+of nodes to support the required "valid tree traversal" property of the IR was causing
+the code to be incorrect because for instructions like `add` the carry bit needs to be
+immediately consumed by the computation ofthe hi bits.
+
+## Decomposition with temps
+
+In order to preserve the current decomposition approach and not change the fundamental
+tree ordering properties, it is necessary to evaluate sub-expressions into temps, to keep
+the lo and hi computations together.
+
+There are concerns about this, because it requires generating a number of extra temps
+in the case of nested expressions. However, mikedn has done some experimentation
+[here](https://github.com/mikedn/coreclr/blob/decompose/src/jit/lower.cpp#L424) 
+that indicaates that this approach may not be as problematic as we feared.
+
+This basic idea is that whenever we need to decompose hi/lo operations but keep them
+together (i.e. can’t preserve the tree traversal/linear order invariants), we create a temp.
+
+## Richer Liveness Model (No Decomposition)
+
+The idea here woudl be to retain `TYP_LONG` nodes in the IR, and to find a way to extend
+the liveness model used by Lowering, LSRA and CodeGen to ensure good register allocation.
+o	
+
+In the table below, there are several operations that have different register lifetime
+characteristics.
+Each is modeled with a sequence of "Locations" for which the changes in register lifetimes
+can be viewed as happening at the same time.
+All lifetimes participating in a given Location are considered to interfere.
+Note that the simplest model is that all the uses happen at one location, and then the
+def(s) happen at the next Location (i.e. the def does not interfere with the uses).
+The model becomes more complicated when we have constraints such as RMW (read-modify-write)
+operands, and even more so when we are actually modeling multiple target instructions
+with a single IR node.
+
+To avoid even more complication than is already inherent in this issue, we will assume
+that the evaluation order is fixed during Lowering (and in these examples we are assuming
+that the lo operand is evaluated first).
+
+In future we may want to consider the implications of relaxing that constraint, though
+note that the Decomposition approach also requires a predetermined evaluation order.
+
+| Operation     | Location 1 | Location 2 | Location 3 |
+| --------------|:----------:| ----------:| ----------:|
+| x = y         | use y.lo   | def x.lo   |            |
+|               | use y.hi   | def x.hi   |            |
+|               |            |            |            |
+| x = y + z     | use y.lo   | def x.lo   | def x.hi   |
+|               | use z.lo   | use y.hi   |            |
+|               | y.hi live  | use z.hi   |            |
+|               | z.hi live  | use z.hi   |            |
+|               |            |            |            |
+| x = *p        | use p      | def x.hi   |            |
+| (non-ldp)     | def x.lo   |            |            |
+|               |            |            |            |
+| x = *p        | use p      | def x.lo   |            |
+| (ldp)         |            | def x.hi   |            |
+|               |            |            |            |
+| x = \*(p+i*8) | use p      | def x.hi   |            |
+| (non-ldp)     | use i      |            |            |
+|               | def x.lo   |            |            |
+|               |            |            |            |
+| x = call      | use args   | def x.lo   |            |
+|               | kill (end) | def x.hi   |            |
+
+
+Both the non-ldp  (load pair - available on Arm but not x86) cases take
+advantage of the fact that all the sources must remain live
+with the first def. The same can be done in the non-ldp case of `x = *p`.
+However, that approach doesn't reduce the Location requirement for the `x = y + z` case,
+where, if we assume that all inputs must be live for the first def, then we keep the lo
+registers live longer than necessary, and therefore can't reuse them for the x.lo (or
+other) results.
+
+TO DO:
+* Extend the above to include more operations, and to show the actual instructions
+  generated, to determine how well we can approximate the "optimal" liveness model.
+
+## True Linear IR
+
+It would be really nice to break the “valid tree traversal” restriction, and allow
+arbitrary (or at least more flexible) reordering of nodes when in linear form.
+* Gets rid of embedded statements!!
+* Affects (at least) rationalizer and lsra, which implicitly depend on this property in their use of stacks.
+* Probably has many unknown other impacts
diff --git a/Documentation/design-docs/removing-embedded-statements.md b/Documentation/design-docs/removing-embedded-statements.md
new file mode 100644
index 0000000000..c7d63e7926
--- /dev/null
+++ b/Documentation/design-docs/removing-embedded-statements.md
@@ -0,0 +1,180 @@
+## Removing Embedded Statements From the RyuJIT Backend IR
+
+Pat Gavlin ([*pagavlin@microsoft.com*](mailto:pagavlin@microsoft.com))
+
+July 2016
+
+### Abstract
+
+RyuJIT’s IR comes in two forms: that produced and manipulated in the front end and that expected and manipulated by the
+back end. The boundary between these two forms of IR is comprised of rationalization and lowering, both of which
+transform the front-end IR into back end IR. For the purposes of this paper, the relevant differences between the two
+IRs revolve around ordering constructs within basic blocks: top-level statements, trees, comma nodes, and embedded
+statements. The latter two constructs are used to represent arbitrary (often side-effecting) code that executes at a
+specific point in a tree but does not otherwise participate in the tree’s dataflow. Unfortunately, representational
+challenges with embedded statements make them difficult to understand and error-prone to manipulate. This paper proposes
+that we remove all statements--embedded and otherwise--from the backend IR by chaining the last linearly-threaded node
+within a statement to the first linearly-threaded node within its successor and vice versa as well as removing certain
+constraints on the shape of the nodes within a block.
+
+### Review: IR ordering semantics
+
+As previously mentioned, RyuJIT uses two forms of IR: the front-end IR (referred to hereafter as HIR) and the back-end
+IR (referred to hereafter as LIR). Aside from using different representations for operations such as stores, HIR and LIR
+differ in their ordering constructs within basic blocks.
+
+Within a basic block, the HIR is ordered first by statements. Each statement consists of a single tree that performs the
+computation associated with that statement. The nodes of that tree are executed in the order produced by a left-to-right
+post-order visit of the tree’s nodes (with the exception of binary operator nodes that have the GTF\_REVERSE\_OPS flag
+set, which reverses the order of their operand trees). As such, the edges in a tree represent both ordering and edges in
+a dataflow graph where every definition has a single use, with some exceptions:
+
+-   Edges to nodes that represent defs of storage locations (e.g. edges to nodes on the LHS of assignment operators)
+    often represent neither ordering or a use-to-def relationship: the def of the location happens as part of the
+    execution of its parent, and the edge does not represent a use of an SDSU temp.
+
+-   Edges to unused values do not represent a use-to-def relationship: these edges exist only for ordering.
+
+The primary source of the latter are comma nodes, which are used in the HIR specifically to insert arbitrary code into a
+tree’s execution that does not otherwise participate in its SDSU dataflow.
+
+Similar to the HIR, the LIR is ordered first by statements. Each statement consists of a single tree and a linear
+threading of nodes that represent the SDSU dataflow and computation, respectively, that are associated with that
+statement. The linear threading of nodes must contain each node that is present in the tree, and all nodes that make up
+the tree must be present in the linear threading in the same order in which they would have been executed by the HIR ’s
+tree ordering. Additional nodes may be present in the linear threading in the form of embedded statements, which are
+sub-statements whose nodes form a contiguous subsequence of their containing statement’s execution order, but do not
+participate in the containing statement’s SDSU dataflow (i.e. the embedded statement’s nodes are not present in the
+containing statement’s tree). Embedded statements otherwise have the same ordering semantics as top-level statements,
+and are used for the same purpose as comma nodes in the HIR: they allow the compiler to insert arbitrary code into a
+statement’s execution that does not otherwise participate in its SDSU dataflow.
+
+As mentioned, both comma nodes and embedded statements are used for the same purpose in the HIR and LIR, respectively.
+Each construct represents a contiguous sequence of code that executes within the context of a tree but that does not
+participate in its SDSU dataflow. Of the two, however, embedded statements are generally more difficult to work with:
+since they are not present in their containing statement’s tree, the developer must take particular care when processing
+LIR in order to avoid omitting the nodes that comprise embedded statements from analyses such as those required for code
+motion (e.g. the analysis required to make addressing mode “contained”) and to avoid violating the tree order constraint
+placed upon the LIR ’s linear threading.
+
+The problems with manipulating embedded statements have two relatively clear solutions: either replace embedded
+statements with a construct that is represented in its parent statement’s tree, or remove the tree order constraint and
+move the LIR to a linear ordering that is constrained only by the requirements of dataflow and side-effect consistency.
+We believe that the latter is preferable to the former, as it is consistent with the overall direction that the LIR has
+taken thus far and reduces the number of concepts bound up in tree edges, which would thereafter represent only the SDSU
+dataflow for a node. This would clarify the meaning of the IR as well as the analysis required to perform the
+transformations required by the backend.
+
+### Approach
+
+The approach that we propose is outlined below.
+
+#### 1. Add utilities for working with LIR.
+
+Efficiently working with the new IR shape will require the creation of utilities for manipulating, analyzing,
+validating, and displaying linearly-ordered LIR. Of these, validation and display of the LIR are particularly
+interesting. Validation is likely to check at least the following properties:
+
+1.  The linear threading is loop-free
+2.  All SDSU temps (i.e. temps represented by edges) are in fact singly-used
+3.  All defs of SDSU temps occur before the corresponding use
+4.  All SDSU defs that are used are present in the linear IR
+
+In the short term, we propose that the LIR be displayed using the linear dump that was recently added to the JIT. These
+dumps would be configured such that all of the information typically present in today’s tree-style dumps (e.g. node
+flags, value numbers, etc.) is also present in the linear dump.
+
+#### 2. Stop Generating embedded statements in the rationalizer.
+
+This can be approached in a few different ways:
+
+1.  Remove commas from the IR before rationalizing. There is already infrastructure in-flight in order to perform this
+    transformation, so this is attractive from a dev-hours point of view. Unfortunately, this approach carries both
+    throughput and a code quality risks due to the addition of another pass and the creation of additional local vars.
+
+2.  Change the rationalizer such that it simply does not produce embedded statements. This requires that the
+    rationalizer is either able to operate on both HIR and LIR or simply linearize commas as it goes.
+
+3.  Remove embedded statements from the IR with a linearization pass between the rationalizer and lowering.
+
+We will move forward with option 2, as it is the most attractive from a throughput and code quality standpoint:
+it does not add additional passes (as does option 3), nor does it introduce additional local vars (as does option
+1).
+
+#### 3. Refactor decomposition and lowering to work with linear LIR.
+
+The bulk of the work in this step involves moving these passes from statement- and tree-ordered walks to linear walks
+and refactoring any code that uses the parent stack to instead use a helper that can calculate the def-to-use edge for a
+node. It will also be necessary to replace embedded statement insertion with simple linear IR insertion, which is
+straightforward.
+
+##### 3.i. Decomposition
+
+Transitioning decomposition should be rather simple. This pass walks the nodes of each statement in execution order,
+decomposing 64-bit operations into equivalent sequences of 32-bit operations as it goes. Critically, the rewrites
+performed by decomposition are universally expansions of a single operator into a contiguous sequence of operators whose
+results are combined to form the ultimate result. This sort of transformation is easily performed on linear IR by
+inserting the new nodes before the node being replaced. Furthermore, because nodes will appear in the linear walk in the
+same relative order that they appear in the execution-order tree walk, rewrites performed in linear order will occur in
+the same order in which they were originally performed.
+
+##### 3.ii. Lowering
+
+The picture for lowering is much the same as the picture for decomposition. Like decomposition, all rewrites are
+performed in execution order, and most rewrites are simple expansions of a single node into multiple nodes. There is one
+notable exception, however: the lowering of add nodes for the x86 architecture examines the parent stack provided by
+tree walks in order to defer lowering until it may be possible to form an address mode. In this case, the use of the
+parent stack can be replaced with a helper that finds the node that uses the def produced by the add node.
+
+##### 3.iii. General issues
+
+Both decomposition and lowering make use of a utility to fix up information present in the per-call-node side table that
+tracks extra information about the call’s arguments when necessary. This helper can be replaced by instead fixing up the
+side table entries when the call is visited by each pass.
+
+#### 4. Remove uses of statements and the tree-order invariant from LSRA.
+
+LSRA currently depends on the tree order invariant so that it can build a stack to track data associated with the defs
+consumed by an operator. Removing the tree order invariant requires that LSRA is able to accommodate situations where
+the contents of this stack as produced by a linear walk would no longer contain correct information due to the insertion
+of new nodes that produce values that are live across the operator and some subset of its operands. Analysis indicates
+that a simple map from defs to the necessary information should be sufficient.
+
+#### 5. Remove uses of statements from the rest of the backend.
+
+The rest of the backend--i.e. codegen--has no known dependencies on tree order, so there are no concerns with respect to
+the change in ordering semantics. However, statement nodes are used to derive IL offsets for debug information. There
+are a number of alternative methods of tracking IL offsets to consider:
+
+1.  Insert “IL offset” nodes into the linearly-ordered LIR. These nodes would then be noted by code generation and used
+    to emit the necessary IP-mapping entries in the same way that statements are today. This has the disadvantage of
+    requiring additional work when performing code motion on LIR if the IL offset for a particular node needs to be
+    kept up-to-date. However, today’s backend does not perform this sort of code motion and even if it did, optimized
+    debugging is not currently a scenario, so a loss of debugging fidelity may be acceptable.
+
+2.  Use a side table to map from each node to its IL offset. Although this increases the total working set of the
+    backend, this has the advantage of making it easy to keep IL offsets correct in the face of code motion.
+
+3.  Add IL offset information directly to each node. This has the disadvantage of increasing the working set of the
+    entire compiler.
+
+Unless we expect to require correct debug information in the face of code motion in the future, our recommendation is
+for the first option, which comes at a minimum size and implementation cost.
+
+#### 6. Remove contained nodes from the LIR’s execution order.
+
+Once statements are removed from the LIR, there is one major issue left to address: the presence of contained nodes in
+the LIR’s execution order. The execution of these nodes logically happens as part of the node in which they are
+contained, but these nodes remain physically present in the IR at their original locations. As a result, the backend
+must often take care to skip contained nodes when manipulating the LIR in ways that must take execution order into
+account. Instead of leaving such nodes in the LIR, these node should be removed from execution order and instead be
+represented as (probably unordered) trees referenced only by the containing node.
+
+### Conclusion and Future Directions
+
+The changes suggested in this paper move RyuJIT’s LIR from a linear view of a tree-ordered nodes with certain nodes
+represented only in execution order to a linearly ordered sequence of nodes. Furthermore, with this change, tree edges
+in LIR would represent either uses of SDSU temps that have a place in execution order (where the edge points from the
+use of the temp to the def) or uses of unordered expression trees that execute as part of the parent node. This form
+should be easier to work with due to the removal of the tree order constraint and embedded statements and its similarity
+to other linear IR designs.