1 files changed, 117 insertions, 0 deletions
diff --git a/Documentation/design-docs/longs-on-32bit-arch.md b/Documentation/design-docs/longs-on-32bit-arch.md
new file mode 100644
index 0000000000..d04efb59fa
--- /dev/null
+++ b/Documentation/design-docs/longs-on-32bit-arch.md
@@ -0,0 +1,117 @@
+# Modeling Longs on 32-bit Architectures
+
+The challenge here is to model long operations in a way that reflects register lifetimes for the
+lo and hi halves of a 64-bit integer accurately enough that we can achieve good register
+allocation.
+
+## Background
+
+The current liveness model supports:
+* Internal registers (interfere with all sources, but no defs)
+* Sources that are not last use (interfere with all others)
+* Sources that are last use and NOT DelayFree (interfere with all sources, but no defs)
+* Sources that are last use AND DelayFree (interfere with all sources and defs)
+* Kills (interfere with all uses but no defs)
+
+Previously (eons ago) when we were working on arm32, the model had all but the last def 
+interfering with the uses and internal registers (i.e. they went live during the first
+location for the node).
+* This supports the ability to reuse inputs, e.g. for long adds, subs or addresses, after the first def register goes live.
+* This doesn’t work for supporting multi-reg call nodes:
+  * All defs need to occur AFTER the kills, and they do not interfere with any sources.
+
+Considerations for an expanded model:
+* Must be “pay for play” (not make LSRA more expensive for non-long code generation)
+* It is not practical to precisely model nodes that generate multiple instructions:
+  * In the case of longs, we may have one source operand that could be overwritten by
+    the first result, but other operands that need to remain live.
+  * However, on x86 it will be really important not to waste registers.
+
+## Decomposition
+
+This is the approach currently being taken. In the initial implementation, the reordering
+of nodes to support the required "valid tree traversal" property of the IR was causing
+the code to be incorrect because for instructions like `add` the carry bit needs to be
+immediately consumed by the computation ofthe hi bits.
+
+## Decomposition with temps
+
+In order to preserve the current decomposition approach and not change the fundamental
+tree ordering properties, it is necessary to evaluate sub-expressions into temps, to keep
+the lo and hi computations together.
+
+There are concerns about this, because it requires generating a number of extra temps
+in the case of nested expressions. However, mikedn has done some experimentation
+[here](https://github.com/mikedn/coreclr/blob/decompose/src/jit/lower.cpp#L424) 
+that indicaates that this approach may not be as problematic as we feared.
+
+This basic idea is that whenever we need to decompose hi/lo operations but keep them
+together (i.e. can’t preserve the tree traversal/linear order invariants), we create a temp.
+
+## Richer Liveness Model (No Decomposition)
+
+The idea here woudl be to retain `TYP_LONG` nodes in the IR, and to find a way to extend
+the liveness model used by Lowering, LSRA and CodeGen to ensure good register allocation.
+o	
+
+In the table below, there are several operations that have different register lifetime
+characteristics.
+Each is modeled with a sequence of "Locations" for which the changes in register lifetimes
+can be viewed as happening at the same time.
+All lifetimes participating in a given Location are considered to interfere.
+Note that the simplest model is that all the uses happen at one location, and then the
+def(s) happen at the next Location (i.e. the def does not interfere with the uses).
+The model becomes more complicated when we have constraints such as RMW (read-modify-write)
+operands, and even more so when we are actually modeling multiple target instructions
+with a single IR node.
+
+To avoid even more complication than is already inherent in this issue, we will assume
+that the evaluation order is fixed during Lowering (and in these examples we are assuming
+that the lo operand is evaluated first).
+
+In future we may want to consider the implications of relaxing that constraint, though
+note that the Decomposition approach also requires a predetermined evaluation order.
+
+| Operation     | Location 1 | Location 2 | Location 3 |
+| --------------|:----------:| ----------:| ----------:|
+| x = y         | use y.lo   | def x.lo   |            |
+|               | use y.hi   | def x.hi   |            |
+|               |            |            |            |
+| x = y + z     | use y.lo   | def x.lo   | def x.hi   |
+|               | use z.lo   | use y.hi   |            |
+|               | y.hi live  | use z.hi   |            |
+|               | z.hi live  | use z.hi   |            |
+|               |            |            |            |
+| x = *p        | use p      | def x.hi   |            |
+| (non-ldp)     | def x.lo   |            |            |
+|               |            |            |            |
+| x = *p        | use p      | def x.lo   |            |
+| (ldp)         |            | def x.hi   |            |
+|               |            |            |            |
+| x = \*(p+i*8) | use p      | def x.hi   |            |
+| (non-ldp)     | use i      |            |            |
+|               | def x.lo   |            |            |
+|               |            |            |            |
+| x = call      | use args   | def x.lo   |            |
+|               | kill (end) | def x.hi   |            |
+
+
+Both the non-ldp  (load pair - available on Arm but not x86) cases take
+advantage of the fact that all the sources must remain live
+with the first def. The same can be done in the non-ldp case of `x = *p`.
+However, that approach doesn't reduce the Location requirement for the `x = y + z` case,
+where, if we assume that all inputs must be live for the first def, then we keep the lo
+registers live longer than necessary, and therefore can't reuse them for the x.lo (or
+other) results.
+
+TO DO:
+* Extend the above to include more operations, and to show the actual instructions
+  generated, to determine how well we can approximate the "optimal" liveness model.
+
+## True Linear IR
+
+It would be really nice to break the “valid tree traversal” restriction, and allow
+arbitrary (or at least more flexible) reordering of nodes when in linear form.
+* Gets rid of embedded statements!!
+* Affects (at least) rationalizer and lsra, which implicitly depend on this property in their use of stacks.
+* Probably has many unknown other impacts