summaryrefslogtreecommitdiff
path: root/Documentation/design-docs/first-class-structs.md
blob: fd6a3762c489130ddd2616e49e6ac7e273235095 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
First Class Structs
===================

Objectives
----------
Primary Objectives
* Avoid forcing structs to the stack if they are only assigned to/from, or passed to/returned
 from a call or intrinsic
 - Including SIMD types as well as other pointer-sized-or-less struct types
 - Enable enregistration of structs that have no field accesses
* Optimize these types as effectively as any other basic type
 - Value numbering, especially for types that are used in intrinsics (e.g. SIMD)
 - Register allocation

Secondary Objectives
* No “swizzling” or lying about struct types – they are always struct types
 - No confusing use of GT_LCL_FLD to refer to the entire struct as a different type

Struct-Related Issues in RyuJIT
-------------------------------
The following issues illustrate some of the motivation for improving the handling of value types
(structs) in RyuJIT:

* VSO Bug 98404: .NET JIT x86 - poor code generated for value type initialization
 * This is a simple test case that should generate simply `xor eax; ret` on x86 and x64, but
   instead generates many unnecessary copies. It is addressed by full enregistration of
   structs that fit into a register:
 
```C#
struct foo { public byte b1, b2, b3, b4; }
static foo getfoo() { return new foo(); }
```

* [\#1133 JIT: Excessive copies when inlining](https://github.com/dotnet/coreclr/issues/1133)
 * The scenario given in this issue involves a struct that is larger than 8 bytes, so
   it is not impacted by the fixed-size types. However, by enabling assertion propagation
   for struct types (which, in turn is made easier by using normal assignments), the
   excess copies can be eliminated.
   * Note that these copies are not generated when passing and returning scalar types,
     and it may be worth considering (in future) whether we can avoiding adding them
     in the first place.
 
* [\#1161  RyuJIT properly optimizes structs with a single field if the field type is int but not if it is double](https://github.com/dotnet/coreclr/issues/1161)
  * This issue arises because we never promote a struct with a single double field, due to
    the fact that such a struct may be passed or returned in a general purpose register.
    This issue could be addressed independently, but should "fall out" of improved heuristics
    for when to promote and enregister structs.
  
* [\#1636 Add optimization to avoid copying a struct if passed by reference and there are no
  writes to and no reads after passed to a callee](https://github.com/dotnet/coreclr/issues/1636).
  * This issue is nearly the same as the above, except that in this case the desire is to
    eliminate unneeded copies locally (i.e. not just due to inlining), in the case where
    the struct may or may not be passed or returned directly.
  * Unfortunately, there is not currently a scenario or test case for this issue.
  
* [\#3144 Avoid marking tmp as DoNotEnregister in tmp=GT_CALL() where call returns a
  enregisterable struct in two return registers](https://github.com/dotnet/coreclr/issues/3144)
  * This issue could be addressed without First Class Structs. However,
    it will be easier with struct assignments that are normalized as regular assignments, and
    should be done along with the streamlining of the handling of ABI-specific struct passing
    and return values.
    
* [\#3539 RyuJIT: Poor code quality for tight generic loop with many inlineable calls](https://github.com/dotnet/coreclr/issues/3539)
(factor x8 slower than non-generic few calls loop).
  * I am still investigating this issue.

* [\#5556 RuyJIT: structs in parameters and enregistering](https://github.com/dotnet/coreclr/issues/5556)
  * This also requires further investigation, but requires us to "Add support in prolog to extract fields, and
    remove the restriction of not promoting incoming reg structs that have more than one field" - see [Dependent Work Items](https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md#dependent-work-items)

Normalizing Struct Types
------------------------
We would like to facilitate full enregistration of structs with the following properties:
1. Its fields are infrequently accessed, and
1. The entire struct fits into a register, and
2. Its value is used or defined in a register 
(i.e. as an argument to or return value from calls or intrinsics).

In RyuJIT, the concept of a type is very simplistic (which helps support the high throughput
of the JIT). Rather than a symbol table to hold the properties of a type, RyuJIT primarily
deals with types as simple values of an enumeration. When more detailed information is
required about the structure of a type, we query the type system, across the JIT/EE interface.
This is generally done only during the importer (translation from MSIL to the RyuJIT IR), and
during struct promotion analysis. As a result, struct types are treated as an opaque type
(TYP_STRUCT) of unknown size and structure.

In order to treat fully-enregisterable struct types as "first class" types in RyuJIT, we
 create new types with fixed size and structure:
* TYP_SIMD8, TYP_SIMD12, TYP_SIMD16 and (where supported by the target) TYP_SIMD32
 - These types already exist, and represent some already-completed steps toward First Class Structs.
* TYP_STRUCT1, TYP_STRUCT2, TYP_STRUCT4, TYP_STRUCT8 (on 64-bit systems)
 - These types are new, and will be used where struct types of the given size are passed and/or
 returned in registers.

We want to identify and normalize these types early in the compiler, before any decisions are
made regarding whether they are constrained to live on the stack and whether and how they are
promoted (scalar replaced) or copied.

One issue that arises is that it becomes necessary to know the size of any struct type that
we encounter, even if we may not actually need to know the size in order to generate code.
The major cause of additional queries seems to be for field references. It is possible to
defer some of these cases. I don't know what the throughput impact will be to always do the
normalization, but in principle I think it is worth doing because the alternative would be
to transform the types later (e.g. during morph) and use a contextual tree walk to see if we
care about the size of the struct. That would likely be a messier analysis.

Current Struct IR Phase Transitions
-----------------------------------

There are three phases in the JIT that make changes to the representation of struct tree
nodes and lclVars:

* Importer
 * All struct type lclVars have TYP_STRUCT
 * All struct assignments/inits are block ops
 * All struct call args are ldobj
 * Other struct nodes have TYP_STRUCT
* Struct promotion
 * Fields of promoted structs become separate lclVars (scalar promoted) with primitive types
* Global morph
 * All struct nodes are transformed to block ops
   - Besides call args
  * Some promoted structs are forced to stack
   - Become “dependently promoted”
 * Call args 
   - Morphed to GT_LCL_FLD if passed in a register
   - Treated in various ways otherwise (inconsistent)

Proposed Approach
-----------------
The most fundamental change with first class structs is that struct assignments become
just a special case of assignment. The existing block ops (GT_INITBLK, GT_COPYBLK,
 GT_COPYOBJ, GT_LDOBJ) are eliminated. Instead, the block operations in the incoming MSIL
 are translated into assignments to or from a new GT_OBJ node.

New fixed-size struct types are added: (TYP_STRUCT[1|2|4|8]), which are somewhat similar
to the (existing) SIMD types (TYP_SIMD[8|16|32]). As is currently done for the SIMD types,
these types are normalized in the importer.

Conceptually, struct nodes refer to the object, not the address. This is important, as
the existing block operations all take address operands, meaning that any lclVar involved
in an assignment (including initialization) will be in an address-taken context in the JIT,
requiring special analysis to identify the cases where the address is only taken in order
to assign to or from the lclVar. This further allows for consistency in the treatment of
structs and simple types - even potentially enabling optimizations of non-enregisterable
structs.

### Struct promotion

* Struct promotion analysis
 * Aggressively promote pointer-sized fields of structs used as args or returns
 * Consider FULL promotion of pointer-size structs
   * If there are fewer field references than calls or returns

### Assignments
* Struct assignments look like any other assignment
* GenTreeAsg (GT_ASG) extends GenTreeOp with:

```C#
// True if this assignment is a volatile memory operation.
bool IsVolatile() const { return (gtFlags & GTF_BLK_VOLATILE) != 0; }
bool gtAsgGcUnsafe;

// What code sequence we will be using to encode this operation.
enum
{
    AsgKindInvalid,
    AsgKindDirect,
    AsgKindHelper,
    AsgKindRepInstr,
    AsgKindUnroll,
} gtAsgKind;
```

### Struct “objects” as lvalues
* Lhs of a struct assignment is a block node or lclVar
* Block nodes represent the address and “shape” info formerly on the block copy:
 * GT_BLK and GT_STORE_BLK (GenTreeBlk)
   * Has a (non-tree node) size field
   * Addr() is op1
   * Data() is op2
 * GT_OBJ and GT_STORE_OBJ (GenTreeObj extends GenTreeBlk)
   * gtClass, gtGcPtrs, gtGcPtrCount, gtSlots
 * GT_DYN_BLK and GT_STORE_DYN_BLK (GenTreeDynBlk extends GenTreeBlk)
   * Additional child gtDynamicSize

### Struct “objects” as rvalues
After morph, structs on rhs of assignment are either:
* The tree node for the object: e.g. call, retExpr
* GT_IND of an address (e.g. GT_LEA)

The lhs provides the “shape” for the assignment. Note: it has been suggested that these could 
remain as GT_BLK nodes, but I have not given that any deep consideration.

### Preserving Struct Types in Trees

Prior to morphing, all nodes that may represent a struct type will have a class handle.
After morphing, some will become GT_IND.

### Structs As Call Arguments

All struct args imported as GT_OBJ, transformed as follows during morph:
* P_FULL promoted locals:
  * Remain as a GT_LCL_VAR nodes, with the appropriate fixed-size struct type.
  * Note that these may or may not be passed in registers.
* P_INDEP promoted locals:
  * These are the ones where the fields don’t match the reg types
    GT_STRUCT (or something) for aggregating multiple fields into a single register
  * Op1 is a lclVar for the first promoted field
  * Op2 is the lclVar for the next field, OR another GT_STRUCT
  * Bit offset for the second child
* All other cases (non-locals, OR P_DEP or non-promoted locals):
  * GT_LIST of GT_IND for each half

### Struct Return

The return of a struct value from the current method is represented as follows:
* GT_RET(GT_OBJ) initially
* GT_OBJ morphed, and then transformed similarly to call args

Proposed Struct IR Phase Transitions
------------------------------------

* Importer
  * Struct assignments are imported as GT_ASG
  * Struct type is normalized to TYP_STRUCT* or TYP_SIMD*
* Struct promotion
  * Fields of promoted structs become separate lclVars (as is)
  * Enregisterable structs (including Pair Types) may be promoted to P_FULL (i.e. fully enregistered)
  * As a future optimization, we may "restructure" multi-register argument or return values as a
    synthesized struct of appropriately typed fields, and then promoted in the normal manner.
* Global morph
  * All struct type local variables remain as simple GT_LCL_VAR nodes.
  * All other struct nodes are transformed to GT_IND (rhs of assignment) or remain as GT_OBJ
    * In Lowering, GT_OBJ will be changed to GT_BLK if there are no gc ptrs. This could be done
      earlier, but there are some places where the object pointer is desired.
    * It is not actually clear if there is a great deal of value in the GT_BLK, but it was added
      to be more compatible with existing code that expects block copies with gc pointers to be
      distinguished from those that do not.
  * Promoted structs are forced to stack ONLY if address taken
  * Call arguments
    * Fixed-size enregisterable structs: GT_LCL_VAR or GT_OBJ of appropriate type.
    * Multi-register arguments: GT_LIST of register-sized operands:
      * GT_LCL_VAR if there is a promoted field that exactly matches the register size and type
        (note that, if we have performed the optimization mentioned above in struct promotion,
        we may have a GT_LCL_VAR of a synthesized struct field).
      * GT_LCL_FLD if there is a matching field in the struct that has not been promoted.
      * GT_IND otherwise. Note that if this is a struct local that does not have a matching field,
        this will force the local to live on the stack.
* Lowering
  * Pair types (e.g. TYP_LONG on 32-bit targets) are decomposed as needed to expose register requirements.
    Although these are not strictly structs, their handling is similar.
    * Computations are decomposed into their constituent parts when they independently write
      separate registers.
    * TYP_LONG lclVars (and TYP_DOUBLE on ARM) are split (similar to promotion/scalar replacement of
      structs) if and only if they are register candidates.
    * Other TYP_LONG/TYP_DOUBLE lclVars are loaded into independent registers either via:
      * Single GT_LCL_VAR that will translate into a pair load instruction (ldp), with two register 
        targets, or
      * GT_LCL_FLD (current approach) or GT_IND (probaby a better approach)
  * Calls and loads that target multiple registers
    * Existing gtLsraInfo has the capability to specify multiple destination registers
    * Additional work is required in LSRA to handle these correctly
    * If HFAs can be return values (not just call args), then we may need to support up to 4 destination
      registers for LSRA

Sample IR
---------
### Bug 98404
#### Before

The `getfoo` method initializes a struct of 4 bytes.
The dump of the (single) local variable is included to show the change from `struct (8)` to
`struct4`, as the "exact size" of the struct is 4 bytes.
Here is the IR after Import:

```
;  V00 loc0           struct ( 8) 

   ▌  stmtExpr  void  (top level) (IL 0x000...  ???)
   │  ┌──▌  const     int    4
   └──▌  initBlk   void  
      │  ┌──▌  const     int    0
      └──▌  <list>    void  
         └──▌  addr      byref 
            └──▌  lclVar    struct V00 loc0         

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   └──▌  return    int   
      └──▌  lclFld    int    V00 loc0         [+0]
```
This is how it currently looks just before code generation:
```
   ▌  stmtExpr  void  (top level) (IL 0x000...0x003)
   │  ┌──▌  const     int    0 REG rax $81
   │  ├──▌  &lclVar   byref  V00 loc0         d:3 REG NA
   └──▌  storeIndir int    REG NA

   ▌  stmtExpr  void  (top level) (IL 0x008...0x009)
   │  ┌──▌  lclFld    int    V00 loc0         u:3[+0] (last use) REG rax $180
   └──▌  return    int    REG NA $181
```
And here is the resulting code:
```
  push     rax
  xor      rax, rax
  mov      qword ptr [V00 rsp], rax
  xor      eax, eax
  mov      dword ptr [V00 rsp], eax
  mov      eax, dword ptr [V00 rsp]
  add      rsp, 8
  ret      
```
#### After
Here is the IR after Import with the prototype First Class Struct changes.
Note that the fixed-size struct variable is assigned and returned just as for a scalar type.

```
;  V00 loc0          struct4 

   ▌  stmtExpr  void  (top level) (IL 0x000...  ???)
   │  ┌──▌  const     int    0
   └──▌  =         struct4 (init)
      └──▌  lclVar    struct4 V00 loc0         


   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   └──▌  return    struct4
      └──▌  lclVar    struct4    V00 loc0         
```
And Here is the resulting code just prior to code generation:
```
   ▌  stmtExpr  void  (top level) (IL 0x008...0x009)
   │  ┌──▌  const     struct4    0 REG rax $81
   └──▌  return    struct4    REG NA $140
```
Finally, here is the resulting code that we were hoping to acheive:
```
  xor      eax, eax
```

### Issue 1133:
#### Before

Here is the IR after Inlining for the `TestValueTypesInInlinedMethods` method that invokes a
sequence of methods that are inlined, creating a sequence of copies.
Because this struct type does not fit into a single register, the types do not change (and
therefore the local variable table is not shown).

```
   ▌  stmtExpr  void  (top level) (IL 0x000...0x003)
   │  ┌──▌  const     int    16
   └──▌  initBlk   void  
      │  ┌──▌  const     int    0
      └──▌  <list>    void  
         └──▌  addr      byref 
            └──▌  lclVar    struct V00 loc0         

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   │  ┌──▌  const     int    16
   └──▌  copyBlk   void  
      │  ┌──▌  addr      byref 
      │  │  └──▌  lclVar    struct V00 loc0         
      └──▌  <list>    void  
         └──▌  addr      byref 
            └──▌  lclVar    struct V01 tmp0         

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   │  ┌──▌  const     int    16
   └──▌  copyBlk   void  
      │  ┌──▌  addr      byref 
      │  │  └──▌  lclVar    struct V01 tmp0         
      └──▌  <list>    void  
         └──▌  addr      byref 
            └──▌  lclVar    struct V02 tmp1         

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   │  ┌──▌  const     int    16
   └──▌  copyBlk   void  
      │  ┌──▌  addr      byref 
      │  │  └──▌  lclVar    struct V02 tmp1         
      └──▌  <list>    void  
         └──▌  addr      byref 
            └──▌  lclVar    struct V03 tmp2         

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   └──▌  call help long   HELPER.CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
      ├──▌  const     long   0x7ff918494e10
      └──▌  const     int    1

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   │  ┌──▌  const     int    16
   └──▌  copyBlk   void  
      │  ┌──▌  addr      byref 
      │  │  └──▌  lclVar    struct V03 tmp2         
      └──▌  <list>    void  
         │  ┌──▌  const     long   8 Fseq[#FirstElem]
         └──▌  +         byref 
            └──▌  field     ref    s_dt

   ▌  stmtExpr  void  (top level) (IL 0x00E...  ???)
   └──▌  return    void  
```
And here is the resulting code:
```
sub      rsp, 104
xor      rax, rax
mov      qword ptr [V00 rsp+58H], rax
mov      qword ptr [V00+0x8 rsp+60H], rax
xor      rcx, rcx
lea      rdx, bword ptr [V00 rsp+58H]
vxorpd   ymm0, ymm0
vmovdqu  qword ptr [rdx], ymm0
vmovdqu  ymm0, qword ptr [V00 rsp+58H]
vmovdqu  qword ptr [V01 rsp+48H]ymm0, qword ptr 
vmovdqu  ymm0, qword ptr [V01 rsp+48H]
vmovdqu  qword ptr [V02 rsp+38H]ymm0, qword ptr 
vmovdqu  ymm0, qword ptr [V02 rsp+38H]
vmovdqu  qword ptr [V03 rsp+28H]ymm0, qword ptr 
mov      rcx, 0x7FF918494E10
mov      edx, 1
call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
mov      rax, 0x1FAC6EB29C8
mov      rax, gword ptr [rax]
add      rax, 8
vmovdqu  ymm0, qword ptr [V03 rsp+28H]
vmovdqu  qword ptr [rax], ymm0
add      rsp, 104
ret      
```

#### After
After fginline:
(note that the obj node will become a blk node downstream).
```
   ▌  stmtExpr  void  (top level) (IL 0x000...0x003)
   │  ┌──▌  const     int    0
   └──▌  =         struct (init)
      └──▌  lclVar    struct V00 loc0         

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   │  ┌──▌  lclVar    struct V00 loc0         
   └──▌  =         struct (copy)
      └──▌  lclVar    struct V01 tmp0         

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   │  ┌──▌  lclVar    struct V01 tmp0         
   └──▌  =         struct (copy)
      └──▌  lclVar    struct V02 tmp1         

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   │  ┌──▌  lclVar    struct V02 tmp1         
   └──▌  =         struct (copy)
      └──▌  lclVar    struct V03 tmp2         

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   └──▌  call help long   HELPER.CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
      ├──▌  const     long   0x7ff9184b4e10
      └──▌  const     int    1

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   │  ┌──▌  lclVar    struct V03 tmp2         
   └──▌  =         struct (copy)
      └──▌  obj(16)   struct
         │  ┌──▌  const     long   8 Fseq[#FirstElem]
         └──▌  +         byref 
            └──▌  field     ref    s_dt

   ▌  stmtExpr  void  (top level) (IL 0x00E...  ???)
   └──▌  return    void  
```
Here is the IR after fgMorph:
Note that copy propagation has propagated the zero initialization through to the final store.
```
   ▌  stmtExpr  void  (top level) (IL 0x000...0x003)
   │  ┌──▌  const     int    0
   └──▌  =         struct (init)
      └──▌  lclVar    struct V00 loc0         

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   │  ┌──▌  const     struct 0
   └──▌  =         struct (init)
      └──▌  lclVar    struct V01 tmp0         

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   │  ┌──▌  const     struct 0
   └──▌  =         struct (init)
      └──▌  lclVar    struct V02 tmp1         

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   │  ┌──▌  const     struct 0
   └──▌  =         struct (init)
      └──▌  lclVar    struct V03 tmp2         

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   └──▌  call help long   HELPER.CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
      ├──▌  const     long   0x7ffc8bbb4e10
      └──▌  const     int    1

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   │  ┌──▌  const     struct 0
   └──▌  =         struct (init)
      └──▌  obj(16)   struct
         │  ┌──▌  const     long   8 Fseq[#FirstElem]
         └──▌  +         byref 
            └──▌  indir     ref   
               └──▌  const(h)  long   0x2425b6229c8 static Fseq[s_dt]

   ▌  stmtExpr  void  (top level) (IL 0x00E...  ???)
   └──▌  return    void  

```
After liveness analysis the dead stores have been eliminated:
```
   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   └──▌  call help long   HELPER.CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
      ├──▌  const     long   0x7ffc8bbb4e10
      └──▌  const     int    1

   ▌  stmtExpr  void  (top level) (IL 0x008...  ???)
   │  ┌──▌  const     struct 0
   └──▌  =         struct (init)
      └──▌  obj(16)   struct
         │  ┌──▌  const     long   8 Fseq[#FirstElem]
         └──▌  +         byref 
            └──▌  indir     ref   
               └──▌  const(h)  long   0x2425b6229c8 static Fseq[s_dt]

   ▌  stmtExpr  void  (top level) (IL 0x00E...  ???)
   └──▌  return    void  
```
And here is the resulting code, going from a code size of 129 bytes down to 58.
```
sub      rsp, 40
mov      rcx, 0x7FFC8BBB4E10
mov      edx, 1
call     CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE
xor      rax, rax
mov      rdx, 0x2425B6229C8
mov      rdx, gword ptr [rdx]
add      rdx, 8
vxorpd   ymm0, ymm0
vmovdqu  qword ptr [rdx], ymm0
add      rsp, 40
ret 
```

Work Items
----------
This is a preliminary breakdown of the work into somewhat separable tasks. Those whose descriptions
are prefaced by '*' have been prototyped in an earlier version of the JIT, and that work is now
being re-integrated and tested, but may require some cleanup and/or phasing with other work items
before a PR is submitted.

### Mostly-Independent work items
1.	*Replace block ops with assignments & new nodes.

2.	*Add new fixed-size types, and normalize them in the importer (might be best to do this with or after #1, but not really dependent)

3.	LSRA
    * Enable support for multiple destination regs, call nodes that return a struct in multiple
      registers (for x64/ux, and for arm)
    * Handle multiple destination regs for ldp on arm64 (could be done before or concurrently with the above).
      Note that this work item is specifically intended for call arguments. It is likely the case that
      utilizing ldp for general-purpose code sequences would be handled separately.

4.	X64/ux: aggressively promote lclVar struct incoming or outgoing args with two 8-byte fields

5.	X64/ux:
    * modify the handling of multireg struct args to use GT_LIST of GT_IND
    * remove the restriction to NOT promote things that are multi-reg args, as long as they match (i.e. two 8-byte fields).
      Pass those using GT_LIST of GT_LCL_VAR.
    * stop adding extra lclVar copies

6.	Arm64:
    * Promote 16-byte struct lclVars that are incoming or outgoing register arguments only if they have 2 8-byte fields (DONE).
      Pass those using GT_LIST of GT_LCL_VAR (as above for x64/ux).
      Note that, if the types do not match, e.g. a TYP_DOUBLE field that will be passed in an integer register,
      it will require special handling in Lowering and LSRA, as is currently done in the TYP_SIMD8 case.
    * For other cases, pass as GT_LIST of GT_IND (DONE)
    * The GT_LIST would be created in fgMorphArgs(). Then in Lower, putarg_reg nodes will be inserted between
      the GT_LIST and the list item (GT_LCL_VAR or GT_IND). (DONE)
    * Add support for HFAs.
    
    ### Dependent work items:
    
7.	*(Depends on 1 & 2): Fully enregister TYP_STRUCT[1|2|3|4|8] with no field accesses.

8.  *(Depends on 1 & 2): Enable value numbering and assertion propagation for struct types.

9.	(Depends on 1 & 2, mostly to avoid conflicts): Add support in prolog to extract fields, and
    remove the restriction of not promoting incoming reg structs that have more than one field.
    Note that SIMD types are already reassembled in the prolog.
    
10.	(Not really dependent, but probably best done after 1, 2, 5, 6): Add support for assembling
    non-matching fields into registers for call args and returns. This includes producing the
    appropriate IR, which may be simply be shifts and or's of the appropriate fields.
    This would either be done during `fgMorphArgs()` and the `GT_RETURN` case of `fgMorphSmpOp()`
    or as described below in
    [Extracting and Assembling Structs](#Extract-Assemble).
    
11. (Not really dependent, but probably best done after 1, 2, 5, 6): Add support for extracting the fields for the
    returned struct value of a call, producing the appropriate IR, which may simply be shifts and
    and's.
    This would either be done during the morphing of the call itself, or as described below in
    [Extracting and Assembling Structs](#Extract-Assemble).

12.	(Depends on 3, may replace the second part of 6): For arm64, add support for loading non-promoted
    or non-local structs with ldp
    * Either using TYP_STRUCT and special case handling, OR adding TYP_STRUCT16

13.	(Depends on 7, 9, 10, 11): Enhance struct promotion to allow full enregistration of structs,
    even if some field are accessed, if there are more call/return references than field references.
    This work item should address issue #1161, by removing the automatic non-promotion
    of structs with a single double field, and adding appropriate heuristics for when it
    should be allowed.

Related Work Item
-----------------
These changes are somewhat orthogonal, though will likely have merge issues if done in parallel with any of
the above:
* Unified API for ABI info
  * Pass/Return info:
    * Num regs used for passing
    * Per-slot location (reg num / REG_STK)
    * Per-slot type (for reg “slots”)
    * Starting stack slot offset (if passed on stack)
    * By reference?
    * etc.
  * We should be able to unify HFA handling into this model
  * For arg passing, the API for creating the argEntry should take an arg state that keeps track of
    what regs have been used, and handles the backfilling case for ARM
    
Open Design Questions
---------------------
### <a name="Extract-Assemble"/>Extracting and Assembling Structs

Should the IR for extracting and assembling struct arguments from or to argument or return registers
be generated directly during the morphing of call arguments and returns, or should this capability
be handled in a more general fashion in `fgMorphCopyBlock()`?
The latter seems desirable for its general applicability.

One way to handle this might be:

1. Whenever you have a case of mismatched structs (call args, call node, or return node),
   create a promoted temp of the "fake struct type", e.g. for arm you would introduce three
   new temps for the struct, and for each of its TYP_LONG promoted fields.
2. Add an assignment to or from the temp (e.g. as a setup arg node), BUT the structs on
   both sides of that assignment can now be promoted.
3. Add code to fgMorphCopyBlock to handle the extraction and assembling of structs.
4. The promoted fields of the temp would be preferenced to the appropriate argument or return registers.