Documentation/performance/JitOptimizerTodoAssessment.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182

Optimizer Codebase Status/Investments
=====================================

There are a number of areas in the optimizer that we know we would invest in
improving if resources were unlimited.  This document lists them and some
thoughts about their current state and prioritization, in an effort to capture
the thinking about them that comes up in planning discussions.


Big-Ticket Items
----------------

### Improved Struct Handling

This is an area that has received recent attention, with the [first-class structs](https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md)
work and the struct promotion improvements that went in for `Span<T>`.  Work here
is expected to continue and can happen incrementally.  Possible next steps:

 - Struct promotion stress mode (test mode to improve robustness/reliability)
 - Promotion of more structs; relax limits on e.g. field count (should generally
   help performance-sensitive code where structs are increasingly used to avoid
   heap allocations)
 - Improve handling of System V struct passing (I think we currently insert
   some unnecessary round-trips through memory at call boundaries due to
   internal representation issues)
 - Implicit byref parameter promotion w/o shadow copy

We don't have specific benchmarks that we know would jump in response to any of
these.  May well be able to find some with some looking, though this may be an
area where current performance-sensitive code avoids structs.

There's also work going on in corefx to use `Span<T>` more broadly.  We should
make sure we are expanding our span benchmarks appropriately to track and
respond to any particular issues that come out of that work.


### Exception handling

This is increasingly important as C# language constructs like async/await and
certain `foreach` incantations are implemented with EH constructs, making them
difficult to avoid at source level.  The recent work on finally cloning, empty
finally removal, and empty try removal targeted this.  [Writethrough](https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/eh-writethru.md)
is another key optimization enabler here, and we are actively pursuing it.  Other
things we've discussed include inlining methods with EH and computing funclet
callee-save register usage independently of main function callee-save register
usage, but I don't think we have any particular data pointing to either as a
high priority.


### Loop Optimizations

We haven't been targeting benchmarks that spend a lot of time doing compuations
in an inner loop.  Pursuing loop optimizations for the peanut butter effect
would seem odd.  So this simply hasn't bubbled up in priority yet, though it's
bound to eventually.  Obvious candidates include [IV widening](https://github.com/dotnet/coreclr/issues/9179),
[unrolling](https://github.com/dotnet/coreclr/issues/11606), load/store motion,
and strength reduction.


### High Tier Optimization

We don't have that many knobs we can "crank up" (though we do have the tracked
assertion count and could switch inliner policies), nor do we have any sort of
benchmarking story set up to validate whether tiering changes are helping or
hurting.  We should get that benchmarking story sorted out and at least hook
up those two knobs.

Some of this may depend on register allocation work, as the RA currently has
some issues, particularly around spill placement, that could be exacerbated by
very aggressive upstream optimizations.


Mid-Scale Items
---------------

### More Expression Optimizations

We again don't have particular benchmarks pointing to key missing cases, and
balancing the CQ vs TP will be delicate here, so it would really help to have
an appropriate benchmark suite to evaluate this work against.


### Forward Substitution

This too needs an appropriate benchmark suite that I don't think we have at
this time.  The tradeoffs against register pressure increase and throughput
need to be evaluated.  This also might make more sense to do if/when we can
handle SSA renames.


### Async

We've made note of the prevalence of async/await in modern code (and particularly
in web server code such as TechEmpower), and have some opportunities listed in
[#7914](https://github.com/dotnet/coreclr/issues/7914).  Some sort of study of
async peanut butter to find more opportunities is probably in order, but what
would that look like?


### If-Conversion (cmov formation)

This hits big in microbenchmarks where it hits.  There's some work in flight
on this (see [#7447](https://github.com/dotnet/coreclr/issues/7447) and
[#10861](https://github.com/dotnet/coreclr/pull/10861)).


### Address Mode Building

One opportunity that's frequently visible in asm dumps is that more address
expressions could be folded into memory operands' address expressions.  This
would likely give a measurable codesize win.  Needs some thought about where
to run in phase list and how aggressive to be about e.g. analyzing across
statements.


### Low Tier Back-Off

We have some changes we know we want to make here: morph does more than it needs
to in minopts, and tier 0 should be doing throughput-improving inlines, as
opposed to minopts which does no inlining.  It would be nice to have the
benchmarking story set up to measure the effect of such changes when they go in,
we should do that.


### Helper Call Register Kill Set Improvements

We have some facility to allocate caller-save registers across calls to runtime
helpers that are known not to trash them, but the information about which
helpers trash which registers is spread across a few places in the codebase,
and has some puzzling quirks like separate "GC" and "NoGC" kill sets for the
same helper.  Unifying the information sources and then refining the recorded
kill sets would help avoid more stack traffic.  See [#12940](https://github.com/dotnet/coreclr/issues/12940).

Low-Hanging Fruit
-----------------

### Switch Lowering

The MSIL `switch` instruction is actually encoded as a jump table, so (for
better or worse) intelligent optimization of source-level switch statements
largely falls to the MSIL generator (e.g. Roslyn), since encoding sparse
switches as jump tables in MSIL would be impractical.  That said, when the MSIL
has a switch of just a few cases (as in [#12868](https://github.com/dotnet/coreclr/issues/12868)),
or just a few distinct cases that can be efficiently checked (as in [#12477](https://github.com/dotnet/coreclr/issues/12477)),
the JIT needn't blindly emit these as jump tables in the native code.  Work is
underway to address the latter case in [#12552](https://github.com/dotnet/coreclr/pull/12552).


### Write Barriers

A number of suggestions have been made for having the JIT recognize certain
patterns and emit specialized write barriers that avoid various overheads --
see [#13006](https://github.com/dotnet/coreclr/issues/13006) and [#12812](https://github.com/dotnet/coreclr/issues/12812).


### Byref-Exposed Store/Load Value Propagation

There are a few tweaks to our value-numbering for byref-exposed loads and stores
to share some of the machinery we use for heap loads and stores that would
allow better propagation through byref-exposed locals and out parameters --
see [#13457](https://github.com/dotnet/coreclr/issues/13457) and
[#13458](https://github.com/dotnet/coreclr/issues/13458).

Miscellaneous
-------------

### Value Number Conservativism

We have some frustrating phase-ordering issues resulting from this, but the
opt-repeat experiment indicated that they're not prevalent enough to merit
pursuing changing this right now.  Also, using SSA def as the proxy for value
number would require handling SSA renaming, so there's a big dependency chained
to this.
Maybe it's worth reconsidering the priority based on throughput?


### Mulshift

RyuJIT has an implementation that handles the valuable cases (see [analysis](https://gist.github.com/JosephTremoulet/c1246b17ea2803e93e203b9969ee5a25#file-mulshift-md)
and [follow-up](https://github.com/dotnet/coreclr/pull/13128) for details).
The current implementation is split across Morph and CodeGen; ideally it would
be moved to Lower, which is tracked by [#13150](https://github.com/dotnet/coreclr/issues/13150).