1 files changed, 487 insertions, 0 deletions
diff --git a/Documentation/design-docs/finally-optimizations.md b/Documentation/design-docs/finally-optimizations.md
new file mode 100644
index 0000000000..d35d5a47bd
--- /dev/null
+++ b/Documentation/design-docs/finally-optimizations.md
@@ -0,0 +1,487 @@
+Finally Optimizations
+=====================
+
+In MSIL, a try-finally is a construct where a block of code
+(the finally) is guaranteed to be executed after control leaves a
+protected region of code (the try) either normally or via an
+exception.
+
+In RyuJit a try-finally is currently implemented by transforming the
+finally into a local function that is invoked via jitted code at normal
+exits from the try block and is invoked via the runtime for exceptional
+exits from the try block.
+
+For x86 the local function is simply a part of the method and shares
+the same stack frame with the method. For other architectures the
+local function is promoted to a potentially separable "funclet"
+which is almost like a regular function with a prolog and epilog. A
+custom calling convention gives the funclet access to the parent stack
+frame.
+
+In this proposal we outline three optimizations for finallys: removing
+empty trys, removing empty finallys and finally cloning.
+
+Empty Finally Removal
+---------------------
+
+An empty finally is one that has no observable effect. These often
+arise from `foreach` or `using` constructs (which induce a
+try-finally) where the cleanup method called in the finally does
+nothing. Often, after inlining, the empty finally is readily apparent.
+
+For example, this snippet of C# code
+```C#
+static int Sum(List<int> x) {
+    int sum = 0;
+    foreach(int i in x) {
+        sum += i;
+    }
+    return sum;
+}
+```
+produces the following jitted code:
+```asm
+; Successfully inlined Enumerator[Int32][System.Int32]:Dispose():this
+;    (1 IL bytes) (depth 1) [below ALWAYS_INLINE size]
+G_M60484_IG01:
+       55                   push     rbp
+       57                   push     rdi
+       56                   push     rsi
+       4883EC50             sub      rsp, 80
+       488D6C2460           lea      rbp, [rsp+60H]
+       488BF1               mov      rsi, rcx
+       488D7DD0             lea      rdi, [rbp-30H]
+       B906000000           mov      ecx, 6
+       33C0                 xor      rax, rax
+       F3AB                 rep stosd
+       488BCE               mov      rcx, rsi
+       488965C0             mov      qword ptr [rbp-40H], rsp
+
+G_M60484_IG02:
+       33C0                 xor      eax, eax
+       8945EC               mov      dword ptr [rbp-14H], eax
+       8B01                 mov      eax, dword ptr [rcx]
+       8B411C               mov      eax, dword ptr [rcx+28]
+       33D2                 xor      edx, edx
+       48894DD0             mov      gword ptr [rbp-30H], rcx
+       8955D8               mov      dword ptr [rbp-28H], edx
+       8945DC               mov      dword ptr [rbp-24H], eax
+       8955E0               mov      dword ptr [rbp-20H], edx
+
+G_M60484_IG03:
+       488D4DD0             lea      rcx, bword ptr [rbp-30H]
+       E89B35665B           call     Enumerator[Int32][System.Int32]:MoveNext():bool:this
+       85C0                 test     eax, eax
+       7418                 je       SHORT G_M60484_IG05
+
+; Body of foreach loop
+
+G_M60484_IG04:
+       8B4DE0               mov      ecx, dword ptr [rbp-20H]
+       8B45EC               mov      eax, dword ptr [rbp-14H]
+       03C1                 add      eax, ecx
+       8945EC               mov      dword ptr [rbp-14H], eax
+       488D4DD0             lea      rcx, bword ptr [rbp-30H]
+       E88335665B           call     Enumerator[Int32][System.Int32]:MoveNext():bool:this
+       85C0                 test     eax, eax
+       75E8                 jne      SHORT G_M60484_IG04
+
+; Normal exit from the implicit try region created by `foreach`
+; Calls the finally to dispose of the iterator
+
+G_M60484_IG05:
+       488BCC               mov      rcx, rsp
+       E80C000000           call     G_M60484_IG09      // call to finally
+
+G_M60484_IG06:
+       90                   nop
+
+G_M60484_IG07:
+       8B45EC               mov      eax, dword ptr [rbp-14H]
+
+G_M60484_IG08:
+       488D65F0             lea      rsp, [rbp-10H]
+       5E                   pop      rsi
+       5F                   pop      rdi
+       5D                   pop      rbp
+       C3                   ret
+
+; Finally funclet. Note it simply sets up and then tears down a stack
+; frame. The dispose method was inlined and is empty.
+
+G_M60484_IG09:
+       55                   push     rbp
+       57                   push     rdi
+       56                   push     rsi
+       4883EC30             sub      rsp, 48
+       488B6920             mov      rbp, qword ptr [rcx+32]
+       48896C2420           mov      qword ptr [rsp+20H], rbp
+       488D6D60             lea      rbp, [rbp+60H]
+
+G_M60484_IG10:
+       4883C430             add      rsp, 48
+       5E                   pop      rsi
+       5F                   pop      rdi
+       5D                   pop      rbp
+       C3                   ret
+```
+
+In such cases the try-finally can be removed, leading to code like the following:
+```asm
+G_M60484_IG01:
+       57                   push     rdi
+       56                   push     rsi
+       4883EC38             sub      rsp, 56
+       488BF1               mov      rsi, rcx
+       488D7C2420           lea      rdi, [rsp+20H]
+       B906000000           mov      ecx, 6
+       33C0                 xor      rax, rax
+       F3AB                 rep stosd
+       488BCE               mov      rcx, rsi
+
+G_M60484_IG02:
+       33F6                 xor      esi, esi
+       8B01                 mov      eax, dword ptr [rcx]
+       8B411C               mov      eax, dword ptr [rcx+28]
+       48894C2420           mov      gword ptr [rsp+20H], rcx
+       89742428             mov      dword ptr [rsp+28H], esi
+       8944242C             mov      dword ptr [rsp+2CH], eax
+       89742430             mov      dword ptr [rsp+30H], esi
+
+G_M60484_IG03:
+       488D4C2420           lea      rcx, bword ptr [rsp+20H]
+       E8A435685B           call     Enumerator[Int32][System.Int32]:MoveNext():bool:this
+       85C0                 test     eax, eax
+       7414                 je       SHORT G_M60484_IG05
+
+G_M60484_IG04:
+       8B4C2430             mov      ecx, dword ptr [rsp+30H]
+       03F1                 add      esi, ecx
+       488D4C2420           lea      rcx, bword ptr [rsp+20H]
+       E89035685B           call     Enumerator[Int32][System.Int32]:MoveNext():bool:this
+       85C0                 test     eax, eax
+       75EC                 jne      SHORT G_M60484_IG04
+
+G_M60484_IG05:
+       8BC6                 mov      eax, esi
+
+G_M60484_IG06:
+       4883C438             add      rsp, 56
+       5E                   pop      rsi
+       5F                   pop      rdi
+       C3                   ret
+```
+
+Empty finally removal is unconditionally profitable: it should always
+reduce code size and improve code speed.
+
+Empty Try Removal
+---------------------
+
+If the try region of a try-finally is empty, and the jitted code will
+execute on a runtime that does not protect finally execution from
+thread abort, then the try-finally can be replaced with just the
+content of the finally.
+
+Empty trys with non-empty finallys often exist in code that must run
+under both thread-abort aware and non-thread-abort aware runtimes. In
+the former case the placement of cleanup code in the finally ensures
+that the cleanup code will execute fully. But if thread abort is not
+possible, the extra protection offered by the finally is not needed.
+
+Empty try removal looks for try-finallys where the try region does
+nothing except invoke the finally. There are currently two different
+EH implementation models, so the try screening has two cases:
+
+* callfinally thunks (x64/arm64): the try must be a single empty
+basic block that always jumps to a callfinally that is the first
+half of a callfinally/always pair;
+* non-callfinally thunks (x86/arm32): the try must be a
+callfinally/always pair where the first block is an empty callfinally.
+
+The screening then verifies that the callfinally identified above is
+the only callfinally for the try. No other callfinallys are expected
+because this try cannot have multiple leaves and its handler cannot be
+reached by nested exit paths.
+
+When the empty try is identified, the jit modifies the
+callfinally/always pair to branch to the handler, modifies the
+handler's return to branch directly to the continuation (the
+branch target of the second half of the callfinally/always pair),
+updates various status flags on the blocks, and then removes the
+try-finally region.
+
+Finally Cloning
+---------------
+
+Finally cloning is an optimization where the jit duplicates the code
+in the finally for one or more of the normal exit paths from the try,
+and has those exit points branch to the duplicated code directly,
+rather than calling the finally.  This transformation allows for
+improved performance and optimization of the common case where the try
+completes without an exception.
+
+Finally cloning also allows hot/cold splitting of finally bodies: the
+cloned finally code covers the normal try exit paths (the hot cases)
+and can be placed in the main method region, and the original finally,
+now used largely or exclusively for exceptional cases (the cold cases)
+spilt off into the cold code region. Without cloning, RyuJit
+would always treat the finally as cold code.
+
+Finally cloning will increase code size, though often the size
+increase is mitigated somewhat by more compact code generation in the
+try body and streamlined invocation of the cloned finallys.
+
+Try-finally regions may have multiple normal exit points. For example
+the following `try` has two: one at the `return 3` and one at the try
+region end:
+
+```C#
+try {
+   if (p) return 3;
+   ...
+}
+finally {
+   ...
+}
+return 4;
+```
+
+Here the finally must be executed no matter how the try exits. So
+there are to two normal exit paths from the try, both of which pass
+through the finally but which then diverge. The fact that some try
+regions can have multiple exits opens the potential for substantial
+code growth from finally cloning, and so leads to a choice point in
+the implementation:
+
+* Share the clone along all exit paths
+* Share the clone along some exit paths
+* Clone along all exit paths
+* Clone along some exit paths
+* Only clone along one exit path
+* Only clone when there is one exit path
+
+The shared clone option must essentially recreate or simulate the
+local call mechanism for the finally, though likely somewhat more
+efficiently. Each exit point must designate where control should
+resume once the shared finally has finished.  For instance the jit
+could introduce a new local per try-finally to determine where the
+cloned finally should resume, and enumerate the possibilities using a
+small integer. The end of the cloned finally would then use a switch
+to determine what code to execute next. This has the downside of
+introducing unrealizable paths into the control flow graph.
+
+Cloning along all exit paths can potentially lead to large amounts of
+code growth.
+
+Cloning along some paths or only one path implies that some normal
+exit paths won't be as well optimized. Nonetheless cloning along one
+path was the choice made by JIT64 and the one we recommend for
+implementation. In particular we suggest only cloning along the end of
+try region exit path, so that any early exit will continue to invoke
+the funclet for finally cleanup (unless that exit happens to have the
+same post-finally continuation as the end try region exit, in which
+case it can simply jump to the cloned finally).
+
+One can imagine adaptive strategies. The size of the finally can
+be roughly estimated and the number of clones needed for full cloning
+readily computed. Selective cloning can be based on profile
+feedback or other similar mechanisms for choosing the profitable
+cases.
+
+The current implementation will clone the finally and retarget the
+last (largest IL offset) leave in the try region to the clone. Any
+other leave that ultimately transfers control to the same post-finally
+offset will also be modified to jump to the clone.
+
+Empirical studies have shown that most finallys are small. Thus to
+avoid excessive code growth, a crude size estimate is formed by
+counting the number of statements in the blocks that make up the
+finally. Any finally larger that 15 statements is not cloned. In our
+study this disqualifed about 0.5% of all finallys from cloning.
+
+### EH Nesting Considerations
+
+Finally cloning is also more complicated when the finally encloses
+other EH regions, since the clone will introduce copies of all these
+regions. While it is possible to implement cloning in such cases we
+propose to defer for now.
+
+Finally cloning is also a bit more complicated if the finally is
+enclosed by another finally region, so we likewise propose deferring
+support for this.  (Seems like a rare enough thing but maybe not too
+hard to handle -- though possibly not worth it if we're not going to
+support the enclosing case).
+
+### Control-Flow and Other Considerations
+
+If the try never exits normally, then the finally can only be invoked
+in exceptional cases. There is no benefit to cloning since the cloned
+finally would be unreachable. We can detect a subset of such cases
+because there will be no call finally blocks.
+
+JIT64 does not clone finallys that contained switch. We propose to
+do likewise. (Initially I did not include this restriction but
+hit a failing test case where the finally contained a switch. Might be
+worth a deeper look, though such cases are presumably rare.)
+
+If the finally never exits normally, then we presume it is cold code,
+and so will not clone.
+
+If the finally is marked as run rarely, we will not clone.
+
+Implementation Proposal
+-----------------------
+
+We propose that empty finally removal and finally cloning be run back
+to back, spliced into the phase list just after fgInline and
+fgAddInternal, and just before implicit by-ref and struct
+promotion. We want to run these early before a lot of structural
+invariants regarding EH are put in place, and before most
+other optimization, but run them after inlining
+(so empty finallys can be more readily identified) and after the
+addition of implicit try-finallys created by the jit.  Empty finallys
+may arise later because of optimization, but this seems relatively
+uncommon.
+
+We will remove empty finallys first, then clone.
+
+Neither optimization will run when the jit is generating debuggable
+code or operating in min opts mode.
+
+### Empty Finally Removal (Sketch)
+
+Skip over methods that have no EH, are compiled with min opts, or
+where the jit is generating debuggable code.
+
+Walk the handler table, looking for try-finally (we could also look
+for and remove try-faults with empty faults, but those are presumably
+rare).
+
+If the finally is a single block and contains only a `retfilter`
+statement, then:
+
+* Retarget the callfinally(s) to jump always to the continuation blocks.
+* Remove the paired jump always block(s) (note we expect all finally
+calls to be paired since the empty finally returns).
+* For funclet EH models with finally target bits, clear the finally
+target from the continuations.
+* For non-funclet EH models only, clear out the GT_END_LFIN statement
+in the finally continuations.
+* Remove the handler block.
+* Reparent all directly contained try blocks to the enclosing try region
+or to the method region if there is no enclosing try.
+* Remove the try-finally from the EH table via `fgRemoveEHTableEntry`.
+
+After the walk, if any empty finallys were removed, revalidate the
+integrity of the handler table.
+
+### Finally Cloning (Sketch)
+
+Skip over all methods, if the runtime suports thread abort. More on
+this below.
+
+Skip over methods that have no EH, are compiled with min opts, or
+where the jit is generating debuggable code.
+
+Walk the handler table, looking for try-finally. If the finally is
+enclosed in a handler or encloses another handler, skip.
+
+Walk the finally body blocks. If any is BBJ_SWITCH, or if none
+is BBJ_EHFINALLYRET, skip cloning. If all blocks are RunRarely
+skip cloning. If the finally has more that 15 statements, skip
+cloning.
+
+Walk the try region from back to front (from largest to smallest IL
+offset). Find the last block in the try that invokes the finally. That
+will be the path that will invoke the clone.
+
+If the EH model requires callfinally thunks, and there are multiple
+thunks that invoke the finally, and the callfinally thunk along the
+clone path is not the first, move it to the front (this helps avoid
+extra jumps).
+
+Set the insertion point to just after the callfinally in the path (for
+thunk models) or the end of the try (for non-thunk models).  Set up a
+block map. Clone the finally body using `fgNewBBinRegion` and
+`fgNewBBafter` to make the first and subsequent blocks, and
+`CloneBlockState` to fill in the block contents. Clear the handler
+region on the cloned blocks. Bail out if cloning fails. Mark the first
+and last cloned blocks with appropriate BBF flags. Patch up inter-clone
+branches and convert the returns into jumps to the continuation.
+
+Walk the callfinallys, retargeting the ones that return to the
+continuation so that they invoke the clone. Remove the paired always
+blocks. Clear the finally target bit and any GT_END_LFIN from the
+continuation.
+
+If all call finallys are converted, modify the region to be try/fault
+(interally EH_HANDLER_FAULT_WAS_FINALLY, so we can distinguish it
+later from "organic" try/faults).  Otherwise leave it as a
+try/finally.
+
+Clear the catch type on the clone entry.
+
+### Thread Abort
+
+For runtimes that support thread abort (desktop), more work is
+required:
+
+* The cloned finally must be reported to the runtime. Likely this
+can trigger off of the BBF_CLONED_FINALLY_BEGIN/END flags.
+* The jit must maintain the integrity of the clone by not losing
+track of the blocks involved, and not allowing code to move in our
+out of the cloned region
+
+Code Size Impact
+----------------
+
+Code size impact from finally cloning was measured for CoreCLR on
+Windows x64.
+
+```
+Total bytes of diff: 16158 (0.12 % of base)
+    diff is a regression.
+Total byte diff includes 0 bytes from reconciling methods
+        Base had    0 unique methods,        0 unique bytes
+        Diff had    0 unique methods,        0 unique bytes
+Top file regressions by size (bytes):
+        3518 : Microsoft.CodeAnalysis.CSharp.dasm (0.16 % of base)
+        1895 : System.Linq.Expressions.dasm (0.32 % of base)
+        1626 : Microsoft.CodeAnalysis.VisualBasic.dasm (0.07 % of base)
+        1428 : System.Threading.Tasks.Parallel.dasm (4.66 % of base)
+        1248 : System.Linq.Parallel.dasm (0.20 % of base)
+Top file improvements by size (bytes):
+       -4529 : System.Private.CoreLib.dasm (-0.14 % of base)
+        -975 : System.Reflection.Metadata.dasm (-0.28 % of base)
+        -239 : System.Private.Uri.dasm (-0.27 % of base)
+        -104 : System.Runtime.InteropServices.RuntimeInformation.dasm (-3.36 % of base)
+         -99 : System.Security.Cryptography.Encoding.dasm (-0.61 % of base)
+57 total files with size differences.
+Top method regessions by size (bytes):
+         645 : System.Diagnostics.Process.dasm - System.Diagnostics.Process:StartCore(ref):bool:this
+         454 : Microsoft.CSharp.dasm - Microsoft.CSharp.RuntimeBinder.Semantics.ExpressionBinder:AdjustCallArgumentsForParams(ref,ref,ref,ref,ref,byref):this
+         447 : System.Threading.Tasks.Dataflow.dasm - System.Threading.Tasks.Dataflow.Internal.SpscTargetCore`1[__Canon][System.__Canon]:ProcessMessagesLoopCore():this
+         421 : Microsoft.CodeAnalysis.VisualBasic.dasm - Microsoft.CodeAnalysis.VisualBasic.Symbols.ImplementsHelper:FindExplicitlyImplementedMember(ref,ref,ref,ref,ref,ref,byref):ref
+         358 : System.Private.CoreLib.dasm - System.Threading.TimerQueueTimer:Change(int,int):bool:this
+Top method improvements by size (bytes):
+       -2512 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT():ref:this (68 methods)
+        -824 : Microsoft.CodeAnalysis.dasm - Microsoft.Cci.PeWriter:WriteHeaders(ref,ref,ref,ref,byref):this
+        -663 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_CLRtoWinRT(ref):int:this (17 methods)
+        -627 : System.Private.CoreLib.dasm - System.Diagnostics.Tracing.ManifestBuilder:CreateManifestString():ref:this
+        -546 : System.Private.CoreLib.dasm - DomainNeutralILStubClass:IL_STUB_WinRTtoCLR(long):int:this (67 methods)
+3014 total methods with size differences.
+```
+
+The largest growth is seen in `Process:StartCore`, which has 4
+try-finally constructs.
+
+Diffs generally show improved codegen in the try bodies with cloned
+finallys. However some of this improvement comes from more aggressive
+use of callee save registers, and this causes size inflation in the
+funclets (note finally cloning does not alter the number of
+funclets). So if funclet save/restore could be contained to registers
+used in the funclet, the size impact would be slightly smaller.
+
+There are also some instances where cloning relatively small finallys
+leads to large code size increases. xxx is one example.